Russian Federation
Russian Federation
VKA named after A. F. Mozhaisky (Department of Mathematics and Software, Professor)
Russian Federation
This paper discusses the transition from traditional data warehouses to data lakes in geographic information systems using Lambda architecture. Provides an overview of the key transition steps, including planning, data collection and processing, data querying, data analytics, and metadata management. Particular attention is paid to the interaction of data lakes and GIS, as well as sample big data processing code based on Lambda architecture. The advantages of using data lakes in GIS and the possibilities o integrating modern data processing technologies are considered.
data lakes; data warehouses; Lambda architecture; geographic information systems; metadata; big data processing; data integration; data analytics; transition from data warehouses
1. Ecy, M. T. Principy organizacii raspredelennyh baz dannyh = Principles of Distributed Database Systems. Fourth Edition / M. T. Esu, P. Val'duries; per. s angl. A. A. Slinkina. — Moskva: DMK Press, 2021. — 672 s.
2. Bhattacherjee, S. RStore: A Distributed Multi-Version Document Store / S. Bhattacherjee, A. Deshpande // Proceedings of the 34th International Conference on Data Engineering (ICDE 2018), (Paris, France, 16–19 April 2018). — Institute of Electrical and Electronics Engineers, 2018. — Pp. 389–400. DOI:https://doi.org/10.1109/ICDE.2018.00043.
3. Leveraging the Data Lake: Current State and Challenges / C. Giebler, C. Gröger, E. Hoos, [et al.] // Big Data Analytics and Knowledge Discovery (DaWaK 2019): Proceedings of the 21st International Conference (Linz, Austria, 26–29 August 2019) / C. Ordonez, [et al.] (eds.). — Cham: Springer Nature, 2019. — Pp. 179–188. — (Lecture Notes in Computer Science. Vol. 11708). DOI:https://doi.org/10.1007/978–3–030–27520–4_13.
4. Lock, M. Maximizing Your Data Lake with a Cloud or Hybrid Approach / M. Lock; Aberdeen Group. — 2016. — 4 p. URL: http://technology-signals. com/wp-content/uploads/download-managerfiles/maximizingyourdatalake.pdf (data obrascheniya 12.01.2024).
5. Extending Data Lake Metadata Management by Semantic Profiling / J.W. Ansari, N. Karim, S. Decker, [et al.] // Proceedings of the 15th International Extended Semantic Web Conference (ESWC 2018), (Heraklion, Crete, Greece 03–07 June 2018). — Springer International Publishing, 2018. — 15 p. URL: http://2018.eswc-conferences.org/wp-content/uploads/2018/02/ ESWC2018_paper_127.pdf (data obrascheniya 12.01.2024)
6. CoreDB: A Data Lake Service / A. Beheshti, B. Benatallah, R. Nouri, [et al.] // Proceedings of the 2017 ACM Conference on Information and Knowledge Management (CIKM ‘17), (Singapore, Singapore, 06–10 November 2017). — New York: Association for Computing Machinery, 2017. — Pp. 2451–2454. DOI:https://doi.org/10.1145/3132847.3133171.
7. Data Lake Management: Challenges and Opportunities / F. Nargesian, E. Zhu, R.J. Miller, [et al.] // Proceedings of the VLDB Endowment. 2019. Vol. 12, Is. 12. Pp. 1986–1989. DOI:https://doi.org/10.14778/3352063.3352116.
8. CLAMS: Bringing Quality to Data Lakes / M. Farid, A. Roatis, I.F. Ilyas, [et al.] // Proceedings of the 2016 International Conference on Management of Data (SIGMOD ‘16), (San Francisco, CA, USA, 26 June‑01 July 2016). — New York: Association for Computing Machinery, 2016. — Pp. 2089–2092. DOI:https://doi.org/10.1145/2882903.2899391.
9. Keeping the Data Lake in Form: DS-kNN Datasets Categorization Using Proximity Mining / A. Alserafi, A. Abelló, O. Romero, T. Calders // Model and Data Engineering (MEDI 2019): Proceedings of the 9th International Conference (Toulouse, France, 28–31 October 2019) / K.-D. Schewe, N.K. Singh (eds.). — Cham: Springer Nature, 2019. — Pp. 35–49. — (Lecture Notes in Computer Science. Volume 11815). DOI:https://doi.org/10.1007/978–3–030–32065–2_3.
10. Dataset Discovery in Data Lakes / A. Bogatu, A. A.A. Fernandes, N.W. Paton, N. Konstantinou // Proceedings of the IEEE 36th International Conference on Data Engineering (ICDE 2020), (Dallas, TX, USA, 20–24 April 2020). — Institute of Electrical and Electronics Engineers, 2020. — Pp. 709–720. DOI: 10.1109/ ICDE48307.2020.00067.
11. Goods: Organizing Google’s Datasets / A. Halevy, F. Korn, N.F. Noy // Proceedings of the 2016 International Conference on Management of Data (SIGMOD ‘16), (San Francisco, CA, USA, 26 June‑01 July 2016). — New York: Association for Computing Machinery, 2016. — Pp. 795–806. DOI:https://doi.org/10.1145/2882903.2903730.
12. Sawadogo, P.N. On Data Lake Architectures and Metadata Management / P.N. Sawadogo, J. Darmont // Journal of Intelligent Information Systems. 2021. Vol. 56, Is. 1. Pp. 97–120. DOI:https://doi.org/10.1007/s10844–020–00608–7.
13. Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics / M. Armbrust, A. Ghodsi, R. Xin, M. Zaharia // Proceedings of the 11th Annua Conference on Innovative Data Systems Research (CIDR 21), (11– 15 January 2021, Online). — 8 p. URL: http://cidrdb.org/cidr2021/ papers/cidr2021_paper17.pdf (data obrascheniya 12.01.2024).
14. Jensen, R., Shen, H., & Yue, P. (2017). Geo-Analytics: Integrating Geospatial Information Systems and Big Data Analytics. In Geographic Information Science (pp. 297–315). Springer International Publishing.
15. Gao, S., & Liu, Z. (2021). A Review of Big Data and Geospatial Data Integration for Geocomputation and Decision Support. Remote Sensing, 13(2), 316. https://doi.org/10.3390/rs13020316.
16. International Network Performance and Security Testing Based on Distributed Abyss Storage Cluster and Draft of Data Lake Framework / B.-R. Cha, S. Park, J.-W. Kim // Security and Communication Networks. 2018. Art. No. 1746809. 14 p. DOI:https://doi.org/10.1155/2018/1746809.
17. Rituerto, Á., & Alvarez, J. M. (2019). Geo-Big Data: A Literature Review. ISPRS International Journal of Geo-Information, 8(11), 471. https://doi.org/10.3390/ijgi8110471.
18. Bezvorotnyh, A. V. Lambda architecture dlya korporativnogo «Ozera dannyh» / A. V. Bezvorotnyh; nauch. ruk. R. I. Kuz'mich // Molodost'. Intellekt. Iniciativa: Materialy X Mezhdunarodnoy nauchno-prakticheskoy konferencii studentov i magistrantov (Vitebsk, Belarus', 22 aprelya 2022 g.). — Vitebsk: Vitebskiy gos. un-t imeni P.M. Masherova, 2022. — S. 6–8.
19. Implementing Big Data Lake for Heterogeneous Data Sources / H. Mehmood, E. Gilman, M. Cortes, [et al.] // Proceedings of the IEEE 35th International Conference on Data Engineering Workshops (ICDEW 2019), (Macao, China, 08–12 April 2019). — Institute of Electrical and Electronics Engineers, 2020. — Pp. 37–44. DOI:https://doi.org/10.1109/ICDEW.2019.00–37.
20. Marz, N. Big Data: Principles and best practices of scalable realtime data systems / N. Marz, J. Warren. — Shelter Island (NY): Manning Publications, 2015. — 328 p.
21. Sawadogo, P.N. Metadata Management for Textual Documents in Data Lakes / P.N. Sawadogo, T. Kibata, J. Darmont // Proceedings of the 21st International Conference on Enterprise Information Systems (ICEIS 2019), (Heraklion, Crete, Greece, 03–05 May 2019). — SciTePress, 2019. — Vol. 1. — Pp. 72–83. DOI:https://doi.org/10.5220/0007706300720083.
22. Visual Bayesian Fusion to Navigate a Data Lake / K. Singh, K. Paneri, A. Pandey, [et al.] // Proceedings of the 19th International Conference on Information Fusion (FUSION 2016), (Heidelberg, Germany, 05–08 July 2016). — Institute of Electrical and Electronics Engineers, 2016. — Pp. 987–994.
23. Munshi, A. A. Data Lake Lambda Architecture for Smart Grids Big Data Analytics / A.A. Munshi, Y. A.-R. I. Mohamed // IEEE Access. 2018. Vol. 6. Pp. 40463–40471. DOI:https://doi.org/10.1109/ACCESS. 2018.2858256.
24. DataHub — A Metadata Platform for the Modern Data Stack. URL: http://datahubproject.io (data obrascheniya 25.12.2023).