Abstract
Data lakes are storage repositories that contain large amounts of data in its native format; either structured ssemi-structured or unstructured, to be used when needed. Data lakes are open to a wide range of use cases such as carrying out advanced analytics, extracting knowledge patterns, etc. However, simply dumping all the data into a data lake would only lead to a so-called data swamp. To prevent such a situation, enterprises can adopt best practices among which to build and maintain metadata. In recent years there has been a growing body of research about managing metadata in data lake environments. Existing research efforts deal separately with different activities such as metadata modeling, metadata capture and extraction, metadata usage, etc. Nevertheless, despite its importance, a global view about the research landscape about metadata management for data lakes is still missing. This survey congregates different facets of metadata management in data lakes and presents a global view along with the technological implications and the required features for building successful metadata management systems. Besides, this survey summarizes and discusses research gaps, open problems and main challenges facing both industrialists and academics. This survey pertains to the broader field of Big Data and especially to the data platforms that manage enterprise big data assets. Furthermore, considering the parallels between data lakes and digital libraries regarding their dependence on metadata for content management, this study could offer valuable insights to the digital library community, offering them a technological outlook on metadata management.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Notes
1 An API, or Application Programming Interface, is a set of rules or protocols that let software applications communicate with each other to exchange data, features and functionality, https://www.ibm.com/topics/api [Accessed: 06-Apr-2024].
2 Integrity constraints are rules that help to maintain the accuracy and consistency of data in a database, https://www.knowledgehut.com/blog/database/integrity-constraints-in-dbms [Accessed: 09-Apr-2024].
3 Apache Hadoop is an open source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data, https://aws.amazon.com/what-is/hadoop/[Accessed: 06-Apr-2024].
4 A raw data zone is an area where all types of data are ingested without processing and stored in their native format [77].
5 UMl stands for Unified Modeling Language, a specification defining a graphical language for visualizing, specifying, constructing, and documenting the artifacts of distributed object systems. https://www.omg.org/spec/UML/2.5.1/About-UML [Accessed: 06-Apr-2024].
6 Apache Tika is a toolkit that detects and extracts metadata and text from over a thousand different file types, https://tika.apache.org/[Accessed: 06-Apr-2024].
7 Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale and facilitates reading, writing, and managing petabytes of data residing in distributed storage using SQL, https://hive.apache.org/[Accessed: 06-Apr-2024].
8 Machine learning is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions, www.wikipedia.org, [Accessed: 28-Mar-2024].
9 A graphical representation of entities and relationships of the entity-relationship model.
10 Graph Theory is a general theory of mathematical structures and their relations. https://en.wikipedia.org/[Accessed: 09-Apr-2024].
11 Data Vault and other Ensemble Modeling patterns (EMP) are data modeling approaches optimized for enterprise data integration, data historization, big data, streaming, and all situations requiring highly flexible data structures (ref: http://dvstandards.com/, [accessed 29-Mar-2024].
12 NoSQL stands for Not Only SQL, an approach to database design that enables the storage and querying of data outside the traditional structures found in relational databases, www.wikipedia.org, [accessed 29-Mar-2024].
13 https://azure.microsoft.com/en-us/products/data-catalog/, [Accessed: 09-Apr-2024].
14 JSONiq is a query language specifically designed for the popular JSON data model, https://www.jsoniq.org/, [Accessed: 07-Apr-2024].
15 SPARQL is a query language for the Resource Description Framework RDF, https://www.w3.org/TR/rdf-sparql-query/, [Accessed: 07-Apr-2024].
16 In computer science, an inverted index is a database index storing a mapping from content, such as words or numbers, to its locations in a table, or in a document or a set of documents, www.wikipedia.com, [Accessed: 07-Apr-2024].