11
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Metadata Management in Data Lake Environments: A Survey

ORCID Icon, &
Published online: 15 Jul 2024
 

Abstract

Data lakes are storage repositories that contain large amounts of data in its native format; either structured ssemi-structured or unstructured, to be used when needed. Data lakes are open to a wide range of use cases such as carrying out advanced analytics, extracting knowledge patterns, etc. However, simply dumping all the data into a data lake would only lead to a so-called data swamp. To prevent such a situation, enterprises can adopt best practices among which to build and maintain metadata. In recent years there has been a growing body of research about managing metadata in data lake environments. Existing research efforts deal separately with different activities such as metadata modeling, metadata capture and extraction, metadata usage, etc. Nevertheless, despite its importance, a global view about the research landscape about metadata management for data lakes is still missing. This survey congregates different facets of metadata management in data lakes and presents a global view along with the technological implications and the required features for building successful metadata management systems. Besides, this survey summarizes and discusses research gaps, open problems and main challenges facing both industrialists and academics. This survey pertains to the broader field of Big Data and especially to the data platforms that manage enterprise big data assets. Furthermore, considering the parallels between data lakes and digital libraries regarding their dependence on metadata for content management, this study could offer valuable insights to the digital library community, offering them a technological outlook on metadata management.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Notes

1 An API, or Application Programming Interface, is a set of rules or protocols that let software applications communicate with each other to exchange data, features and functionality, https://www.ibm.com/topics/api [Accessed: 06-Apr-2024].

2 Integrity constraints are rules that help to maintain the accuracy and consistency of data in a database, https://www.knowledgehut.com/blog/database/integrity-constraints-in-dbms [Accessed: 09-Apr-2024].

3 Apache Hadoop is an open source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data, https://aws.amazon.com/what-is/hadoop/[Accessed: 06-Apr-2024].

4 A raw data zone is an area where all types of data are ingested without processing and stored in their native format [77].

5 UMl stands for Unified Modeling Language, a specification defining a graphical language for visualizing, specifying, constructing, and documenting the artifacts of distributed object systems. https://www.omg.org/spec/UML/2.5.1/About-UML [Accessed: 06-Apr-2024].

6 Apache Tika is a toolkit that detects and extracts metadata and text from over a thousand different file types, https://tika.apache.org/[Accessed: 06-Apr-2024].

7 Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale and facilitates reading, writing, and managing petabytes of data residing in distributed storage using SQL, https://hive.apache.org/[Accessed: 06-Apr-2024].

8 Machine learning is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions, www.wikipedia.org, [Accessed: 28-Mar-2024].

9 A graphical representation of entities and relationships of the entity-relationship model.

10 Graph Theory is a general theory of mathematical structures and their relations. https://en.wikipedia.org/[Accessed: 09-Apr-2024].

11 Data Vault and other Ensemble Modeling patterns (EMP) are data modeling approaches optimized for enterprise data integration, data historization, big data, streaming, and all situations requiring highly flexible data structures (ref: http://dvstandards.com/, [accessed 29-Mar-2024].

12 NoSQL stands for Not Only SQL, an approach to database design that enables the storage and querying of data outside the traditional structures found in relational databases, www.wikipedia.org, [accessed 29-Mar-2024].

14 JSONiq is a query language specifically designed for the popular JSON data model, https://www.jsoniq.org/, [Accessed: 07-Apr-2024].

15 SPARQL is a query language for the Resource Description Framework RDF, https://www.w3.org/TR/rdf-sparql-query/, [Accessed: 07-Apr-2024].

16 In computer science, an inverted index is a database index storing a mapping from content, such as words or numbers, to its locations in a table, or in a document or a set of documents, www.wikipedia.com, [Accessed: 07-Apr-2024].

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 158.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.