Metadata Management in Data Lake Environments: A Survey: Journal of Library Metadata: Vol 0, No 0

Abstract

Data lakes are storage repositories that contain large amounts of data in its native format; either structured ssemi-structured or unstructured, to be used when needed. Data lakes are open to a wide range of use cases such as carrying out advanced analytics, extracting knowledge patterns, etc. However, simply dumping all the data into a data lake would only lead to a so-called data swamp. To prevent such a situation, enterprises can adopt best practices among which to build and maintain metadata. In recent years there has been a growing body of research about managing metadata in data lake environments. Existing research efforts deal separately with different activities such as metadata modeling, metadata capture and extraction, metadata usage, etc. Nevertheless, despite its importance, a global view about the research landscape about metadata management for data lakes is still missing. This survey congregates different facets of metadata management in data lakes and presents a global view along with the technological implications and the required features for building successful metadata management systems. Besides, this survey summarizes and discusses research gaps, open problems and main challenges facing both industrialists and academics. This survey pertains to the broader field of Big Data and especially to the data platforms that manage enterprise big data assets. Furthermore, considering the parallels between data lakes and digital libraries regarding their dependence on metadata for content management, this study could offer valuable insights to the digital library community, offering them a technological outlook on metadata management.

Keywords:

Disclosure statement

No potential conflict of interest was reported by the author(s).

Notes

1 An API, or Application Programming Interface, is a set of rules or protocols that let software applications communicate with each other to exchange data, features and functionality, https://www.ibm.com/topics/api [Accessed: 06-Apr-2024].

2 Integrity constraints are rules that help to maintain the accuracy and consistency of data in a database, https://www.knowledgehut.com/blog/database/integrity-constraints-in-dbms [Accessed: 09-Apr-2024].

3 Apache Hadoop is an open source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data, https://aws.amazon.com/what-is/hadoop/[Accessed: 06-Apr-2024].

4 A raw data zone is an area where all types of data are ingested without processing and stored in their native format [77].

5 UMl stands for Unified Modeling Language, a specification defining a graphical language for visualizing, specifying, constructing, and documenting the artifacts of distributed object systems. https://www.omg.org/spec/UML/2.5.1/About-UML [Accessed: 06-Apr-2024].

6 Apache Tika is a toolkit that detects and extracts metadata and text from over a thousand different file types, https://tika.apache.org/[Accessed: 06-Apr-2024].

7 Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale and facilitates reading, writing, and managing petabytes of data residing in distributed storage using SQL, https://hive.apache.org/[Accessed: 06-Apr-2024].

8 Machine learning is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions, www.wikipedia.org, [Accessed: 28-Mar-2024].

9 A graphical representation of entities and relationships of the entity-relationship model.

10 Graph Theory is a general theory of mathematical structures and their relations. https://en.wikipedia.org/[Accessed: 09-Apr-2024].

11 Data Vault and other Ensemble Modeling patterns (EMP) are data modeling approaches optimized for enterprise data integration, data historization, big data, streaming, and all situations requiring highly flexible data structures (ref: http://dvstandards.com/, [accessed 29-Mar-2024].

12 NoSQL stands for Not Only SQL, an approach to database design that enables the storage and querying of data outside the traditional structures found in relational databases, www.wikipedia.org, [accessed 29-Mar-2024].

13 https://azure.microsoft.com/en-us/products/data-catalog/, [Accessed: 09-Apr-2024].

14 JSONiq is a query language specifically designed for the popular JSON data model, https://www.jsoniq.org/, [Accessed: 07-Apr-2024].

15 SPARQL is a query language for the Resource Description Framework RDF, https://www.w3.org/TR/rdf-sparql-query/, [Accessed: 07-Apr-2024].

16 In computer science, an inverted index is a database index storing a mapping from content, such as words or numbers, to its locations in a table, or in a document or a set of documents, www.wikipedia.com, [Accessed: 07-Apr-2024].

Log in via your institution

Access through your institution

Log in to Taylor & Francis Online

Shibboleth

Log in to Taylor & Francis Online

Restore content access

Restore content access for purchases made as guest

Purchase options * Save for later

PDF download + Online access

48 hours access to article PDF & online version
Article PDF can be downloaded
Article PDF can be printed

USD 53.00 Add to cart

Issue Purchase

30 days online access to complete issue
Article PDFs can be downloaded
Article PDFs can be printed

USD 158.00 Add to cart

* Local tax will be added as applicable

Metadata Management in Data Lake Environments: A Survey

Log in via your institution

Log in to Taylor & Francis Online

Restore content access

Related Research

Information for

Open access

Opportunities

Help and information

Metadata Management in Data Lake Environments: A Survey

Abstract

Disclosure statement

Notes

Log in via your institution

Log in to Taylor & Francis Online

Log in to Taylor & Francis Online

Restore content access

Related Research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature