What is a “data lake”?

strong>A data lake is a central repository in which all structured and unstructured data can be stored to any extent. We will show you what this means exactly and what the advantages and disadvantages are.

The term data lake describes a very large data store that holds data from a wide variety of sources. The special feature compared to normal databases is that a data lake stores the data in its original raw format. This can be both structured and unstructured data – it does not need to be validated or reformatted before storage. Data is not structured or reformatted, if necessary, until the data in question is needed. In this way, the data lake can be fed from a wide variety of sources and ideally used for flexible analyses in the Big Data environment.

The concept of the data lake is supported by many frameworks and file systems for Big Data applications as well as by the distributed storage of data. For example, data lakes can be implemented with the Apache Hadoop Distributed File System (HDFS). Alternatively, they can also be implemented with cloud services such as Azure Data Lake and Amazon Web Services (AWS).

Requirements for a Data Lake

In turn, to meet the requirements of applications built on top of the information, a data lake must meet the following requirements:

  • It must be possible to store a wide variety of data or data formats in order to avoid distributed data silos.
  • Common frameworks and protocols of database systems and database applications from the Big Data environment are to be supported in order to enable the most flexible use of the data.
  • The following measures must be taken to ensure data protection and data security: role-based access control, data encryption, and mechanisms for backing up and restoring data.

Advantages and disadvantages

➕ More meaningful and in-depth analyses thanks to the large amount of information provided
➕ Fast storage operations by storing data in its raw format (without prior structuring or reformatting).
➕ Low requirements in terms of computing power, even for storing large amounts of data
➕ No restriction of the analysis possibilities (due to the inclusion of all data)

➖ High requirements in terms of data protection and data security (the more data and the more interrelationships, the more in need of protection)

Source: BigData-Insider