Hadoop Data Lake – Explanation & Overview

What is a Hadoop data lake?

Hadoop is an important element of the architecture that is used to build data lakes. A Hadoop data lake is one which has been built on a platform made up of Hadoop clusters. Hadoop is particularly popular in data lake architecture as it is open source (as part of the Apache Software Foundation project). This means that it can significantly reduce the costs of building large-scale data stores.

Data and information stored on Hadoop clusters are non-relational and can include JSON objects, log files, images, and web posts. This kind of architecture is not built for transaction processing, instead it is geared towards supporting analytics applications. 

In data lakes, the data is most usually stored in a Hadoop Distributed File System (HDFS). This system allows for simultaneous processing of data. That is because as it is ingested, the data is broken into segments and distributed through different nodes in a cluster. Hadoop data lakes can also hold a variety of structured, unstructured, and semi-structured data. This can make them more suitable for certain operations than more narrowly focused data warehouses.

Apart from Hadoop, there are other data lake examples. These would include an Azure Data Lake Store or the Amazon S3 cloud object store. With regards to data lakes, there is no data lake definition that stems solely from the technology that they use. Thus, in the future, it’s possible that Hadoop won’t be used in data lake architecture.

Another important distinction when talking about big data architecture is that of data warehouse vs data lake. This refers to how much data is being held and how structured it is. A data lake is a large repository, up to petabytes, which holds raw data as blobs or files. A data warehouse, however, is far more focused in its makeup. The data it holds is usually processed and refined, making it easier and faster to use when analyzing data to gain business intelligence.