Data Lakes Requirements

Data lakes will surround, and in some cases drown the data warehouse and we’ll see significant technology innovations, data lake products, methodologies, and reference architectures that turn the promise of broader data access and big data insights into a reality. But big data products and solutions must mature and go beyond the role of being primarily developer tools for highly skilled programmers. The enterprise data lake will allow organizations to track, manage and leverage data they’ve never had access to in the past. New enterprise data management strategies are already leading to more predictive and prescriptive analytics that are driving improved customer service experiences, cost savings, and an overall competitive advantage when there is the right alignment with key business initiatives.

So whether your enterprise data warehouse is on life support or moving into maintenance mode, it will most likely continue to do what it’s good at for the time being: operational and historical reporting and analysis (aka rear view mirror). As you consider adopting an enterprise data lake strategy to manage more dynamic, poly-structured data, your data integration strategy must also evolve to handle the new requirements. Thinking that you can simply hire more developers to write code or rely on your legacy rows-and-columns-centric tools is a recipe to sink into a data swamp instead of swimming in a data lake.

First, let’s define what is a data lake.

What is a Data Lake?

A data lake is a large, centralized repository of structured and unstructured data that is stored in its raw, native format. Data lakes are designed to provide a scalable and flexible platform for storing and analyzing large amounts of data, and they are often used by organizations to store data from a variety of sources, such as sensors, social media feeds, and transactional systems. The data in a data lake can be processed and analyzed using a wide range of tools and technologies, including batch processing systems, real-time stream processing engines, and interactive query engines. The goal of a data lake is to provide a single, central repository for all of an organization’s data, where it can be easily accessed, queried, and analyzed to support a wide range of use cases.

Great. Now that we know what a data lake is, let’s get more technical.

Here are eight enterprise data management requirements that must be addressed in order to get maximum value from your big data technology investments and data lake products.

8 Enterprise Data Management Requirements for Your Data Lake

1) Storage and Data Formats

Traditional data warehousing focused on relational databases as the primary data and storage format. A key concept of the data lake is the ability to reliably store a large amount of data. Such data volumes are typically much larger than what can be handled in traditional relational databases, or much larger than what can be handled in a cost-effective manner. To this end, the underlying data storage must be scalable and reliable. The Hadoop Distributed File System (HDFS) and associated Hadoop data management tools have matured and are now the leading data storage technology that enables the reliable persistence of large-scale data. However, other storage and data lake products can also provide the data store backend for the data lake. Open-source systems such as Cassandra, HBase, and MongoDB can provide reliable storage for the data lake. Alternatively, cloud-based storage services can also be used as a data store backend. Such services include Amazon S3, Google Cloud Storage, and the Microsoft Azure Blob Store.

Unlike relational databases, big data storage does not usually dictate a data storage format. That is, big data storage supports arbitrary data formats that are understood by the applications that use the data. For example, data may be stored in CSV, RCFile, ORC, or Parquet to name a few. In addition, various compression techniques — such as GZip, LZO, and Snappy — can be applied to data files to improve space and network bandwidth utilization. This makes data lake storage much more flexible. Multiple formats and compression techniques can be used in the same data lake to best support specific data and query requirements.

2) Ingest and Delivery

Data lakes need mechanisms for getting data into and out of the backend storage platform. In traditional data warehouses, data is inserted and queried using some form of SQL and a database driver, possibly via ODBC or JDBC. While compatibility drivers do exists to access Hadoop data, the variety of data formats requires more flexible tooling to accommodate the different formats. Open source tools such as Sqoop and Flume provide low-level interfaces for pulling in data from relational databases and log data respectively. In addition, custom MapReduce programs and scripts are currently used to import data from APIs and other data sources. Commercial tools provide pre-built connectors and a wealth of data formats support to mix and match data sources to data repositories in the data lake.

Given the variety of data formats for Hadoop data, a comprehensive schema management tool does not yet exist. Hive’s metastore extended via HCatalog provides a relational schema manager for Hadoop data. Yet, not all data formats can be described in HCatalog. To date, quite a bit of Hadoop data is defined inside the applications themselves, perhaps using JSON, AVRO, RCFile, or Parquet. Just like with data endpoints and data formats, the right commercial tools can help describe the lake data and surface the schemas to the end users more readily.

3) Discovery and Preparation

Due to the flexibility of data formats in Hadoop data management tools and other data lake backend storage platforms, it is common to dump data into the lake before fully understanding the schema of the data. In fact, a lot of lake data may be highly unstructured. In any case, the cost-effectiveness of Hadoop data makes it possible to prepare the data after it has been acquired. This is more ELT (extract, load, transform) than traditional ETL (extract, transform, load). However, there is a point at which to do useful work with a data set, the format of the data must be understood.

In the open-source ecosystem, discovery and preparation can be done at the command line with scripting languages, such as Python and Pig. Ultimately, native MapReduce jobs, Pig, or Hive can be used to extract useful data out of semi-structured data. This new, accessible data can be used by further analytic queries or machine-learning algorithms. In addition, the prepared data can be delivered to traditional relational databases so that conventional business intelligence tools can directly query it.

Commercial offerings in the data discovery and basic data preparation space offer web-based interfaces (although some are basic on-premises tools for so-called “data blending”) for investigating raw data and then devising strategies for cleansing and pulling out relevant data. Such commercial tools range from “lightweight” spreadsheet-like interfaces to heuristic-based analysis interfaces that help guide data discovery and extraction.

4) Transformations and Analytics

Not only are systems like Hadoop more flexible in the types of data that can be stored, but they are also more flexible in the types of queries and computations that be performed on the stored data. SQL is a powerful language for querying and transforming relational data but is not appropriate for queries on non-relational data and for employing iterative machine learning algorithms and other arbitrary computations. Tools like Hive, Impala, and Spark SQL bring SQL-like queries to Hadoop data. However, tools like Cascading, Crunch, and Pig bring more flexible data processing to Hadoop data. Most of these tools are powered by one of the two most widely-used data processing engines: MapReduce or Spark.

In the data lake we see three types of transformations and analytics: simple transformations, analytics queries, and ad-hoc computation. Simple transformations include tasks such as data preparation, data cleansing, and filtering. Analytic queries will be used to provide a summary view of a data set, perhaps cross-referencing other data sets. Finally, the ad-hoc computation can be used to support a variety of algorithms, for example, building a search index or classification via machine learning. Often such algorithms are iterative in nature and require several passes over the data.

5) Streaming

Traditional data warehouses support batch analytic queries. However, in the open source ecosystem as well as in commercial products we are seeing a convergence of hybrid batch and streaming architectures. For example, Spark supports both batch processing as well as stream processing with Spark Streaming. Apache Flink is another project aiming to combine batch and stream processing. This is a natural progression because fundamentally it is possible to use very similar APIs and languages to specify a batch or streaming computation. It is no longer necessary to have two completely disparate systems. In fact, a unified architecture makes it easier to discover different types of data sources.

Hybrid batch and streaming architectures will also prove to be extremely beneficial when it comes to IoT data. Streaming can be used to analyze and react to data in real-time as well as to ingest data into the data lake for batch processing. Modern, high-performance messaging systems such as Apache Kafka can be used to help in the unification of batch and streaming. Integration tools can help feed Kafka, process Kafka data in a streaming fashion, and also feed a data lake with filtered and aggregated data.

6) Scheduling and Workflow

Orchestration in the data lake is a mandatory requirement. Scheduling refers to launching jobs at specified times or in response to an external trigger. Workflow refers to specifying job dependencies and providing a means to execute jobs in a way that the dependencies are respected. A job could be a form of data acquisition, data transformation, or data delivery. In the context of a data lake, scheduling and workflow both need to interface with the underlying data storage and data processing platforms. For the enterprise, scheduling and workflow should be defined via a graphical user interface and not through the command line.

The open-source ecosystem provides some low-level tools such as Oozie, Azkaban, and Luigi. These tools provide command line interfaces and file-based configuration. They are useful mainly for orchestrating work primarily within Hadoop.

Commercial data integration tools provide high-level interfaces to scheduling and workflow, making such tasks more accessible to a wider range of IT professionals.

7) Metadata and Governance

Two areas that are still less mature in data lake products such as Hadoop are metadata and governance. Metadata refers to update and access requests as well as schema. These capabilities are provided in the context of the conventional relational data warehouse, where updates are more easily tracked and schema is more constrained.

Work in open source on metadata and governance is progressing, but there is not widespread agreement on a particular implementation. For example, Apache Sentry helps enforce role-based authorization for Hadoop data. It works with some, but not all, Hadoop data management tools.

Enterprises looking to better manage metadata and governance currently employ custom solutions or simply live with limited functionality in this regard. Recently LinkedIn open-sourced an internal tool called WhereHows that may prove to improve the ability to collect, discover, and understand metadata in the data lake. Look to see commercial data integration solution providers develop new ways to manage metadata and governance in the enterprise data lake.

8) Security

Security in the various data lake backends is also evolving and it is addressed at different levels. Hadoop supports Kerberos authentication and UNIX-style authorization via file and directory permissions. Apache Sentry and Cloudera’s Record Service are two approaches to fine-grained authorization within Hadoop data files. There is no universal agreement on an approach to authorization, consequently not all Hadoop tools support all of the different approaches. This makes it difficult to standardize at the moment because you will restrict the tools that you can use depending on the selected authorization approach.

A lack of a standard makes it difficult for commercial products to provide comprehensive support at this time. However, in the interim, commercial products can serve as a gateway to the data lake and provide a good amount of security functionality that can help enterprises meet their security requirements in the short term, then adopt standardized mechanisms as they become available.

Tools used to manage and analyze data in a data lake include

Some common tools that are used to manage and analyze data in a data lake include:

Apache Hadoop: An open-source framework that is commonly used to build and manage data lakes. It includes a distributed storage system (HDFS) for storing data, as well as a range of tools for processing and analyzing data, such as MapReduce, Pig, and Hive.
Apache Spark: An open-source, distributed computing system that is designed for high-speed, large-scale data processing. It is often used to analyze data in a data lake, and it includes a range of tools and libraries for working with data, such as SQL and machine learning libraries.
Elasticsearch: A search and analytics engine that is commonly used to index and query data in a data lake. It is designed to handle large volumes of data and to provide fast and flexible search capabilities.
Amazon S3: A cloud-based storage service that is often used to store data in a data lake. It is scalable, durable, and secure, and it includes a range of features that make it easy to manage and analyze data at scale.
Tableau: A popular business intelligence and data visualization tool that is often used to explore and analyze data in a data lake. It allows users to create interactive dashboards and visualizations that can help them make sense of complex data sets.

Data lake infrastructure refers to the hardware, software, and services that are used to build and manage a data lake.

Services that are used to build and manage a data lake

These typically include:

Distributed storage systems, such as Apache Hadoop HDFS or Amazon S3, which are used to store large amounts of data in a scalable and fault-tolerant manner.
Data processing and analytics tools, such as Apache Spark or Elasticsearch, which are used to perform various operations on the data, such as cleaning, transforming, and aggregating it.
Data governance and security tools, such as Apache Ranger or AWS IAM, are used to control access to the data, ensure its quality, and protect it from unauthorized access or tampering.
Data integration and ETL tools, such as Apache NiFi or Talend, which are used to extract data from various sources, transform it into a consistent format, and load it into the data lake.
Data visualization and reporting tools, such as Tableau or Qlik, are used to explore and analyze the data in the data lake, and to create dashboards and reports that can be shared with others.

The Bottom Line

There is no shortage of hype around the promise of big data, the data lake, data lake products, and the new technologies that are now available to harness the power of the platform. As the market matures, it’s going to be increasingly important, to begin with, the end in mind, and build a strategic plan that will scale and grow as your requirements also evolve. Look for a modern data integration provider that has technical depth and breadth in the new world, as well as hands-on experience with enterprise deployments and partnerships. Don’t settle for the same old, same old data integration as you build your vision for an enterprise data lake to power next-generation analytics and insights.

Next Steps: