Eight Data Management Requirements for the Enterprise Data Lake

SnapLogicDataLakeMgmt01itbe_logoThis article originally appeared as a slide slow on ITBusinessEdge: Data Lakes – 8 Data Management Requirements.

2016 is the year of the data lake. It will surround, and in some cases drown the data warehouse and we’ll see significant technology innovations, methodologies and reference architectures that turn the promise of broader data access and big data insights into a reality. But big data solutions must mature and go beyond the role of being primarily developer tools for highly skilled programmers. The enterprise data lake will allow organizations to track, manage and leverage data they’ve never had access to in the past. New data management strategies are already leading to more predictive and prescriptive analytics that are driving improved customer service experiences, cost savings and an overall competitive advantage when there is the right alignment with key business initiatives.

So whether your enterprise data warehouse is on life support or moving into maintenance mode, it will most likely continue to do what it’s good at for the time being: operational and historical reporting and analysis (aka rear view mirror). As you consider adopting an enterprise data lake strategy to manage more dynamic, poly-structured data, your data integration strategy must also evolve to handle the new requirements. Thinking that you can simply hire more developers to write code or rely on your legacy rows-and-columns-centric tools is a recipe to sink in a data swamp instead of swimming in a data lake. Here are eight enterprise data management requirements that must be addressed in order to get maximum value from your big data technology investments.

1) Storage and Data Formats

Traditional data warehousing focused on relational databases as the primary data and storage format. A key concept of the data lake is the ability to reliably store a large amount of data. Such data volumes are typically much larger than what can be handled in traditional relational databases, or much larger than what can be handled in a cost-effective manner. To this end, the underlying data storage must be scalable and reliable. The Hadoop Distributed File System (HDFS) has matured and is now the leading data storage technology that enables the reliable persistence of large-scale data. However, other storage technologies can also provide the data store backend for the data lake. Open source systems such as Cassandra, HBase, and MongoDB can provide reliable storage for the data lake. Alternatively, cloud-based storage services can also be used as a data store backend. Such services include Amazon S3, Google Cloud Storage, and the Microsoft Azure Blob Store.

Unlike relational databases, big data storage does not usually dictate a data storage format. That is, big data storage supports arbitrary data formats that are understood by the applications that use the data. For example, data may be stored in CSV, RCFile, ORC, or Parquet to name a few. In addition, various compression techniques — such as GZip, LZO, and Snappy — can be applied to data files to improve space and network bandwidth utilization. This makes data lake storage much more flexible. Multiple formats and compression techniques can be used in the same data lake to best support specific data and query requirements.

2) Ingest and Delivery

Data lakes need mechanisms for getting data into and out of the backend storage platform. In traditional data warehouses, data is inserted and queried using some form of SQL and a database driver, possibly via ODBC or JDBC. While compatibility drivers do exists to access Hadoop data, the variety of data formats requires more flexible tooling to accommodate the different formats. Open source tools such as Sqoop and Flume provide low-level interfaces for pulling in data from relational databases and log data respectively. In addition, custom MapReduce programs and scripts are currently used to import data from APIs and other data sources. Commercial tools provide pre-built connectors and a wealth of data formats support to mix and match data sources to data repositories in the data lake.

Given the variety of data formats for Hadoop data, a comprehensive schema management tool does not yet exist. Hive’s metastore extended via HCatalog provides a relational schema manager for Hadoop data. Yet, not all data formats can be described in HCatalog. To date, quite a bit of Hadoop data is defined inside applications themselves, perhaps using JSON, AVRO, RCFile, or Parquet. Just like with data endpoints and data formats, the right commercial tools can help describe the lake data and surface the schemas to the end users more readily.

3) Discovery and Preparation

Due to the flexibility of data formats in Hadoop and other data lake backend storage platforms, it is common to dump data into the lake before fully understanding the schema of the data. In fact, a lot of lake data may be highly unstructured. In any case, the cost effectiveness of Hadoop data makes it possible to prepare the data after it has been acquired. This is more ELT (extract, load, transform) than traditional ETL (extract, transform, load). However, there is a point at which to do useful work with a data set, the format of the data must be understood.

In the open-source ecosystem, discovery and preparation can be done at the command line with scripting languages, such as Python and Pig. Ultimately, native MapReduce jobs, Pig, or Hive can be used to extract useful data out of semi-structured data. This new, accessible data can be used by further analytic queries or machine-learning algorithms. In addition, the prepared data can be delivered to traditional relational databases so that conventional business intelligence tools can directly query it.

Commercial offerings in the data discovery and basic data preparation space offer web-based interfaces (although some are basic on-premises tools for so-called “data blending”) for investigating raw data and then devising strategies for cleansing and pulling out relevant data. Such commercial tools range from “lightweight” spreadsheet-like interfaces to heuristic-based analysis interfaces that help guide data discovery and extraction.

4) Transformations and Analytics

Not only are systems like Hadoop more flexible in the types of data that can be stored, they are also more flexible in the types of queries and computations that be be performed on the stored data. SQL is a powerful language for querying and transforming relational data, but is not appropriate for queries on non-relational data and for employing iterative machine learning algorithms and other arbitrary computations. Tools like Hive, Impala, and Spark SQL bring SQL-like queries to Hadoop data. However, tools like Cascading, Crunch, and Pig bring more flexible data processing to Hadoop data. Most of these tools are powered by one of the two most widely-used data processing engines: MapReduce or Spark.

In the data lake we see three types of transformations and analytics: simple transformations, analytics queries, and ad-hoc computation. Simple transformations include tasks such as data preparation, data cleansing, and filtering. Analytic queries will be used to provide a summary view of a data set, perhaps cross-referencing other data sets. Finally, ad-hoc computation can be used to support a variety of algorithms, for example, building a search index or classification via machine learning. Often such algorithms are iterative in nature and require several passes over the data.

5) Streaming

Traditional data warehouses support batch analytic queries. However, in the open source ecosystem as well as in commercial products we are seeing a convergence of hybrid batch and streaming architectures. For example, Spark supports both batch processing as well as stream processing with Spark Streaming. Apache Flink is another project aiming to combine batch and stream processing. This is a natural progression because fundamentally it is possible to use very similar APIs and languages to specify a batch or streaming computation. It is no longer necessary to have two completely disparate systems. In fact, a unified architecture makes it easier to discover different types of data sources.

Hybrid batch and streaming architectures will also prove to be extremely beneficial when it comes to IoT data. Streaming can be used to analyze and react to data in real time as well as to ingest data into the data lake for batch processing. Modern, high performance messaging systems such as Apache Kafka can be used to help in the unification of batch and streaming. Integration tools can help feed Kafka, process Kafka data in a streaming fashion, and also feed a data lake with filtered and aggregated data.

6) Scheduling and Workflow

Orchestration in the data lake is a mandatory requirement. Scheduling refers to launching jobs at specified times or in response to an external trigger. Workflow refers to specifying job dependencies and providing a means to execute jobs in a way that the dependencies are respected. A job could be a form of data acquisition, data transformation, or data delivery. In the context of a data lake, scheduling and workflow both need to interface with the underlying data storage and data processing platforms. For the enterprise, scheduling and workflow should be defined via a graphical user interface and not through the command line.

The open source ecosystem provides some low-level tools such Oozie, Azkaban, and Luigi. These tools provide command line interfaces and file-based configuration. They are useful mainly for orchestrating work primarily within Hadoop.

Commercial data integration tools provide high-level interfaces to scheduling and workflow, making such tasks more accessible to a wider range of IT professionals.

7) Metadata and Governance

Two areas that are still less mature in data lake technologies such as Hadoop are metadata and governance. Metadata refers to update and access requests as well as schema. These capabilities are provided in the context of the conventional relational data warehouse, where updates are more easily tracked and schema is more constrained.

Work in open source on metadata and governance is progressing, but there is not widespread agreement on a particular implementation. For example, Apache Sentry, helps enforce role-based authorization to Hadoop data. It works with some, but not all, Hadoop tools.

Enterprises looking to better manage metadata and governance currently employ custom solutions or simply live with limited functionality in this regard. Recently LinkedIn open sourced an internal tool called WhereHows that may prove to improve the ability to collect, discover, and understand metadata in the the data lake. Look to see commercial data integration solution providers develop new ways to manage metadata and governance in the enterprise data lake.

8) Security

Security in the various data lake backends is also evolving and it is addressed at different levels. Hadoop supports Kerberos authentication and UNIX-style authorization via file and directory permissions. Apache Sentry and Cloudera’s Record Service are two approaches to fine-grained authorization within Hadoop data files. There is no universal agreement on an approach to authorization, consequently not all Hadoop tools support all of the different approaches. This makes it difficult to standardize at the moment because you will restrict the tools that you can use depending on the selected authorization approach.

A lack of a standard makes it difficult for commercial products to provide comprehensive support at this time. However, in the interim, commercial products can serve as a gateway to the data lake and provide a good amount of security functionality that can help enterprises meet their security requirements in the short term, then adopt standardized mechanisms as they become available.

The Bottom Line
There is no shortage of hype around the promise of big data and the data lake and the new technologies that are now available to harness the power of the platform. As the market matures, it’s going to be increasingly important to begin with the end in mind and build a strategic plan that will scale and grow as your requirements also evolve. Look for a modern data integration provider that has technological depth and breadth in the new world, as well as hands-on experience with enterprise deployments and partnerships. Don’t settle for same old, same old data integration as you build your vision for an enterprise data lake to power next-generation analytics and insights.

Next Steps: