JSON is the New CSV and Streams are the New Batch

Mark MadsenThis is the 2nd post in the series from Mark Madsen’s whitepaper: Will the Data Lake Drown the Data Warehouse? In the first post,  Mark outlined the differences between the data lake and the traditional data warehouse, concluding: “The core capability of a data lake, and the source of much of its value, is the ability to process arbitrary data.”

In this post, Mark reviews the new environment and new requirements:

“The pre-Hadoop environments, including integration tools that were built to handle structured rows and columns, limit the type of data that can be processed. Requirements in the new ecosystem that tend to cause the most trouble for traditional environments are variable data structure, streaming data and nonrelational datasets.

JSONData Structures: JSON is the New CSV

The most common interchange format between applications is not database connectors but flat files in comma-separated value (CSV) format, often exchanged via FTP. One of the big shifts in application design over the past ten years was a move to REST APIs with payloads formatted in JSON, an increasingly common data format. When combined with streaming infrastructure, this design shift reduces the need for old style file integration. JSON and APIs are becoming the new CSV and FTP.

Most enterprise data integration tools were built assuming use of a relational database. This works well for data coming from transactional applications. It works less well for logs, event streams and human-authored data. These do not have the same regular structure of rows, columns and tables that databases and integration tools require. These tools have difficulty working with JSON and must do extra work to process and store it.

The reverse is not true. Newer data integration tools can easily represent tables in JSON, whereas nested structures in JSON are difficult to represent in tables. Flexible representation of data enables late binding for data structures and data types.

This is a key advantage of JSON when compared to the early binding and static typing used by older data integration tools. One simple field change upstream can break a dataflow in the older tools, where the more flexible new environment may be able to continue uninterrupted.

JSON is not the best format for storing data, however. This means tools are needed to translate data from JSON to more efficient storage formats in Hadoop, and from those formats back to JSON for applications. Much of the web and non-transactional data is sent today as JSON messages. The more flexible Hadoop and streaming technologies are a better match for transporting and processing this data than conventional data integration tools.

Streams are the new batch
Often, the initial sources of data in a data lake come from event streams and can be processed continuously rather than in batch. As a rule, a data warehouse is a poor place to process data that must be available in less than a few minutes. The architecture was designed for periodic incremental loads, not for a continuous stream of data. A data lake should support multiple speeds from near real-time to high latency batch.

Batch processing is actually a subset of stream processing. It is easy to persist data for a time and then run a job to process it as a batch. It is not as easy to take a batch system and make it efficiently process data one message at a time. A batch engine can’t keep up with streaming requirements, but tools that have been designed to process streaming data can behave like a batch engine.

Streaming data also implies that data volume can fluctuate, from a small trickle during one hour to a flood in the next. The fixed server models and capacity planning of a traditional architecture do not translate well to dynamic server scaling as message volume grows and shrinks. This requires rethinking how one scales data collection and integration.”

——

The paper Will the Data Lake Drown the Data Warehouse? goes on to note that, “different datasets drive new engines.” In the next post in this series, Mark will describe the new data lake architecture, diving into some of the concepts he covered in the companion data lake whitepaper: How to Build an Enterprise Data Lake: Important Considerations Before You Jump In. Be sure to also check out the recent webinar presentation and recording with SnapLogic here and learn more about SnapLogic for big data integration here.