Big Data Integration: Understanding Big Data Streaming Computation
In the first post in this series, I introduced The Case for Hybrid Batch and Streaming Architecture for Data Integration and focused on understanding batch computation. In this post I’ll focus on understanding streaming computation. In the final post I’ll review the benefits of unifying batch and streaming for data integration.
Understanding Streaming Computation
Streaming data processing engines have varying functionality and implementation strategies. In one view, a streaming engine can process data as it arrives in contrast to a batch system that must first have all the data present before starting a computation. The goal of the streaming computation may be to filter out unneeded data or transform incoming data before sending the resulting data onto its final destination. If each piece of streaming data can be acted on independently, then the memory requirements of the the stream processing nodes can be constrained as long as the the streaming computation can keep up with the incoming data. Also, it is often not necessary or desirable to persist incoming stream data to disk.
Another, more sophisticated, view of stream processing involves performing computations on sets of incoming data, such as aggregating to find a sum or average. This is useful in many contexts in which you want to respond to incoming data in real time in order to identify behavior or detect problems. A set of data may be defined by a fixed number of data elements, such as, every 10 messages. Alternatively, a set of data may be defined as the data elements that arrive within a specified time period, such as every 30 seconds. Finally, a streaming data set may also be defined by a sliding window in which a streaming computation is performed on say the last 30 seconds of data every 10 seconds. This creates overlapping data sets.
Just like in batch-oriented data processing, streaming computation requires a means to handle failures during execution. Current stream processing engines such as Apache Storm, Apache Flink and Spark Streaming, all have different approaches to achieving fault tolerance. One approach is to periodically save checkpoints from the stream nodes to persistent storage and restore from the checkpoint if a failure occurs. Apache Flink uses a form of checkpointing to tolerate failure. Another approach is to recover from failure by replaying the incoming data stream from some reliable storage, such as Apache Kafka. Apache Storm uses a replay approach. Yet another approach that is taken by Spark Streaming is to read incoming data into micro batches that are replicated within the memory of the stream processing cluster. In this way, incoming stream data can be be recovered from a replica.
Different stream processing engines also have different processing semantics, which are influenced by their approaches to fault-tolerance. Both Storm and Flink can process data as soon as it arrives, whereas Spark must combine incoming data into micro batches before the data is processed. However, the micro batch approach can lead to higher overall throughput at the cost of latency. Another semantic difference is at least once versus exactly once message processing. Spark Streaming and Flink have exactly once semantics, which is easier for a programmer to reason about. Storm supports at least once semantics by default, but exactly once semantics can be achieved using the Trident API layer for Storm.
While these streaming engines overlap to some degree in functionality, some engines will be more suitable for different types of problems. At this point, there is not one universal framework that solves all streaming problems. Increasingly, data integration tasks will need to be executed in both a streaming context as well as a batch context.
In the final post in this series, I’ll outline the benefits of unifying batch and streaming for enterprise data integration.