Unifying Batch and Streaming for Data Integration
So far in this series of posts we have outlined the foundations of batch and streaming computation as well as some of the open source frameworks targeted at batch and streaming. The advantage of a unified batch and streaming programming model can result in greater productivity because users have to learn and maintain fewer environments. In some cases, there can also be greater code reused because some of the computations applied to batch can also be applied to streaming. Similarly, the advantage of a unified data processing engine can reduce operational complexity.
We see data processing projects like Spark with Spark Streaming and Flink attempt to provide some unification of batch and streaming data semantics. These engines are evolving in terms of both performance and flexibility, thus several issues need to be resolved in terms of data integration tasks.
- Varying Semantics for Different Modes: While some data processing engines are attempting to become universal in terms of batch and streaming, each engine has varying semantics for batch mode and streaming mode. For some applications micro batching with Spark Streaming may be appropriate, but other applications may need to process each message instantly. Furthermore, these data processing engines have different operational requirements that may have implications for deploying within an existing IT infrastructure.
- Robustness and Maturity: These data processing engines are still relatively young in terms of robustness and maturity. MapReduce has benefitted from many more years of production use, debugging, and open source development. So, while the emerging engines show a lot of promise, there will be some gaps in their applicability and stability. Continued adoption will help increase the feedback loop and ultimately harden these engines.
- Basic Connectivity: Finally, these engines provide relatively low-level interfaces and APIs in order to construct computations. From an integration perspective, many types of operations can be specified at a higher-level. Also, integration is both about data processing and connecting with large number of different data endpoints. The data processing engines provide only basic connectivity to data endpoints to a limited set of endpoints.
At SnapLogic, we see the unification of batch and streaming occurring at multiple levels. Emerging data processing engines are providing unified semantics, but these engines are too low level for widespread data integration use. Our Elastic Integration Platform leverages a unified data processing engine, while providing a high-level programming model for interfacing with multiple engine types. This gives organizations the ability to take advantage of modern hybrid engines and connect their data endpoints using a higher-level, visual programming model.