Understanding Streaming Computation Streaming data processing engines have varying functionality and implementation strategies. In one view, a streaming engine can process data as it arrives in contrast to a batch system that must first have all the data present before starting a computation. The goal of the streaming computation may be to filter out unneeded data or transform incoming data before sending the resulting data onto its final destination. If each piece of streaming data can be acted on independently, then the memory requirements of the the stream processing nodes can be constrained as long as the the streaming computation can keep up with the incoming data. Also, it is often not necessary or desirable to persist incoming stream data to disk. Continue reading “Big Data Integration: Understanding Streaming Computation”
Data integration is not optional. It is a fundamental technology that binds systems and data together to drive the business. The importance of data integration is self-evident. However, in the changing world of IT, the path to effective data integration approaches and technology seems to be out of reach for even the most innovative and well-funded enterprises. The gap seems to be more about understanding than capabilities. Let’s fix that problem.
Modern data integration requires both reliable batch and reliable streaming computation to support essential business processes. Traditionally, in the enterprise software space, batch ETL (Extract Transform and Load) and streaming CEP (Complex Event Processing) were two completely different products with different means to formulating computations. Until recently, in the open source software space for big data, batch and streaming were addressed separately, such as MapReduce for batch and Storm for streams. Now we are seeing more data processing engines that attempt to provide models for both batch and streaming, such as Apache Spark and Apache Flink. In series of posts I’ll explain the need for a unified programming model and underlying hybrid data processing architecture that accommodates both batch and streaming computation for data integration. However, for data integration, this model must be at a level that abstracts specific data processing engines. Continue reading “The Case for a Hybrid Batch and Streaming Architecture for Data Integration”
In today’s business world big data is generating a big buzz. Besides the searching, storing and scaling, one thing that clearly stands out is – stream processing. That’s where Apache Kafka comes in.
Kafka at a high level can be described as a publish and subscribe messaging system. Like any other messaging system, Kafka maintains feeds of messages into topics. Producers write data into topics and consumers read data out of these topics. For the sake of simplicity, I have linked to the Kafka documentation here.