Applying machine learning to data integration

greg-bensonBy Gregory D. Benson

Few tasks are more personally rewarding than working with brilliant graduate students on research problems that have practical applications. This is exactly what I get to do as both a Professor of Computer Science at the University of San Francisco and as Chief Scientist at SnapLogic. Each semester, SnapLogic sponsors student research and development projects for USF CS project classes, and I am given the freedom to work with these students on new technology and exploratory projects that we believe will eventually impact the SnapLogic Enterprise Integration Cloud Platform. Iris and the Integration Assistant, which applies machine learning to the creation of data integration pipelines, represents one of these research projects that pushes the boundaries of self-service data integration.

For the past seven years, these research projects have provided SnapLogic Labs with bright minds and at the same time given USF students exposure to problems found in real-world commercial software. I have been able to leverage my past 19 years of research and teaching at USF in parallel and distributed computing to help formulate research areas that enable students to bridge their academic experience with problems found in large-scale software that runs in the cloud. Project successes include Predictive Field Linking, the first SnapLogic MapReduce implementation called SnapReduce, and the Document Model for data integration. It is a mutually beneficial relationship.

During the research phase of Labs projects, the students have access to the SnapLogic engineering team, and can ask questions and get feedback. This collaboration allows the students to ramp up quickly with our codebase and gets the engineering team familiar with the students. Once we have prototyped and demonstrated the potential for a research project we transition the code to production. But the relationship doesn’t end there – students who did the research work are usually hired on to help with transitioning the prototype to production code.

The SnapLogic Philosophy
Iris technology was born to help an increasing number of business users design and implement data integration tasks that previously required extensive programming skills. Most companies must manage an increasing number of data sources and cloud applications as well as an increasing amount of data volume. And it’s data Integration platforms that help business connect and transform all of this disparate data. The SnapLogic philosophy has always been to truly provide self-service integration through visual programming. Iris and the Integration Assistant further advances this philosophy by learning from the successes and failures of thousands of pipelines and billions of executions on the SnapLogic platform.

The Project
Two years ago, I led a project that refined our metadata architecture and last year I proposed a machine learning project for USF students. At the time, I gave some vague ideas about what we could achieve. The plan was to spend the first part of the project doing data science on the SnapLogic metadata to see what patterns we could find and opportunities for applying machine learning.

One of the USF graduate students working on the project, Thanawut “Jump” Anapiriyakul, discovered that we could learn from past pipeline definitions in our metadata to help recommend likely next Snaps during pipeline creation. Jump experimented with several machine learning algorithms to find the ones that give the best recommendation accuracy. We later combined the pipeline definition with Snap execution history to further improve recommendation accuracy. The end result: Pipeline creation is now much faster with the Integration Assistant.

The exciting thing about the Iris technology is that we have created an internal metadata architecture that supports not only the Integration Assistant but also the data science needed to further leverage historical user activity and pipeline executions to power future applications of machine learning in the SnapLogic Enterprise Cloud. In my view, true self-service in data integration will only be possible through the application of machine learning and artificial intelligence as we are doing at SnapLogic.

As for the students who work on SnapLogic projects, most are usually offered internships and many eventually become full-time software engineers at SnapLogic. It is very rewarding to continue to work with my students after they graduate. After ceremonies this May at USF, Jump will join SnapLogic full-time this summer, working with the team on extending Iris and its capabilities.

I look forward to writing more about Iris and our recent technology advances in the weeks to come. In the meantime, you can check out my past posts on JSON-centric iPaaS and Hybrid Batch and Streaming Architecture for Data Integration.

Gregory D. Benson is a Professor in the Department of Computer Science at the University of San Francisco and Chief Scientist at SnapLogic. Follow him on Twitter @gregorydbenson.

Unifying Batch and Streaming for Data Integration

streaming_data_integration

So far in this series of posts we have outlined the foundations of batch and streaming computation as well as some of the open source frameworks targeted at batch and streaming. The advantage of a unified batch and streaming programming model can result in greater productivity because users have to learn and maintain fewer environments. In some cases, there can also be greater code reused because some of the computations applied to batch can also be applied to streaming. Similarly, the advantage of a unified data processing engine can reduce operational complexity.

We see data processing projects like Spark with Spark Streaming and Flink attempt to provide some unification of batch and streaming semantics. These engines are evolving in terms of both performance and flexibility, thus several issues need to be resolved in terms of data integration tasks. Continue reading “Unifying Batch and Streaming for Data Integration”

Big Data Integration: Understanding Streaming Computation

In the first post in this series, I introduced The Case for Hybrid Batch and Streaming Architecture for Data Integration and focused on understanding batch computation. In this post I’ll focus on understanding streaming computation. In the final post I’ll review the benefits of unifying batch and streaming for data integration.

data_streamingUnderstanding Streaming Computation
Streaming data processing engines have varying functionality and implementation strategies. In one view, a streaming engine can process data as it arrives in contrast to a batch system that must first have all the data present before starting a computation. The goal of the streaming computation may be to filter out unneeded data or transform incoming data before sending the resulting data onto its final destination. If each piece of streaming data can be acted on independently, then the memory requirements of the the stream processing nodes can be constrained as long as the the streaming computation can keep up with the incoming data. Also, it is often not necessary or desirable to persist incoming stream data to disk. Continue reading “Big Data Integration: Understanding Streaming Computation”

The Case for a Hybrid Batch and Streaming Architecture for Data Integration

binarystreamModern data integration requires both reliable batch and reliable streaming computation to support essential business processes. Traditionally, in the enterprise software space, batch ETL (Extract Transform and Load) and streaming CEP (Complex Event Processing) were two completely different products with different means to formulating computations. Until recently, in the open source software space for big data, batch and streaming were addressed separately, such as MapReduce for batch and Storm for streams. Now we are seeing more data processing engines that attempt to provide models for both batch and streaming, such as Apache Spark and Apache Flink. In series of posts I’ll explain the need for a unified programming model and underlying hybrid data processing architecture that accommodates both batch and streaming computation for data integration. However, for data integration, this model must be at a level that abstracts specific data processing engines. Continue reading “The Case for a Hybrid Batch and Streaming Architecture for Data Integration”

SnapLogic Big Data Processing Platforms

ArchitectureOne of our goals at SnapLogic is to match data flow execution requirements with an appropriate execution platform. Different data platforms have different benefits. The goal of this post is to explain the nature of data flow pipelines and how to choose an appropriate data platform. In addition to categorizing pipelines, I will explain our current supported execution targets and our planned support for Apache Spark.

First, some preliminaries. All data processed by SnapLogic pipelines is handled natively in an internal JSON format. We call this document-oriented processing. Even flat, record-oriented data is converted into JSON for internal processing. This lets us handle both flat and hierarchical data seamlessly. Pipelines are constructed from Snaps. Each Snap encapsulates specific application or technology functionality. The Snaps are connected together to carry out a data flow process. Pipelines are constructed with our visual Designer. Some Snaps provide connectivity, such as connecting to databases or cloud applications. Some Snaps allow for data transformation such as filtering out documents, adding or removing fields or modifying fields. We also have Snaps that perform more complex operations such as sort, join and aggregate.

Given this setup, we can categorize pipelines into two types: streaming and accumulating. In a streaming pipeline, documents can flow independently. The processing of one document is not dependent on another document as they flow through the pipeline. Such streaming pipelines have low memory requirements because documents can exit the pipeline once they have reached the last Snap. In contrast, an accumulating pipeline requires that all documents from the input source must be collected before result documents can be emitted from a pipeline. Pipelines with sort, join and aggregate are accumulating pipelines. In some cases, a pipeline can be partially accumulating. Such accumulating pipelines can have high memory requirements depending on the number of documents coming in from an input source.

Now let’s turn to execution platforms. SnapLogic has an internal data processing platform called a Snaplex. Think of a Snaplex as a collection of processing nodes or containers that can execute SnapLogic pipelines. We have a few flavors of Snaplexes:

  •  A Cloudplex is a Snaplex that we host in the cloud and it can autoscale as pipeline load increases.
  • Groundplex is a fixed set of nodes that are installed on-premises or in a customer VPC. With a Groundplex, customers can do all of their data processing behind their firewall so that data does not leave their infrastructure.

We are also expanding our support for external data platforms. We have recently released our Hadooplex technology that allows SnapLogic customers to use Hadoop as an execution target for SnapLogic pipelines. A Hadooplex leverages YARN to schedule Snaplex containers on Hadoop nodes in order to execute pipelines. In this way, we can autoscale inside a Hadoop cluster. Recently we introduced SnapReduce 2.0, which enables a Hadooplex to translate SnapLogic pipelines into MapReduce jobs. A user builds a designated SnapReduce pipeline and specifies HDFS files and input and output. These pipelines are compiled to MapReduce jobs to execute on very large data sets that live in HDFS. (Check out the demonstration in our recent cloud and big data analytics webinar.)

Finally, as we announced last week as part of Cloudera’s real-time streaming announcement, we’ve begun work on our support for Spark as a target big data platform. A Sparkplex will be able to utilize SnapLogic’s extensive connectivity to bring data into and out of Spark RDDs (Resilient Distributed Datasets). In addition, similar to SnapReduce, we will allow users to compile SnapLogic pipelines into Spark codes so the pipelines can run as Spark jobs. We will support both streaming and batch Spark jobs. By including Spark in our data platform support, we will give our customers a comprehensive set of options for pipeline execution.

Choosing the right big data platform will depend on many factors: data size, latency requirements, connectivity and pipeline type (streaming versus accumulating). Here are some guidelines for choosing a particular big data integration platform:

Cloudplex

  • Cloud-to-cloud data flow
  • Streaming unlimited documents
  • Accumulating pipelines in which accumulated data can fit into node memory

Groundplex

  • Ground-to-ground, ground-to-cloud and cloud-to-ground data flow
  • Streaming unlimited documents
  • Accumulating pipelines in which accumulated data can fit into node memory

Hadooplex

  • Ground-to-ground, ground-to-cloud and cloud-to-ground data flow
  • Streaming unlimited documents
  • Accumulating pipelines can operate on arbitrary data sizes via MapReduce

Sparkplex

  • Ground-to-ground, ground-to-cloud and cloud-to-ground data flow
  • Allow for Spark connectivity to all SnapLogic accounts
  • Streaming unlimited documents
  • Accumulating pipelines can operate on data sizes that can fit in Spark cluster memory

Snap In to Big DataNote that recent work in the Spark community has increased support for out-of-core computations, such as sorting. This means that accumulating pipelines that are currently only suitable for MapReduce execution may be supported in Spark as out-of-core Spark support becomes more general. The Hadooplex and Sparkplex have added reliable execution benefits so that long-running pipelines are guaranteed to complete.

At SnapLogic, our goal is to allow customers to create and execute arbitrary data flow pipelines on the most appropriate data platform. In addition, we provide a simple and consistent graphical UI for developing pipelines which can then execute on any supported platform. Our platform agnostic approach decouples data processing specification from data processing execution. As your data volume increases or latency requirements change, the same pipeline can execute on larger data and at a faster rate just by changing the target data platform. Ultimately, SnapLogic allows you to adapt to your data requirements and doesn’t lock you into a specific big data platform.

Reliability in the SnapLogic iPaaS

Dependable system operation is a requirement for any serious integration platform as a service (iPaaS). Often, reliability or fault tolerance is listed as a feature, but it is hard to get a sense of what this means in practical terms. For a data integration project, reliability can be challenging because it must connect disparate external services, which fail on their own. In a previous blog post, we discussed how SnapLogic Integration Cloud pipelines can be constructed to manage end point failures with our guaranteed delivery mechanism. In this post, we are going to look at some of the techniques we use to ensure the reliable execution of the services we control.

We broadly divide the SnapLogic architecture into two categories: the data plane and the control plane. The data plane is encapsulated within a Snaplex and the control plane is a set of replicated distributed servers. This design separation is useful both for data isolation and for reliability because we can easily employ different approaches to fault tolerance into the two planes.

Data Plane: Snaplex and Pipeline Redundancy
The Snaplex is a cluster of one or more pipeline execution nodes. A Snaplex can reside both in the SnapLogic Integration Cloud or on-premises. The Snaplex is designed to support autoscaling in the pretense of increased pipeline load. In addition, the Monitoring Dashboard monitors the health of all Snaplex nodes. In this way, Snaplex node failure can be detected early so that future pipelines are not scheduled on the faulty node. For cloud-based Snaplexes, also known as Cloudplexes, node failures are detected automatically and replacement nodes are made availably seamlessly. For on-premise Snaplexes, aka Groundplexes, admin users are notified of the faulty node so that a replacement can be made.

If a Snaplex node fails during a pipeline execution, the pipeline will be marked as failed. Developers can choose to retry failed pipelines or in some cases, such as recurring scheduled pipelines, the failed run may be ignored. Dividing long running pipelines into several shorter pipelines can limit exposure to node failure. For highly critical integrations it is possible to build and run replicated pipelines concurrently. In this way a single failed replica won’t interrupt the integration. As an alternative, a pipeline can be constructed to stage data in the SnapLogic File System (SLFS) or in an alternate data store such as AWS S3. Staging data can mitigate the need to re-acquire data from a data source, for example, if a data source is slow such as AWS Glacier. Also, some data sources have higher transfer costs or have transfer limits that would make it prohibitive to request data multiple times in the presence of failures on the upstream end point in a pipeline.

Control Plane: Service Reliability
SnapLogic’s “control plane” resides in the SnapLogic Integration Cloud, which is hosted in AWS. By decoupling control from data processing, we provide differentiated approaches to reliability. All control plane services are replicated for both scalability and for reliability. All REST-based front end servers sit behind the AWS ELB (Elastic Load Balancing) service. If any control plane service fails, there will always be a pool of replicated services available that can service client and internal requests. Here is an example where redundancy helps both with reliability and scalability.

We employ ZooKeeper to implement our reliable scheduling service. An important aspect of the SnapLogic iPaaS is the ability to create scheduled integrations. It is important that these scheduled tasks are initiated at a specified time or the required intervals. We implement the scheduling service as a collection of servers. All the servers can accept incoming CRUD requests on tasks, but only one server is elected as the leader. We use a ZooKeeper-based leader election algorithm for this purpose. In this way, if the leader fails, a new leader will be elected immediately and resume scheduling tasks on time. We ensure that no scheduled task is missed. In addition to using ZooKeeper for leader election, we also use it to allow the follower schedulers to notify the leader of task updates.

We also utilize a suite of replicated data storage technologies to ensure control and that metadata exists in a reliable manner. We currently use MongoDB clusters for metadata and encrypted AWS S3 buckets for implementing SLFS. We don’t expose S3 directly, but rather provide a virtual hierarchical view of the data. This allows us to track and properly authorize access to the SLFS data.

For MongoDB we have developed a reliable read-modify-write strategy to handle metadata updates in a non-blocking manner using findAndModfy. Our approach results in highly efficient non-conflicting updates, but is safe in the presence of a write conflict. In a future post we will provide a technical description of how this works.

The Benefits of Software-Defined Integration
By dividing the SnapLogic elastic iPaaS architecture into the data plane and the control plane we can employ effective, but different, reliability strategies between these two classes. In the data plane we help both identify and correct Snaplex server failures, but also allow users to implement highly reliable pipelines as needed. In the control plane we use a combination of server replication, load balancing and ZooKeeper to ensure reliable system execution. Our one size does not fit all approach allows us to modularize reliability and employ targeted testing strategies. Reliability is not a product feature, but an intrinsic design feature in every aspect of the SnapLogic Integration Cloud.

Managing Errors in an iPaaS

In the world of distributed web services and Big Data, errors and faults can not be treated as frequent occurrences. Rather, they are commonplace and must be understood and managed. Errors can take the form of bad or missing data, and faults can arise from web service congestion or failures. The role of an iPaaS (Integration Platform as a Service) is to help users perform integration tasks in the presence of errors and faults.

First, let’s be clear on the difference between errors and faults:

  • An integration error usually refers to some form of data inconsistency due to data corruption, missing data, or unexpected data formats. The communication channels and integration pipeline execution all work as expected, but the data itself has a problem.
  • On the other hand a fault is the result of a connection or service failure. In some cases, the iPaaS servers may experience failures that results in faults, which must also be managed.

Error Handling

The SnapLogic Integration Cloud architecture provides both data error handling and fault tolerance in order to ensure the reliable execution of integration data flows, called pipelines. Pipelines can be designed to handle bad data using error views, and pipeline segments can be boxed in a way to provide guaranteed delivery in the presence of network or service end point failure. In the data plane, our Snaplex clusters are designed to detect node failure and to ensure there is always a stable set of Snaplex nodes available to run pipelines. Finally, we also provide redundancy and reliability throughout the control plane. In this blog, we will share details on error handling. Please be on the lookout for an additional blog post that explains how SnapLogic handles fault tolerance and resiliency in the architecture.

Elastic Snaplex

Error Views
SnapLogic Integration Cloud pipelines consist of Snaps that are connected together using views. Snaps can have zero or more input views and zero or more output views. For example, a database reader will have zero input views and one output view. A router Snap will have one input view and multiple output views. In addition to the output views, most Snaps can be configured to have an error view. Snap errors will usually cause a Snap and the pipeline to fail early. However, in some cases, a pipeline developer may want to manage error condition explicitly. A common error is missing data or bad data. If there is no error view then missing or bad data will cause the pipeline to fail. However, with an error view, the error condition is treated like data and can be passed on to a pipeline segment. In this way, a pipeline segment can be used to log the error to fix up bad data. This powerful feature allows developers to seamlessly handle both good and bad data using Snaps and pipeline segments. In some cases, certain types of faults can be communicated as errors to allow the pipeline developer to build reliable pipeline execution, such as when a database connection fault can be realized as a document sent to an error view.

As an example, some records in a CSV file may be missing one or more fields. Our CSV Reader Snap can detect these error records and send them to the error view. The error records can be sent to a pipeline segment that can attempt to clean up the data or log it for later inspection. As another example, the Email Sender Snap has one available error view. If the Email Sender error view is enabled, it will be given a document for each bad address or unsent message. In this way, the pipeline developer can create a report for the bad addresses or use the bad addresses to update a contact database so that no future attempts will be made to send to the address.

Guaranteed Delivery
The SnapLogic Integration Cloud enables the integration of several cloud services. Most modern cloud services are exposed via REST or SOAP interfaces over HTTP. However, the public network can be susceptible to failure that can cause service requests to be dropped. To help managed service connection failures the SnapLogic Integration Cloud supports guaranteed delivery of documents. To enable guaranteed delivery, the pipeline developer marks a streaming pipeline segment. This boxed segment can be configured with a retry policy. Documents that are to be sent to the boxed segment are temporarily held in persistent storage. Once a document has made it through the entire segment, an acknowledgement is sent to the pipeline manager which allows the document to be removed from the persistent storage. If the segment or segment endpoint fails the retry policy will be invoked to retrieve the document from the persistent storage and send it through the segment again. Retry policies such as linear wait times or exponential backoff can be employed to manage most intermittent endpoint failure scenarios.

For more details on the architecture of the SnapLogic Integration Cloud, check out our series of videos explaining our product and platform. Below, you can see the SnapLogic Integration Cloud for Developers: