A Definitionless Definition of Big Data Architecture

“‘When I use a word,’ Humpty Dumpty said, in rather a scornful tone, ‘it means just what I choose it to mean — neither more nor less.’

‘The question is,’ said Alice, ‘whether you can make words mean so many different things.'”

Through The Looking Glass, Lewis Caroll

“Big Data”, like most buzzwords, has generated many partially overlapping definitions. (In fact, the author has become of the opinion that just like herds of cows and murders of crows, collections of definitions need their own collective noun. He respectfully submits “opinion”, as in “an opinion of definitions”, as the natural choice.) This post is not about adding another definition of Big Data. It is about considering the operational and architectural implications of calling something Big Data.

http://www.xkcd.com/1429/
Copyright XKCD.

So grab your definition(s) of choice and a representative handful of your data, and consider the following: Continue reading “A Definitionless Definition of Big Data Architecture”

New SnapLogic Community: For Developers By Developers

Today’s post is from SnapLogic summer intern, Rishabh Mehan: My name is Rishabh Mehan and I’m currently a student at New York Institute of Technology. I’ve been doing computer programming/software development for 8 years and this summer I’ve been working at SnapLogic as an intern. My main focus has been the new SnapLogic Developer Community, which went live with our Summer 2014 release.

Screen Shot 2014-07-31 at 2.49.37 PMOne of the things that excited me the most about what we’re working on at SnapLogic (other than Elastic Integration, Big Data and powering cloud analytics of course) is the fact that we’re setting out to enable our customers with the potential to move to the cloud and expand the kind of data and application integrations that are possible.

Our new SnapLogic Developer Community was created to make it easier for developers to expand the current list of Snaps according to their needs, as well as to have the ability to create a completely new Snap. With a very simple approach, the SnapLogic Developer Community provides a knowledge base and environment to share ideas for developing on our cloud integration platform.

The Developer Community provides a base for collaborative learning, and our team and other developers will always be there to help you, as well as ask you for help. This is how developers work. The current structure of the Community is diversified into three segments:

  1. Get Set
  2. Get Started
  3. Get Collaborative

Get Set

  • Brief overview of the architecture
  • Introduction to the technology and terminology
  • Snaps and pipelines
  • Set up the on-premises Snaplex

Get StartedScreen Shot 2014-07-31 at 3.43.39 PM

  • Set up your developer environment
  • Snap Development
  • Demo Snaps and guides
  • Documentation for your reference

Get Collaborative

  • Community forum to discuss your issues
  • Post your responses and help others
  • Learn about what other developers are doing

After being provisioned as a Developer in your SnapLogic organization, you are all set to enter the Developer Community and go through each and every document available.

Additionally, we have developed easy multi-platform installers for developers, which help in setting up your own on-premises Snaplex and develop without depending on any other resources. The package also provides you with the Snap Developer Kit (SDK) and our Snaps for developers. You can easily reference them, use them and – if you’d like to – modify them.

Here’s an example of a SnapLogic Windows Installer:

installer

All of the documentation will guide you through the process, so even if you don’t know anything about the development of Snaps, you really don’t have to worry. So login and get started today. We’re looking forward to hearing your feedback!

Top 5 SnapLogic Elastic iPaaS Architecture Posts

Over the last few months we’ve written a lot about the architecture of SnapLogic’s new elastic integration platform as a service (iPaaS). In 2010, the company recognized that in order to be able to meet the changing application and data integration requirements of the Social, Mobile, Analytics (Big Data), Cloud and the Internet of Things (SMACT) era, a new platform would be required. A few of the organizing principles that went in to the building of the SnapLogic Elastic Integration Platform included:

  • It had to be able to handle structured and unstructured data;
  • It had to be built for data and API streaming, but also have the ability to handle batch data integration requirements;
  • It had to go beyond point-to-point integration and deliver cloud-to-cloud and cloud-to-ground orchestration;
  • It had to be built for hybrid integration use cases, with a software-defined architecture;
  • It had to be delivered as a multi-tenant cloud service;
  • It had to be able to scale out to take advantage of cloud and big data infrastructures;
  • And finally, it had to have a user experience designed for “citizen integrators” without sacrificing power, performance and the ability to handle complex, multi-point data and application integration use cases.

Here are the top 5 posts we’re written in the past few months (based on blog traffic) that will help you better understand the architecture of the SnapLogic Elastic Integration Platform, what we value, and why we believe heritage matters when it comes to cloud integration. (Not surprisingly, 3 of the 5 were authored by our Chief Scientist, Greg Benson.)

  1. GregBenson_SnapReduceTechnical Advantages of JSON-centric iPaaS
  2. Managing Errors in an iPaaS
  3. Reliability in the SnapLogic iPaaS
  4. Talking Cloud Integration @ iRobot
  5. What to Look for in a Modern Integration Platform

We’re glad you find this information useful. You can also download this technical whitepaper for a deeper dive on the SnapLogic Elastic Integration Platform.

Be sure to subscribe to our blog to receive email updates for new posts, and get ready for some exciting news about the SnapLogic Summer 2014 release that we’ll be announcing tomorrow. It’s going to be our most significant release so far!

Reliability in the SnapLogic iPaaS

Dependable system operation is a requirement for any serious integration platform as a service (iPaaS). Often, reliability or fault tolerance is listed as a feature, but it is hard to get a sense of what this means in practical terms. For a data integration project, reliability can be challenging because it must connect disparate external services, which fail on their own. In a previous blog post, we discussed how SnapLogic Integration Cloud pipelines can be constructed to manage end point failures with our guaranteed delivery mechanism. In this post, we are going to look at some of the techniques we use to ensure the reliable execution of the services we control.

We broadly divide the SnapLogic architecture into two categories: the data plane and the control plane. The data plane is encapsulated within a Snaplex and the control plane is a set of replicated distributed servers. This design separation is useful both for data isolation and for reliability because we can easily employ different approaches to fault tolerance into the two planes.

Data Plane: Snaplex and Pipeline Redundancy
The Snaplex is a cluster of one or more pipeline execution nodes. A Snaplex can reside both in the SnapLogic Integration Cloud or on-premises. The Snaplex is designed to support autoscaling in the pretense of increased pipeline load. In addition, the Monitoring Dashboard monitors the health of all Snaplex nodes. In this way, Snaplex node failure can be detected early so that future pipelines are not scheduled on the faulty node. For cloud-based Snaplexes, also known as Cloudplexes, node failures are detected automatically and replacement nodes are made availably seamlessly. For on-premise Snaplexes, aka Groundplexes, admin users are notified of the faulty node so that a replacement can be made.

If a Snaplex node fails during a pipeline execution, the pipeline will be marked as failed. Developers can choose to retry failed pipelines or in some cases, such as recurring scheduled pipelines, the failed run may be ignored. Dividing long running pipelines into several shorter pipelines can limit exposure to node failure. For highly critical integrations it is possible to build and run replicated pipelines concurrently. In this way a single failed replica won’t interrupt the integration. As an alternative, a pipeline can be constructed to stage data in the SnapLogic File System (SLFS) or in an alternate data store such as AWS S3. Staging data can mitigate the need to re-acquire data from a data source, for example, if a data source is slow such as AWS Glacier. Also, some data sources have higher transfer costs or have transfer limits that would make it prohibitive to request data multiple times in the presence of failures on the upstream end point in a pipeline.

Control Plane: Service Reliability
SnapLogic’s “control plane” resides in the SnapLogic Integration Cloud, which is hosted in AWS. By decoupling control from data processing, we provide differentiated approaches to reliability. All control plane services are replicated for both scalability and for reliability. All REST-based front end servers sit behind the AWS ELB (Elastic Load Balancing) service. If any control plane service fails, there will always be a pool of replicated services available that can service client and internal requests. Here is an example where redundancy helps both with reliability and scalability.

We employ ZooKeeper to implement our reliable scheduling service. An important aspect of the SnapLogic iPaaS is the ability to create scheduled integrations. It is important that these scheduled tasks are initiated at a specified time or the required intervals. We implement the scheduling service as a collection of servers. All the servers can accept incoming CRUD requests on tasks, but only one server is elected as the leader. We use a ZooKeeper-based leader election algorithm for this purpose. In this way, if the leader fails, a new leader will be elected immediately and resume scheduling tasks on time. We ensure that no scheduled task is missed. In addition to using ZooKeeper for leader election, we also use it to allow the follower schedulers to notify the leader of task updates.

We also utilize a suite of replicated data storage technologies to ensure control and that metadata exists in a reliable manner. We currently use MongoDB clusters for metadata and encrypted AWS S3 buckets for implementing SLFS. We don’t expose S3 directly, but rather provide a virtual hierarchical view of the data. This allows us to track and properly authorize access to the SLFS data.

For MongoDB we have developed a reliable read-modify-write strategy to handle metadata updates in a non-blocking manner using findAndModfy. Our approach results in highly efficient non-conflicting updates, but is safe in the presence of a write conflict. In a future post we will provide a technical description of how this works.

The Benefits of Software-Defined Integration
By dividing the SnapLogic elastic iPaaS architecture into the data plane and the control plane we can employ effective, but different, reliability strategies between these two classes. In the data plane we help both identify and correct Snaplex server failures, but also allow users to implement highly reliable pipelines as needed. In the control plane we use a combination of server replication, load balancing and ZooKeeper to ensure reliable system execution. Our one size does not fit all approach allows us to modularize reliability and employ targeted testing strategies. Reliability is not a product feature, but an intrinsic design feature in every aspect of the SnapLogic Integration Cloud.

Powering Elastic iPaaS with JSON and REST

In an earlier post we summarized the key concepts behind SnapLogic “Snaps.”

  • Snaps are modular collections of integration components built for a specific application or data source.
  • They shield business users and developers from much of the complexity of the underlying application, data model and service.
  • Snaps are easy to build and modify and are available for analytics and big data sources, identity management, social media, online storage, ERP, databases and technologies such as XML, JSON, Oauth, SOAP and REST.

You can check out what Snaps are available on the SnapStore.

In this post, we’ll cover a related topic from our technical whitepaper: Web Standards. Purpose built for the cloud, the SnapLogic Integration Cloud has embraced modern technologies and made them a native part of the platform. JavaScript Object Notification (JSON) and Representational State Transfer (REST) are regarded as the key building blocks of the our scalable web architecture. 

JSON:
The internal representation of data inside a SnapLogic Integration Cloud pipeline is JSON format. This lightweight data-interchange format provides the cloud integration platform the flexibility to handle structured as well as unstructured data. In Greg Benson’s post Technical Advantages of JSON-centric iPaaS, he notes:

“At SnapLogic we recognize that while modern web services and data stores are heading toward JSON, businesses still use relational databases for normalized, transactional data. The great thing about documents is that they are a superset of relational records. When converting a record into a document we combine the column names from the schema with the field data to create a key/value document. This allows us to consume records and output records as needed, but still get all the advantages of the document model. Furthermore, we support traditional ETL operations such as JOIN, AGGREGATE, and SORT on documents. This allows primarily relational data to be treated seamlessly, but also extends these ETL operations in a way that support hierarchical documents.”

REST:
Each deployed pipeline by default is eligible for invocation with the REST protocol. An administrator exposing any pipeline as an API is a matter of flipping a switch. The administrator will need to grant requisite permissions (authentication and authorization) to clients. Typical clients of these APIs will be trading partners and mobile consumers looking to consume business data or business processes. For example, a trading partner may need to have real-time insight into inventory to ensure that they can make commitments to their customers that expects a certain product inventory levels. Or, a customer may want to check the status of their order through their mobile application; this lookup involves querying your shipping module with the shipment ID.

In this Integration Developer News review of the SnapLogic Integration Cloud Spring 2014, Maneesh Joshi notes:

“We’ve found customers are really interested in APIs for a number of reasons. They want to expose their business process as APIs and people find they can also use them to monitor the performance and reliability for their cloud integration [with off-premises SaaS].”

He describes SnapLogic’s API strategy this way:

“For developers, you can turn a SnapLogic pipeline quickly into an API, which lets them cut way down on coding and complexity on integrating mobile apps with backend systems. For operations, we provide a graphical dashboard that lets IT monitor these integrations for scale, performance and whatever.”

Download this technical whitepaper to learn more about the SnapLogic Integration Cloud architecture. Speaking of web standards, here’s a demonstration of the SnapLogic XSLT Snap:

Managing Errors in an iPaaS

In the world of distributed web services and Big Data, errors and faults can not be treated as frequent occurrences. Rather, they are commonplace and must be understood and managed. Errors can take the form of bad or missing data, and faults can arise from web service congestion or failures. The role of an iPaaS (Integration Platform as a Service) is to help users perform integration tasks in the presence of errors and faults.

First, let’s be clear on the difference between errors and faults:

  • An integration error usually refers to some form of data inconsistency due to data corruption, missing data, or unexpected data formats. The communication channels and integration pipeline execution all work as expected, but the data itself has a problem.
  • On the other hand a fault is the result of a connection or service failure. In some cases, the iPaaS servers may experience failures that results in faults, which must also be managed.

Error Handling

The SnapLogic Integration Cloud architecture provides both data error handling and fault tolerance in order to ensure the reliable execution of integration data flows, called pipelines. Pipelines can be designed to handle bad data using error views, and pipeline segments can be boxed in a way to provide guaranteed delivery in the presence of network or service end point failure. In the data plane, our Snaplex clusters are designed to detect node failure and to ensure there is always a stable set of Snaplex nodes available to run pipelines. Finally, we also provide redundancy and reliability throughout the control plane. In this blog, we will share details on error handling. Please be on the lookout for an additional blog post that explains how SnapLogic handles fault tolerance and resiliency in the architecture.

Elastic Snaplex

Error Views
SnapLogic Integration Cloud pipelines consist of Snaps that are connected together using views. Snaps can have zero or more input views and zero or more output views. For example, a database reader will have zero input views and one output view. A router Snap will have one input view and multiple output views. In addition to the output views, most Snaps can be configured to have an error view. Snap errors will usually cause a Snap and the pipeline to fail early. However, in some cases, a pipeline developer may want to manage error condition explicitly. A common error is missing data or bad data. If there is no error view then missing or bad data will cause the pipeline to fail. However, with an error view, the error condition is treated like data and can be passed on to a pipeline segment. In this way, a pipeline segment can be used to log the error to fix up bad data. This powerful feature allows developers to seamlessly handle both good and bad data using Snaps and pipeline segments. In some cases, certain types of faults can be communicated as errors to allow the pipeline developer to build reliable pipeline execution, such as when a database connection fault can be realized as a document sent to an error view.

As an example, some records in a CSV file may be missing one or more fields. Our CSV Reader Snap can detect these error records and send them to the error view. The error records can be sent to a pipeline segment that can attempt to clean up the data or log it for later inspection. As another example, the Email Sender Snap has one available error view. If the Email Sender error view is enabled, it will be given a document for each bad address or unsent message. In this way, the pipeline developer can create a report for the bad addresses or use the bad addresses to update a contact database so that no future attempts will be made to send to the address.

Guaranteed Delivery
The SnapLogic Integration Cloud enables the integration of several cloud services. Most modern cloud services are exposed via REST or SOAP interfaces over HTTP. However, the public network can be susceptible to failure that can cause service requests to be dropped. To help managed service connection failures the SnapLogic Integration Cloud supports guaranteed delivery of documents. To enable guaranteed delivery, the pipeline developer marks a streaming pipeline segment. This boxed segment can be configured with a retry policy. Documents that are to be sent to the boxed segment are temporarily held in persistent storage. Once a document has made it through the entire segment, an acknowledgement is sent to the pipeline manager which allows the document to be removed from the persistent storage. If the segment or segment endpoint fails the retry policy will be invoked to retrieve the document from the persistent storage and send it through the segment again. Retry policies such as linear wait times or exponential backoff can be employed to manage most intermittent endpoint failure scenarios.

For more details on the architecture of the SnapLogic Integration Cloud, check out our series of videos explaining our product and platform. Below, you can see the SnapLogic Integration Cloud for Developers:

SnapLogic Integration Cloud Architecture in Review

Earlier in the week I posted the first of a series of whiteboard presentations featuring Andy Buteau, SnapLogics’s Director of Engineering: Going Beyond Point-to-Point Cloud Integration with SnapLogic. In this video, Andy walks through the SnapLogic Integration Cloud Architecture, which includes: