Enterprise IoT: Defensive Development

In the initial post in this series, we talked about how IoT data, once acquired, is still just data. Before diving into building our first demo, it’s worth taking a moment to talk about getting the data to the stage of “once acquired”.

If you’re an existing SnapLogic customer – and if, not, why not request a trial? – you’re used to working in an environment where our Snaps work hard to ensure you aren’t confronted by irrational application behavior, and things Just Work. This happens because our talented developers get to spend hundreds or thousands of person-hours on creating integrations for a wide variety applications, APIs and enterprise data sources.

XKCD.com - Good Code Flowchart
Courtesy xkcd.com

For IoT devices, though, we provide thoroughly-tested Snaps to enable communication with the device, but we can’t guarantee the device itself won’t do silly things. One such silly thing occurred during the creation of the demo this blog series is going to show you how to build.  An addressable color-changing USB LED was used that can theoretically display any of over 16 million colors. The included demo software lets you pick any of those colors and see the light change. However, when we integrated it with SnapLogic, we got odd errors back when asking it to change color.
Continue reading “Enterprise IoT: Defensive Development”

LNAV: The Log File Navigator

Today I wanted to throw a shameless plug and highlight one of the side-projects from one of our very own engineers, Tim Stack.

For developers, ops, and those who frequently live in the terminal, lnav is log file viewer that can help you analyze and navigate through your logs. Simply point lnav at a file or a directory and it will decompress the files, detect their format, and index any log messages contained within. The log messages from all files are then displayed in a single view ordered by time and several hotkeys are available for jumping to messages based on their time or log level. Semantic information is highlighted and filters may be applied to help focus your investigation. Additional features include a pretty-printing structured data, like XML or JSON, a SQL interface for querying logs and more. To check out the complete list of features and download for Linux or Mac OS X, head on over to http://lnav.org/

Here are a couple of screenshots: Continue reading “LNAV: The Log File Navigator”

Reliability in the SnapLogic iPaaS

Dependable system operation is a requirement for any serious integration platform as a service (iPaaS). Often, reliability or fault tolerance is listed as a feature, but it is hard to get a sense of what this means in practical terms. For a data integration project, reliability can be challenging because it must connect disparate external services, which fail on their own. In a previous blog post, we discussed how SnapLogic Integration Cloud pipelines can be constructed to manage end point failures with our guaranteed delivery mechanism. In this post, we are going to look at some of the techniques we use to ensure the reliable execution of the services we control.

We broadly divide the SnapLogic architecture into two categories: the data plane and the control plane. The data plane is encapsulated within a Snaplex and the control plane is a set of replicated distributed servers. This design separation is useful both for data isolation and for reliability because we can easily employ different approaches to fault tolerance into the two planes.

Data Plane: Snaplex and Pipeline Redundancy
The Snaplex is a cluster of one or more pipeline execution nodes. A Snaplex can reside both in the SnapLogic Integration Cloud or on-premises. The Snaplex is designed to support autoscaling in the pretense of increased pipeline load. In addition, the Monitoring Dashboard monitors the health of all Snaplex nodes. In this way, Snaplex node failure can be detected early so that future pipelines are not scheduled on the faulty node. For cloud-based Snaplexes, also known as Cloudplexes, node failures are detected automatically and replacement nodes are made availably seamlessly. For on-premise Snaplexes, aka Groundplexes, admin users are notified of the faulty node so that a replacement can be made.

If a Snaplex node fails during a pipeline execution, the pipeline will be marked as failed. Developers can choose to retry failed pipelines or in some cases, such as recurring scheduled pipelines, the failed run may be ignored. Dividing long running pipelines into several shorter pipelines can limit exposure to node failure. For highly critical integrations it is possible to build and run replicated pipelines concurrently. In this way a single failed replica won’t interrupt the integration. As an alternative, a pipeline can be constructed to stage data in the SnapLogic File System (SLFS) or in an alternate data store such as AWS S3. Staging data can mitigate the need to re-acquire data from a data source, for example, if a data source is slow such as AWS Glacier. Also, some data sources have higher transfer costs or have transfer limits that would make it prohibitive to request data multiple times in the presence of failures on the upstream end point in a pipeline.

Control Plane: Service Reliability
SnapLogic’s “control plane” resides in the SnapLogic Integration Cloud, which is hosted in AWS. By decoupling control from data processing, we provide differentiated approaches to reliability. All control plane services are replicated for both scalability and for reliability. All REST-based front end servers sit behind the AWS ELB (Elastic Load Balancing) service. If any control plane service fails, there will always be a pool of replicated services available that can service client and internal requests. Here is an example where redundancy helps both with reliability and scalability.

We employ ZooKeeper to implement our reliable scheduling service. An important aspect of the SnapLogic iPaaS is the ability to create scheduled integrations. It is important that these scheduled tasks are initiated at a specified time or the required intervals. We implement the scheduling service as a collection of servers. All the servers can accept incoming CRUD requests on tasks, but only one server is elected as the leader. We use a ZooKeeper-based leader election algorithm for this purpose. In this way, if the leader fails, a new leader will be elected immediately and resume scheduling tasks on time. We ensure that no scheduled task is missed. In addition to using ZooKeeper for leader election, we also use it to allow the follower schedulers to notify the leader of task updates.

We also utilize a suite of replicated data storage technologies to ensure control and that metadata exists in a reliable manner. We currently use MongoDB clusters for metadata and encrypted AWS S3 buckets for implementing SLFS. We don’t expose S3 directly, but rather provide a virtual hierarchical view of the data. This allows us to track and properly authorize access to the SLFS data.

For MongoDB we have developed a reliable read-modify-write strategy to handle metadata updates in a non-blocking manner using findAndModfy. Our approach results in highly efficient non-conflicting updates, but is safe in the presence of a write conflict. In a future post we will provide a technical description of how this works.

The Benefits of Software-Defined Integration
By dividing the SnapLogic elastic iPaaS architecture into the data plane and the control plane we can employ effective, but different, reliability strategies between these two classes. In the data plane we help both identify and correct Snaplex server failures, but also allow users to implement highly reliable pipelines as needed. In the control plane we use a combination of server replication, load balancing and ZooKeeper to ensure reliable system execution. Our one size does not fit all approach allows us to modularize reliability and employ targeted testing strategies. Reliability is not a product feature, but an intrinsic design feature in every aspect of the SnapLogic Integration Cloud.