Testing… Testing… 1, 2, 3: How SnapLogic tests Snaps on the Apache Spark Platform

The SnapLogic Elastic Integration Platform connects your enterprise data, applications, and APIs by building drag-and-drop data pipelines. Each pipeline is made up of Snaps, which are intelligent connectors, that users drag onto a canvas and “snap” together like puzzle pieces.

A SnapLogic pipeline being built and configured
A SnapLogic pipeline being built and configured

These pipelines are executed on a Snaplex, an application that runs on a multitude of platforms: on a customer’s infrastructure, on the SnapLogic cloud, and most recently on Hadoop. A Snaplex that runs on Hadoop can execute pipelines natively in Spark.

The SnapLogic platform is known for its easy-to-use, self-service interface, made possible by our team of dedicated engineers (we’re hiring!). We work to apply the industry’s best practices so that our clients get the best possible end product — and testing is fundamental. Continue reading “Testing… Testing… 1, 2, 3: How SnapLogic tests Snaps on the Apache Spark Platform”

Snaplex Thresholds and Pipeline Queuing

As the integration market continues to mature, there is a constant demand to support and process more complex data and process flows. When applications process large data, they often run out of resources and become unresponsive, leaving users confused and unhappy. Gauging resources and alerting users with appropriate messages are some of the most important factors of ideal software. In the Winter 2016 release of the SnapLogic Elastic Integration Platform, we introduced the concept of pipeline queuing, which allows users to define thresholds for their Snaplexes, and when thresholds are reached, any further requests to it are queued until the next resources are available. Continue reading “Snaplex Thresholds and Pipeline Queuing”

Puzzle Pieces: Snaplex Names Explained

Welcome to Puzzle Pieces, a periodic series exploring the “Why?” of SnapLogic’s platform. To kick things off, let’s talk Snaplexes, which have sometimes proved puzzling. (Editor’s note: future installments of Puzzle Pieces will be rigorously scrubbed for alliterative excesses).

The SnapLogic Elastic Integration Platform is divided into two main parts: the Control Plane and the Data Plane. As a customer, you come into contact with the Control Plane through the SnapLogic web interface. Behind the scenes, the Control Plane also handles talking to the Data Plane and coordinating the flow of data in pipelines.

The pipelines actually run in the Data Plane. The container that handles running a particular pipeline is called a Snaplex. A Snaplex (or Plex) is a collection of computing resources – perhaps one virtual machine, perhaps an entire server rack. These are the Snaplex types you may come across:

Continue reading “Puzzle Pieces: Snaplex Names Explained”

The SnapLogic Hadooplex: Achieving Elastic Scalability Using YARN

Elastic-big-dataYARN, a major advancement in Hadoop 2.0, is a resource manager that separates out the execution and processing management from the resource management capabilities of MapReduce. Like an operating system on a server, YARN is designed to allow multiple, diverse user applications to run on a multi-tenant platform.

Developers are no longer limited to writing multi-pass MapReduce programs with disadvantages like high latency, when a better option can be modeled using a directed acyclic graphic (DAG) approach.

Any application, including the likes of Spark, can be deployed onto an existing Hadoop cluster, and take advantage of YARN for scheduling and resource allocation. This is also the basic ingredient of a Hadooplex in SnapLogic – to achieve elastic scale out and scale in for integration jobs.

The per-application ApplicationMaster is, in effect, a framework specific a library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor tasks.

SnapLogic’s application master is responsible for negotiating resources with the ResourceManager. The control plane in SnapLogic is the brain (read this post on software defined integration), which holds all critical information and helps make logical decisions for scale out and scale in. The Hadooplex is the actual application itself that runs the workload.

In the this diagram you can see that the Hadooplex reports its workload information to the control plane at regular intervals. The application master gets the load information from the control plane, also at regular intervals.

Hadooplex

As the workload increases, the application master requests the YARN ResourceManager to spin up more Hadooplex nodes one at a time as shown in the diagram below. This scale out occurs dynamically until either the workload starts decreasing or a maximum number of Hadooplex nodes allowed has been met.

Hadooplex-nodes

As the workload decreases, the nodes start spinning down. This is how SnapLogic achieves elastic scaling based on the workload volumes within a Hadoop cluster utilizing the YARN ResourceManager. This is possible only if an application is a native YARN application. (Read about the importance of YARN-native here.)

Next steps:

This post originally appeared on LinkedIn.

SnapLogic Big Data Processing Platforms

ArchitectureOne of our goals at SnapLogic is to match data flow execution requirements with an appropriate execution platform. Different data platforms have different benefits. The goal of this post is to explain the nature of data flow pipelines and how to choose an appropriate data platform. In addition to categorizing pipelines, I will explain our current supported execution targets and our planned support for Apache Spark.

First, some preliminaries. All data processed by SnapLogic pipelines is handled natively in an internal JSON format. We call this document-oriented processing. Even flat, record-oriented data is converted into JSON for internal processing. This lets us handle both flat and hierarchical data seamlessly. Pipelines are constructed from Snaps. Each Snap encapsulates specific application or technology functionality. The Snaps are connected together to carry out a data flow process. Pipelines are constructed with our visual Designer. Some Snaps provide connectivity, such as connecting to databases or cloud applications. Some Snaps allow for data transformation such as filtering out documents, adding or removing fields or modifying fields. We also have Snaps that perform more complex operations such as sort, join and aggregate.

Given this setup, we can categorize pipelines into two types: streaming and accumulating. In a streaming pipeline, documents can flow independently. The processing of one document is not dependent on another document as they flow through the pipeline. Such streaming pipelines have low memory requirements because documents can exit the pipeline once they have reached the last Snap. In contrast, an accumulating pipeline requires that all documents from the input source must be collected before result documents can be emitted from a pipeline. Pipelines with sort, join and aggregate are accumulating pipelines. In some cases, a pipeline can be partially accumulating. Such accumulating pipelines can have high memory requirements depending on the number of documents coming in from an input source.

Now let’s turn to execution platforms. SnapLogic has an internal data processing platform called a Snaplex. Think of a Snaplex as a collection of processing nodes or containers that can execute SnapLogic pipelines. We have a few flavors of Snaplexes:

  •  A Cloudplex is a Snaplex that we host in the cloud and it can autoscale as pipeline load increases.
  • Groundplex is a fixed set of nodes that are installed on-premises or in a customer VPC. With a Groundplex, customers can do all of their data processing behind their firewall so that data does not leave their infrastructure.

We are also expanding our support for external data platforms. We have recently released our Hadooplex technology that allows SnapLogic customers to use Hadoop as an execution target for SnapLogic pipelines. A Hadooplex leverages YARN to schedule Snaplex containers on Hadoop nodes in order to execute pipelines. In this way, we can autoscale inside a Hadoop cluster. Recently we introduced SnapReduce 2.0, which enables a Hadooplex to translate SnapLogic pipelines into MapReduce jobs. A user builds a designated SnapReduce pipeline and specifies HDFS files and input and output. These pipelines are compiled to MapReduce jobs to execute on very large data sets that live in HDFS. (Check out the demonstration in our recent cloud and big data analytics webinar.)

Finally, as we announced last week as part of Cloudera’s real-time streaming announcement, we’ve begun work on our support for Spark as a target big data platform. A Sparkplex will be able to utilize SnapLogic’s extensive connectivity to bring data into and out of Spark RDDs (Resilient Distributed Datasets). In addition, similar to SnapReduce, we will allow users to compile SnapLogic pipelines into Spark codes so the pipelines can run as Spark jobs. We will support both streaming and batch Spark jobs. By including Spark in our data platform support, we will give our customers a comprehensive set of options for pipeline execution.

Choosing the right big data platform will depend on many factors: data size, latency requirements, connectivity and pipeline type (streaming versus accumulating). Here are some guidelines for choosing a particular big data integration platform:

Cloudplex

  • Cloud-to-cloud data flow
  • Streaming unlimited documents
  • Accumulating pipelines in which accumulated data can fit into node memory

Groundplex

  • Ground-to-ground, ground-to-cloud and cloud-to-ground data flow
  • Streaming unlimited documents
  • Accumulating pipelines in which accumulated data can fit into node memory

Hadooplex

  • Ground-to-ground, ground-to-cloud and cloud-to-ground data flow
  • Streaming unlimited documents
  • Accumulating pipelines can operate on arbitrary data sizes via MapReduce

Sparkplex

  • Ground-to-ground, ground-to-cloud and cloud-to-ground data flow
  • Allow for Spark connectivity to all SnapLogic accounts
  • Streaming unlimited documents
  • Accumulating pipelines can operate on data sizes that can fit in Spark cluster memory

Snap In to Big DataNote that recent work in the Spark community has increased support for out-of-core computations, such as sorting. This means that accumulating pipelines that are currently only suitable for MapReduce execution may be supported in Spark as out-of-core Spark support becomes more general. The Hadooplex and Sparkplex have added reliable execution benefits so that long-running pipelines are guaranteed to complete.

At SnapLogic, our goal is to allow customers to create and execute arbitrary data flow pipelines on the most appropriate data platform. In addition, we provide a simple and consistent graphical UI for developing pipelines which can then execute on any supported platform. Our platform agnostic approach decouples data processing specification from data processing execution. As your data volume increases or latency requirements change, the same pipeline can execute on larger data and at a faster rate just by changing the target data platform. Ultimately, SnapLogic allows you to adapt to your data requirements and doesn’t lock you into a specific big data platform.

New SnapLogic Community: For Developers By Developers

Today’s post is from SnapLogic summer intern, Rishabh Mehan: My name is Rishabh Mehan and I’m currently a student at New York Institute of Technology. I’ve been doing computer programming/software development for 8 years and this summer I’ve been working at SnapLogic as an intern. My main focus has been the new SnapLogic Developer Community, which went live with our Summer 2014 release.

Screen Shot 2014-07-31 at 2.49.37 PMOne of the things that excited me the most about what we’re working on at SnapLogic (other than Elastic Integration, Big Data and powering cloud analytics of course) is the fact that we’re setting out to enable our customers with the potential to move to the cloud and expand the kind of data and application integrations that are possible.

Our new SnapLogic Developer Community was created to make it easier for developers to expand the current list of Snaps according to their needs, as well as to have the ability to create a completely new Snap. With a very simple approach, the SnapLogic Developer Community provides a knowledge base and environment to share ideas for developing on our cloud integration platform.

The Developer Community provides a base for collaborative learning, and our team and other developers will always be there to help you, as well as ask you for help. This is how developers work. The current structure of the Community is diversified into three segments:

  1. Get Set
  2. Get Started
  3. Get Collaborative

Get Set

  • Brief overview of the architecture
  • Introduction to the technology and terminology
  • Snaps and pipelines
  • Set up the on-premises Snaplex

Get StartedScreen Shot 2014-07-31 at 3.43.39 PM

  • Set up your developer environment
  • Snap Development
  • Demo Snaps and guides
  • Documentation for your reference

Get Collaborative

  • Community forum to discuss your issues
  • Post your responses and help others
  • Learn about what other developers are doing

After being provisioned as a Developer in your SnapLogic organization, you are all set to enter the Developer Community and go through each and every document available.

Additionally, we have developed easy multi-platform installers for developers, which help in setting up your own on-premises Snaplex and develop without depending on any other resources. The package also provides you with the Snap Developer Kit (SDK) and our Snaps for developers. You can easily reference them, use them and – if you’d like to – modify them.

Here’s an example of a SnapLogic Windows Installer:

installer

All of the documentation will guide you through the process, so even if you don’t know anything about the development of Snaps, you really don’t have to worry. So login and get started today. We’re looking forward to hearing your feedback!

Software Defined Integration

The primary concept of software-defined networking (SDN) is “the decoupling of the system that makes decisions about where traffic is sent (the control plane) from the underlying systems that forward traffic to the selected destination (the data plane).” The Open Networking Foundation defines the SDN architecture as:

  • Directly programmable
  • Agile
  • Centrally managed
  • Programmatically configured
  • Open standards-based and vendor-neutral

As we outlined in this technical whitepaper, the SnapLogic Integration Cloud is architected on SDN concepts and built with a specific set of values in mind. The “control plane” controls where and how data is processed based on user configuration and preferences. The “data plane” (aka the Snaplex) does the actual processing of data as per instructions received from the control plane. No data is stored in the SnapLogic Integration Cloud, instead data is streamed between systems via the Snaplex. In this post we’ll provide an overview of the SnapLogic data plane. You can learn more about the control plane here and Snaps here. We’ll dive into security in the next post.

Snaplex: The Data Plane

Snaplex Data CloudA Snaplex is the data processing component of the SnapLogic Integration Cloud. It is the “data plane.” Customers can deploy one or many Snaplexes as required to run pipelines and process data. A Snaplex consists of one or more Nodes and come in two flavors – on-premises (aka “Groundplex”) and in the cloud (aka “Cloudplex”).

  • Cloudplex:  All Cloudplexes run inside the SnapLogic Integration Cloud. Customers use the Manager and the Monitoring Dashboard to administer their Cloudplex. The SnapLogic DevOps team administers the infrastructure key performance indicators (KPIs) such as uptime, etc. Customers needing to run integrations that orchestrate across cloud applications (e.g. SalesforceServiceNowWorkday) with no on-premises connections will not require any software to run behind their firewall.
  • Groundplex:  Customers needing on-premises connectivity (e.g. SAP, Oracle, Microsoft Dynamics AX, etc.) will need a Groundplex, which runs behind the firewall. Although they run on private or virtual private data centers, Groundplexes are managed remotely by the SnapLogic Integration Cloud control plane (e.g. heartbeat monitoring, software upgrades, etc.).

The Snaplex can elastically expand and contract based on data traffic flowing through it. The unit of scalability inside Snaplex is a Java virtual machine (JVM), referred to as a “Node”. The control plane has built-in “smarts” to automatically scale the Snaplex out and in, in order to handle variable traffic loads. For instance, each Snaplex is initialized with a configurable minimum number of Nodes (say one, for example). Once the utilization of this one node
reaches a certain configurable threshold (say a certain number of pipelines running or a certain percentage of CPU or memory utilization per node) due to a spike in traffic, a new Node is automatically spun up to handle the additional workload. Once this excess data traffic has been processed and the second Node becomes idle, it gets “torn down” to scale back the Snaplex to its original size.

For more information on the SnapLogic Integration Cloud architecture and how it works, be sure to download this technical whitepaper.