SNAP_IN_BIG_DATAThis week we announced our Series D financing, led by Ignition Partners. Frank Artale, Managing Partner at Ignition had this to say:

“The inescapable shift towards big data infrastructure and cloud applications in the enterprise presents a tremendous opportunity for a new approach to data connectivity and self-service. SnapLogic’s modern approach to data access, shaping and streaming positions the company well to capitalize on a tremendous market opportunity.”

While we continue to deliver powerful elastic integration platform as a service (iPaaS) for connecting and synchronizing cloud and on-premises business applications (Salesforce, ServiceNow, SAP, Workday, Zuora, etc.) over the past few releases, we’ve broadened the capabilities of our unified platform to address the growing need for big data integration. Greg Benson, our Chief Scientist, summarized SnapLogic’s big data processing platforms in this post and was was recently featured in an Integration Developer News article, which reviewed our Fall 2014 release. When it comes to Spark, Greg noted in Cloudera’s most recent announcement that:

“SnapLogic is adding native support for Apache Spark as part of CDH in upcoming releases of our Elastic Integration Platform. In the first phase, we are adding a Spark Snap that taks advantage of a Spark cluster co-located with our Snaplex processing engine. This allows SnapLogic pipelines to stream data into a Spark resilient distributed dataset (RDD). Our goal is to make it easy to deliver data to Spark from disparate sources such as conventional databases, cloud applications, APIs and any SnapLogic-supported destination. Further applications of Spark include combining our SnapReduce computations and Spark computations into coordinated workflows via Snaplogic pipelines and providing the data wrangling capabilities that will allow organizations to double the productivity of their data scientists.”

Big Data Tools and Technologies in Use 2014At #Hadoopworld in New York last week, this Wikibon chart caught my attention – 52% of respondents listed data integration tools as the big data tools and technologies that are in use today. But whether it’s requirements for data parsing or streaming, the need to handle new and different data formats and locations (i.e. cloud) or the demand for self-service from citizen integrators, traditional extract, transform and load (ETL) tools built for rows and columns will struggle with big data integration use cases.

Check out this detailed demonstration of SnapReduce and the SnapLogic Hadooplex to see the kinds of advantages the SnapLogic Elastic Integration Platform delivers:

ArchitectureOne of our goals at SnapLogic is to match data flow execution requirements with an appropriate execution platform. Different data platforms have different benefits. The goal of this post is to explain the nature of data flow pipelines and how to choose an appropriate data platform. In addition to categorizing pipelines, I will explainour current supported execution targets and our planned support for Apache Spark.

First, some preliminaries. All data processed by SnapLogic pipelines is handled natively in an internal JSON format. We call this document-oriented processing. Even flat, record-oriented data is converted into JSON for internal processing. This lets us handle both flat and hierarchical data seamlessly. Pipelines are constructed from Snaps. Each Snap encapsulates specific application or technology functionality. The Snaps are connected together to carry out a data flow process. Pipelines are constructed with our visual Designer. Some Snaps provide connectivity, such as connecting to databases or cloud applications. Some Snaps allow for data transformation such as filtering out documents, adding or removing fields or modifying fields. We also have Snaps that perform more complex operations such as sort, join and aggregate.

Given this setup, we can categorize pipelines into two types: streaming and accumulating. In a streaming pipeline, documents can flow independently. The processing of one document is not dependent on another document as they flow through the pipeline. Such streaming pipelines have low memory requirements because documents can exit the pipeline once they have reached the last Snap. In contrast, an accumulating pipeline requires that all documents from the input source must be collected before result documents can be emitted from a pipeline. Pipelines with sort, join and aggregate are accumulating pipelines. In some cases, a pipeline can be partially accumulating. Such accumulating pipelines can have high memory requirements depending on the number of documents coming in from an input source.

Now let’s turn to execution platforms. SnapLogic has an internal data processing platform called a Snaplex. Think of a Snaplex as a collection of processing nodes or containers that can execute SnapLogic pipelines. We have a few flavors of Snaplexes:

  •  A Cloudplex is a Snaplex that we host in the cloud and it can autoscale as pipeline load increases.
  • Groundplex is a fixed set of nodes that are installed on-premises or in a customer VPC. With a Groundplex, customers can do all of their data processing behind their firewall so that data does not leave their infrastructure.

We are also expanding our support for external data platforms. We have recently released our Hadooplex technology that allows SnapLogic customers to use Hadoop as an execution target for SnapLogic pipelines. A Hadooplex leverages YARN to schedule Snaplex containers on Hadoop nodes in order to execute pipelines. In this way, we can autoscale inside a Hadoop cluster. Recently we introduced SnapReduce 2.0, which enables a Hadooplex to translate SnapLogic pipelines into MapReduce jobs. A user builds a designated SnapReduce pipeline and specifies HDFS files and input and output. These pipelines are compiled to MapReduce jobs to execute on very large data sets that live in HDFS. (Check out the demonstration in our recent cloud and big data analytics webinar.)

Finally, as we announced last week as part of Cloudera’s real-time streaming announcement, we’ve begun work on our support for Spark as a target big data platform. A Sparkplex will be able to utilize SnapLogic’s extensive connectivity to bring data into and out of Spark RDDs (Resilient Distributed Datasets). In addition, similar to SnapReduce, we will allow users to compile SnapLogic pipelines into Spark codes so the pipelines can run as Spark jobs. We will support both streaming and batch Spark jobs. By including Spark in our data platform support, we will give our customers a comprehensive set of options for pipeline execution.

Choosing the right big data platform will depend on many factors: data size, latency requirements, connectivity and pipeline type (streaming versus accumulating). Here are some guidelines for choosing a particular big data integration platform:

Cloudplex

  • Cloud-to-cloud data flow
  • Streaming unlimited documents
  • Accumulating pipelines in which accumulated data can fit into node memory

Groundplex

  • Ground-to-ground, ground-to-cloud and cloud-to-ground data flow
  • Streaming unlimited documents
  • Accumulating pipelines in which accumulated data can fit into node memory

Hadooplex

  • Ground-to-ground, ground-to-cloud and cloud-to-ground data flow
  • Streaming unlimited documents
  • Accumulating pipelines can operate on arbitrary data sizes via MapReduce

Sparkplex

  • Ground-to-ground, ground-to-cloud and cloud-to-ground data flow
  • Allow for Spark connectivity to all SnapLogic accounts
  • Streaming unlimited documents
  • Accumulating pipelines can operate on data sizes that can fit in Spark cluster memory

Snap In to Big DataNote that recent work in the Spark community has increased support for out-of-core computations, such as sorting. This means that accumulating pipelines that are currently only suitable for MapReduce execution may be supported in Spark as out-of-core Spark support becomes more general. The Hadooplex and Sparkplex have added reliable execution benefits so that long-running pipelines are guaranteed to complete.

At SnapLogic, our goal is to allow customers to create and execute arbitrary data flow pipelines on the most appropriate data platform. In addition, we provide a simple and consistent graphical UI for developing pipelines which can then execute on any supported platform. Our platform agnostic approach decouples data processing specification from data processing execution. As your data volume increases or latency requirements change, the same pipeline can execute on larger data and at a faster rate just by changing the target data platform. Ultimately, SnapLogic allows you to adapt to your data requirements and doesn’t lock you into a specific big data platform.

2014_FB_ArticleLinkPosts_484x252Last week we wrote about the big news at Dreamforce 2014: Cloud Analytics in the Spotlight: Riding the Wave at #DF14. SnapLogic also hosted a webinar with David Glueck from Bonobos as our featured speaker: BI in the Sky: The New Rules of Cloud Analytics. David is the founder of the data science and engineering team at Bonobos, which is the largest e-commerce-born apparel brand in the US. A seasoned analytics guru, prior to joining Bonobos, David has held business intelligence roles at Groupon, Netflix, HP, Knightsbridge Consulting, Cisco.

He summarized the three primary reasons Bonobos was an early cloud analytics adopter as speed, flexibility and scale. With the benefit of being born in the cloud and remaining cloud first from an IT infrastructure perspective, Bonobos relies upon the SnapLogic platform to pull data from multiple cloud services, CSVs, APIs, and databases and pushes it into multiple cloud analytics tools, including GoodData, Amazon Redshift, Tableau, Gecko Board, and Python.

Early in the discussion David stated: “I would rather work with the business than with the hardware.” His focus on insights, alignment and business outcomes instead of tools and technology really came through in his presentation. Just take a look at his advice if cloud-based business intelligence is something your organization is considering:

  • Pick the high value use case first
  • Know what you need to get an A in
  • Know what you want to accomplish and why
  • Think like an investor

I’ve embedded the presentation below. You can watch the entire webinar with interactive Q&A here. And if you want to learn more about how SnapLogic powers cloud (AWS Redshift, Salesforce Analytics Cloud, etc.) and big data (Cloudera, Hortonworks) analytics, be sure to check out our Analytics Solutions page.

In part this SnapLogic tips and tricks series we have demonstrated how the XML Generator Snap:

In this final part of the series, we will cover how the XML Generator Snap creates one serialized XML string for every input document.

Example 4: POSTing the Generated Content
In the last example we will be POSTing the generated content to some REST endpoint using the REST POST Snap.
In the screenshot below we are using the POST Snap which has the entity set as $xml. That will use the XML content that was generated by the upstream XML Generator Snap and POST it as a body to the endpoint.
You might want to set the content-type and accept header as defined below.

xml-gen-6

The POST will be executed for every document on the input view. There are a total of two documents, hence we will execute two post requests.

Series Summary
In summary, XML Generator Snap enables you to generate XML data, either directly in the Snap using the XML template or dynamically by using data from the input view. It lets you generate the XML by providing an XSD and it can validate the created XML against the XML at runtime.

Additional Resources:

In part two of this series, we covered how to map to the JSON schema upstream. In this post, we will cover how to validate the generated XML against the XSD.

Example 3: Writing the Generated Content to File
Sometimes one wants to write the generated XML to a file. For that use case we provide a DocumentToBinary Snap which can take the content and convert it to binary data object, which then can be written to a file, e.g using a File Writer Snap.

xml-gen-5

Above we map XML to the content field of the DocumentToBinary Snap, and set the Encode or Decode option on the DocumentToBinary Snap to NONE.

This outputs then one binary document for each order. We can then write it to a directory. (Careful, here you’d want to use the append option, since you potentially would be writing two files to the same directory, *which will be supported soon for SnapLogic’s file system) or you can use an expression such as Date.now() to write individual files for each incoming binary data object).

In our final part of this series, we will demonstrate how the XML Generator Snap creates one serialized XML string for every input document.

Additional Resources:

In the first part of this series, we explained how to use the XML Generator Snap to generate XML based off an XSD. In this post, we will cover how to map to the JSON schema upstream.

Example 2: Mapping to XML Generator via XSD
Lets use a JSON Generator to provide the input order data, such as defined below:

[
 {
 "items" : [{
 "title": "iphone6",
 "quantity": 1,
 "price": 598.00
 },
 {
 "title": "note 4",
 "quantity": 1,
 "price": 599.00
 }
 ],
 "address":"some address",
 "city": "San Mateo",
 "orderId": "1234",
 "name": "Mr. Undecided"
 },
 {
 "items" : [{
 "title": "iphone6",
 "quantity": 1,
 "price": 598.00
 },
 {
 "title": "note 4",
 "quantity": 1,
 "price": 599.00
 },
 {
 "title": "lumina",
 "quantity": 1,
 "price": 0.99
 }
 ],
 "address":"some address",
 "city": "San Mateo",
 "orderId": "1234",
 "name": "Mr. Even more Undecided"
 }
]

We then map the data using the Mapper Snap, which has access to the XSD of the downstream XML Generator Snap of the previous example (now with an added input view).

xml-gen-3

Here we map the items to the item list on the target. Further we map the city, address, country and name to the shipTo object on the target and then finally we map the name against orderperson and orderId against @orderId on the target. The @ indicates we map against an XML attribute.

Hint: the Mapper Snap was enhanced in the Fall 2014 release to allow viewing the data on the in/output while doing the mappings (on the bottom, expanded with the arrow in the middle)

Lets look at the output of the XML Snap:

xml-gen-4

Here we see that each incoming order document was translated into an XML string. We include the original data from the input view, in case it is further needed downstream.
The XML Generator Snap can validate the generated content if needed using the “Validate XML” property.

In our next post in this series, we will demonstrate how the XML Generator Snap validates the generated XML against the XSD.

Other Resources:

The XML Generator Snap was introduced in the Summer 2014 release. In the Fall release, it was enhanced with the addition of XML generation based on a provided XSD and the suggestion of the JSON schema (based of the XSD schema) to the upstream Snap. The XML Generator Snap is similar to the XML Formatter Snap, which formats incoming documents into XML, however this Snap allows you to map to the XML content to allow a more specific XML generation. In a four-part series, we will explain how the XML Generator Snap:

Example 1: XML Generation via XSD
For this first example, I created a simple pipeline to generate order data XML directly with the XML Generator Snap.

xml-gen-1

We provide the sample XSD (originating from: http://www.w3schools.com/schema/schema_example.asp) defined as:

<?xml version="1.0" encoding="UTF-8" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="shiporder">
 <xs:complexType>
<xs:sequence>
 <xs:element name="orderperson" type="xs:string"/>
 <xs:element name="shipto">
   <xs:complexType>
     <xs:sequence>
       <xs:element name="name" type="xs:string"/>
       <xs:element name="address" type="xs:string"/>
       <xs:element name="city" type="xs:string"/>
       <xs:element name="country" type="xs:string"/>
     </xs:sequence>
   </xs:complexType>
 </xs:element>
 <xs:element name="item" maxOccurs="unbounded">
   <xs:complexType>
     <xs:sequence>
       <xs:element name="title" type="xs:string"/>
       <xs:element name="note" type="xs:string" minOccurs="0"/>
       <xs:element name="quantity" type="xs:positiveInteger"/>
       <xs:element name="price" type="xs:decimal"/>
     </xs:sequence>
   </xs:complexType>
 </xs:element>
</xs:sequence>
<xs:attribute name="orderid" type="xs:string" use="required"/>
 </xs:complexType>
</xs:element>
</xs:schema>

We then suggest the XML root element, which returns {}shiporder.
Finally, we click on Edit XML which will automatically trigger the XML template generation based off the XSD, as seen below.

xml-gen-2

Now we could replace the variables with our own values to generate the XML on the output view or move on to the next example.

Note: The execution of the Snap above will create an XML attribute on the output view which provides the serialized XML content as a string.

In part two of this series, you will see how to use a JSON Generator to map to the XML Generator XSD.

Other Resources: