ArchitectureOne of our goals at SnapLogic is to match data flow execution requirements with an appropriate execution platform. Different data platforms have different benefits. The goal of this post is to explain the nature of data flow pipelines and how to choose an appropriate data platform. In addition to categorizing pipelines, I will explainour current supported execution targets and our planned support for Apache Spark.

First, some preliminaries. All data processed by SnapLogic pipelines is handled natively in an internal JSON format. We call this document-oriented processing. Even flat, record-oriented data is converted into JSON for internal processing. This lets us handle both flat and hierarchical data seamlessly. Pipelines are constructed from Snaps. Each Snap encapsulates specific application or technology functionality. The Snaps are connected together to carry out a data flow process. Pipelines are constructed with our visual Designer. Some Snaps provide connectivity, such as connecting to databases or cloud applications. Some Snaps allow for data transformation such as filtering out documents, adding or removing fields or modifying fields. We also have Snaps that perform more complex operations such as sort, join and aggregate.

Given this setup, we can categorize pipelines into two types: streaming and accumulating. In a streaming pipeline, documents can flow independently. The processing of one document is not dependent on another document as they flow through the pipeline. Such streaming pipelines have low memory requirements because documents can exit the pipeline once they have reached the last Snap. In contrast, an accumulating pipeline requires that all documents from the input source must be collected before result documents can be emitted from a pipeline. Pipelines with sort, join and aggregate are accumulating pipelines. In some cases, a pipeline can be partially accumulating. Such accumulating pipelines can have high memory requirements depending on the number of documents coming in from an input source.

Now let’s turn to execution platforms. SnapLogic has an internal data processing platform called a Snaplex. Think of a Snaplex as a collection of processing nodes or containers that can execute SnapLogic pipelines. We have a few flavors of Snaplexes:

  •  A Cloudplex is a Snaplex that we host in the cloud and it can autoscale as pipeline load increases.
  • Groundplex is a fixed set of nodes that are installed on-premises or in a customer VPC. With a Groundplex, customers can do all of their data processing behind their firewall so that data does not leave their infrastructure.

We are also expanding our support for external data platforms. We have recently released our Hadooplex technology that allows SnapLogic customers to use Hadoop as an execution target for SnapLogic pipelines. A Hadooplex leverages YARN to schedule Snaplex containers on Hadoop nodes in order to execute pipelines. In this way, we can autoscale inside a Hadoop cluster. Recently we introduced SnapReduce 2.0, which enables a Hadooplex to translate SnapLogic pipelines into MapReduce jobs. A user builds a designated SnapReduce pipeline and specifies HDFS files and input and output. These pipelines are compiled to MapReduce jobs to execute on very large data sets that live in HDFS. (Check out the demonstration in our recent cloud and big data analytics webinar.)

Finally, as we announced last week as part of Cloudera’s real-time streaming announcement, we’ve begun work on our support for Spark as a target big data platform. A Sparkplex will be able to utilize SnapLogic’s extensive connectivity to bring data into and out of Spark RDDs (Resilient Distributed Datasets). In addition, similar to SnapReduce, we will allow users to compile SnapLogic pipelines into Spark codes so the pipelines can run as Spark jobs. We will support both streaming and batch Spark jobs. By including Spark in our data platform support, we will give our customers a comprehensive set of options for pipeline execution.

Choosing the right big data platform will depend on many factors: data size, latency requirements, connectivity and pipeline type (streaming versus accumulating). Here are some guidelines for choosing a particular big data integration platform:

Cloudplex

  • Cloud-to-cloud data flow
  • Streaming unlimited documents
  • Accumulating pipelines in which accumulated data can fit into node memory

Groundplex

  • Ground-to-ground, ground-to-cloud and cloud-to-ground data flow
  • Streaming unlimited documents
  • Accumulating pipelines in which accumulated data can fit into node memory

Hadooplex

  • Ground-to-ground, ground-to-cloud and cloud-to-ground data flow
  • Streaming unlimited documents
  • Accumulating pipelines can operate on arbitrary data sizes via MapReduce

Sparkplex

  • Ground-to-ground, ground-to-cloud and cloud-to-ground data flow
  • Allow for Spark connectivity to all SnapLogic accounts
  • Streaming unlimited documents
  • Accumulating pipelines can operate on data sizes that can fit in Spark cluster memory

Snap In to Big DataNote that recent work in the Spark community has increased support for out-of-core computations, such as sorting. This means that accumulating pipelines that are currently only suitable for MapReduce execution may be supported in Spark as out-of-core Spark support becomes more general. The Hadooplex and Sparkplex have added reliable execution benefits so that long-running pipelines are guaranteed to complete.

At SnapLogic, our goal is to allow customers to create and execute arbitrary data flow pipelines on the most appropriate data platform. In addition, we provide a simple and consistent graphical UI for developing pipelines which can then execute on any supported platform. Our platform agnostic approach decouples data processing specification from data processing execution. As your data volume increases or latency requirements change, the same pipeline can execute on larger data and at a faster rate just by changing the target data platform. Ultimately, SnapLogic allows you to adapt to your data requirements and doesn’t lock you into a specific big data platform.

2014_FB_ArticleLinkPosts_484x252Last week we wrote about the big news at Dreamforce 2014: Cloud Analytics in the Spotlight: Riding the Wave at #DF14. SnapLogic also hosted a webinar with David Glueck from Bonobos as our featured speaker: BI in the Sky: The New Rules of Cloud Analytics. David is the founder of the data science and engineering team at Bonobos, which is the largest e-commerce-born apparel brand in the US. A seasoned analytics guru, prior to joining Bonobos, David has held business intelligence roles at Groupon, Netflix, HP, Knightsbridge Consulting, Cisco.

He summarized the three primary reasons Bonobos was an early cloud analytics adopter as speed, flexibility and scale. With the benefit of being born in the cloud and remaining cloud first from an IT infrastructure perspective, Bonobos relies upon the SnapLogic platform to pull data from multiple cloud services, CSVs, APIs, and databases and pushes it into multiple cloud analytics tools, including GoodData, Amazon Redshift, Tableau, Gecko Board, and Python.

Early in the discussion David stated: “I would rather work with the business than with the hardware.” His focus on insights, alignment and business outcomes instead of tools and technology really came through in his presentation. Just take a look at his advice if cloud-based business intelligence is something your organization is considering:

  • Pick the high value use case first
  • Know what you need to get an A in
  • Know what you want to accomplish and why
  • Think like an investor

I’ve embedded the presentation below. You can watch the entire webinar with interactive Q&A here. And if you want to learn more about how SnapLogic powers cloud (AWS Redshift, Salesforce Analytics Cloud, etc.) and big data (Cloudera, Hortonworks) analytics, be sure to check out our Analytics Solutions page.

In part this SnapLogic tips and tricks series we have demonstrated how the XML Generator Snap:

In this final part of the series, we will cover how the XML Generator Snap creates one serialized XML string for every input document.

Example 4: POSTing the Generated Content
In the last example we will be POSTing the generated content to some REST endpoint using the REST POST Snap.
In the screenshot below we are using the POST Snap which has the entity set as $xml. That will use the XML content that was generated by the upstream XML Generator Snap and POST it as a body to the endpoint.
You might want to set the content-type and accept header as defined below.

xml-gen-6

The POST will be executed for every document on the input view. There are a total of two documents, hence we will execute two post requests.

Series Summary
In summary, XML Generator Snap enables you to generate XML data, either directly in the Snap using the XML template or dynamically by using data from the input view. It lets you generate the XML by providing an XSD and it can validate the created XML against the XML at runtime.

Additional Resources:

In part two of this series, we covered how to map to the JSON schema upstream. In this post, we will cover how to validate the generated XML against the XSD.

Example 3: Writing the Generated Content to File
Sometimes one wants to write the generated XML to a file. For that use case we provide a DocumentToBinary Snap which can take the content and convert it to binary data object, which then can be written to a file, e.g using a File Writer Snap.

xml-gen-5

Above we map XML to the content field of the DocumentToBinary Snap, and set the Encode or Decode option on the DocumentToBinary Snap to NONE.

This outputs then one binary document for each order. We can then write it to a directory. (Careful, here you’d want to use the append option, since you potentially would be writing two files to the same directory, *which will be supported soon for SnapLogic’s file system) or you can use an expression such as Date.now() to write individual files for each incoming binary data object).

In our final part of this series, we will demonstrate how the XML Generator Snap creates one serialized XML string for every input document.

Additional Resources:

In the first part of this series, we explained how to use the XML Generator Snap to generate XML based off an XSD. In this post, we will cover how to map to the JSON schema upstream.

Example 2: Mapping to XML Generator via XSD
Lets use a JSON Generator to provide the input order data, such as defined below:

[
 {
 "items" : [{
 "title": "iphone6",
 "quantity": 1,
 "price": 598.00
 },
 {
 "title": "note 4",
 "quantity": 1,
 "price": 599.00
 }
 ],
 "address":"some address",
 "city": "San Mateo",
 "orderId": "1234",
 "name": "Mr. Undecided"
 },
 {
 "items" : [{
 "title": "iphone6",
 "quantity": 1,
 "price": 598.00
 },
 {
 "title": "note 4",
 "quantity": 1,
 "price": 599.00
 },
 {
 "title": "lumina",
 "quantity": 1,
 "price": 0.99
 }
 ],
 "address":"some address",
 "city": "San Mateo",
 "orderId": "1234",
 "name": "Mr. Even more Undecided"
 }
]

We then map the data using the Mapper Snap, which has access to the XSD of the downstream XML Generator Snap of the previous example (now with an added input view).

xml-gen-3

Here we map the items to the item list on the target. Further we map the city, address, country and name to the shipTo object on the target and then finally we map the name against orderperson and orderId against @orderId on the target. The @ indicates we map against an XML attribute.

Hint: the Mapper Snap was enhanced in the Fall 2014 release to allow viewing the data on the in/output while doing the mappings (on the bottom, expanded with the arrow in the middle)

Lets look at the output of the XML Snap:

xml-gen-4

Here we see that each incoming order document was translated into an XML string. We include the original data from the input view, in case it is further needed downstream.
The XML Generator Snap can validate the generated content if needed using the “Validate XML” property.

In our next post in this series, we will demonstrate how the XML Generator Snap validates the generated XML against the XSD.

Other Resources:

The XML Generator Snap was introduced in the Summer 2014 release. In the Fall release, it was enhanced with the addition of XML generation based on a provided XSD and the suggestion of the JSON schema (based of the XSD schema) to the upstream Snap. The XML Generator Snap is similar to the XML Formatter Snap, which formats incoming documents into XML, however this Snap allows you to map to the XML content to allow a more specific XML generation. In a four-part series, we will explain how the XML Generator Snap:

Example 1: XML Generation via XSD
For this first example, I created a simple pipeline to generate order data XML directly with the XML Generator Snap.

xml-gen-1

We provide the sample XSD (originating from: http://www.w3schools.com/schema/schema_example.asp) defined as:

<?xml version="1.0" encoding="UTF-8" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="shiporder">
 <xs:complexType>
<xs:sequence>
 <xs:element name="orderperson" type="xs:string"/>
 <xs:element name="shipto">
   <xs:complexType>
     <xs:sequence>
       <xs:element name="name" type="xs:string"/>
       <xs:element name="address" type="xs:string"/>
       <xs:element name="city" type="xs:string"/>
       <xs:element name="country" type="xs:string"/>
     </xs:sequence>
   </xs:complexType>
 </xs:element>
 <xs:element name="item" maxOccurs="unbounded">
   <xs:complexType>
     <xs:sequence>
       <xs:element name="title" type="xs:string"/>
       <xs:element name="note" type="xs:string" minOccurs="0"/>
       <xs:element name="quantity" type="xs:positiveInteger"/>
       <xs:element name="price" type="xs:decimal"/>
     </xs:sequence>
   </xs:complexType>
 </xs:element>
</xs:sequence>
<xs:attribute name="orderid" type="xs:string" use="required"/>
 </xs:complexType>
</xs:element>
</xs:schema>

We then suggest the XML root element, which returns {}shiporder.
Finally, we click on Edit XML which will automatically trigger the XML template generation based off the XSD, as seen below.

xml-gen-2

Now we could replace the variables with our own values to generate the XML on the output view or move on to the next example.

Note: The execution of the Snap above will create an XML attribute on the output view which provides the serialized XML content as a string.

In part two of this series, you will see how to use a JSON Generator to map to the XML Generator XSD.

Other Resources:

Salesforce Analytics Cloud Data IntegrationThis week at Dreamforce 2014, Salesforce announced their new Analytics Cloud. They called the new service “Wave” and the headline copy writers jumped all over it. Here are a few of my favorites:

The announcement was brilliantly orchestrated by Salesforce CEO Marc Benioff, who “let slip” via Twitter that there would be a new Analytics Cloud announcement at Dreamforce a few weeks ago. The day before the press hit the wire, he then tweeted: “Good day to catch a Wave: Salesforce Analytics Cloud” with a link to the new app on the App Store. The coverage since has been tremendous.

Suddenly Cloud Analytics is Red Hot. Oracle made a cloud analytics announcement at OpenWorld. Tableau has introduced a cloud data visualization service. SAP has announced partnerships. Many Cloud Analytics pure-play vendors continue to get funded and grow (although the impact of the Salesforce announcement on analytics partners remains to be seen). And the Amazon Redshift cloud data warehouse is said to be the fastest growing AWS service.

But cloud analytics is not a new topic, particularly in the CRM market. When Oracle acquired Siebel (who had acquired a small company called nQuire in 2001), the analytics product line was seen by many to be the “crown jewel” of their Fusion strategy. This despite the fact that Tom Siebel famously claimed to “not have a thorough analysis of the pipeline here with us today on a 2002 earnings call where he tried to explain the company’s big miss.

So why cloud analytics and why now? What are the benefits? What are the challenges? We’re going to be digging into these and other questions this week in a webinar with SnapLogic customer Bonobos. I hope you can attend. In the meantime, this article summarizes a great list of 10 drivers of Cloud Business Intelligence that comes from EMA Research:

  1. Lines of business are pushing their own BI agenda and pursuing their own solutions – often cloud-based.
  2. BI users are becoming more diverse and more demanding in their user experience, and cloud can provide a better (and more mobile) UX.
  3. User communities are maturing. Consumers of BI are a ‘new breed of knowledge worker,’ less timid about technology and more sophisticated.
  4. New technology, from in-memory to big data (NoSQL, Hadoop) is forcing a re-evaluation of existing BI infrastructure.
  5. The economics of cloud enable more companies to get involved with BI, allowing them to widen the scope of BI projects utilize cloud to widen scope economically.
  6. Companies are realizing the value of new types of data from new data feeds.
  7. On-premise data warehouses are under strain from these new requirements and data sources.
  8. Data warehouses are decentralizing – cloud is an ideal platform for decentralized BI architectures.
  9. Cloud can minimize the pain of traditional BI projects.
  10. The CAPEX-based cloud pricing model makes cloud BI projects easier to fund and pursue.

In 2007, I joined a cloud business intelligence software company called LucidEra (after my 1.0 attempt at getting Salesforce to develop an analytics product line). Here’s a presentation from 2007, which summarizes some of the benefits we saw on-demand business intelligence delivering back then. While LucidEra was in many ways ahead of its time, the benefits are as true today as they were then: self-service, ease of use, rapid time to value, lower set up costs, etc. But one of the primary challenges that customers and solution providers must still overcome is the need for robust data integration. Ideally the data integration technology is also cloud-based; it must be able to deliver disparate data to end-users in batch and real time; and it must be as easy to set up and easy to use as the cloud analytics solution itself.

But back to the Wave. When Salesforce jumps into a market, it brings great awareness to what’s possible and the new Analytics Cloud will definitely encourage more and more IT organizations to re-think their approach to data warehousing and business intelligence. I’d like to congratulations my old friend Keith Bigelow and the entire Salesforce Analytics Cloud team on their launch. And I’m excited to say that SnapLogic has partnered with Salesforce to make the new Analytics Cloud possible for our customers. Here’s Keith introducing the primary Analytics Cloud data integration and consulting partners:

Salesforce Analytics Partners

You can learn more about our new Snap and Snap Patterns here and see it in action on our Cloud Analytics webinar later this week, which will also demonstrate getting data into and out of Amazon Redshift and big data integration.

Welcome to the new era of Cloud Analytics, powered by next-generation Cloud Integration!