Learning about the Spark Script Snap

SnapLogic provides a big data integration platform as a service (iPaaS) for business customers to process data in a simple, intuitive and powerful way. SnapLogic provides a number of different modules called Snaps. An individual Snap provides a convenient way to get, manipulate or output data, and each Snap corresponds to a specific data operation. All the customer needs to do is to drag the corresponding Snaps together and configure them, which creates a data pipeline. Customers execute pipelines to handle specific data integration flows.

Figure 1 - SnapLogic Pipeline Example
Figure 1 – SnapLogic Pipeline Example

Continue reading “Learning about the Spark Script Snap”

Ingestion, Transformation and Data Flow Snaps in Spark

In the previous post, we discussed what SnapLogic’s Hadooplex can offer with Spark. Now let’s continue the conversation by seeing what Snaps are available to build Spark Pipelines.

The suite of Snaps available in the Spark mode enable us to ingest and land data from a Hadoop ecosystem and transform the data by leveraging the parallel operations such as map, filter, reduce or join on a Resilient Distributed Datasets (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel.

There are various formats available for data storage in HDFS. These file formats support one or more compression formats that affect the size of data stored in the HDFS file system. The choice of file formats and compression depends on various factors like desired performance for read or write specific use case, desired compression level for storing the data. Continue reading “Ingestion, Transformation and Data Flow Snaps in Spark”

Testing… Testing… 1, 2, 3: How SnapLogic tests Snaps on the Apache Spark Platform

The SnapLogic Elastic Integration Platform connects your enterprise data, applications, and APIs by building drag-and-drop data pipelines. Each pipeline is made up of Snaps, which are intelligent connectors, that users drag onto a canvas and “snap” together like puzzle pieces.

A SnapLogic pipeline being built and configured
A SnapLogic pipeline being built and configured

These pipelines are executed on a Snaplex, an application that runs on a multitude of platforms: on a customer’s infrastructure, on the SnapLogic cloud, and most recently on Hadoop. A Snaplex that runs on Hadoop can execute pipelines natively in Spark.

The SnapLogic data management platform is known for its easy-to-use, self-service interface, made possible by our team of dedicated engineers (we’re hiring!). We work to apply the industry’s best practices so that our clients get the best possible end product — and testing is fundamental. Continue reading “Testing… Testing… 1, 2, 3: How SnapLogic tests Snaps on the Apache Spark Platform”

Top Five iPaaS and Big Data Integration Posts in June

top 5June was a record month on the SnapLogic blog in terms of views and posts – thanks to our readers and contributors! If you have suggested topics, feedback or are interested in writing a post about hybrid cloud and big data integration, digital transformation, SnapLogic best practices, tips and tricks, or other related topics, please be sure to Contact Us or share your comments below. Continue reading “Top Five iPaaS and Big Data Integration Posts in June”

SnapLogic Travels to San Francisco for Spark Summit 2016

The Big Data Team at Spark Summit in San Francisco
The SnapLogic Big Data Team at Spark Summit in San Francisco

The SnapLogic big data team was at the Spark Summit last week in San Francisco. Around 2,500 people attended this year and featured several high profile speakers such as Matei Zaharia the creator of Spark, Jeff Dean of Google, Andrew Ng of Baidu, and representatives from influential tech companies like Amazon, Microsoft, and Intel.

Continue reading “SnapLogic Travels to San Francisco for Spark Summit 2016”

The Case for a Hybrid Batch and Streaming Architecture for Data Integration

binarystreamModern data integration requires both reliable batch and reliable streaming computation to support essential business processes. Traditionally, in the enterprise software space, batch ETL (Extract Transform and Load) and streaming CEP (Complex Event Processing) were two completely different products with different means to formulating computations. Until recently, in the open source software space for big data, batch and streaming were addressed separately, such as MapReduce for batch and Storm for streams. Now we are seeing more data processing engines that attempt to provide models for both batch and streaming, such as Apache Spark and Apache Flink. In series of posts I’ll explain the need for a unified programming model and underlying hybrid data processing architecture that accommodates both batch and streaming computation for data integration. However, for data integration, this model must be at a level that abstracts specific data processing engines. Continue reading “The Case for a Hybrid Batch and Streaming Architecture for Data Integration”