Deep Dive into SnapLogic Winter 2017 Snaps Release

By Pavan Venkatesh

Data streams with Confluent and migration to Hadoop: In my previous blog post, I explained how future data movement trends will look. In this post, I’ll dig into some of the exciting things we announced as part of the Winter 2017 (4.8) Snaps release. This will also address future data movement trends for customers who want to move data to the cloud from different systems or migrate to Hadoop.

Major highlights in 2017 Winter release (4.8) include:

  • Support of Confluent Kafka – A distributed messaging system for streaming data
  • Teradata to Hadoop – A quick and easy way to migrate data
  • Enhancements to the Teradata Snap Pack: On the TPT front, customers can quickly load/update/delete data in Teradata
  • The RedShift Multi-Execute Snap – Allows multiple statements to be sequentially executed, so customers can maintain business logic
  • Enhancements to the MongoDB Snap pack (Delete and Update) and the DynamoDB Snap pack (Delete and Delete-item)
  • Workday Read output enhancements – Now it’s easier for the downstream systems to consume
  • Netsuite Snap Pack improvements -Users can now submit asynchronous operations
  • Security feature enhancements – Including SSL for MongoDB Snap Pack and invalidating database connection pools when account properties are modified
  • Major performance improvement while writing to an S3 bucket using S3 File Writer – Users can now configure a buffer size in the Snap so larger blocks are sent to S3 quickly

Confluent Kafka Snap Pack

Kafka is a distributed messaging system based on publish/subscribe model with high throughput and scalability. It is mainly used for ingestion from multiple sources and then sent to multiple downstream systems. Use cases include website activity tracking, fraud analytics, log aggregation, sales analytics, and others. Confluent is the company that provides the enterprise capability and offering for open source Kafka.

Here at SnapLogic we have built Kafka Producer and Consumer Snaps as part of the Confluent Snap Pack. A deep dive into Kafka architecture and its working will be a good segue before going into the Snap Pack or pipeline details.

kafka-cluster

Kafka consists of single or multiple Producers that can produce messages from a single or multiple upstream systems, and single or multiple Consumers that consume messages as part of downstream systems. A Kafka cluster constitutes one or more servers called Brokers. Messages (key and value or just the value) will be fed into higher level abstraction called Topics. Each Topic can have multiple messages from different Producers. User can also define different Topics for new category of messages. These Producers write messages to Topics and Consumers consume from one or more Topics. Also Topics are partitioned, replicated, and persisted across Brokers. Messages in the Topics are ordered within a partition and each of these will have a sequential ID number called offset. Zookeeper usually maintains these offsets but Confluent calls it coordination kernel.

Kafka also allows configuring a Consumer group where multiple Consumers are part of it, when consuming from a Topic.

With over 400 Snaps supporting various on-prem (relational databases, files, nosql databases, and others) and cloud products (Netsuite, SalesForce, Workday, RedShift, Anaplan, and others), the Snaplogic Elastic Integration Cloud in combination with the Confluent Kafka Snap Pack will be a powerful combination for moving data to different systems in a fast and streaming manner. Customers can realize benefits and generate business outcomes in a quick manner.

With respect to the Confluent Kafka Snap Pack, we support Confluent Version 3.0.1 (Kafka v0.9). These Snaps abstract the complexities and users only have to provide configuration details to build a pipeline which moves data easily. One thing to note is that when multiple Consumer Snaps are used in a pipeline and have been configured with the same consumer group, then each Consumer Snap will be assigned a different subset of partitions in the Topic.

kafka-producer

kafka-consumer

pipeline1

In the above example, I built a pipeline where sales leads (messages) stored in local files and MySQL are sent to a Topic in Confluent Kafka via Confluent Kafka Producer Snaps. The downstream system Redshift will consume these messages from that Topic via the Confluent Kafka Consumer Snap and bulk load it to RedShift for historical or auditing needs. These messages are also sent to Tableau as another Consumer to run analytics on how many leads were generated this year, so customer can compare this against last year.

Easy migrations from Teradata to Hadoop

There has been a major shift where customers are moving from expensive Teradata solutions to Hadoop or other data warehouse. Until now, there has not been an easy solution in transferring large amounts of data from Teradata to big data Hadoop. With this release we have developed a Teradata Export to HDFS Snap with two goals in mind: 1) ease of use and 2) high performance. This Snap uses the Teradata Connector for Hadoop (TDCH v1.5.1). Customers just have to download this connector from the Teradata website in addition to the regular jdbc jars. No installation required on either Teradata or Hadoop nodes.

TDCH utilizes MapReduce (MR) as its execution engine where the queries gets submitted to this framework, and the distributed processes launched by the MapReduce framework make JDBC connections to the Teradata database. The data fetched will be directly loaded into the defined HDFS location. The degree of parallelism for these TDCH jobs is defined by the number of mappers (a Snap configuration) used by the MapReduce job. The number of mappers also defines the number of files created in HDFS location.

The Snap account details with a sample query to extract data from Teradata and load it to HDFS is shown below.

edit-account

terradata-export

 

The pipeline to this effect is as follows:

pipeline2

As you can see above, you use just one Snap to export data from Teradata and load it into HDFS. Customers can later use HDFS Reader Snap to read files that are exported.

Winter 2017 release has equipped customers with lots of benefits, from data streams, easy migrations, to enhancing security functionality, and performance benefits. More information on the SnapLogic Winter 2017 (4.8) release can be found in the release notes.

Pavan Venkatesh is Senior Product Manager at SnapLogic. Follow him on Twitter @pavankv.

SnapLogic Introduces Intelligent Connectors for Microsoft Azure Data Lake Store

SnapLogic announced the availability of new pre-built intelligent connectors – called Snaps – for Microsoft Azure Data Lake Store. The new Snaps provide fast, self-service data ingestion and transformation from virtually any source – whether on-premises, in the cloud or in hybrid environments – to Microsoft’s highly-scalable, cloud-based repository for big data analytics workloads. This latest integration between SnapLogic and Microsoft Azure helps enterprise customers gain new insights and unlock business value from their cloud-based big data initiatives.

Microsoft Quote Continue reading “SnapLogic Introduces Intelligent Connectors for Microsoft Azure Data Lake Store”

Learning about the Spark Script Snap

SnapLogic provides a big data integration platform as a service (iPaaS) for business customers to process data in a simple, intuitive and powerful way. SnapLogic provides a number of different modules called Snaps. An individual Snap provides a convenient way to get, manipulate or output data, and each Snap corresponds to a specific data operation. All the customer needs to do is to drag the corresponding Snaps together and configure them, which creates a data pipeline. Customers execute pipelines to handle specific data integration flows.

Figure 1 - SnapLogic Pipeline Example
Figure 1 – SnapLogic Pipeline Example

Continue reading “Learning about the Spark Script Snap”

SnapLogic Summer 2016 Release Now Available

Another release is in the books – today we announced the Summer 2016 SnapLogic platform update, along with several additions and improvements to our Snap library.  The release brings additions for big data integration, self-service integration, and enterprise governance and control.

As our VP Engineering Vaikom Krishnan put it:

SnapLogic Summer 2016 Release
SnapLogic Summer 2016 Release

“SnapLogic continues to break down the barriers between data and application integration in the enterprise with a converged platform that is built for self-service. The Summer 2016 release further enhances our Snap library and resources for Snap developers to help support our vision of ‘anything, anytime, anywhere’ integration.”

Highlights of this “Snappy” release include:

  • New Snaps for Apache Hive and Teradata
  • Major updates to Snaps for Anaplan and Tableau
  • Enhancements to the Mapper Snap that make it faster and simpler to search, filter and map the entries in a complex schema tree
  • User-defined pipeline parameters can now be logged and retained with runtime history in order for administrators to audit API usage and quickly debug pipeline performance issues
  • A new, seamless way to auto-shard documents across all nodes in a SnapLogic data processing Snaplex, leveraging the power of all nodes and boosting data integration performance 
  • Users can now limit invocation of triggered tasks to one instance at a time for more granular control and to avoid overloading resources.

We’re also excited for the launch of the new  Snap developer site. It’s easy to use, mobile-friendly and full of practical guidance for our customers and partners building and maintaining their own Snaps.

For more information on the Summer 2016 release, including demo videos, see: https://www.snaplogic.com/summer2016

Ingestion, Transformation and Data Flow Snaps in Spark

In the previous post, we discussed what SnapLogic’s Hadooplex can offer with Spark. Now let’s continue the conversation by seeing what Snaps are available to build Spark Pipelines.

The suite of Snaps available in the Spark mode enable us to ingest and land data from a Hadoop ecosystem and transform the data by leveraging the parallel operations such as map, filter, reduce or join on a Resilient Distributed Datasets (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel.

There are various formats available for data storage in HDFS. These file formats support one or more compression formats that affect the size of data stored in the HDFS file system. The choice of file formats and compression depends on various factors like desired performance for read or write specific use case, desired compression level for storing the data. Continue reading “Ingestion, Transformation and Data Flow Snaps in Spark”

Testing… Testing… 1, 2, 3: How SnapLogic tests Snaps on the Apache Spark Platform

The SnapLogic Elastic Integration Platform connects your enterprise data, applications, and APIs by building drag-and-drop data pipelines. Each pipeline is made up of Snaps, which are intelligent connectors, that users drag onto a canvas and “snap” together like puzzle pieces.

A SnapLogic pipeline being built and configured
A SnapLogic pipeline being built and configured

These pipelines are executed on a Snaplex, an application that runs on a multitude of platforms: on a customer’s infrastructure, on the SnapLogic cloud, and most recently on Hadoop. A Snaplex that runs on Hadoop can execute pipelines natively in Spark.

The SnapLogic platform is known for its easy-to-use, self-service interface, made possible by our team of dedicated engineers (we’re hiring!). We work to apply the industry’s best practices so that our clients get the best possible end product — and testing is fundamental. Continue reading “Testing… Testing… 1, 2, 3: How SnapLogic tests Snaps on the Apache Spark Platform”

SnapLogic’s Latest Release: Spring 2016 has Sprung…

…and it’s looking Kafka-esque. So to speak.

Today SnapLogic announced our Spring 2016 platform and Snap release. Overall, we believe this release will help our customers focus on data insights, not data engineering. It takes a lot of the repetitive, time-consuming activities around data ingest-preparation-delivery and makes them reusable and simple. We also believe that this release will help our customers continue to stay abreast of the ever-changing big data technology ecosystem, and choose the right tools and frameworks for each job. Continue reading “SnapLogic’s Latest Release: Spring 2016 has Sprung…”