In this video, learn about SnapLogic eXtreme’s updates, including re-usability of PySpark and Spark pipelines, cost efficiency with on-demand and spot instances and the ability to extract data to and from Snowflake.
Hi. In this demo I will walk through updates to SnapLogic eXtreme that you’ll find in the 4.16 release. These enhancements will improve how you manage your big data resource costs, and increase productivity and performance with pipeline re-usability capabilities.
Sam, the Integration Specialist, will easily perform a one-time configuration of the big data runtime environment that can be leveraged across multiple projects without any intervention from IT Ops. His role is to set up policies for optimal cost management when running big data projects.
Dave, the Ad-Hoc Integrator, is responsible for setting up capture routines for various sources, and storing the gathered data into a data lake such as Amazon S3. His job is to conform and refine the massive amount of data and have it written back to the data lake to then be delivered to downstream applications.
This diagram shows how SnapLogic eXtreme fits in as part of our overall SnapLogic Intelligent Integration Platform.
Now, let’s take a look at how this works in practice and take the role of Sam, the Integration Specialist. Let’s go to the SnapLogic Manager view and configure the serverless big data runtime environment.
We’ve introduced this concept of hybrid instances where we have designed the systems to use an intelligent combination of on-demand and spot instances for optimal cost management.
Sam now has the ability to size node capacity for resource-intensive Spark pipeline runs so the cluster does not fail, he also has the ability to pick and allow legacy pipelines to run on older eXtremeplex versions.
In the Advanced tab, he can look at how hybrid instances are enabled for auto-scaling, and also pick the instance for his master node, core node, and task nodes.
Now, let’s switch over to Dave, the Ad Hoc Integrator, he wants to leverage existing PySpark and Java scripts and upload them as eXtreme pipelines.
Here is an example of the PySpark script that he’s uploaded, and here’s an example of a JAR Submit. Let’s compare those to the original Java source code from GitHub.
Previously, in order to write to a Snowflake system, Dave would write a Spark mode pipeline to take the data from AWS S3, conform it, refine it and write it back to S3. He would then write a separate standard mode pipeline to take the conformed data from S3 and then write to Snowflake.
Now, he can read from and write into Snowflake within a Spark pipeline that can run in a big data runtime environment like Amazon EMR.
Here Dave’s uniting data from various sources like Store, Store Sales, and Customer demographics; transforming them and populating a table in Snowflake in a single Spark pipeline.
In this pipeline, Dave is taking data from the Snowflake table, performing aggregations, then populating an aggregate table in Snowflake.
Let’s summarize what we just saw; Sam, the Integration Specialist, did a one-time configuration of the big data runtime environment and set up policies for optimal cost management using hybrid instances.
Dave, the ad-hoc Integrator, leveraged existing PySpark and Java scripts by uploading them to run as eXtreme pipelines, then created integrations for populating data into Snowflake and built aggregate tables all using a visual, no-code paradigm.
The key benefits of these new capabilities include lower the total cost of ownership due to optimal cost management for big data resources, improved performance through re-usability; as data engineers can now use existing PySpark and Java scripts in eXtreme, and higher productivity in building pipelines for cloud data warehouse use cases.