In this video, learn how you can leverage SnapLogic eXtreme for big data processing on Azure Databricks and for Amazon RedShift.
August 2019 Release: SnapLogic eXtreme Updates
This demo will focus on SnapLogic eXtreme. First, I will focus on eXtreme support for Azure Databricks and then talk about some general updates around Redshift support and partition support for the Parquet Formatter Snap.
In the first demo, we will use SnapLogic eXtreme to analyze customer returns for a retail chain. This use case is based on the TPC Benchmark DSTM specification, a widely used decision support industry-benchmark that evaluates the performance of big data processing engines. For a retail chain, we need data on customers who return more than 20% of the average customer merchandise for a given store, per year.
We will show how Sam, the Integration Specialist, can easily do a one-time configuration of the Big Data runtime environment in Azure Databricks that can be leveraged across multiple projects without any intervention from IT Operations. All he is going to do is set up policies; he doesn’t have to set anything up physically. Dave, the Ad hoc Integrator, is responsible for setting up capture, conform, and refine routines to gather data from various sources, transform it and then store it in a data lake, i.e. Azure Blob Storage.
Here is the high-level solution architecture that shows how SnapLogic eXtreme fits as part of our overall SnapLogic Intelligent Integration Platform.
In terms of the high-level demo steps, as Sam, the Integration Specialist, I am going to first add the Azure Databricks account details to set up the serverless big data runtime environment. This account will be used to initiate, elastically scale and terminate Azure Databricks clusters.
Then I am going to create & configure a Snaplex of the “eXtremeplex” type just once. Then, as Dave, the Adhoc Integrator, I am going to use Snaplogic Designer to develop Spark pipelines to process data using Azure Databricks as the runtime environment. I will then execute & monitor the Spark pipelines and then let Barry, the business user or the Citizen Integrator, review the results in the Data Lake.
So let’s get right into the demo. As Sam, the Integration Specialist, let’s go to SnapLogic Manager and open up an Azure Databricks account that I created earlier. When I create new accounts I just need to provide Token Id and the Azure Databricks URL. Similarly let me also show you how I configured the Azure WASB account.
Lets go to the ‘Snaplexes’ tab. eXtreme performs elastic scale processing on Azure Databricks clusters through a new Snaplex called ‘eXtremeplex.’ I will perform a one-time configuration which involves defining the characteristics of the Azure Databricks cluster I want to spin up.
I have already created a Snaplex called “adb_extreme_demo”, so let’s just open it up.
Under Settings, I have used the Account Type as Azure Databricks and the relevant Instance type for the Worker Node.
I will use the WASB account I defined earlier. Now lets navigate to the Advanced tab.
eXtreme helps reduce operational expenses by providing the option of auto-terminating inactive clusters. In this case, we have set the value to be 60 min. You can also set the value to 0 if you don’t want the cluster to get auto-terminated. You can also enable Auto-Scaling by setting the minimum and maximum number of workers.
As Dave, the Ad hoc Integrator, I will now use Snaplogic Designer to build a new Spark pipeline.
This pipeline processes data from multiple WASB sources. The ‘Stores Return fact’ table is about 290M rows or 32 GB, the date dimensions table is about 73K rows or 10 MB and the Customers dimension table is about 12M rows or 1.5GB. So, you can get a sense of the volume of data we are processing here.
With the conventional approach, I would need to write all of this in Java, Scala, or Python. I am not really a hard-core Java programmer but here I was able to accomplish building the integration because I was able to leverage a visual, no-code paradigm for designing my pipelines just like I do for Application Integration and Data Integration use-cases. I am an Ad hoc integrator and I just became a Big Data Developer without writing any code. Now that’s awesome.
Let’s go ahead and execute one of the pipelines. The eXtremeplex now triggers the Azure Databricks cluster to be initialized.
Now let’s go to SnapLogic Dashboard and pick an older run of the pipeline and click on the status to examine the pipeline execution statistics. Here you see about 41GB of data being processed.
Just to summarize what we just demonstrated. As Sam, the Integration Specialist, we did a one-time configuration of the big data runtime environment using the Azure Databricks credentials and set up various lifecycle management policies around termination and auto-scaling for the Azure Databricks clusters. I did not need to reserve a bunch of nodes for this. I only paid for what I used and did not have to plan for peak capacity. This results in economic savings in operating expenses and hence lower Total Cost of Ownership as you don’t need to manage these clusters anymore and you can spin them up and tear them down as needed.
Then, as Dave, the Ad hoc Integrator, we then used the visual, no-code design paradigm to develop Spark pipeline(s) to process data using Azure Databricks. This results in improved productivity as you don’t need to write code in Java, Scala or Python. And all of this results in Speed and Agility – you can do more with less resources, improve time to market and faster time to insights. And, as a result, speed up data-driven big data project implementations.
Now let me share another pipeline for the same ‘Retail Chain Analysis’ scenario. But, in this case, I am reading data from Redshift, transforming it and writing it to another Redshift table using AWS EMR as the big data runtime environment. I am now able to do all of this in an eXtreme pipeline.
And here is another eXtreme pipeline that demonstrates partition support for the Parquet formatter snap. In this case, we are partitioning it by “year” but it could have been other values like Month, Day etc.
Thank you for watching this video. If you would like to know more about SnapLogic eXtreme, please visit snapLogic.com.