In this video, learn about how you can build ML models with minimal data inputs with SnapLogic’s AutoML capabilities and reduce dimensionality with SnapLogic’s Principal Component Analysis capability.
February 2019 Release: SnapLogic Data Science updates
Hi, we introduced three new data engineering Snaps in the 4.16 release. I will show how Adhoc Integrators can use SnapLogic Data Science to train a machine learning model to predict customer sentiment using the Natural Language Processing (NLP) Snaps. We’ll also review how to use the AutoML Snap for automatic machine learning. The NLP and AutoML Snaps are available in the SnapLogic’s ML Core Snap Pack. We’ve also added the Principal Component Analysis (PCA) Snap to achieve dimensionality reduction.
I’ve created a pipeline to build out a model to predict customer sentiment. In this pipeline, I am going to read the data set from AWS S3, run it through a sample data set of 35% and produce 3 key outputs – a common words JSON file, the actual model file, and the statistics around how well the algorithm did.
The NLP Snaps let me perform operations in natural language processing, as part of this process I am using the Tokenizer Snap to convert sentences into an array of tokens.
The Common Words Snap to find the most popular words in the dataset of input sentences, and the Bag of Words Snap, to vectorize sentences into a set of numeric fields. This pipeline generates higher productivity and visibility for Data Scientists and Data Engineers.
Now, let’s look at a pipeline that I have created to execute the model that I built earlier.
First, I need to create an Ultra task, that lets me create REST APIs from SnapLogic pipelines. I’ve called this pipeline “RunSentiAnalysis.” Let’s open this up and see how it’s been configured to run as an Ultra task.
Let’s switch to a sample web page that includes the HTTP endpoint of the Ultra task that has been put behind a load balancer, I can run it as an Ultra pipeline to predict the customer sentiment.
After entering some sample text, I can obtain more information.
The feedback is almost instantaneous, and it comes back with a measure of the sentiment in this field called “polarity.” Since a value of 1 implies that the sentiment is positive and the results give me a polarity value of 0.8, which is quite positive.
Now let’s look at the AutoML Snap, this Snap helps automate the process of training a large selection of candidate machine learning models by providing minimal inputs. In this case, it reads the customer file as input and predicts customer churn.
I can specify a time limit to let it run for and set the maximum number of models. A fold value of 5 means the data set will get split 5 ways. You can also pick the engine here. For this example, we have selected the H2O engine.
We can also view a leaderboard that provides statistics on each of the algorithms that were run as part of AutoML.
Here is another example where we have used the Weka open source engine instead of H2O.
Let’s view the leaderboard for this Weka run.
The Principal Component Analysis Snap helps Data Engineers perform principal component analysis for dimensionality reduction. Principal Component Analysis or PCA is a dimension reduction tool that’s used to reduce a large set of variables to a smaller set that still contains most of the information from the original. By reducing the number of dimensions, you significantly reduce the amount of data that the downstream Snap must manage, making it faster.
Let’s look at the settings for the PCA Snap, the dimension specifies the maximum number of dimensions – or columns – that you want in the output. The default is 10. Variance specifies the minimum variance that you want to keep in the output document. The default is 0.95 with a maximum value that can be specified as 1. The passthrough checkbox can be selected to include all the categorical input fields in the output.
This pipeline applies a PCA to reduce the number of dimensions to two, and we can see those dimensions here in the class column of the Mapper Snap.
To summarize, I showed how Adhoc Integrators can use SnapLogic Data Science to train a machine learning model to predict customer sentiment using Natural Language Processing Snaps. We also used the AutoML Snap for automatic machine learning, and the Principal Component Analysis Snap to achieve dimensionality reduction. This results in higher productivity for Data Scientists and Data Engineers as they leverage a visual, no-code paradigm for machine learning development.