In this video, learn how you can build better machine learning models for unseen data with SnapLogic Intelligent Integration Platform.
In our 4.17 product release, we made several enhancements to SnapLogic Data Science, an extension of the SnapLogic Intelligent Integration Platform that accelerates the development and deployment of machine learning models. In this video, we’ll show you how each new feature works and how you can employ them in your data engineering and machine learning projects.
Profiling Output: First, let’s look at the data profiling improvements we made in the context of a Kickstarter dataset. Kickstarter is one of the most popular online crowdfunding platforms. In this example, we have utilized the Profile Snap to get quick insights into the data. When we preview the data, we see a field called ‘Main Category’ that is used to categorize these projects. Let’s expand the panel to look at a Pie Chart with ‘Main Category’ as the visualization key. You see that ‘Film and Video’ is about 20% of the total and ‘Music’ projects is about 18%. Now let’s look at the profile output report and open up the html file. Let’s open it in a new tab. You can see that there are 15 unique values for the ‘Main Category’ field with ‘Film and Video’ being the most popular ‘Main Category’. You can also download this file to your desktop to view it later or to include it in another document for collaboration with another user.
Cluster Snap: Clustering is the most common unsupervised machine learning technique and is used to find hidden patterns or groupings in data. Clustering divides a set of observations into different groups such that an observation in a given group, or a ‘cluster’, is more similar to other observations in that same cluster than to observations in other clusters.
In this training pipeline, let’s look at the configuration details for the ‘Cluster Snap’. We have picked the K-Means algorithm, but you can also pick another algorithm such as X-Means or G-Means. The maximum cluster size is 3. Let’s preview the cluster snap output. Next, let’s do a Pie Chart for the 3 clusters with prediction as the Visualization Key. The first cluster has about 50% of the records. The key objective here is to be able to support applications for cluster analysis like gene sequence analysis, market research, and object recognition.
For example, if a cell phone company wants to optimize the placement of their cell phone towers, they can use machine learning to estimate the clusters of people relying on their towers. A phone can only talk to one tower at a time, so the team uses clustering algorithms to design the best placement of cell towers to improve signal reception for groups, or clusters, of their customers.
In this prediction testing pipeline, we can use the sample input values to exercise the model that we created and arrive at predictions. Here is the sample file and here are the predictions.
Fuzzy Matching: Fuzzy matching is a technique used in computer-assisted translation as a special case of record linkage. The ‘Match Snap,’ which is included in the ML Data Preparation Snap Pack, performs record linkage to identify documents from different data sources, or input views, that may represent the same entity without relying on a common key. Here, the ‘Match Snap’ is used to match data from 2 country data files. Let’s look at the ‘Match Snap’ configuration. Threshold is the minimum confidence required for documents to be considered matched. You can check the Confidence check box if you want to include the confidence level of each match in the output.
For the Matching Criteria, you can specify the Left Field as the field in the first dataset you want to use for matching, and the Right Field as the field in the second dataset you want to use for matching.
A cleaner makes the comparison easier by removing variations from data, which are not likely to indicate genuine differences. For example, a cleaner might strip everything except digits from a ZIP code. Or, it might normalize the text and make it lowercase.
A comparator compares two values and produces a similarity indicator, on a scale from 0 to 1 – zero, being “completely different,” and one being “exactly equal.” You can choose from the following comparators – Levenshtein, Weighted Levenshtein, Longest Common Subsequence, Numeric, and Q-Grams. The default is the popular Levenshtein algorithm.
For Low, you enter a decimal value representing the lowest level of confidence at which the records will be deemed to not match. If this value is left empty, a value of 0.3 is applied automatically.
For High, you enter a decimal value representing the highest level of confidence at which the records will be deemed to be a perfect match. If this value is left empty, a value of 0.95 is applied automatically.
The First Output shows the matched documents and, optionally, the confidence level associated with the matching. Let’s look at the breakdown by country. The Second Output shows the Unmatched documents from the first dataset and the third output shows the unmatched documents from the second dataset.
Feature Synthesis: The ‘Feature Synthesis Snap’ generates features from the selected fields in multiple datasets and adds them to the base dataset. The datasets may not all be related to each other but have to be related to at least one of the input datasets. Features that are generated include: Mean, Min, Max, Mode, Unique, Count, Sum, Standard Deviation, etc. The first input is the base dataset. In this case, it is the customer data set. Subsequent inputs are reference datasets. The expected output is the base dataset containing all the features generated based on the reference datasets. Let’s look at the ‘Feature Synthesis Snap’. Here is the policy that joins the customer and the transactions dataset based on customer ID. And here is what the feature synthesis output looks like.
Auto-ML Improvements: Now let’s talk about some improvements we have made to the Auto-ML snap.
The first one is the HTML-based leaderboard reporting enhancement, which makes it easier to explain results. Let’s click on “Write Report” to view the html report. Here you will see that the XGBoost algorithm provided the highest accuracy of 77.4%.
And if you open the Auto-ML Snap, you now have the ability to provide a list of algorithm groups so users can decide the type of algorithm they want to use for building the model. In this case, you can pick from Standard, Tree, XGBoost and Neural Network algorithm groups.
Now let me show you another pipeline where the AutoML Snap gives a second input view where the user can feed best-so-far model from the previous run.
Thank you for watching this video.