SnapLogic Data Science: Data collection and preparation

Learn how SnapLogic Data Science can help you collect and prepare data.

Read full transcript

Hi! In this video, I will show you how SnapLogic Data Science can help you collect and prepare data.

The four steps of the data science lifecycle include data acquisition, data exploration and preparation, model training and testing, and model deployment. Once you have collected data from various sources, as a Data Engineer, you need to prepare and cleanse the data or perform feature engineering. There are a number of Snaps within SnapLogic Data Science that enable us to do that. The Expression language in the Mapper can do some of it but Snaps are easier to use and even more intuitive. I can select the Snaps by searching the term ‘ML’ so that I can see all the Machine Learning Snaps together.

Now let’s look at the Shuffle Snap. The Shuffle Snap is a Flow type Snap that enables you to randomize the order of the rows in an incoming dataset. The Snap can be optimized to work with large datasets by configuring the maximum percentage of memory that can be used to buffer the dataset, if that limit is exceeded then the dataset is downloaded to a temporary file in local storage.

Let’s look at the Sample Snap. The Sample Snap is a Flow type Snap that enables you to generate a sample dataset from the input dataset. This sampling is carried out based on one of the four available algorithms and with a predefined pass through percentage. The algorithms available are Streamable Sampling, Strict Sampling, Stratified Sampling, and Weighted Stratified Sampling.

There are also other data preparation capabilities that are important in the ML lifecycle.

I can use the Clean Missing Values Snap to do some simple data prep tasks. Here I have a simple CSV file that has missing values for currency – you will see that in the third row. So let’s exit out of this and let’s define a rule that will drop the row if the currency value is null. And once I go ahead and run it, this is what the cleansed file looks like. You see that row was specifically dropped and the file is now cleansed.

Now let us look at the Profile Snap to profile the data and compute statistics of the incoming data. Here is what the file that is used to store various kinds of assets looks like. I will parse the CSV data and then profile it and have the profile output written to a JSON output file. Here is what the profile output looks like with the value distribution etc. Here I can take a look at and determine what further data cleansing needs to happen. One more thing to point out is that here I am using the Date Extractor Snap. The Date Time Extractor is a Transform Type Snap that is used to extract components from datetime data and add them to the result field. In this case, I am extracting the epoch value into the result field. You can use the Snap to prepare the data before performing aggregation or analysis.

You already know how important it is to prep the data before it gets consumed by the model build, test and deploy phases of the machine learning lifecycle. And you saw how easy it was to do that with SnapLogic. By leveraging a visual, drag-and-drop no-code interface, SnapLogic Data Science is able to significantly increase the productivity of the Data Engineer or the Data Scientist working on the Data Science project.

Thank you for watching this video. For more information, please visit snaplogic.com.

Video

The AI Mindset: Getting Started with Self-Service Machine Learning

Watch Now

Analyst Report

Gartner names SnapLogic a Leader in 2019 iPaaS Magic Quadrant…

Read Now

Case Study

Making the faculty and academia even smarter at Boston University

Read Now
Contact Us Free Trial