The ML Data Preparation Snap Pack automates many of the various data preparation tasks that arise when developing a machine learning model. It gives data scientists a visual, drag-and-drop alternative to tedious hand-coding for data preparation operations. Data scientists can spend less time cleansing data and more time doing strategic – and fun – work like actually building a machine learning model.
The ML Data Preparation Snap Pack includes the following Snaps:
- Categorical to Numeric: Convert categorical columns into numeric columns by using integer coding or one hot encoding.
- Clean Missing Values: Replace missing values in datasets by dropping or imputing values.
- Date Time Extractor: Extract components from datetime objects.
- Numeric to Categorical: Convert numeric columns into categorical columns by using custom ranging or binning.
- Sample: Generate sample datasets from an input dataset using sampling algorithms.
- Scale: Scale values in columns to specify ranges or apply statistical transformations.
- Shuffle: Randomize the order of the row data in the dataset.
- Type Converter: Determine types of values in columns. There are four supported types: integer, floating point, text, and datetime.
- Principal Component Analysis: Perform principal component analysis for dimensionality reduction.
Turn your categorical data into numeric data
Training a model on numeric data often is easier than with categorical data. But in many cases, your raw data contains categorical information. This is a time when the Categorical to Numeric Snap comes in handy. For more information on Categorical to Numeric Snap
Below, we’ve generated a CSV file using the CSV Generator Snap. The CSV file contains the category column. We use the Categorical to Numeric Snap to encode the values in this column. We apply both integer encoding and one hot encoding to see the differences.
The table below shows the output of the Categorical to Numeric Snap. The first column is the original data from CSV Generator Snap. The column in red (category_int) is the result of Integer Encoding. The columns in blue (category_Comics, category_Crafts, category_Design, category_Film & Video, category_Food, category_Music, and category_Publishing) are the result of One Hot Encoding. Now that the categorical column is in integers or numerical data, data scientists can easily move the data to build their machine learning model.
The ML Snap Packs are included in SnapLogic Data Science, an extension of the Intelligent Integration Platform that provides a visual drag-and-drop approach to developing and deploying machine learning models. Check out our other ML Snap Packs: ML Core Snap Pack and ML Analytics Snap Pack.
Learn more about the ML Data Preparation Snap Pack in the blog post, “SnapLogic November 2018 Release: Revolutionize your business with intelligent integration.”