Home Snaps ML Data Preparation Snap Pack
ML Data Preparation icon

ML Data Preparation Snap Pack

The ML Data Preparation Snap Pack automates various data preparation tasks for a machine learning model.


This comprehensive collection of Snaps handles everything from cleaning missing values and removing duplicates to converting data types, scaling features, and synthesizing new variables from related datasets. Convert categorical data into numeric formats, extract datetime components for time-series work, mask sensitive information for compliance, perform Principal Component Analysis to reduce dimensionality, and systematically sample and shuffle data for training and testing.

Whether you’re dealing with messy real-world data full of gaps and inconsistencies or need to match records across multiple sources, these Snaps give you the precision tools to shape your data exactly how your models need it—all within SnapLogic, without writing complex preparation scripts or switching between multiple tools.

The ML Data Preparation Snap Pack includes the following Snaps:

  • Categorical to Numeric: Convert categorical columns into numeric columns by using integer coding or one hot encoding.
  • Clean Missing Values: Replace missing values in datasets by dropping or imputing values.
  • Date Time Extractor: Extract components from datetime objects.
  • Deduplicate: Identify and remove duplicate records from datasets.
  • Feature Synthesis: Automatically create features out of multiple datasets that share a one-to-one or one-to-many relationship with each other
  • Mask: Mask sensitive information in your dataset before exporting the dataset for analytics.
  • Match: Match records from different data sources that represent the same entity without relying on a common key
  • Principal Component Analysis: Perform principal component analysis for dimensionality reduction.
  • Sample: Generate sample datasets from an input dataset using sampling algorithms.
  • Scale: Scale values in columns to specify ranges or apply statistical transformations.
  • Shuffle: Randomize the order of the row data in the dataset.
  • Type Converter: Determine types of values in columns. There are four supported types: integer, floating point, text, and datetime.

To learn more, please check out the documentation page.