Big Data Ingestion Patterns: Ingest into the Data Lake

In my previous post I described the various approaches and patterns to consider when ingesting data from a relational data sources into a Hadoop-based data lake. In this post I’ll describe a practical approach on how to utilize these patterns with the SnapLogic Elastic Integration Platform without the need to write code. The big data ingestion patterns described here take into account all the design considerations and best practices for effective ingestion of data into the Hadoop data lake. These patterns are being used by many enterprise organizations today to move large amounts of data.

Let’s consider a scenario in which you want to ingest data from a relational database (like Oracle, mySql, Sql Server, etc.) into Hive. Let’s say we want to use AVRO as the data format for the Hive tables. Let’s now look at how this pattern enables you to automatically generate all the required AVRO schemas and the Hive tables.

The key steps associated with this pattern are as follows:

  1. Provide the ability to select a database type like Oracle, mySql, SqlServer, etc. Then configure the appropriate database connection information (such as username, password, host, port, database name, etc.)
  2. Provide the ability to select a table, a set of tables or all tables from the source database. For example, we want to move all tables that start with or contain “orders” in the table name.
  3. For each table selected from the source relational database:
    • Query the source relational database metadata for information on table columns, column data types, column order, and primary/foreign keys.
    • Generate the AVRO schema for a table. Automatically handle all the required mapping and transformations for the column (column names, primary keys and data types) and generate the AVRO schema.
    • Generate DDL required for the Hive table. Automatically handle all the required mapping and transformations for the columns and generate the DDL for the equivalent Hive table.
    • Save the AVRO schemas and Hive DDL to HDFS and other target repositories.
    • Create the Hive table using the DDL

The SnapLogic pattern to automatically generate the schema and Hive DDL is below. It can be used as part of your Ingest data flow pipeline to move data from a relation data source to Hive.snaplogic data ingest pipeline

 

For each table, Avro schemas and Hive table DDLs are also stored in HDFS. snaplogic big data integration pipeline

 

For each table, create the Hive table, using the Hive DDL and AVRO schema.
snaplogic big data integration pipelineThe SnapLogic integration Patterns catalog provides prebuilt, reusable integration data flow pipelines that enable reusability and improve productivity.
snaplogic patterns

 

In my next post I will write about another practical integration pattern for big data ingestion with SnapLogic without the need to do any coding.

Next steps: