How to get valuable insights on data stored in Azure Data Lake Store

In a previous blog post, I discussed major trends in the data integration space and customers moving from on-prem to cloud. I’d like to focus on one trend which involves moving data from on-premises or cloud data analytics platforms to a Data Lake technology such as Azure Data Lake.

What is a Data Lake?

The Data Lake is a term coined for storing large amounts of data in its raw native form, including structured and unstructured data in one location. This data can come from various sources, and the Data Lake can act as a single source of truth for any organization. From the architecture standpoint, the data is first stored in data swamp/data acquisition, then cleansed/transformed as part of data transformation, and later published to gain business insights.

Data Lake

As seen in the diagram above, enterprises have multiple systems such as ERP, CRM, RDBMS, NoSQL, IoT sensors, etc. The disparate data, stored in different systems makes, is difficult to pull data from. A Data Lake brings all the data under one roof (data acquisition) using one of the following services:

  • Azure Blob
  • Azure Data Lake Store
  • Amazon S3
  • HDFS
  • Others

Data stored in one of these services can then be transformed in the following ways:

  • Aggregate
  • Sort
  • Join
  • Merge
  • Other

The transformed data is then moved to the data publish/data access section (could be the same as data acquisition services) where users can utilize the following tools to query the data:

  • Microsoft’s U-SQL
  • Amazon Athena
  • Hive
  • Presto
  • Others etc.

The bottom line is that a Data Lake can serve as a platform to run analytics in order to provide better customer experience, recommendations, and more.

Azure Data Lake is one such Data Lake from Microsoft and the repository used to store all the data is Azure Data Lake Store. Users can run Analytics Service, HDInsight or use U-SQL – a big data query language on top of this data store to gain better business insights.

ADLSSource: Microsoft

Azure Data Lake Store (ADLS) can store any data in its native format. One of the goals of this data store is to bring data from disparate sources. The Snaplogic Enterprise Integration Cloud with its pre-built connectors called Snaps help by moving data from different systems to the data store in a fast manner.

ADLS provides a complex API, which applications use to store data in ADLS. Snaplogic has abstracted all these complexities via Snaps so users can now easily move data from various systems to ADLS without needing to know anything of the complexities of these APIs.

Use case

A business needs to track and analyze content to better recommend products or services to its customers. Its data – from various sources such as Oracle, files, Twitter, etc. – needs to be stored in a central repository such as ADLS so that business users can run analytics on top to measure customer buying behavior, their interests, and products purchased.

Here’s a sample pipeline that can address this use case using Snaps:

Using the File Writer Snap and choosing the Azure Data Lake account as shown below, one can store the data merged from various systems into Azure Data Lake with ease.

All in all, the Data Lake can be a one-stop shop of storage for any data, giving users more ways to derive insights from multiple data sources. And SnapLogic is ready to make it easier for users to move their data into the Data Lake (in this case, an Azure Data Lake Store) in a quick and easy way.

Pavan Venkatesh is Senior Product Manager at SnapLogic. Follow him on Twitter @pavankv.

Deep Dive into SnapLogic Winter 2017 Snaps Release

By Pavan Venkatesh

Data streams with Confluent and migration to Hadoop: In my previous blog post, I explained how future data movement trends will look. In this post, I’ll dig into some of the exciting things we announced as part of the Winter 2017 (4.8) Snaps release. This will also address future data movement trends for customers who want to move data to the cloud from different systems or migrate to Hadoop.

Major highlights in 2017 Winter release (4.8) include:

  • Support of Confluent Kafka – A distributed messaging system for streaming data
  • Teradata to Hadoop – A quick and easy way to migrate data
  • Enhancements to the Teradata Snap Pack: On the TPT front, customers can quickly load/update/delete data in Teradata
  • The RedShift Multi-Execute Snap – Allows multiple statements to be sequentially executed, so customers can maintain business logic
  • Enhancements to the MongoDB Snap pack (Delete and Update) and the DynamoDB Snap pack (Delete and Delete-item)
  • Workday Read output enhancements – Now it’s easier for the downstream systems to consume
  • Netsuite Snap Pack improvements -Users can now submit asynchronous operations
  • Security feature enhancements – Including SSL for MongoDB Snap Pack and invalidating database connection pools when account properties are modified
  • Major performance improvement while writing to an S3 bucket using S3 File Writer – Users can now configure a buffer size in the Snap so larger blocks are sent to S3 quickly

Confluent Kafka Snap Pack

Kafka is a distributed messaging system based on publish/subscribe model with high throughput and scalability. It is mainly used for ingestion from multiple sources and then sent to multiple downstream systems. Use cases include website activity tracking, fraud analytics, log aggregation, sales analytics, and others. Confluent is the company that provides the enterprise capability and offering for open source Kafka.

Here at SnapLogic we have built Kafka Producer and Consumer Snaps as part of the Confluent Snap Pack. A deep dive into Kafka architecture and its working will be a good segue before going into the Snap Pack or pipeline details.

kafka-cluster

Kafka consists of single or multiple Producers that can produce messages from a single or multiple upstream systems, and single or multiple Consumers that consume messages as part of downstream systems. A Kafka cluster constitutes one or more servers called Brokers. Messages (key and value or just the value) will be fed into higher level abstraction called Topics. Each Topic can have multiple messages from different Producers. User can also define different Topics for new category of messages. These Producers write messages to Topics and Consumers consume from one or more Topics. Also Topics are partitioned, replicated, and persisted across Brokers. Messages in the Topics are ordered within a partition and each of these will have a sequential ID number called offset. Zookeeper usually maintains these offsets but Confluent calls it coordination kernel.

Kafka also allows configuring a Consumer group where multiple Consumers are part of it, when consuming from a Topic.

With over 400 Snaps supporting various on-prem (relational databases, files, nosql databases, and others) and cloud products (Netsuite, SalesForce, Workday, RedShift, Anaplan, and others), the Snaplogic Elastic Integration Cloud in combination with the Confluent Kafka Snap Pack will be a powerful combination for moving data to different systems in a fast and streaming manner. Customers can realize benefits and generate business outcomes in a quick manner.

With respect to the Confluent Kafka Snap Pack, we support Confluent Version 3.0.1 (Kafka v0.9). These Snaps abstract the complexities and users only have to provide configuration details to build a pipeline which moves data easily. One thing to note is that when multiple Consumer Snaps are used in a pipeline and have been configured with the same consumer group, then each Consumer Snap will be assigned a different subset of partitions in the Topic.

kafka-producer

kafka-consumer

pipeline1

In the above example, I built a pipeline where sales leads (messages) stored in local files and MySQL are sent to a Topic in Confluent Kafka via Confluent Kafka Producer Snaps. The downstream system Redshift will consume these messages from that Topic via the Confluent Kafka Consumer Snap and bulk load it to RedShift for historical or auditing needs. These messages are also sent to Tableau as another Consumer to run analytics on how many leads were generated this year, so customer can compare this against last year.

Easy migrations from Teradata to Hadoop

There has been a major shift where customers are moving from expensive Teradata solutions to Hadoop or other data warehouse. Until now, there has not been an easy solution in transferring large amounts of data from Teradata to big data Hadoop. With this release we have developed a Teradata Export to HDFS Snap with two goals in mind: 1) ease of use and 2) high performance. This Snap uses the Teradata Connector for Hadoop (TDCH v1.5.1). Customers just have to download this connector from the Teradata website in addition to the regular jdbc jars. No installation required on either Teradata or Hadoop nodes.

TDCH utilizes MapReduce (MR) as its execution engine where the queries gets submitted to this framework, and the distributed processes launched by the MapReduce framework make JDBC connections to the Teradata database. The data fetched will be directly loaded into the defined HDFS location. The degree of parallelism for these TDCH jobs is defined by the number of mappers (a Snap configuration) used by the MapReduce job. The number of mappers also defines the number of files created in HDFS location.

The Snap account details with a sample query to extract data from Teradata and load it to HDFS is shown below.

edit-account

terradata-export

 

The pipeline to this effect is as follows:

pipeline2

As you can see above, you use just one Snap to export data from Teradata and load it into HDFS. Customers can later use HDFS Reader Snap to read files that are exported.

Winter 2017 release has equipped customers with lots of benefits, from data streams, easy migrations, to enhancing security functionality, and performance benefits. More information on the SnapLogic Winter 2017 (4.8) release can be found in the release notes.

Pavan Venkatesh is Senior Product Manager at SnapLogic. Follow him on Twitter @pavankv.

Future Data Movement Trends with SnapLogic

Data volumes are exponentially increasing and many organizations are starting to realize the complexity of their growing data movement and data management solutions. Data exists in various systems, and getting meaningful value out of it has become a major challenge for many companies. Also, most of the data is usually stored in relational systems like MySQL, PostgreSQL and Oracle, these being the mainstream databases primarily used for OLTP purposes. NoSQL systems like Cassandra, MongoDB and DynamoDB have also emerged with tunable consistency model in order to store some of these mission critical data. Customers then typically move these data to much bigger systems like Teradata and Hadoop (OLAP) that can store large amounts of data, so they can run analytics, reporting or complex queries against it. There is also a recent trend where some of these data are moved to the cloud, especially to Amazon RedShift or Snowflake and also to HDInsights or Azure Data Warehouse.

Continue reading “Future Data Movement Trends with SnapLogic”