Get your game plan on: Data warehouse migration to the cloud

By Ravi Dharnikota

You’ve decided to move your data warehouse to the cloud, and want to get started. Great! It’s easy to see why – in addition to the core benefits I wrote about in my last blog post, there are many more benefits associated with cloud data warehousing: incredibly fast processing, speedy deployment, built-in fault tolerance and disaster recovery and, depending on your cloud provider, strong security and governance.

A six-step reality check

But before you get too excited, it’s time to take a reality check; moving an existing data warehouse to the cloud is not quick, and it isn’t easy. It is definitely not as simple as exporting data from one platform and loading to another. Data is only one of the six warehouse components to be migrated.

Tactically and technically, data warehouse migration is an iterative process and needs many steps to migrate all of the components, as illustrated below. Here’s everything you need to consider in migrating your data warehouse to the cloud.

1) Migrating schema: Before moving warehouse data, you’ll need to migrate table structures and specifications. You may need to make structural changes as part of the migration, including indexing or partitioning – do they need to be rethought?

Data Warehouse Migration Process

2) Migrating data: Moving very large volumes of data is process intensive, network intensive, and time-consuming. You’ll need to map out how long will it take to migrate and if you can accelerate that process. You may need to restructure as part of schema migration and transform data as part of the data migration? Alternatively, can you transform in-stream or should you pre-process and then migrate?

3) Migrating ETL: Moving data may be the easy part when compared to migrating ETL processes. You may need to change the code base to optimize for platform performance and change data transformations to sync with data restructuring. You’ll need to determine if data flows should remain intact or be reorganized. As part of the migration, you may need to reduce data latency and deliver near real-time data. If that’s the case, would it make sense to migrate ETL processing to the cloud, as well? Is there a utility to convert your ETL code?

4) Rebuilding data pipelines: With any substantive change to data flow or data transformation, rebuilding data pipelines may be a better choice than migrating existing ETL. You may be able to isolate individual data transformations and package them as executable modules. You’ll need to understand the dependencies among data transformations to construct optimum workflow and the advantages you may gain – performance, agility, reusability, and maintainability – by rebuilding ETL as modular data pipelines using modern, cloud-friendly technology. 

5) Migrating metadata: Source-to-target metadata is a crucial part of managing a data warehouse; knowing data lineage, and tracing and troubleshooting is critical when problems occur. How readily will this metadata transfer to a new cloud platform? Are all of the mappings, transform logic, dataflow, and workflow locked in proprietary tools or buried in SQL code? You’ll need to determine if you’ll be able to export and import by either reverse engineering the metadata or rebuilding it from scratch.

6) Migrating users and applications: The final step in the process is migrating users and applications to the new cloud data warehouse, without interrupting business operations. Security and access authorizations may need to be created or changed, and BI and analytics tools should be connected. To do this, what communication is needed and with whom?

Don’t try to do everything at once

A typical enterprise data warehouse contains a large amount of data describing many business subject areas. Migrating an entire data warehouse in a single pass is usually not realistic. Incremental migration is the smart approach when “big bang” migration isn’t practical. Migrating incrementally is a must when undertaking significant design changes as part of the effort.

However, incremental migration brings new considerations. Data location should be transparent from a user point of view throughout the period when some data resides in the legacy data warehouse and some in the new cloud data warehouse. Consider a virtual layer as a point of access to decouple queries from data storage location.

A hybrid strategy is another viable option. With a hybrid approach, your on-premises data warehouse can remain operating as the cloud data warehouse comes online. During this transition phase, you’ll need to synchronize the data between the old on-premises data warehouse and the new one that’s in the cloud.

Cloud migration tools to the rescue

The good news is, there are many tools and services that can be invaluable when migrating your legacy data warehouse to the cloud. In my next post, the third and final in this series, I’ll explore the tools for data integration, data warehouse automation, and data virtualization, and system integrator resources that can speed and de-risk the process.

Learn more at SnapLogic’s upcoming webcast, “Traditional Data Warehousing is Dead: How digital enterprises are scaling their data to infinity and beyond in the Cloud,” on Wednesday, August 16 at 9:00am PT. I’ll be presenting with Dave Wells, Data Management Practice Lead, Eckerson Group, and highlighting tangible business benefits that your organization can achieve by moving your data to the cloud. You’ll learn:

      • Practical best practices, key technologies to consider, and case studies to get you started
      • The potential pitfalls of “cloud-washed” legacy data integration solutions
      • Cloud data warehousing market trends
      • How SnapLogic’s Enterprise Integration Cloud delivers up to a 10X improvement in the speed and ease of data integration

Sign up today!

Ravi Dharnikota is Chief Enterprise Architect at SnapLogic. Follow him on Twitter @rdharn1

How to set up Stream processing for Twitter using Snaps

Sharath-Punreddy300pxBy Sharath Punreddy

As you probably know, SnapLogic data pipelines use Streams, a continuous flow of data from a source to a target. By processing and extracting valuable insights out of Streaming data, a user/system can make decisions more quickly than with traditional batch processing. Streaming data analytics now provide near real-time, if not real-time, analytics.

In this data-driven age, timing of data analytics and insights has become a key differentiator. In some cases, the data becomes less relevant - if not obsolete - as it ages. Analyzing the data as it flows-in is crucial for use cases such as sentimental analysis for new product launches in retail, fraudulent transaction detection in the financial industry, preventing machine failures in manufacturing, sensor data processing for weather forecasts, disease outbreaks in healthcare, etc. Stream processing enables processing in near real-time, if not real-time, allowing the user or system to draw insights from the very latest data. Along with traditional APIs, companies are providing Streaming APIs for rendering data in real-time as it is being generated. Unlike traditional ReST/SOAP APIs, Streaming APIs establish a connection to the server and continuously stream the data for the desired amount of time. Once the time has elapsed, the connection will be terminated. Apache Spark with Apache Kafka as a Streaming platform has become a de facto industry standard for stream processing.

In this blog post, I’ll walk through the steps for building a simple pipeline to retrieve and process Tweets. You can also jump to the how-to video here.

Twitter Streams
Twitter has become a primary data source for sentiment analysis. The Twitter Streaming APIs provide access to global Tweets and can be accessed in real-time as people are tweeting. Snaplogic’s “Twitter Streaming Query” Snap enables users to retrieve Tweets based on a keyword in the text of the Tweet. The Tweets can then be processed using Snaps such as Filter Snap, Mapper Snap, or Aggregate Snap, for filtering, transforming, and aggregating, respectively. SnapLogic also provides a “Spark Script” Snap where an existing Python program can be executed on incoming Tweets. Tweets can also be routed to different destinations based on a condition, copied to multiple destinations (RDBMS, HDFS, S3, etc.) for storing and further analysis.

Getting Started
Below is a simple pipeline for retrieving Tweets, filtering them based on the language, and publishing to a Kafka cluster.

  1. Twitter_to_Kafka_PipelineUsing the Snaps tab on the left frame, search for the Snap. Drag and drop the Snap onto the Designer canvas (white space on the right).

Twitter_Snap_Img1    a. Click on the Snap to open the Snap Settings form.

Twitter_Snap_Img4Note: The “Twitter Streaming Query” Snap requires a Twitter account, which can be created through Designer while building the pipeline or using Manager prior to building pipeline.

b. Click on the “Account” tab.

Twitter_Snap_Img3    c. Click on the “Add Account” button.

Twitter_Account_Create_Img1Note: Twitter provides a couple of ways to authenticate applications to Twitter account. The “Twitter Dynamic OAuth1” is for Application-Only authentication and “Twitter OAuth1” is for User Authentication where the user is required to authenticate the application by signing into Twitter. In this case, we are using the User Authentication mechanism.

d. Choose an appropriate option based on the accessibility of the Account:
i. For Location of the Account: Shared makes this account accessible by the entire Organization, “projects/shared” would make the account accessible by all the users in the project, and “project/” would make the account accessible by only the user.
ii. For Account Type: Choose the “Twitter OAuth1” option to grant access to the Twitter account of the individual user.
iii. Click “OK.”

Twitter_Account_Create_Img2    e. Enter meaningful text for the “Label” such as [Twitter_of_] and click the “Authorize” button.

Twitter_Account_Create_Img3Note: If a user is logged into Twitter with an active session, they will be taken to the “Authorize” page of the Twitter website for the user to grant access to the application. If the user is not logged in or does not have an active session, it will take the user to Twitter sign-in page for them to sign in.

f. Click on the “Authorize app” button.

Twitter_Account_Create_Img4Note: The above “OAuth token” and “OAuth token secret” values are not active and are for example only.

g. At this point, the “OAuth token” and the “OAuth token secret” should have been populated. Click “Apply.”

Twitter_Account_Select_Img12. Once the account is successfully set up, click on the “settings” tab to provide the search keyword and time.

Twitter_Snap_Img4Note: The Twitter Snap will be retrieving Tweets for a designated time duration. For continuous retrieving, you can provide a value of “0” to the “Timeout in seconds.”

a. Enter a keyword and a time duration in seconds.

Twitter_Snap_Img5

3. Save by clicking the disk icon on the top right . This will trigger validation and should become a check mark if validation is successful.

Twitter_Snap_Img6

4. Click on list to preview the data.

Twitter_Snap_Img75. This confirms that the “Twitter Streaming Query” Snap has successfully established connection to the Twitter account and is fetching the Tweets.

6. The “Filter” Snap is used for filtering Tweets. Search for “Filter” using the Snaps tab on left frame. Drag and drop “Filter” Snap onto the canvas.

Filter_Snap_Img1    a. Click on “Filter” Snap to open the Settings form.

Filter_Snap_Img2    b. Provide a meaningful name such as “Filter By Language” for the “Label” and filter condition for “Filter Expression.” You can use the drop-down for choosing the filter attribute.

7. Click on disk icon to save it, which again triggers validation. You’ve now successfully completed a “Filter” Snap.

8. Search for “Confluent Kafka Producer” Snap using the Snaps tab on left frame. Drag and drop the Snap on the canvas.

Confluent_Account_Img1BNote: Confluent is an Apache Kafka distribution geared for Enterprises.

a. The “Confluent Kafka Producer” requires an account to connect to the Kafka cluster. Choose appropriate values based on the location and type of the account.

Confluent_Account_Img1A    b. Provide meaningful text for the “Label” of bootstrap server(s). In case of multiple bootstrap servers, use a comma to separate them, along with port.

Twitter_Account_Create_Img2    c. The “Schema registry URL” is optional, but is required in case Kafka is required to parse the message based on the Schema.

Confluent_Account_Img3    d. The other optional Kafka properties can be passed to the Kafka using the “Advanced Kafka Properties.” Click on “validate.”

e. If the validation is successfully, you should see a message on top as “Account validation successful.” Click “Apply.”

Confluent_Snap_Img29. Once the account is setup and chosen, click on “Settings” tab to provide Kafka topic and message.


Confluent_Snap_Img3

a. You can choose from the list of available topics by clicking the bubble icon next to the “Topic” field. Leave other fields to default. Another required field is “Message value.” Enter “$” to send entire Tweet and metadata information. Save by clicking the disk icon .

Twitter_to_Kafka_Pipeline410. The above is a fully validated pipeline to fetch the Tweets and load them into Kafka.

11. At this point, the pipeline is all set to receive the Tweets and push them into Kafka Topic. Run the pipeline by the clicking play button on the right-hand top corner . View the progress by clicking display button .

Twitter_to_Kafka_Pipeline5As you can see, the pipeline can be built in less than 15 minutes without requiring any deep technical knowledge. This tutorial and video provides a basic example of what can be achieved when using these Snaps. There are several other Snaps that can act on the data, such as filtering, copying, aggregating, triggering events, sending out emails, and others. Snaplogic takes pride in bringing complex technology to citizen integrator. I hope you found this useful!

Sharath Punreddy is Enterprise Solution Architect at SnapLogic. Follow him on Twitter @srpunreddy.

Testing… Testing… 1, 2, 3: How SnapLogic tests Snaps on the Apache Spark Platform

The SnapLogic Elastic Integration Platform connects your enterprise data, applications, and APIs by building drag-and-drop data pipelines. Each pipeline is made up of Snaps, which are intelligent connectors, that users drag onto a canvas and “snap” together like puzzle pieces.

A SnapLogic pipeline being built and configured
A SnapLogic pipeline being built and configured

These pipelines are executed on a Snaplex, an application that runs on a multitude of platforms: on a customer’s infrastructure, on the SnapLogic cloud, and most recently on Hadoop. A Snaplex that runs on Hadoop can execute pipelines natively in Spark.

The SnapLogic data management platform is known for its easy-to-use, self-service interface, made possible by our team of dedicated engineers (we’re hiring!). We work to apply the industry’s best practices so that our clients get the best possible end product — and testing is fundamental. Continue reading “Testing… Testing… 1, 2, 3: How SnapLogic tests Snaps on the Apache Spark Platform”

New Podcast Series: SnapTalk

We are pleased to announce our new podcast series called SnapTalk. The series will feature short, 10-15 min. episodes on topics relevant to big data, data management and app and data integration. Our host for the series is Ravi Dharnikota, SnapLogic’s head of enterprise architecture. Each episode features a special guest in conversation with Ravi, such as SnapLogic’s chief scientist, Greg Benson.

This project grew out of the great conversations we have at Snappy Hour. Eating lunch as a group at least a couple of times a week and our weekly happy hour (called Snappy Hour) are big parts of the SnapLogic culture. And, invariably, the conversations at these gatherings range from the lightweight, such as the latest episode of Game of Thrones, to the complex, such as the future of Spark and what makes streaming data streaming. This podcast series is intended to capture the essence of those ad hoc discussions, get people thinking, and hopefully inspire additional discussions.

The first episodes are posted now and cover topics such as Spark, streaming data and Kafka. Stay tuned to this space for the next episode. The SnapTalk playlist is here and our new SoundCloud channel is here– I hope you’ll subscribe, and we welcome your feedback.

Tips and Tricks for Workday Integration with the Enterprise

This post illustrates two of our commonly encountered customer scenarios:

a) An example of complex XML processing, and
b) A real-world example of what HR On-Boarding/Off-boarding might look like with Workday data

Below is a screenshot of the pipeline and a detailed walkthrough of what it attempts to achieve.

SnapLogic HR on-boarding pipeline for Workday
This pipeline shows complex XML processing and HR on-boarding for Workday.

Lets review this pipeline. Continue reading “Tips and Tricks for Workday Integration with the Enterprise”

Puzzle Pieces: Snaplex Names Explained

Welcome to Puzzle Pieces, a periodic series exploring the “Why?” of SnapLogic’s platform. To kick things off, let’s talk Snaplexes, which have sometimes proved puzzling. (Editor’s note: future installments of Puzzle Pieces will be rigorously scrubbed for alliterative excesses).

The SnapLogic Elastic Integration Platform is divided into two main parts: the Control Plane and the Data Plane. As a customer, you come into contact with the Control Plane through the SnapLogic web interface. Behind the scenes, the Control Plane also handles talking to the Data Plane and coordinating the flow of data in pipelines.

The pipelines actually run in the Data Plane. The container that handles running a particular pipeline is called a Snaplex. A Snaplex (or Plex) is a collection of computing resources – perhaps one virtual machine, perhaps an entire server rack. These are the Snaplex types you may come across:

Continue reading “Puzzle Pieces: Snaplex Names Explained”

Two-way SSL with SnapLogic’s REST Snap

SnapLogic_word_cloudThere are lots of ways for a client to authenticate itself against a server, including basic authentication, form-based authentication, and OAuth.

In these cases, the client communicates with the server over HTTPS, and the server’s identify is confirmed by validating its public certificate. The server doesn’t care who the client is, just as long as they have the correct credentials. Continue reading “Two-way SSL with SnapLogic’s REST Snap”