Azure Data Platform: Reading and writing data to Azure Blob Storage and Azure Data Lake Store

By Prasad Kona

Organizations have been increasingly moving towards and adopting cloud data and cloud analytics platforms like Microsoft Azure. In this first in a series of Azure Data Platform blog posts, I’ll get you on your way to making your adoption of the cloud platforms and data integration easier.

In this post, I focus on ingesting data into the Azure Cloud Data Platform and demonstrate how to read and write data to Microsoft Azure Storage using SnapLogic.

For those who want to dive right in, my 4-minute step-by-step video “Building a simple pipeline to read and write data to Azure Blob storage” shows how to do what you want, without writing any code.

What is Azure Storage?

Azure Storage enables you to store terabytes of data to support small to big data use cases. It is highly scalable, highly available, and can handle millions of requests per second on average. Azure Blob Storage is one of the types of services provided by Azure Storage.

Azure provides two key types of storage for unstructured data: Azure Blob Storage and Azure Data Lake Store.

Azure Blob Storage

Azure Blob Storage stores unstructured object data. A blob can be any type of text or binary data, such as a document or media file. Blob storage is also referred to as object storage.

Azure Data Lake Store

Azure Data Lake Store provides what enterprises look for in storage today and it:

  • Provides additional enterprise-grade security features like encryption and uses Azure Active Directory for authentication and authorization.
  • Is compatible with Hadoop Distributed File System (HDFS) and works with the Hadoop ecosystem including Azure HDInsight.
  • Includes Azure HDInsight clusters, which can be provisioned and configured to directly access data stored in Data Lake Store.
  • Allows data stored in Data Lake Store to be easily analyzed using Hadoop analytic frameworks such as MapReduce, Spark, or Hive.

How do I move my data to the Azure Data Platform?

Let’s look at how you can read and write to Azure Data Platform using SnapLogic.

For SnapLogic Snaps that support Azure Accounts, we have an option to choose one of Azure Storage Account or Azure Data Lake Store:

Azure Data Platform 1

Configuring the Azure Storage Account in SnapLogic can be done as shown below using the Azure storage account name and access key you get from the Azure Portal:

Azure Data Platform 2

Configuring the Azure Data Lake Store Account in SnapLogic as shown below, uses the Azure Tenant ID, Access ID, and Secret Key that you get from the Azure Portal:

Azure Data Platform 3

Put together, you’ve got a simple pipeline that illustrates how to read and write to Azure Blob Storage:

Azure Data Platform 4

Here’s the step-by-step video again: Building a simple pipeline to read and write data to Azure BLOG storage

In my next blog post, I will describe the approaches to move data from your on-prem databases to Azure SQL Database.

Prasad Kona is an Enterprise Architect at SnapLogic. You can follow him on LinkedIn or Twitter @prasadkona.

 

Testing… Testing… 1, 2, 3: How SnapLogic tests Snaps on the Apache Spark Platform

The SnapLogic Elastic Integration Platform connects your enterprise data, applications, and APIs by building drag-and-drop data pipelines. Each pipeline is made up of Snaps, which are intelligent connectors, that users drag onto a canvas and “snap” together like puzzle pieces.

A SnapLogic pipeline being built and configured
A SnapLogic pipeline being built and configured

These pipelines are executed on a Snaplex, an application that runs on a multitude of platforms: on a customer’s infrastructure, on the SnapLogic cloud, and most recently on Hadoop. A Snaplex that runs on Hadoop can execute pipelines natively in Spark.

The SnapLogic platform is known for its easy-to-use, self-service interface, made possible by our team of dedicated engineers (we’re hiring!). We work to apply the industry’s best practices so that our clients get the best possible end product — and testing is fundamental. Continue reading “Testing… Testing… 1, 2, 3: How SnapLogic tests Snaps on the Apache Spark Platform”

Tips and Tricks for Workday Integration with the Enterprise

This post illustrates two of our commonly encountered customer scenarios:

a) An example of complex XML processing, and
b) A real-world example of what HR On-Boarding/Off-boarding might look like with Workday data

Below is a screenshot of the pipeline and a detailed walkthrough of what it attempts to achieve.

SnapLogic HR on-boarding pipeline for Workday
This pipeline shows complex XML processing and HR on-boarding for Workday.

Lets review this pipeline. Continue reading “Tips and Tricks for Workday Integration with the Enterprise”

Two-way SSL with SnapLogic’s REST Snap

SnapLogic_word_cloudThere are lots of ways for a client to authenticate itself against a server, including basic authentication, form-based authentication, and OAuth.

In these cases, the client communicates with the server over HTTPS, and the server’s identify is confirmed by validating its public certificate. The server doesn’t care who the client is, just as long as they have the correct credentials. Continue reading “Two-way SSL with SnapLogic’s REST Snap”

Connecting SaaS providers with SnapLogic’s OAuth-enabled REST Snaps

OAuth is an open standard for authorization. OAuth provides client applications a ‘secure delegated access’ to server resources on behalf of a resource owner. It specifies a process for resource owners to authorize third-party access to their server resources without sharing their credentials.

Wikipedia

SnapLogic has many Snaps that utilize OAuth, including Box, Concur, Eloqua, LinkedIn, Facebook, and Google Analytics. We also support it in a generic way with our REST Snaps that can be used to connect with providers we have yet to build a Snap for, so it’s useful to understand what OAuth is and how it works.

While it is not necessary to have any prior knowledge of OAuth to continue reading, if you wish to understand the OAuth standard at a deeper level, oauth.net provides a good starting point.

Let’s dive in with a common use case - you (the user) wish to use SnapLogic (the app) to connect to your Google Drive (the server). In this example, your Google Account is the Owner, the Server is Google’s Identify Platform, and the Client is SnapLogic’s REST Snap.

We will use SnapLogic’s REST Snaps to send and receive data to Google’s Drive API, but it needs to be configured first. As we require accessing content from Google, the Snap needs a way of proving to Google that it has been authorized by the user to interact with their Google Drive, while also allowing the user revoke that access directly from their account (Google provides an “Apps connected to your account” settings page where users can easily review and remove apps).

Our first step is to log into the Google Developers Console and create a new Project:

Create SnapLogic Google Drive Project

Once the Project has been created, we must enable Drive API integration:

Enable Drive API integration

Next, we customize the OAuth consent screen by providing a Product name and, optionally, a logo:

Provide product name and logo to the OAuth consent screen

Finally, we configure a new “OAuth 2.0 client ID” credential to identify our Snap to Google when we ask the user for authorization. We use “https://elastic.snaplogic.com/api/1/rest/admin/oauth2callback/rest” URL as the authorized redirect URI.

Create OAuth 2.0 Client ID for Web Application App

Take note of the generated client ID and secret:

Client ID and Client Secret

We can now create a pipeline, add the REST Get Snap, and configure it to request authorization from the user to list their Google Drive files:

Create new pipeline with REST Get Snap, add new OAuth2 account

When creating the REST OAuth2 Account, we use the client ID and secret provided earlier, and configure the remaining fields with the values specified by the Google OAuth for Web Server Apps documentation:

Configure OAuth2 account

The “Header authenticated” checkbox instructs the REST Snap to include an “Authorization” HTTP Header with every request, whose value is the soon-to-be-acquired access token as a Bearer token. Alternatively, you may choose not to enable this setting and instead include an “access_token” query parameter in each request, whose value is the special expression “$account.access_token“, which was created after a successful authorization.

The “redirect_uri” parameter must be provided in both the auth and token endpoint configs, and the value must match the authorized redirect URI configured for the OAuth 2.0 client ID credential created previously. The “response_type” authentication parameter must have a value of “code” (defined by the OAuth specification), and the “scope” parameter defines the Google Drive capabilities being requested (you may wish to modify the scope to what is appropriate for your use case).

The Google-specific “access_type” and “approval_prompt” parameters are also included in the auth endpoint config. An “access_type” value of “offline” requests Google to return a refresh token after the user’s first successful authorization. This allows the Snap to refresh access to the user’s Google Drive without the user being present. The “approval_prompt” parameter value of “auto“, will instruct Google to provide the refresh token only on the very first occasion the user gave offline consent. A value of “force” will prompt the user to re-consent to offline access to acquire a new refresh token.

Clicking the “Authorize” button will start the OAuth Dance. Depending on whether the User is already logged into their Google Account, or is logged to multiple Google Accounts, they may need to login or choose which Account to use. Either way, as long as the user has not already authorized the app, the user will eventually be prompted to allow the REST Snap to access their Google Drive data:

Snap Authorization consent window

These permissions correspond to the “scopes” that were defined previously. You’ll notice that this is a google.com website and the URL address (https://accounts.google.com/o/oauth2/auth) starts with the same value as the one entered for the “OAuth2 Endpoint” field above. The Snap has also appended some of the other fields, plus some extra security parameters have been added by the SnapLogic Platform.

Assuming the User gives consent by clicking the Allow button, the next couple of steps happen behind the scenes on within the SnapLogic Platform and are mostly concerned with checking that neither SnapLogic nor Google are being tricked by the other party.

Google will return an Authorization Code to the SnapLogic Platform, which will immediately send a request to the “OAuth2 Token” URL (also entered above) with the authorization code, client ID, client secret and redirect URI parameters. On a successful receipt of that request, Google will once again redirect back to SnapLogic, but this time will include an access token, an access expiration timestamp, plus a refresh token.

If all goes well, the User will be shown the SnapLogic Designer view with the REST OAuth Account form again visible, except now with values for the access and refresh tokens:

OAuth2 Account with acquired access and refresh tokens

The “Refresh” button is now also visible (due to a refresh token having been acquired), allowing the user to manually acquire a new access token when the existing one expires. The user may also choose to enable the “Auto-refresh token” setting to permit the SnapLogic Platform to automatically refresh any expiring access tokens, enabling a true offline mode.

Automatically refresh access tokens by enabling the Auto-refresh token setting

We can click the “Apply” button to associate the authorized OAuth2 Account with the REST Snap. The user can now begin querying the Google Drive API to list their Google Drive files.

The Google Drive API Reference details the full capabilities of what our integration can interact with. For example, we could list the files whose title contains “Target Customers”. To do this, the ”Service URL” is updated to https://www.googleapis.com/drive/v2/files, and we add a “q” query parameter with the search parameter value “title contains 'Target Customers'“:

REST Get search and list GDrive files

Save and close the settings dialog to validate the pipeline and preview the results:

REST Get preview GDrive API results

et voilà, we have successfully completed an OAuth 2.0 Authorization Dance and used the acquired access token to connect with Google Drive! The full power of the Google Drive API is now accessible within SnapLogic.

SnapLogic Tips and Tricks: XML Generator Snap Overview (Part 4)

In part this SnapLogic tips and tricks series we have demonstrated how the XML Generator Snap:

In this final part of the series, we will cover how the XML Generator Snap creates one serialized XML string for every input document.

Example 4: POSTing the Generated Content
In the last example we will be POSTing the generated content to some REST endpoint using the REST POST Snap.
In the screenshot below we are using the POST Snap which has the entity set as $xml. That will use the XML content that was generated by the upstream XML Generator Snap and POST it as a body to the endpoint.
You might want to set the content-type and accept header as defined below.

xml-gen-6

The POST will be executed for every document on the input view. There are a total of two documents, hence we will execute two post requests.

Series Summary
In summary, XML Generator Snap enables you to generate XML data, either directly in the Snap using the XML template or dynamically by using data from the input view. It lets you generate the XML by providing an XSD and it can validate the created XML against the XML at runtime.

Additional Resources:

SnapLogic Tips and Tricks: XML Generator Snap Overview (Part 3)

In part two of this series, we covered how to map to the JSON schema upstream. In this post, we will cover how to validate the generated XML against the XSD.

Example 3: Writing the Generated Content to File
Sometimes one wants to write the generated XML to a file. For that use case we provide a DocumentToBinary Snap which can take the content and convert it to binary data object, which then can be written to a file, e.g using a File Writer Snap.

xml-gen-5

Above we map XML to the content field of the DocumentToBinary Snap, and set the Encode or Decode option on the DocumentToBinary Snap to NONE.

This outputs then one binary document for each order. We can then write it to a directory. (Careful, here you’d want to use the append option, since you potentially would be writing two files to the same directory, *which will be supported soon for SnapLogic’s file system) or you can use an expression such as Date.now() to write individual files for each incoming binary data object).

In our final part of this series, we will demonstrate how the XML Generator Snap creates one serialized XML string for every input document.

Additional Resources: