How to set up Stream processing for Twitter using Snaps

Sharath-Punreddy300pxBy Sharath Punreddy

As you probably know, SnapLogic data pipelines use Streams, a continuous flow of data from a source to a target. By processing and extracting valuable insights out of Streaming data, a user/system can make decisions more quickly than with traditional batch processing. Streaming data analytics now provide near real-time, if not real-time, analytics.

In this data-driven age, timing of data analytics and insights has become a key differentiator. In some cases, the data becomes less relevant - if not obsolete - as it ages. Analyzing the data as it flows-in is crucial for use cases such as sentimental analysis for new product launches in retail, fraudulent transaction detection in the financial industry, preventing machine failures in manufacturing, sensor data processing for weather forecasts, disease outbreaks in healthcare, etc. Stream processing enables processing in near real-time, if not real-time, allowing the user or system to draw insights from the very latest data. Along with traditional APIs, companies are providing Streaming APIs for rendering data in real-time as it is being generated. Unlike traditional ReST/SOAP APIs, Streaming APIs establish a connection to the server and continuously stream the data for the desired amount of time. Once the time has elapsed, the connection will be terminated. Apache Spark with Apache Kafka as a Streaming platform has become a de facto industry standard for stream processing.

In this blog post, I’ll walk through the steps for building a simple pipeline to retrieve and process Tweets. You can also jump to the how-to video here.

Twitter Streams
Twitter has become a primary data source for sentiment analysis. The Twitter Streaming APIs provide access to global Tweets and can be accessed in real-time as people are tweeting. Snaplogic’s “Twitter Streaming Query” Snap enables users to retrieve Tweets based on a keyword in the text of the Tweet. The Tweets can then be processed using Snaps such as Filter Snap, Mapper Snap, or Aggregate Snap, for filtering, transforming, and aggregating, respectively. SnapLogic also provides a “Spark Script” Snap where an existing Python program can be executed on incoming Tweets. Tweets can also be routed to different destinations based on a condition, copied to multiple destinations (RDBMS, HDFS, S3, etc.) for storing and further analysis.

Getting Started
Below is a simple pipeline for retrieving Tweets, filtering them based on the language, and publishing to a Kafka cluster.

  1. Twitter_to_Kafka_PipelineUsing the Snaps tab on the left frame, search for the Snap. Drag and drop the Snap onto the Designer canvas (white space on the right).

Twitter_Snap_Img1    a. Click on the Snap to open the Snap Settings form.

Twitter_Snap_Img4Note: The “Twitter Streaming Query” Snap requires a Twitter account, which can be created through Designer while building the pipeline or using Manager prior to building pipeline.

b. Click on the “Account” tab.

Twitter_Snap_Img3    c. Click on the “Add Account” button.

Twitter_Account_Create_Img1Note: Twitter provides a couple of ways to authenticate applications to Twitter account. The “Twitter Dynamic OAuth1” is for Application-Only authentication and “Twitter OAuth1” is for User Authentication where the user is required to authenticate the application by signing into Twitter. In this case, we are using the User Authentication mechanism.

d. Choose an appropriate option based on the accessibility of the Account:
i. For Location of the Account: Shared makes this account accessible by the entire Organization, “projects/shared” would make the account accessible by all the users in the project, and “project/” would make the account accessible by only the user.
ii. For Account Type: Choose the “Twitter OAuth1” option to grant access to the Twitter account of the individual user.
iii. Click “OK.”

Twitter_Account_Create_Img2    e. Enter meaningful text for the “Label” such as [Twitter_of_] and click the “Authorize” button.

Twitter_Account_Create_Img3Note: If a user is logged into Twitter with an active session, they will be taken to the “Authorize” page of the Twitter website for the user to grant access to the application. If the user is not logged in or does not have an active session, it will take the user to Twitter sign-in page for them to sign in.

f. Click on the “Authorize app” button.

Twitter_Account_Create_Img4Note: The above “OAuth token” and “OAuth token secret” values are not active and are for example only.

g. At this point, the “OAuth token” and the “OAuth token secret” should have been populated. Click “Apply.”

Twitter_Account_Select_Img12. Once the account is successfully set up, click on the “settings” tab to provide the search keyword and time.

Twitter_Snap_Img4Note: The Twitter Snap will be retrieving Tweets for a designated time duration. For continuous retrieving, you can provide a value of “0” to the “Timeout in seconds.”

a. Enter a keyword and a time duration in seconds.

Twitter_Snap_Img5

3. Save by clicking the disk icon on the top right . This will trigger validation and should become a check mark if validation is successful.

Twitter_Snap_Img6

4. Click on list to preview the data.

Twitter_Snap_Img75. This confirms that the “Twitter Streaming Query” Snap has successfully established connection to the Twitter account and is fetching the Tweets.

6. The “Filter” Snap is used for filtering Tweets. Search for “Filter” using the Snaps tab on left frame. Drag and drop “Filter” Snap onto the canvas.

Filter_Snap_Img1    a. Click on “Filter” Snap to open the Settings form.

Filter_Snap_Img2    b. Provide a meaningful name such as “Filter By Language” for the “Label” and filter condition for “Filter Expression.” You can use the drop-down for choosing the filter attribute.

7. Click on disk icon to save it, which again triggers validation. You’ve now successfully completed a “Filter” Snap.

8. Search for “Confluent Kafka Producer” Snap using the Snaps tab on left frame. Drag and drop the Snap on the canvas.

Confluent_Account_Img1BNote: Confluent is an Apache Kafka distribution geared for Enterprises.

a. The “Confluent Kafka Producer” requires an account to connect to the Kafka cluster. Choose appropriate values based on the location and type of the account.

Confluent_Account_Img1A    b. Provide meaningful text for the “Label” of bootstrap server(s). In case of multiple bootstrap servers, use a comma to separate them, along with port.

Twitter_Account_Create_Img2    c. The “Schema registry URL” is optional, but is required in case Kafka is required to parse the message based on the Schema.

Confluent_Account_Img3    d. The other optional Kafka properties can be passed to the Kafka using the “Advanced Kafka Properties.” Click on “validate.”

e. If the validation is successfully, you should see a message on top as “Account validation successful.” Click “Apply.”

Confluent_Snap_Img29. Once the account is setup and chosen, click on “Settings” tab to provide Kafka topic and message.


Confluent_Snap_Img3

a. You can choose from the list of available topics by clicking the bubble icon next to the “Topic” field. Leave other fields to default. Another required field is “Message value.” Enter “$” to send entire Tweet and metadata information. Save by clicking the disk icon .

Twitter_to_Kafka_Pipeline410. The above is a fully validated pipeline to fetch the Tweets and load them into Kafka.

11. At this point, the pipeline is all set to receive the Tweets and push them into Kafka Topic. Run the pipeline by the clicking play button on the right-hand top corner . View the progress by clicking display button .

Twitter_to_Kafka_Pipeline5As you can see, the pipeline can be built in less than 15 minutes without requiring any deep technical knowledge. This tutorial and video provides a basic example of what can be achieved when using these Snaps. There are several other Snaps that can act on the data, such as filtering, copying, aggregating, triggering events, sending out emails, and others. Snaplogic takes pride in bringing complex technology to citizen integrator. I hope you found this useful!

Sharath Punreddy is Enterprise Solution Architect at SnapLogic. Follow him on Twitter @srpunreddy.

Gartner Names SnapLogic a Leader in the 2017 Enterprise iPaaS Magic Quadrant

For the second year in a row, SnapLogic has been named a Leader in Gartner’s Magic Quadrant for Enterprise Integration Platform as a Service (iPaaS).

Gartner evaluated iPaaS vendors on “completeness of vision” and “ability to execute.” Those named to the Leaders quadrant, as Gartner noted in the report, “have a solid reputation, with notable market presence and a proven track record in enabling … their platforms are well-proven and functionally rich, with regular releases to rapidly address this fast-evolving market.”

In a press release issued today, SnapLogic CTO James Markarian said of the recognition: “Since our inception, we have been laser-focused on delivering a modern enterprise integration platform that is specifically designed to manage the data and application integration demands of today’s hybrid enterprise technology environments. Our Enterprise Integration Cloud eliminates the complexity of legacy integrations, providing a platform that supports fast and easy self-service integration.”

The Enterprise iPaaS Magic Quadrant is embedded below. We’d encourage you to download the complete report as it provides a comprehensive review of all the vendors and the growing market.

Gartner 2017 iPaaS MQ

Thanks to all of SnapLogic’s customers, partners, and employees for the ongoing support and for making SnapLogic’s Enterprise Integration Cloud a leading self-service integration platform connecting applications, data, and things.

Podcast: James Markarian and David Linthicum on New Approaches to Cloud Integration

SnapLogic CTO James Markarian recently joined cloud expert David Linthicum as a guest on the Doppler Cloud Podcast. The two discussed the mass movement to the cloud and how this is changing how companies approach both application and data integration.

In this 20-minute podcast, “Data Integration from Different Perspectives,” the pair discuss how to navigate the new realities of hybrid app integration, data and analytics moving to the cloud, user demand for self-service technologies, the emerging impact of AI and ML, and more.

You can listen to the full podcast here, and below:

 

Workday integrations “on sale”: How to save up to 90%

Nada-headshotBy Nada daVeiga

Macy’s. NastyGal. The Limited. BCBG. No, that’s not a shopping itinerary. These stores are just a short list of the major retailers that soon will be closing hundreds of stores. Some are declaring bankruptcy. Why? In conjunction with the massive shift to online shopping, retailers have trained consumers to shop only when merchandise is “on sale.”

Why pay retail for Workday integration platforms?

It’s still a bit of a mystery, then, why many enterprises will pay “full retail” to integrate their applications. Let’s take Workday applications, for example. Workday HCM and Workday Financial Management are rapidly gaining traction as enterprise systems of record; thousands of companies are choosing these apps to help drive digital transformation, and to move with more speed and agility.

However, enterprises are often challenged to implement Workday quickly and cost-effectively. Associated migration, integration, and implementation services typically cost up to two and a half times of the Workday software cost* due to:

  • Customization: Most enterprises require at least 12 weeks to tailor core Workday software-as-a-service (SaaS) applications to their business processes and needs; integration with other enterprise applications is a separate, additional implementation phase.
  • Complexity of Workday integration offerings: In addition to third-party products such as Informatica PowerCenter, multiple integration solutions are available from Workday. Depending on requirements, enterprises need to work with one or more Workday integration tools:
    • Workday Integration Cloud Connect provides pre-built integrations to common applications and service providers that extend Workday’s functionality. These include Human Capital Management, Payroll, Payroll interface, Financial Management and Spend Management.

      While Workday Cloud Connect has “pre-built” integrations, mapping, changing and customizing them is still labor- and consulting-intensive.

    • Workday Enterprise Interface Builders (EIB) enable simple integrations with Workday, i.e., importing data into Workday from a Microsoft Excel spreadsheet for tasks such as hiring a group of employees or requesting mass compensation changes. Users can also create outbound exports of data from Workday to Excel spreadsheet.

      However, this feature does not have native integration with Microsoft Active Directory, so employee on- and off-boarding can only be accomplished via manually intensive EIBs.

    • Workday Studio is a desktop-based integrated development environment used to create complex hosted integrations, including Workday Studio Custom Integrations, the most advanced.

      The Workday Studio development environment, while powerful, is complex, consulting-heavy and costly to use.

    • Workday Web Services gives customers a programmatic public API for On-Demand Workday Business Management Services.
  • Reliance on external resources: Across industries and geographies, Workday consultants and programmers are scarce and expensive.
  • Time-intensive manual integrations: Many Workday integrations are built manually and must be individually maintained, incurring “technical debt” that robs resources from future IT initiatives.

SnapLogic can reduce the cost of Workday integrations by 90%

The SnapLogic Enterprise Integration Cloud uniquely enhances Workday HCM and Workday Financials with a built-for-the-cloud, easy to use solution optimized for both IT and business users. The building blocks of the Enterprise Integration Cloud, SnapLogic Workday Snaps, are pre-built connectors that abstract the entire Workday API visually, allowing data and processes to be quickly integrated using pre-built patterns. With SnapLogic’s Enterprise Integration Cloud, companies can use a visual tool to automate HR and financial processes between Workday applications, cloud point solutions and legacy systems.

The SnapLogic Enterprise Integration Cloud greatly increases the flexibility of HR and Financial processes, and eases the pain of adding or retiring applications, thus enabling teams to focus on more strategic business priorities. These benefits allow both IT users and “citizen integrators” to execute Workday integrations, without reliance on IT.

Attention shoppers: SnapLogic delivers

Faster time to value: Workday integrations can be done in days, not months.

  • Dramatically lower cost: Using the SnapLogic Enterprise Integration Cloud can reduce the time and cost of Workday integrations by up to 90%.*
  • No programming or maintenance required: SnapLogic’s visual orientation doesn’t require specialized consultants or Java/XML programmers to build or maintain labor-intensive, manual integrations.

Find out how SnapLogic can drive down the cost of your organization’s Workday integrations, while enabling new agility. Download the new white paper, “Cause and effect: How rapid Workday integration drives digital transformation.”

Nada daVeiga is VP Worldwide Pre-Sales, Customer Success, and Professional Services at SnapLogic. Follow her on Twitter @nrdaveiga.

 

*Savings calculated on the average cost of Workday integrations: service fees of 2.5x of the Workday license fee. Source: Workday Analyst Day, 2016.

We Left Informatica. Now You Can, Too | Webinar

 

You can run a modern company on a mainframe. You can also ride a horse to the office. But would it really make sense to do this? Join us on Wednesday, March 22 for a discussion with Informatica’s former CEO Gaurav Dhillon and CTO James Markarian about reinventing data integration for the modern enterprise.

Infa Webinar Banner

Does your business still run on Informatica? It might make more sense to switch to a more modern platform. Join the conversation, hosted by industry analyst David Linthicum, as our distinguished panel discusses the key business reasons and technology factors driving modern enterprises to embrace data integration built for the cloud.

They will also cover:

  • The evolution of data integration – from the pre-internet, mainframe days of Informatica – to today’s modern cloud solutions
  • How they have re-invented application and data integration in the cloud
  • The changing role of IT – from “helicopter” to enabler
  • The cost to modern enterprises of inaction
  • Why sticking to the status quo is not an option

Register for this exclusive webinar here and be sure to join the conversation on Wednesday at 11am PT/ 2pm ET.

VIDEO: SnapLogic Discusses Big Data on #theCUBE from Strata+Hadoop World San Jose

It’s Big Data Week here in Silicon Valley with data experts from around the globe convening at Strata+Hadoop World San Jose for a packed week of keynotes, education, networking and more - and SnapLogic was front-and-center for all the action.

SnapLogic stopped by theCUBE, the popular video-interview show that live-streams from top tech events, and joined hosts Jeff Frick and George Gilbert for a spirited and wide-ranging discussion of all things Big Data.

First up was SnapLogic CEO Gaurav Dhillon, who discussed SnapLogic’s record-growth year in 2016, the acceleration of Big Data moving to the cloud, SnapLogic’s strong momentum working with AWS Redshift and Microsoft Azure platforms, the emerging applications and benefits of ML and AI, customers increasingly ditching legacy technology in favor of modern, cloud-first, self-service solutions, and more. You can watch Gaurav’s full video below, and here:

Next up was SnapLogic Chief Enterprise Architect Ravi Dharnikota, together with our customer, Katharine Matsumoto, Data Scientist at eero. A fast-growing Silicon Valley startup, eero makes a smart wireless networking system that intelligently routes data traffic on your wireless network in a way that reduces buffering and gets rid of dead zones in your home. Katharine leads a small data and analytics team and discussed how, with SnapLogic’s self-service cloud integration platform, she’s able to easily connect a myriad of ever-growing apps and systems and make important data accessible to as many as 15 different line-of-business teams, thereby empowering business users and enabling faster business outcomes. The pair also discussed ML and IoT integration which is helping eero consistently deliver an increasingly smart and powerful product to customers. You can watch Ravi and Katharine’s full video below, and here:

 

Azure Data Platform: Reading and writing data to Azure Blob Storage and Azure Data Lake Store

By Prasad Kona

Organizations have been increasingly moving towards and adopting cloud data and cloud analytics platforms like Microsoft Azure. In this first in a series of Azure Data Platform blog posts, I’ll get you on your way to making your adoption of the cloud platforms and data integration easier.

In this post, I focus on ingesting data into the Azure Cloud Data Platform and demonstrate how to read and write data to Microsoft Azure Storage using SnapLogic.

For those who want to dive right in, my 4-minute step-by-step video “Building a simple pipeline to read and write data to Azure Blob storage” shows how to do what you want, without writing any code.

What is Azure Storage?

Azure Storage enables you to store terabytes of data to support small to big data use cases. It is highly scalable, highly available, and can handle millions of requests per second on average. Azure Blob Storage is one of the types of services provided by Azure Storage.

Azure provides two key types of storage for unstructured data: Azure Blob Storage and Azure Data Lake Store.

Azure Blob Storage

Azure Blob Storage stores unstructured object data. A blob can be any type of text or binary data, such as a document or media file. Blob storage is also referred to as object storage.

Azure Data Lake Store

Azure Data Lake Store provides what enterprises look for in storage today and it:

  • Provides additional enterprise-grade security features like encryption and uses Azure Active Directory for authentication and authorization.
  • Is compatible with Hadoop Distributed File System (HDFS) and works with the Hadoop ecosystem including Azure HDInsight.
  • Includes Azure HDInsight clusters, which can be provisioned and configured to directly access data stored in Data Lake Store.
  • Allows data stored in Data Lake Store to be easily analyzed using Hadoop analytic frameworks such as MapReduce, Spark, or Hive.

How do I move my data to the Azure Data Platform?

Let’s look at how you can read and write to Azure Data Platform using SnapLogic.

For SnapLogic Snaps that support Azure Accounts, we have an option to choose one of Azure Storage Account or Azure Data Lake Store:

Azure Data Platform 1

Configuring the Azure Storage Account in SnapLogic can be done as shown below using the Azure storage account name and access key you get from the Azure Portal:

Azure Data Platform 2

Configuring the Azure Data Lake Store Account in SnapLogic as shown below, uses the Azure Tenant ID, Access ID, and Secret Key that you get from the Azure Portal:

Azure Data Platform 3

Put together, you’ve got a simple pipeline that illustrates how to read and write to Azure Blob Storage:

Azure Data Platform 4

Here’s the step-by-step video again: Building a simple pipeline to read and write data to Azure BLOG storage

In my next blog post, I will describe the approaches to move data from your on-prem databases to Azure SQL Database.

Prasad Kona is an Enterprise Architect at SnapLogic. You can follow him on LinkedIn or Twitter @prasadkona.