How to set up Stream processing for Twitter using Snaps

Sharath-Punreddy300pxBy Sharath Punreddy

As you probably know, SnapLogic data pipelines use Streams, a continuous flow of data from a source to a target. By processing and extracting valuable insights out of Streaming data, a user/system can make decisions more quickly than with traditional batch processing. Analytics from Streaming data now provide near real-time, if not real-time, analytics.

In this data-driven age, timing of data analytics and insights has become a key differentiator. In some cases, the data becomes less relevant - if not obsolete - as it ages. Analyzing the data as it flows-in is crucial for use cases such as sentimental analysis for new product launches in retail, fraudulent transaction detection in the financial industry, preventing machine failures in manufacturing, sensor data processing for weather forecasts, disease outbreaks in healthcare, etc. Stream processing enables processing in near real-time, if not real-time, allowing the user or system to draw insights from the very latest data. Along with traditional APIs, companies are providing Streaming APIs for rendering data in real-time as it is being generated. Unlike traditional ReST/SOAP APIs, Streaming APIs establish a connection to the server and continuously stream the data for the desired amount of time. Once the time has elapsed, the connection will be terminated. Apache Spark with Apache Kafka as a Streaming platform has become a de facto industry standard for stream processing.

In this blog post, I’ll walk through the steps for building a simple pipeline to retrieve and process Tweets. You can also jump to the how-to video here.

Twitter Streams
Twitter has become a primary data source for sentiment analysis. The Twitter Streaming APIs provide access to global Tweets and can be accessed in real-time as people are tweeting. Snaplogic’s “Twitter Streaming Query” Snap enables users to retrieve Tweets based on a keyword in the text of the Tweet. The Tweets can then be processed using Snaps such as Filter Snap, Mapper Snap, or Aggregate Snap, for filtering, transforming, and aggregating, respectively. SnapLogic also provides a “Spark Script” Snap where an existing Python program can be executed on incoming Tweets. Tweets can also be routed to different destinations based on a condition, copied to multiple destinations (RDBMS, HDFS, S3, etc.) for storing and further analysis.

Getting Started
Below is a simple pipeline for retrieving Tweets, filtering them based on the language, and publishing to a Kafka cluster.

  1. Twitter_to_Kafka_PipelineUsing the Snaps tab on the left frame, search for the Snap. Drag and drop the Snap onto the Designer canvas (white space on the right).

Twitter_Snap_Img1    a. Click on the Snap to open the Snap Settings form.

Twitter_Snap_Img4Note: The “Twitter Streaming Query” Snap requires a Twitter account, which can be created through Designer while building the pipeline or using Manager prior to building pipeline.

b. Click on the “Account” tab.

Twitter_Snap_Img3    c. Click on the “Add Account” button.

Twitter_Account_Create_Img1Note: Twitter provides a couple of ways to authenticate applications to Twitter account. The “Twitter Dynamic OAuth1” is for Application-Only authentication and “Twitter OAuth1” is for User Authentication where the user is required to authenticate the application by signing into Twitter. In this case, we are using the User Authentication mechanism.

d. Choose an appropriate option based on the accessibility of the Account:
i. For Location of the Account: Shared makes this account accessible by the entire Organization, “projects/shared” would make the account accessible by all the users in the project, and “project/” would make the account accessible by only the user.
ii. For Account Type: Choose the “Twitter OAuth1” option to grant access to the Twitter account of the individual user.
iii. Click “OK.”

Twitter_Account_Create_Img2    e. Enter meaningful text for the “Label” such as [Twitter_of_] and click the “Authorize” button.

Twitter_Account_Create_Img3Note: If a user is logged into Twitter with an active session, they will be taken to the “Authorize” page of the Twitter website for the user to grant access to the application. If the user is not logged in or does not have an active session, it will take the user to Twitter sign-in page for them to sign in.

f. Click on the “Authorize app” button.

Twitter_Account_Create_Img4Note: The above “OAuth token” and “OAuth token secret” values are not active and are for example only.

g. At this point, the “OAuth token” and the “OAuth token secret” should have been populated. Click “Apply.”

Twitter_Account_Select_Img12. Once the account is successfully set up, click on the “settings” tab to provide the search keyword and time.

Twitter_Snap_Img4Note: The Twitter Snap will be retrieving Tweets for a designated time duration. For continuous retrieving, you can provide a value of “0” to the “Timeout in seconds.”

a. Enter a keyword and a time duration in seconds.

Twitter_Snap_Img5

3. Save by clicking the disk icon on the top right . This will trigger validation and should become a check mark if validation is successful.

Twitter_Snap_Img6

4. Click on list to preview the data.

Twitter_Snap_Img75. This confirms that the “Twitter Streaming Query” Snap has successfully established connection to the Twitter account and is fetching the Tweets.

6. The “Filter” Snap is used for filtering Tweets. Search for “Filter” using the Snaps tab on left frame. Drag and drop “Filter” Snap onto the canvas.

Filter_Snap_Img1    a. Click on “Filter” Snap to open the Settings form.

Filter_Snap_Img2    b. Provide a meaningful name such as “Filter By Language” for the “Label” and filter condition for “Filter Expression.” You can use the drop-down for choosing the filter attribute.

7. Click on disk icon to save it, which again triggers validation. You’ve now successfully completed a “Filter” Snap.

8. Search for “Confluent Kafka Producer” Snap using the Snaps tab on left frame. Drag and drop the Snap on the canvas.

Confluent_Account_Img1BNote: Confluent is an Apache Kafka distribution geared for Enterprises.

a. The “Confluent Kafka Producer” requires an account to connect to the Kafka cluster. Choose appropriate values based on the location and type of the account.

Confluent_Account_Img1A    b. Provide meaningful text for the “Label” of bootstrap server(s). In case of multiple bootstrap servers, use a comma to separate them, along with port.

Twitter_Account_Create_Img2    c. The “Schema registry URL” is optional, but is required in case Kafka is required to parse the message based on the Schema.

Confluent_Account_Img3    d. The other optional Kafka properties can be passed to the Kafka using the “Advanced Kafka Properties.” Click on “validate.”

e. If the validation is successfully, you should see a message on top as “Account validation successful.” Click “Apply.”

Confluent_Snap_Img29. Once the account is setup and chosen, click on “Settings” tab to provide Kafka topic and message.


Confluent_Snap_Img3

a. You can choose from the list of available topics by clicking the bubble icon next to the “Topic” field. Leave other fields to default. Another required field is “Message value.” Enter “$” to send entire Tweet and metadata information. Save by clicking the disk icon .

Twitter_to_Kafka_Pipeline410. The above is a fully validated pipeline to fetch the Tweets and load them into Kafka.

11. At this point, the pipeline is all set to receive the Tweets and push them into Kafka Topic. Run the pipeline by the clicking play button on the right-hand top corner . View the progress by clicking display button .

Twitter_to_Kafka_Pipeline5As you can see, the pipeline can be built in less than 15 minutes without requiring any deep technical knowledge. This tutorial and video provides a basic example of what can be achieved when using these Snaps. There are several other Snaps that can act on the data, such as filtering, copying, aggregating, triggering events, sending out emails, and others. Snaplogic takes pride in bringing complex technology to citizen integrator. I hope you found this useful!

Sharath Punreddy is Enterprise Solution Architect at SnapLogic. Follow him on Twitter @srpunreddy.

Executing Spark Pipelines on HDInsight

Microsoft Azure HDInsight is an Apache Hadoop distribution powered by the cloud. Internally HDInsight leverages the Hortonworks data platform. HDInsight supports a large set of Apache big data projects like Spark, Hive, HBase, Storm, Tez, Sqoop, Oozie and many more. The suite of HDInsight projects can be administered via Apache Ambari.

SnapLogic-for-MicrosoftThis post lists out the steps involved in spinning up an HDInsight cluster, setting up SnapLogic’s Hadooplex on HDInsight, and building and executing a Spark data flow pipeline on HDInsight. We start with spinning up a HDInsight cluster from the MS Azure Portal. Continue reading “Executing Spark Pipelines on HDInsight”

Puzzle Pieces: Snaplex Names Explained

Welcome to Puzzle Pieces, a periodic series exploring the “Why?” of SnapLogic’s platform. To kick things off, let’s talk Snaplexes, which have sometimes proved puzzling. (Editor’s note: future installments of Puzzle Pieces will be rigorously scrubbed for alliterative excesses).

The SnapLogic Elastic Integration Platform is divided into two main parts: the Control Plane and the Data Plane. As a customer, you come into contact with the Control Plane through the SnapLogic web interface. Behind the scenes, the Control Plane also handles talking to the Data Plane and coordinating the flow of data in pipelines.

The pipelines actually run in the Data Plane. The container that handles running a particular pipeline is called a Snaplex. A Snaplex (or Plex) is a collection of computing resources – perhaps one virtual machine, perhaps an entire server rack. These are the Snaplex types you may come across:

Continue reading “Puzzle Pieces: Snaplex Names Explained”

Training Videos: UX Updates and Data Mapper

We receFB_Posts_SummerLaunch14_720x266ntly added a series of new training videos to highlight some features and enhancements of the SnapLogic Elastic Integration Platform. Check out the videos below to learn more about some of the user interface updates from our Summer 2014 release, and how to automatically map known fields with the SnapLogic Data Mapper.

SnapLogic Summer 2014 User Interface Updates

This video features some of the new enhancements we recently made to the SnapLogic Elastic Integration Platform user experience including new features in the pipeline and dashboard tabs.

The SnapLogic Data Mapper

In this video, see how you can use fields and data types with the SmartLink button to automatically map known fields with the SnapLogic Data Mapper. In coming springs, there will be additional learning that will pick up other data that has been mapped. This video also covers Expression Builder which gives access to more comprehensive information about capabilities to manipulate data.

Check out our full video site for additional trainings and demonstrations.

The SnapLogic Integration Cloud: Using the Monitoring Dashboard

Next in the series of our training videos is an overview of the SnapLogic Integration Cloud Monitoring Dashboard. The Dashboard provides visibility into the health of your integrations with system performance graphs found in various tabs.

The tabs you will learn about in this training video are:

  • Health tab – provides a visual view of the overall health of your Snaplex
  • Pipeline tab – displays your pipeline run history including run status, run-time and duration
  • Snaplex tab – displays graphs for active pipelines, executed pipelines, active nodes and pipeline distribution

This video also shows how you can mouse over graphs for specific information at a given point in time, and drag the slider bars to expand the timeframe being viewed. Stay tuned for more training videos next week!

Using the SnapLogic Integration Cloud Manager as an Administrator

Yesterday we showed how to use the SnapLogic Integration Cloud Manager as an Integrator for projects and tasks. Today’s training video will cover use of the Manager as an administrator, including the ability to access groups. Groups are a collection of users that make it easier for users to be assigned to projects.

In this video, administrators using the SnapLogic Integration Cloud learn how to:

  • Create new users
  • Access and manage groups, which includes assigning users to specific projects

Stay tuned for the rest of our series of training videos and in the meantime, download our technical whitepaper for additional details of the SnapLogic Integration Cloud.

Using the SnapLogic Integration Cloud Manager for Projects & Tasks

This is the second training video for the SnapLogic Integration Cloud user interface, specifically covering project access and management using the SnapLogic Manager. Projects are logical groupings of pipelines, files, accounts and tasks, which are an alternative way to execute your pipelines.

In this video, integrators using the SnapLogic Integration Cloud learn how to:

  • Create a new project
  • Delete a pipeline, move a pipeline to a different project, or make a copy of a pipeline
  • Schedule tasks, configuring when and how often they will run
  • Set up a notification for when a task has started, completed or failed

Stay tuned for more on the administration of users, groups and organizations which will be covered in additional training videos. And download our technical whitepaper for additional details of the SnapLogic Integration Cloud.