How to set up Stream processing for Twitter using Snaps

Sharath-Punreddy300pxBy Sharath Punreddy

As you probably know, SnapLogic data pipelines use Streams, a continuous flow of data from a source to a target. By processing and extracting valuable insights out of Streaming data, a user/system can make decisions more quickly than with traditional batch processing. Streaming data analytics now provide near real-time, if not real-time, analytics.

In this data-driven age, timing of data analytics and insights has become a key differentiator. In some cases, the data becomes less relevant - if not obsolete - as it ages. Analyzing the data as it flows-in is crucial for use cases such as sentimental analysis for new product launches in retail, fraudulent transaction detection in the financial industry, preventing machine failures in manufacturing, sensor data processing for weather forecasts, disease outbreaks in healthcare, etc. Stream processing enables processing in near real-time, if not real-time, allowing the user or system to draw insights from the very latest data. Along with traditional APIs, companies are providing Streaming APIs for rendering data in real-time as it is being generated. Unlike traditional ReST/SOAP APIs, Streaming APIs establish a connection to the server and continuously stream the data for the desired amount of time. Once the time has elapsed, the connection will be terminated. Apache Spark with Apache Kafka as a Streaming platform has become a de facto industry standard for stream processing.

In this blog post, I’ll walk through the steps for building a simple pipeline to retrieve and process Tweets. You can also jump to the how-to video here.

Twitter Streams
Twitter has become a primary data source for sentiment analysis. The Twitter Streaming APIs provide access to global Tweets and can be accessed in real-time as people are tweeting. Snaplogic’s “Twitter Streaming Query” Snap enables users to retrieve Tweets based on a keyword in the text of the Tweet. The Tweets can then be processed using Snaps such as Filter Snap, Mapper Snap, or Aggregate Snap, for filtering, transforming, and aggregating, respectively. SnapLogic also provides a “Spark Script” Snap where an existing Python program can be executed on incoming Tweets. Tweets can also be routed to different destinations based on a condition, copied to multiple destinations (RDBMS, HDFS, S3, etc.) for storing and further analysis.

Getting Started
Below is a simple pipeline for retrieving Tweets, filtering them based on the language, and publishing to a Kafka cluster.

  1. Twitter_to_Kafka_PipelineUsing the Snaps tab on the left frame, search for the Snap. Drag and drop the Snap onto the Designer canvas (white space on the right).

Twitter_Snap_Img1    a. Click on the Snap to open the Snap Settings form.

Twitter_Snap_Img4Note: The “Twitter Streaming Query” Snap requires a Twitter account, which can be created through Designer while building the pipeline or using Manager prior to building pipeline.

b. Click on the “Account” tab.

Twitter_Snap_Img3    c. Click on the “Add Account” button.

Twitter_Account_Create_Img1Note: Twitter provides a couple of ways to authenticate applications to Twitter account. The “Twitter Dynamic OAuth1” is for Application-Only authentication and “Twitter OAuth1” is for User Authentication where the user is required to authenticate the application by signing into Twitter. In this case, we are using the User Authentication mechanism.

d. Choose an appropriate option based on the accessibility of the Account:
i. For Location of the Account: Shared makes this account accessible by the entire Organization, “projects/shared” would make the account accessible by all the users in the project, and “project/” would make the account accessible by only the user.
ii. For Account Type: Choose the “Twitter OAuth1” option to grant access to the Twitter account of the individual user.
iii. Click “OK.”

Twitter_Account_Create_Img2    e. Enter meaningful text for the “Label” such as [Twitter_of_] and click the “Authorize” button.

Twitter_Account_Create_Img3Note: If a user is logged into Twitter with an active session, they will be taken to the “Authorize” page of the Twitter website for the user to grant access to the application. If the user is not logged in or does not have an active session, it will take the user to Twitter sign-in page for them to sign in.

f. Click on the “Authorize app” button.

Twitter_Account_Create_Img4Note: The above “OAuth token” and “OAuth token secret” values are not active and are for example only.

g. At this point, the “OAuth token” and the “OAuth token secret” should have been populated. Click “Apply.”

Twitter_Account_Select_Img12. Once the account is successfully set up, click on the “settings” tab to provide the search keyword and time.

Twitter_Snap_Img4Note: The Twitter Snap will be retrieving Tweets for a designated time duration. For continuous retrieving, you can provide a value of “0” to the “Timeout in seconds.”

a. Enter a keyword and a time duration in seconds.

Twitter_Snap_Img5

3. Save by clicking the disk icon on the top right . This will trigger validation and should become a check mark if validation is successful.

Twitter_Snap_Img6

4. Click on list to preview the data.

Twitter_Snap_Img75. This confirms that the “Twitter Streaming Query” Snap has successfully established connection to the Twitter account and is fetching the Tweets.

6. The “Filter” Snap is used for filtering Tweets. Search for “Filter” using the Snaps tab on left frame. Drag and drop “Filter” Snap onto the canvas.

Filter_Snap_Img1    a. Click on “Filter” Snap to open the Settings form.

Filter_Snap_Img2    b. Provide a meaningful name such as “Filter By Language” for the “Label” and filter condition for “Filter Expression.” You can use the drop-down for choosing the filter attribute.

7. Click on disk icon to save it, which again triggers validation. You’ve now successfully completed a “Filter” Snap.

8. Search for “Confluent Kafka Producer” Snap using the Snaps tab on left frame. Drag and drop the Snap on the canvas.

Confluent_Account_Img1BNote: Confluent is an Apache Kafka distribution geared for Enterprises.

a. The “Confluent Kafka Producer” requires an account to connect to the Kafka cluster. Choose appropriate values based on the location and type of the account.

Confluent_Account_Img1A    b. Provide meaningful text for the “Label” of bootstrap server(s). In case of multiple bootstrap servers, use a comma to separate them, along with port.

Twitter_Account_Create_Img2    c. The “Schema registry URL” is optional, but is required in case Kafka is required to parse the message based on the Schema.

Confluent_Account_Img3    d. The other optional Kafka properties can be passed to the Kafka using the “Advanced Kafka Properties.” Click on “validate.”

e. If the validation is successfully, you should see a message on top as “Account validation successful.” Click “Apply.”

Confluent_Snap_Img29. Once the account is setup and chosen, click on “Settings” tab to provide Kafka topic and message.


Confluent_Snap_Img3

a. You can choose from the list of available topics by clicking the bubble icon next to the “Topic” field. Leave other fields to default. Another required field is “Message value.” Enter “$” to send entire Tweet and metadata information. Save by clicking the disk icon .

Twitter_to_Kafka_Pipeline410. The above is a fully validated pipeline to fetch the Tweets and load them into Kafka.

11. At this point, the pipeline is all set to receive the Tweets and push them into Kafka Topic. Run the pipeline by the clicking play button on the right-hand top corner . View the progress by clicking display button .

Twitter_to_Kafka_Pipeline5As you can see, the pipeline can be built in less than 15 minutes without requiring any deep technical knowledge. This tutorial and video provides a basic example of what can be achieved when using these Snaps. There are several other Snaps that can act on the data, such as filtering, copying, aggregating, triggering events, sending out emails, and others. Snaplogic takes pride in bringing complex technology to citizen integrator. I hope you found this useful!

Sharath Punreddy is Enterprise Solution Architect at SnapLogic. Follow him on Twitter @srpunreddy.

SnapLogic Kafka Integration Snaps in Action

Apache Kafka

In today’s business world big data is generating a big buzz. Besides the searching, storing and scaling, one thing that clearly stands out is – stream processing. That’s where Apache Kafka comes in.

Kafka at a high level can be described as a publish and subscribe messaging system. Like any other messaging system, Kafka maintains feeds of messages into topics. Producers write data into topics and consumers read data out of these topics. For the sake of simplicity, I have linked to the Kafka documentation here.

In this blog post, I will demonstrate a simple use case where Twitter feeds to a Kafka topic and the data is written to Hadoop. Below are the detailed instructions of how users can build pipelines using the SnapLogic Elastic Integration Platform.
Continue reading “SnapLogic Kafka Integration Snaps in Action”

June 2014 Snap Release for the SnapLogic Elastic Integration Platform

We are pleased to announce the addition of the following Snaps for the SnapLogic Elastic Integration Platform:

Google-Directory-logoGoogle Directory Snap Pack

With this Snap Pack, you can add and modify users, user photos, groups and org units in your Google Directory. For example:

  • Query users to find all email addresses in use
  • List the group membership of a user
  • Create a new user, group, or org unit
  • Update an existing user’s name and add them to an org unit
  • Delete user photos or an org unit

LinkedinLogoTransparentLinkedIn Snap Pack

This Snap Pack lets you gather information from LinkedIn and provide or modify updates, such as:

  • Fetch a LinkedIn user’s profile
  • Search for people
  • Update or share a status message, and optionally post it to Twitter
  • Join or leave groups
  • Post, like, and follow group updates

Twitter_logo_blueThe latest Snap to join the Twitter Snap Pack, the Twitter Streaming Search Snap, streams tweets based on a keyword.

Additionally, the following Snaps are available as a Beta release:

  • Data Validator Snap (Beta release): This Snap validates incoming documents and their attributes against constraints you define.
  • Hadoop Snap Pack: Reader, Writer (Beta release): This Snap Pack lets you read data from or write data to a Hadoop File System.

With the June 2014 Snap Release, we are also delivering minor updates and fixes for the following Snaps: Google DFA Reports, JIRA Search, JMS and Salesforce.com. See the Release Notes for more information on these Snaps. The Snap update will occur this evening PDT, with no required down time.

For more information about SnapLogic Snaps, be sure to also check out our documentation as well as this post.

Getting Ready to Rock Dreamforce!

/*

edToolbar()

We are relaying more exciting stories from the field on how SnapLogic is being successfully leveraged in the cloud ecosystem. This week is a particularly exciting one, since we are live at Dreamforce where the cloud and its fantastic possibilities are on display in San Francisco.  Needless to say, being a cloud integration platform, SnapLogic is in the middle of the action when it comes to realizing the cloud’s fantastic possibilities and uniting the growing list of applications that companies adopt.

Apart from being a Gold Sponsor (Booth 509) again this year at Dreamforce, SnapLogic will be demonstrating live some of the valuable solutions that our ecosystem has brought forth.  There is certainly a common Saleforce element since it is the most broadly adopted cloud application but we have solutions showing real business processes supported via integrations with adjacent applications such as SAP, Coupa, Financial Force, DocuSign, Eloqua, Drupal, Birst, Twitter, Facebook and Google Analytics.

The solutions being demonstrated by SnapLogic’s partners (Coupa, DocuSign, TopCurve and Cervello) provide tremendous value in the areas of marketing automation, customer master data management, approval workflow with e-signature, e-procurement to financials and social media analytics.

 

 

BTW, if you are attending Dreamforce, drop by Booth 509 to sketch your integration fantasy (like the one above).

To see a schedule of demonstrations, click here.

To learn more about the solutions and Snaps built by our ecosystem, please visit the SnapStore.

edCanvas = document.getElementById(‘content’);

SnapLogic Application Connection Survey: Barricades around Connections

/*

edToolbar()

During the past month, we asked over 100 IT executives about their application connection priorities now and in the future.  I’ve been fascinated as I dig into all this data to see how much has changed in recent years – and what hasn’t. Two key areas were especially enlightening: what companies’ saw as their biggest roadblocks to fully harnessing data, and the types of applications they’re focused on integrating.

When we asked about what’s in the way of gaining more business value from business applications and data, the number one answer we got was lack of integration (45% of responders).  I highly doubt this number would be so high 10 or 15 years ago when massive application stacks were king in enterprise IT.  But now that companies are turning to a wider variety of best-of-breed technology solutions to build a “collection of services” that’s custom-fit for their business, it’s clear that the integration challenge is mounting.

During a recent Webinar we held with InformationWeek, we found that 40% of attendees were using at least three SaaS or Web-based applications in their company.  And the research we released today found that the number of companies who will implement at least four SaaS or cloud applications will double in the next two years.  Of course, that’s all in addition to the legacy systems these companies still use, as well as the increasing use of external data sources like social media.

Which brings us to the second and third biggest roadblocks: data quality (40%) and performance (35%).  It’s not surprising that with more data and more types of data, comes more work in cleansing and parsing that information.  These challenges were relatively minor when the data was residing in on-premise relational databases, but now that IT folks are dealing with such diverse data ecosystems, they’re facing an uphill battle.

But it’s not all doom and gloom. We’ve actually seen many companies successfully break through these barricades by abstracting away the complexity of integration using intuitive application connectors accessed from a drag and drop desginer. Technical and business users alike can now do complex data filtering and data cleansing without any coding whatsover. They can even incorporate data enrichment as part of their integration workflows by utilizing third party services such as Trillium. This approach makes it easy for companies to address data quality right at the earliest point of entry into the IT ecosystem. And companies that exploit Web standards like REST and HTTPS benefit from location transparency, which helps them automatically route connections across the fastest paths, manage bandwidth shrewdly, and mitigate impact from network outages.

Now, to what hasn’t changed so much – not surprisingly, Business Intelligence/Analytics was top of the list of applications companies want to integrate, both over the coming year and in the 1-3 years that follow.  Interestingly, Productivity/Collaboration tools like Goggle Apps, Box.net and Slideshare was the next most popular priority for application integration, while traditionally important Sales and Financial applications were only the fourth and sixth (respectively) most popular categories in the 1-3 year time horizon.

To me, this indicates an interesting shift beyond the major enterprise data sources of the last century to emerging sources of valuable data.  In fact, our research found that approximately a quarter of all companies want to integrate Social Media or Mobile/Logistics/Location data in 1-3 years (28% and 23%, respectively). And looking down the road, over twice as many companies plan to integrate Offers/Advertising applications (Groupon, anyone?) or Big Data services (i.e. Hadoop) in 1-3 years versus the next 12 months.

That tells me that when you look beyond the obvious candidates for application integration, things are changing.  I’d love to know what you think – do your company’s priorities for application integration reflect or differ from these findings?

edCanvas = document.getElementById(‘content’);

What We Learn from the SnapStore: Hottest Snap Downloads

/*

edToolbar()

The great thing about having an online marketplace is that you learn a lot about what people are into. In our case, a quick glance through SnapStore downloads can shed light on our customers’ integration needs as well as point to some bigger trends in enterprise software.

As we continue to build out the SnapStore with popular connections from a growing list of partners and developers, we thought we’d share some data on the most in-demand downloads from our pool of over 100 Snaps. Here’s a quick look at what we saw during the first half of 2011:

5 Most Downloaded Snaps

1.    Twitter

2.    Box.net

3.    GoodData

4.    SugarCRM

5.    Google Apps

Top 4 Popular Downloaded Snap Categories

1.    Social Media

2.    File Sharing/Collaboration

3.    Customer/Marketing Tools

4.    Business Intelligence

Not surprisingly, social media, business intelligence and collaboration tools top the list. Our survey on companies’ top app connection priorities also showed very similar trends this year. Most businesses realize that these tools can provide valuable insight for sales & marketing strategies, customer retention and aid in better overall business decision-making and they’re desperately seeking an efficient way to get this data flowing throughout their organization.

Honorable Mentions

We also saw some Snap downloads that, while they don’t make our ‘Top’ list, are worth calling attention to, simply because they show just how easy it is to get creative with your application connections once you get started using our Snaps:

  • Kiva: Kiva, as most know, is a popular online lending institution. We saw a number of downloads for this Snap which enables corporations or micro-finance institutions browse lender data.
  • Yelp: Businesses are using this Snap to search Yelp for business reviews using the Yelp API.
  • Google Geocode: Another one of our free Snaps that deserves an honorable mention. We’ve seen a handful of customers leverage this connection to integrate and analyze geo-location info from all kinds of sources, in particular GIS systems.

Keep an eye out for more honorable mentions and regular download rankings from the SnapStore throughout the year as we stay on top of the latest application connection trends!

edCanvas = document.getElementById(‘content’);