SnapLogic connects enterprise applications in the cloud, helping you get from big interactions to big insights more quickly and easily than any other integration solution.
By Rich Dill
Welcome to this demonstration of SnapLogic Big Data-as-a-Service. In this demonstration, we'll be using SnapLogic, Hadoop, Twitter, Birst, and R. The scenario for this demonstration is straightforward. The question we're interested in answering is, what Twitter users will have the most influence at the Strata 2013 Conference?
The solution to this question is a SnapLogic demonstration that will capture the Twitter feed, combine it with their profile, and display the output in a graph. We'll use both R and Birst for visualization. The formula is simple. The number of tweets times the number of followers equals their influence.
This is the SnapLogic Designer, a browser-based platform that allows the user to connect to a SnapLogic server, wherever it is. In this case, I'm running a user-friendly named HBase_demo instance up on AWS. And you can see I'm connected as Admin.
The Designer is basically broken into three sections. The Canvas area, which is the work area where you drag and drop your components and build the pipelines. As you can see here, that's going to connect to something, do something, and then create output. I mean, that's what we're doing.
So these objects over here are components that have values associated with them. I've instantiated them. I've given them values. And I can reuse them over and over and over again, which is good for productivity.
Down here in the Foundry, I have the raw materials. These are the templates. These are things like, the DB Wizard and the compute and the connection and the aggregate ones. There's over 150 objects that will do everything from converting types, which I'm going to be using in one of mine, to getting status and lookups and things like that. So there's a wide palette of components that the user can use to basically do what they need to do to be able to get data wherever it is and put it wherever they want it to be. That's what we do. And we do it very well.
So we've got three pipelines. I've got a pipeline that's going to connect to Twitter. It's going to pull down a block of tweets. And then because the tweets don't have all the data I'm interested in, I'm going to sort those tweets. I'm going to find the unique tweet user IDs. And then I'm going to look up the profile information, put those two pieces of information together, and then write that out to HBase. That's, in essence, what this pipeline does.
Then I'm going to read the data out of HBase. And because the data comes into HBase from Twitter in a format that isn't always conducive to what I want to do with it-- for example, I get a created app field that comes to me as a string but in fact contains a date-time value-- so what I'm going to do is convert that from a string-type to a date-time value. And then I'm going to truncate it by removing the time value because I'm really interested in is the date. And then I'm going to aggregate that because, again, I have a user who will tweet 12 times. And what I want to do is I want to say, OK, you tweeted 12 times. So 12 times the number of followers is your total influence.
Then I'm going to take that. I'm going to write that out to a CSV file, which then will be uploaded to Birst. Or I'm going to then just use it in a motion chart in R, so we can see the impact that a user is having. It's all about identifying the influence.
The Birst upload pipeline is the simplest of the three. I got this file that I've created. I'm just going to upload it to Birst. And at the same time, I'm going to write a simple little file down here that basically says, did you, in fact, upload the stuff to Birst? Was there any kind of problems or issues or anything else? Is everything copacetic?
You can see, if you look down here at the bottom, previous run, the output of that field is basically upload status, publish status, upload complete, everything's copacetic. We're cool. So let's go ahead and run the first one. And again, this is about user productivity, ease of use, being able to do what you want to do, not worry about the mechanics.
So I'm going to just go ahead and run this puppy. And when I do that, I click on the Run icon. It's going to go ahead and run. Can you tell I've been practicing? So last time it ran, it took eight seconds. Let's see how long it takes this time. Nine seconds. So we're a little slower. Maybe we need another cup of coffee.
So we're going to view the Log statistics. And let's take a look at the stats. So 908 rows came in. 766 were output. And if I scroll down here, I can get some detailed information about the pipeline.
So this is one of the advantages of using this technology over a simple script or hand coding because we create logs. We know when a pipeline was executed, what its parameters were. So you have that audit trail capability of knowing what's going on, where the data is coming from, going to. All of that is documented. Again, promoting efficiency and understanding of what your data is and what its value is and how to use it.
So I grab this data from Twitter, combined it with the profile information, dropped it into HBase. So now, what I'm going to do is I want to pull it out of HBase. And I want to put it into a file format that then can be either uploaded to Birst or consumed by the R programming language. I have a neat, little motion chart that I'll be showing in a few minutes to show you the output of what we're doing. So, of course, I'm going to read the data out of HBase.
And again, I'm going to do some type conversion. I'm going to convert that from a string value to a date-time. I'm going to truncate it. And then I'm going to aggregate it. Again, the same user ID may tweet 3, 5, 12, 15 times in a single day. I'm going to aggregate those. And then I'm going to create user X tweeted 12 times. 12 times the number of followers is the value. And I'm putting that out into an output file, which then I'm going to upload to Birst. Let's go back here real quick. And let's run this puppy.
This one took three seconds last time and three seconds this time. The runs stats look the same as the previous ones, so we're not going to bother you with that. Now, the Birst upload one is the simplest.
We've got this file that we've created. And what we're going to do is we're going to take that file, and we're going to push it up to Birst. And Birst is then going to do it's thing to it. And at the same time, we're checking to make sure that everything was copacetic, so we've got this output file. But this one takes a little bit longer.
So I'm just going ahead and run this puppy. You can see that the last time, it took 21 seconds. So give or take a few seconds either side depending on the network and whatnot. It should take about 21 seconds for this thing to complete. 23 seconds, this time. So the data is up there. In the next section, what we'll do is we'll take a look at the output by looking at the motion chart and the Birst upload.
Before we look at the charts, let's look at the data file. Both the HBase Reader pipeline and the Birst Upload are using the same file. It's called R Chart Data CSV. So that's the input for the Birst upload. And that's the output of the data file of the pipeline that consumes it off of HBase. So let's take a look at that. So I'm going to go to that directory.
You can see here LS-LA gave me a list of the things. And I'm just going to cat that folder or file. You can see here that what I have done is I've extracted out the user ID or their screen name and the total number of impressions-- again, the number of tweets times the number of followers-- and the date of that value.
Let's go to motion chart first. And I've already set it up with unique callers. And I checked everybody off. You didn't need see me sit here and check everybody out.
What's interesting when I do this is that you can see the values over time as people start to tweet. Again, because this is a conference that is on a specific date, people were doing it very early. And then as the date became closer, all a sudden, there was an explosion. And just within the last couple of hours, Kevin Marks showed up with a rather large 61,746. So he's got a lot of followers. And he tweets a lot.
At the other end, we have Frederick, who has 19, probably closer to what I am doing these days. Over here, we've got Jeremy with 715. And then it's a little busy. But you see the values are all there, so 5,823.
So this is R. As you can see here, we're just executing this locally. You got a little R program that generates this motion chart-- very, very slick technology. And a lot of our customers are using this for obvious reasons. Another usage is to use Birst.
So I've already logged in to Birst. And I'm using the Strata Conference. I'm going to go into Admin. And if I go to the Dashboard, because the data has been uploaded and populated, now what I'm looking for, looking at is a pie chart. You can see here, Kevin just took over within the last couple of hours. And you can see all of the users by their slices of what their presence is. And then if I scroll down a little bit, we get the raw data by tweet ID. Now, some of these are names and some of these are aliases and whatnot-- Eve, Factual Team.
So this is the output of the pipelines that we've worked with. So we've got the motion chart in R, or we've uploaded this into Birst. So this is one of the most popular ways that people are using Big Data. They're doing analytics against a real-time data to identify the impressions, the influence, the impact of sales, of tweets, of web postings, of traffic-- all very, very important in presenting your business in the best possible light and understanding your opportunities, your market, and your potential customer base.
Finally, let's take a closer look at some of these pipeline components. The two most interesting ones, I think, are the Twitter component and the HBase-Writer. So let's start off, how do we connect to Twitter? Well, I have a connection object which encapsulates the security token and the user ID that's used to connect to Twitter. We're using mine, of course, for the purposes of this demo. And you can see here the search string that we're using is #Strata.
This is the conference that we're looking for. So I pop these parameters in. In most cases, these parameters are going to be defaulted by SnapLogic. And if there is a requirement for you to put something in there, there will be a little, "this text or this value is required" in red text to help the user put in data where necessary and accept options or accept defaults when convenient. So again, very easy to use, very developer-friendly.
So I'm connecting to Twitter using my ID. And then I'm putting this into the Sort component. And this is a standard URI component. Everything in SnapLogic is based on URIs. Now, I'm curious to see what the rough, raw data feed look like. So I'm going to go click on the Preview button. And when I preview the data, it's going to go out and execute the Twitter search and the Sort component, and just those two. Can you tell I practiced?
So if I bring this up a little bit, you can see here's the raw data that's coming in from the tweets. So this is all of the data associated with the tweets that I'm interested in. So I've debugged or confirmed the functionality of the pipeline from here to here just by using the Preview capability. Now, I can do the same thing for each step down the line, but that would be redundant. Let's go take a look at the HBase-Writer one because that's the important one.
So I've got a Date field for the HBase version. I want to make sure that I am connecting to the same HBase version that I'm expecting to. I've got a column, Family Name. We've got a core-site.xml file, which is a requirement for all of the connections to the applications that we're talking to in HBase cluster. And we just paste that in. We know what that file name is. The key fields that we're using for this table, for this functionality, is the user's screen name and the tweet ID. And we've got a table name called Twitter-- big surprise on that.
So once I put these values in, I want to confirm that they are, in fact, correct. So I click on the Validate button. And what happens is this Snap will go out and ping HBase and say, is this correct? And it will come back and say, yes, it is. Or no, it's not. So if I get the validation successful, I know that I have a valid connection and valid parameters to talk to HBase. That's as simple as it is. So now, when I click on the Run icon, it's going to go out and perform the tasks that I requested in the sequence and then come back and allow me to look at the information that has now been put into HBase.
SnapLogic is about enabling your developers to be able to get the job done; helping them with default values, validating functionality; and giving them a high-level tool that allows them to focus on what they want to do, not the low-level tasks of how to do it. For more information about SnapLogic's unique technology, please contact us at snaplogic.com. Thanks for your attention.