“Life moves pretty fast. If you don’t stop and look around once in a while, you could miss it.” – Ferris Bueller’s Day Off, 1986.

It feels like there is both more uncertainty and more mainstream adoption in the Big Data market in contrast to a year ago at this time. So, a month before the 2015 Strata/Hadoop World event in New York, it feels like the right time to stop and look around at Big Data then and now.

Industry analysts have an interesting perspective on the issue, speaking regularly to both technology vendors and buyers. So let’s look at milestones from the last year from the perspective of Gartner. [A couple of these reports require a Gartner subscription, but I have referenced publicly-accessible blog posts where possible.]

August 4, 2014: Gartner releases its annual “Hype Cycle” report on Big Data technologies. (“Hype Cycle for Big Data 2014,” Frank Buytendijk, Gartner.) [subscription required]

The hype cycle is a proprietary Gartner maturity model for emerging technologies within a specific category. These regularly-issued reports show where a specific technology is in its lifecycle and how long it will take to reach the next phase of its development. The stages are:

Innovation trigger -> Peak of inflated expectations -> Trough of disillusionment -> Slope of enlightenment -> Plateau of productivity.

The notion here (my interpretation) is that every new technology has a hype stage where the buzz is greater than actual adoption or business value achieved. This phase is generally followed by a trough were the buzz wears off, skepticism creeps in and adopters must consider business needs and realities before applying to specific use cases and proceeding to the plateau where technologies contribute to measurable business productivity.

According to the 2014 report, Big Data technologies were either at the peak of hype or well into the trough:

“Big data, as a whole, has crossed the Peak of Inflated Expectations, and is sliding into the Trough of Disillusionment. Once adoption increases, and reality sets in, based on first successes and failures, the peak of hype has passed. Innovation will continue, the innovation trigger is full, but the Trough of Disillusionment will be fast and brutal. Viable technologies will grow quickly, combined with a shakeout of all the vendors that simply jumped on the bandwagon. This is essentially good news. More robust and enterprise-ready solutions will appear; and big data implementations will move from being systems of innovation to mission-critical systems of differentiation.”

Further, these technologies are seen as a long way from adding business value.

“In many cases, transformation is at least two to five years away — or more. In addition, many technologies indicate that they will become obsolete before reaching the Plateau of Productivity.”

Fast-forward to August 2015 and the updated Big Data Hype Cycle report…

But there isn’t a new report. According to an August 20 blog post from Gartner analyst Nick Heudecker, “Big Data Isn’t Obsolete. It’s Normal.” [no subscription required]

“First, the big data technology profile dropped off a few Hype Cycles, but advanced into the Trough of Disillusionment in others. Second, we retired the very popular Hype Cycle for Big Data [emphasis mine]. The reason for both is simple: big data is no longer a topic unto itself. Instead, the various topics formerly encompassing big data evolved into other areas. What other areas?

  • Advanced Analytics and Data Science
  • Business Intelligence and Analytics
  • Enterprise Information Management
  • In-Memory Computing Technology
  • Information Infrastructure

The characteristics that defined big data, those pesky 3 Vs [volume, variety, velocity], are no longer exotic. They’re common.”

So – one year and Big Data goes from a hyped set of technologies still in the realm of the early adopter, to common and simply part of long-standing tech sectors such as analytics, business intelligence and information infrastructure. Why is that? And what does it mean?

In part 2, I’ll delve into adoption drivers and barriers using Gartner’s annual Hadoop adoption surveys from 2014 and 2015.

Recently, I worked with a customer to reverse engineer a Pig Script running a MapReduce job in Hadoop and then orchestrated it as a SnapReduce pipeline with SnapLogic’s Elastic Integration Platform. SnapLogic’s HTML5 cloud-based Designer user interface and collection of pre-built components called Snaps made it possible to create a visual and functional representation of the data analytics workflow without knowing the intricacies of Pig and MapReduce. Here’s a quick writeup:

About Pig: Pig is a high level scripting language used with Apache Hadoop, for building complex applications to tackle business problems. Pig is used for interactive and batch jobs with MapReduce as the default execution mode. Here’s a tutorial.

About SnapReduce and the Hadooplex: SnapReduce and our Hadooplex enable SnapLogic’s iPaaS to run natively on Hadoop as a YARN application that elastically scales out to power big data analytics. SnapLogic is allowing Hadoop users to take advantage of an HTML5-based drag-and-drop user interface, breadth of connectivity (called Snaps) and modern architecture. Learn more here.

Overall Use Case (Product Usage Analytics)

Product usage raw data from consumer apps is loaded into Hadoop HCatalog tables and stored in RCFile format. The program reads data fields: product name, user and usage history with date and time, cleanses the data and eliminate duplicate records grouping by timestamp. Find unique records for each user and write the results to HDFS partitions based on date/time. Product analysts then create an external table in Hive on top of the already partitioned data to query and create product usage and trend reports. They will write these reports to a file or export to a visual analytics tool like Tableau.

Here’s the Pig Script portion for the above use case (Cleansing data):

REGISTER /apps/cloudera/parcels/CDH/lib/hive/lib/hive-exec.jar
SET default_parallel 24;
DEFINE HCatLoader org.apache.hcatalog.pig.HCatLoader();
raw = load ‘sourcedata.sc_survey_results_history’ USING HCatLoader();
in = foreach raw generate user_guid,survey_results,date_time, product as product;
grp_in = GROUP in BY (user_guid,product);
grp_data = foreach grp_in {
order_date_time = ORDER in BY date_time DESC;
max_grp_data = LIMIT order_date_time 1;
GENERATE FLATTEN(max_grp_data);
};
grp_out_data = foreach grp_data generate max_grp_data::user_guid as user_guid,max_grp_data::product as product,’$create_date’ as create_date,CONCAT(‘-“product”=”‘,CONCAT(max_grp_data::product,CONCAT(‘”,’,max_grp_data::survey_results))) as survey_results;
results = STORE grp_out_data INTO ‘hdfs://nameservice1/warehouse/marketing/sc_survey_results/epoch=$epoch_ts’ USING PigStorage (‘\u0001′);

SnapReduce Pipeline equivalent for the Pig script

This SnapReduce pipeline is translated to run as a MapReduce job in Hadoop. It can be scheduled or triggered to automate the integration. It can even be turned into a re-usable integration pattern. As you will see, it is pretty easy and intuitive to create a pipeline using SnapLogic HTML5 GUI and the Snaps to replace a Pig script.

The complete data analytics use case above was created in SnapLogic. I have only covered the Pig script portion here and plan to write up about the rest of the use case sometime later. Hope this helps! Here’s a demonstration of our big data integration solution in action. Contact Us to learn more.

snapreduce_pipeline_snaplogic

snaplogic_summit_seriesThe SnapLogic team is going on the road with the data management and integration specialists from PricewaterhouseCoopers. Building on our alliance partnership, which PWC’s Michael Pearl talked about in this post, we’re kicking off a Big Data Summit Series in New York on September 9th with a networking and informational lunch designed for data architects.

  • Is your IT organization moving more business applications to the cloud?
  • Are you researching Hadoop and establishing a vision for the data lake?
  • Are more of your integration workloads and analytics running the cloud, Hadoop or both?

SnapLogic’s Frank Samuelian will be moderating the Summit Series, which is designed to be an interactive session for data management professionals featuring presentations from:

  • John Simmons, Principal PWC
  • Kenneth Kryst, Director PWC
  • Ravi Dharnikota, Sr. Advisor, SnapLogic

The speakers will review what’s new in the world of data and application integration and modern data architecture best practices.

Space is limited in New York. The registration details are here, but please connect with Frank Samuelian directly if you’d like to send representatives from your enterprise IT organization.

We’ll also be coming to Boston and Washington, DC in September. Stay tuned for more details.

I recently had the pleasure of chatting with SnapLogic customer Yelp. Given the nature of their business, Yelp had a lot of customer data that they needed to process and act upon quickly in order to optimize their revenue streams. They decided to adopt the Amazon Redshift data warehouse service to give them the analytics speed and flexibility they needed. So the next question was: how to get the data into Redshift efficiently.

Once they discovered that SnapLogic had the out-of-the-box connectors they needed — not only for Redshift but for data sources Salesforce and Workday — it came down to build versus buy. They could build the integrations using in-house resources, but there was an opportunity cost and speed penalty that came with a DIY approach. In the end, they chose SnapLogic and estimate that they cut development time in half.

And they’re not done – they are connecting Workday with Redshift next. Yelp told me, “Looking ahead, we’re planning to deploy the Workday Snap to connect our human resources data to Redshift. SnapLogic has proven to be a tremendous asset.” Sounds like a 5-star review. Read more here.

SnapLogic recently introduced our Summer 2015 release. Last weekend we updated our library of pre-built connectors, called Snaps. Today Enterprise Management Associates published a review on our latest Elastic Integration Platform as a service (iPaaS) innovation. Their conclusion:

ema-logo“By providing better support for big data and cloud data sources and improving governance capabilities, SnapLogic is focusing on the framework required to implement strong data management in addition to integration. Enhancing self-service components through better task management and overall Snap use also shows a strong commitment to providing customers with a way to manage the full data acquisition lifecycle through a reusable framework.” 

I’ve embedded the review below. You can also check the recorded webinar and be sure to sign up for our new bi-weekly SnapLogic Live demonstrations, where our technical experts will dive into hybrid cloud and big data integration topics.

With 300+ Snaps now available, we’re regularly updating and enhancing our intelligent connector library. Building on our recent Summer 2015 release, this weekend all SnapLogic customers will be updated with our August Snap update. Here’s a summary – from A to Z.

updated_snaps_snaplogicUpdated Snaps include:

  • Active Directory
  • AWS Redshift
  • Anaplan
  • Binary
  • Concur
  • DynamoDB 
  • Email
  • Flow
  • JDBC
  • LDAP
  • MongoDB
  • MySQL
  • Oracle RDBMS
  • Oracle E-Business Suite
  • SQL Server
  • SOAP
  • Transform
  • Zuora

New Snaps include:

  • Google SpreadSheet Snap Pack contains Snaps for browsing Google SpreadSheets, reading worksheets, and writing to worksheets.
  • In the Binary Snap Pack there is a new File Poller Snap that polls a directory looking for files matching the specified pattern.
  • There are many new Snaps for AWS DynamoDB. Check out our recent AWS partner webinar with Earth Networks for a great customer overview.
  • The Flow Snap Pack contains a new Exit Snap, which forces a pipeline to stop with a failed status if it receives more records than the user-defined threshold.
  • The Transform Snap Pack contains a new Transcoder Snap, enabling a preview if a Snap contains special characters.

As always, be sure to contact our Support Team if you have any questions. If you’re new to SnapLogic, you can learn more about our Snaps here.  (Yes, that’s me in the video!)

A few months ago we published a series about the new hybrid cloud and big data integration requirements. Here’s an update:

Elastic IntegrationTraditional approaches to data and application integration are being re-imagined thanks to:

  1. Enterprise “cloudification”: Cloud expansion has hit a tipping point and most IT organizations are either running to keep up or trying to get ahead of the pace of this transformation; and
  2. The need for speed: Cloud adoption and big data proliferation have led to business expectations for self-service and real-time data delivery.

As a result, the concept of integration platform as a service (iPaaS) has gained momentum with enterprise IT organizations who need to connect data, applications, and APIs faster. Typical iPaaS requirements include: an easier- to-use user experience, metadata-driven integrations, pre-built connectivity without coding, data transformation and other ETL operations, and support for hybrid deployments. Here are four additional iPaaS requirements that cannot be ignored.

  1. Loose Coupling to Manage Change: It is now expected to respond to changing business requirements immediately. These changes result in data changes that impact the integration layer. For example, a new column is added to a table, or a field to an API, to record or deliver additional information. Last generation ETL tools are strongly typed, requiring the developer to define the exact data structures that will be passing through integrations while designing them. Any departure from this structure results in the integration breaking because additional fields are not recognized. This brittle approach can bring today’s agile enterprise to its knees. The right iPaaS solution must be resilient enough to handle frequent updates and variations in stride. Look for “loose coupling” and a JSON-centric approach that doesn’t require rigid dependency on a pre-defined schema. The result is maximum re-use and the flexibly you need for integrations to continue to run even as endpoint data definitions change over time.
  2. Platform Architecture Matters: Your integration layer must seamlessly transition from connecting on-premises systems to cloud systems (and vice versa) while still ensuring a high degree of business continuity. Many legacy data integration vendors “cloud wash” their solutions by simply hosting their software, or by providing only some aspects of their solution as a multi-tenant cloud service. Some require on-premises ETL or ESB technologies for advanced integration development and administration. When looking at a hybrid cloud integration solution, look under the hood to ensure there’s more than a legacy “agent” running behind the firewall. Look for elastic scale and the ability to handle modern big (and small) data volume, variety, and velocity. And ensure that your iPaaS “respects data gravity” by running as close to the data as necessary, regardless of where it resides.
  3. Integration Innovation: Many enterprise IT organizations are still running old, unsupported versions of integration software because of the fear of upgrades and the mindset of “if it ain’t broke, don’t fix it.” Cumbersome manual upgrades of on-premises installations are error-prone and result in significant re-development, testing cycles, and downtime. The bigger the implementation, the bigger the upgrade challenge—and connector libraries can be equally painful. Modern iPaaS customers expect the vendor to shield them from as much upgrade complexity as possible. They are increasingly moving away from developer-centric desktop IDEs. Instead, they want self service—browser-based designers for building integrations, and automatic access to the latest and greatest functionality.
  4. Future Proofing: Many IT organizations are facing the Integrator’s Dilemma, where their legacy data and application integration technologies were built for last decade’s requirements and can no longer keep up. In order to be able to handle the new social, mobile, analytics, cloud, and Internet of Things (SMACT) requirements, a modern iPaaS must deliver elastic scale that expands and contracts its compute capacity to handle variable workloads. A hybrid cloud integration platform should move data in a lightweight format and add minimal overhead; JSON is regarded as that compact format of choice when compared to XML. A modern iPaaS should also be able to handle REST-based streaming APIs to continuously feed into an analytics infrastructure, whether it’s Hadoop, a cloud-based or traditional data warehouse environment. With iPaaS, data and application integration technologies are being re-imagined so don’t let legacy, segregated approaches be a barrier to enterprise cloud and big data success. Cloud applications like Salesforce and Workday continue to fuel worldwide software growth, while infrastructure as a service (IaaS) and platform as a service (PaaS) providers offer customers the flexibility to build up systems and tear them down in short cycles.
snaplogic_connect_faster