Recently, I worked with a customer to reverse engineer a Pig Script running a MapReduce job in Hadoop and then orchestrated it as a SnapReduce pipeline with SnapLogic’s Elastic Integration Platform. SnapLogic’s HTML5 cloud-based Designer user interface and collection of pre-built components called Snaps made it possible to create a visual and functional representation of the data analytics workflow without knowing the intricacies of Pig and MapReduce. Here’s a quick writeup:

About Pig: Pig is a high level scripting language used with Apache Hadoop, for building complex applications to tackle business problems. Pig is used for interactive and batch jobs with MapReduce as the default execution mode. Here’s a tutorial.

About SnapReduce and the Hadooplex: SnapReduce and our Hadooplex enable SnapLogic’s iPaaS to run natively on Hadoop as a YARN application that elastically scales out to power big data analytics. SnapLogic is allowing Hadoop users to take advantage of an HTML5-based drag-and-drop user interface, breadth of connectivity (called Snaps) and modern architecture. Learn more here.

Overall Use Case (Product Usage Analytics)

Product usage raw data from consumer apps is loaded into Hadoop HCatalog tables and stored in RCFile format. The program reads data fields: product name, user and usage history with date and time, cleanses the data and eliminate duplicate records grouping by timestamp. Find unique records for each user and write the results to HDFS partitions based on date/time. Product analysts then create an external table in Hive on top of the already partitioned data to query and create product usage and trend reports. They will write these reports to a file or export to a visual analytics tool like Tableau.

Here’s the Pig Script portion for the above use case (Cleansing data):

REGISTER /apps/cloudera/parcels/CDH/lib/hive/lib/hive-exec.jar
SET default_parallel 24;
DEFINE HCatLoader org.apache.hcatalog.pig.HCatLoader();
raw = load ‘sourcedata.sc_survey_results_history’ USING HCatLoader();
in = foreach raw generate user_guid,survey_results,date_time, product as product;
grp_in = GROUP in BY (user_guid,product);
grp_data = foreach grp_in {
order_date_time = ORDER in BY date_time DESC;
max_grp_data = LIMIT order_date_time 1;
GENERATE FLATTEN(max_grp_data);
};
grp_out_data = foreach grp_data generate max_grp_data::user_guid as user_guid,max_grp_data::product as product,’$create_date’ as create_date,CONCAT(‘-“product”=”‘,CONCAT(max_grp_data::product,CONCAT(‘”,’,max_grp_data::survey_results))) as survey_results;
results = STORE grp_out_data INTO ‘hdfs://nameservice1/warehouse/marketing/sc_survey_results/epoch=$epoch_ts’ USING PigStorage (‘\u0001′);

SnapReduce Pipeline equivalent for the Pig script

This SnapReduce pipeline is translated to run as a MapReduce job in Hadoop. It can be scheduled or triggered to automate the integration. It can even be turned into a re-usable integration pattern. As you will see, it is pretty easy and intuitive to create a pipeline using SnapLogic HTML5 GUI and the Snaps to replace a Pig script.

The complete data analytics use case above was created in SnapLogic. I have only covered the Pig script portion here and plan to write up about the rest of the use case sometime later. Hope this helps! Here’s a demonstration of our big data integration solution in action. Contact Us to learn more.

snapreduce_pipeline_snaplogic

snaplogic_summit_seriesThe SnapLogic team is going on the road with the data management and integration specialists from PricewaterhouseCoopers. Building on our alliance partnership, which PWC’s Michael Pearl talked about in this post, we’re kicking off a Big Data Summit Series in New York on September 9th with a networking and informational lunch designed for data architects.

  • Is your IT organization moving more business applications to the cloud?
  • Are you researching Hadoop and establishing a vision for the data lake?
  • Are more of your integration workloads and analytics running the cloud, Hadoop or both?

SnapLogic’s Frank Samuelian will be moderating the Summit Series, which is designed to be an interactive session for data management professionals featuring presentations from:

  • John Simmons, Principal PWC
  • Kenneth Kryst, Director PWC
  • Ravi Dharnikota, Sr. Advisor, SnapLogic

The speakers will review what’s new in the world of data and application integration and modern data architecture best practices.

Space is limited in New York. The registration details are here, but please connect with Frank Samuelian directly if you’d like to send representatives from your enterprise IT organization.

We’ll also be coming to Boston and Washington, DC in September. Stay tuned for more details.

I recently had the pleasure of chatting with SnapLogic customer Yelp. Given the nature of their business, Yelp had a lot of customer data that they needed to process and act upon quickly in order to optimize their revenue streams. They decided to adopt the Amazon Redshift data warehouse service to give them the analytics speed and flexibility they needed. So the next question was: how to get the data into Redshift efficiently.

Once they discovered that SnapLogic had the out-of-the-box connectors they needed — not only for Redshift but for data sources Salesforce and Workday — it came down to build versus buy. They could build the integrations using in-house resources, but there was an opportunity cost and speed penalty that came with a DIY approach. In the end, they chose SnapLogic and estimate that they cut development time in half.

And they’re not done – they are connecting Workday with Redshift next. Yelp told me, “Looking ahead, we’re planning to deploy the Workday Snap to connect our human resources data to Redshift. SnapLogic has proven to be a tremendous asset.” Sounds like a 5-star review. Read more here.

SnapLogic recently introduced our Summer 2015 release. Last weekend we updated our library of pre-built connectors, called Snaps. Today Enterprise Management Associates published a review on our latest Elastic Integration Platform as a service (iPaaS) innovation. Their conclusion:

ema-logo“By providing better support for big data and cloud data sources and improving governance capabilities, SnapLogic is focusing on the framework required to implement strong data management in addition to integration. Enhancing self-service components through better task management and overall Snap use also shows a strong commitment to providing customers with a way to manage the full data acquisition lifecycle through a reusable framework.” 

I’ve embedded the review below. You can also check the recorded webinar and be sure to sign up for our new bi-weekly SnapLogic Live demonstrations, where our technical experts will dive into hybrid cloud and big data integration topics.

With 300+ Snaps now available, we’re regularly updating and enhancing our intelligent connector library. Building on our recent Summer 2015 release, this weekend all SnapLogic customers will be updated with our August Snap update. Here’s a summary – from A to Z.

updated_snaps_snaplogicUpdated Snaps include:

  • Active Directory
  • AWS Redshift
  • Anaplan
  • Binary
  • Concur
  • DynamoDB 
  • Email
  • Flow
  • JDBC
  • LDAP
  • MongoDB
  • MySQL
  • Oracle RDBMS
  • Oracle E-Business Suite
  • SQL Server
  • SOAP
  • Transform
  • Zuora

New Snaps include:

  • Google SpreadSheet Snap Pack contains Snaps for browsing Google SpreadSheets, reading worksheets, and writing to worksheets.
  • In the Binary Snap Pack there is a new File Poller Snap that polls a directory looking for files matching the specified pattern.
  • There are many new Snaps for AWS DynamoDB. Check out our recent AWS partner webinar with Earth Networks for a great customer overview.
  • The Flow Snap Pack contains a new Exit Snap, which forces a pipeline to stop with a failed status if it receives more records than the user-defined threshold.
  • The Transform Snap Pack contains a new Transcoder Snap, enabling a preview if a Snap contains special characters.

As always, be sure to contact our Support Team if you have any questions. If you’re new to SnapLogic, you can learn more about our Snaps here.  (Yes, that’s me in the video!)

A few months ago we published a series about the new hybrid cloud and big data integration requirements. Here’s an update:

Elastic IntegrationTraditional approaches to data and application integration are being re-imagined thanks to:

  1. Enterprise “cloudification”: Cloud expansion has hit a tipping point and most IT organizations are either running to keep up or trying to get ahead of the pace of this transformation; and
  2. The need for speed: Cloud adoption and big data proliferation have led to business expectations for self-service and real-time data delivery.

As a result, the concept of integration platform as a service (iPaaS) has gained momentum with enterprise IT organizations who need to connect data, applications, and APIs faster. Typical iPaaS requirements include: an easier- to-use user experience, metadata-driven integrations, pre-built connectivity without coding, data transformation and other ETL operations, and support for hybrid deployments. Here are four additional iPaaS requirements that cannot be ignored.

  1. Loose Coupling to Manage Change: It is now expected to respond to changing business requirements immediately. These changes result in data changes that impact the integration layer. For example, a new column is added to a table, or a field to an API, to record or deliver additional information. Last generation ETL tools are strongly typed, requiring the developer to define the exact data structures that will be passing through integrations while designing them. Any departure from this structure results in the integration breaking because additional fields are not recognized. This brittle approach can bring today’s agile enterprise to its knees. The right iPaaS solution must be resilient enough to handle frequent updates and variations in stride. Look for “loose coupling” and a JSON-centric approach that doesn’t require rigid dependency on a pre-defined schema. The result is maximum re-use and the flexibly you need for integrations to continue to run even as endpoint data definitions change over time.
  2. Platform Architecture Matters: Your integration layer must seamlessly transition from connecting on-premises systems to cloud systems (and vice versa) while still ensuring a high degree of business continuity. Many legacy data integration vendors “cloud wash” their solutions by simply hosting their software, or by providing only some aspects of their solution as a multi-tenant cloud service. Some require on-premises ETL or ESB technologies for advanced integration development and administration. When looking at a hybrid cloud integration solution, look under the hood to ensure there’s more than a legacy “agent” running behind the firewall. Look for elastic scale and the ability to handle modern big (and small) data volume, variety, and velocity. And ensure that your iPaaS “respects data gravity” by running as close to the data as necessary, regardless of where it resides.
  3. Integration Innovation: Many enterprise IT organizations are still running old, unsupported versions of integration software because of the fear of upgrades and the mindset of “if it ain’t broke, don’t fix it.” Cumbersome manual upgrades of on-premises installations are error-prone and result in significant re-development, testing cycles, and downtime. The bigger the implementation, the bigger the upgrade challenge—and connector libraries can be equally painful. Modern iPaaS customers expect the vendor to shield them from as much upgrade complexity as possible. They are increasingly moving away from developer-centric desktop IDEs. Instead, they want self service—browser-based designers for building integrations, and automatic access to the latest and greatest functionality.
  4. Future Proofing: Many IT organizations are facing the Integrator’s Dilemma, where their legacy data and application integration technologies were built for last decade’s requirements and can no longer keep up. In order to be able to handle the new social, mobile, analytics, cloud, and Internet of Things (SMACT) requirements, a modern iPaaS must deliver elastic scale that expands and contracts its compute capacity to handle variable workloads. A hybrid cloud integration platform should move data in a lightweight format and add minimal overhead; JSON is regarded as that compact format of choice when compared to XML. A modern iPaaS should also be able to handle REST-based streaming APIs to continuously feed into an analytics infrastructure, whether it’s Hadoop, a cloud-based or traditional data warehouse environment. With iPaaS, data and application integration technologies are being re-imagined so don’t let legacy, segregated approaches be a barrier to enterprise cloud and big data success. Cloud applications like Salesforce and Workday continue to fuel worldwide software growth, while infrastructure as a service (IaaS) and platform as a service (PaaS) providers offer customers the flexibility to build up systems and tear them down in short cycles.
snaplogic_connect_faster

This post originally appeared on Glenn Donovon’s blog.

I’ve recently decided to take a hard look at cloud iPaaS (integration platform as a service) and in particular, SnapLogic due to a good friend of mine joining them to help build their named account program. It’s an interesting platform which I think has the potential to help IT with the “last mile” of cloud build-out in the enterprise, not just due its features, but rather because of the shift in software engineering and design occurring that started in places like Google, Amazon and Netflix – and startups that couldn’t afford and “enterprise technology stack” – and is now making its way into the enterprise.

However, while discussing this with my friend, it became clear that one has to understand the history of integration servers, EAI, SOA, ESB, and WS standards to put into context the lessons that have been learned along the way regarding actual IT agility. But let me qualify myself before we jump in. My POV is based on being an enterprise tech sales rep who sold low latency middleware, and standards based middleware, EAI, a SOA grid-messaging bus as well as applications like CRM, BPM and firm-wide market/credit risk management, which have massive system integration requirements. While I did some university and goofing off coding in my life (did some small biz database consulting for a while), I’m not an architect, coder or even a systems analyst. I saw up close and personal why these technologies were purchased, how they evolved over time, what clients got out of them and how all our plans and aspirations played out. So, I’m trying to put together a narrative here that connects the dots for people other than middleware developers and CTO/Architect types. Okay, that said, buckle up.

The Terrain
Let’s define some terms – middleware is a generic a term – I’m using it to refer to message buses/ESBs and integration servers (EAI). The history of those domains led us to our current state and helps make clear where we need to go next, and why the old way of doing systems integration and building SOA is falling away in favor of RESTful web services micro-services based design in general.

Integration servers – whether from IBM or challengers in those days like SeeBeyond (who I worked for), the point of an integration server was to stop the practice of hand writing code for each system/project to access data/systems. These were often referred to as “point to point” integrations in those days, and when building complex systems in large enterprises before integration servers, the data flows between systems often looked like a plate of spaghetti. One enterprise market risk project I worked on at Bank of New York always comes to mind when I think of this. The data flows of over 100 integration points from which the risk system consumed data had been laid out in one diagram and we actually laminated it on 11×17 paper, making a sort of placemat out of it. It became symbolic of this type of problem for me. It looked kind of like this:
integrationimageJohnSchmidt(Image attribution to John Schmidt, recognized authority in the integration governance field and author of books on Integration Competency Centre and Lean Integration)

So, along came the “integration server”. The purpose was to provide a common interface and tools to use to connect systems without writing code from scratch so integrations between systems would be easy to build and maintain, while also being secure and performing well. Another major motivation was to loosely couple target systems and data consuming systems to isolate both from changes in the underlying code in either. The resulting service was available essentially as an API, they were proprietary systems, and in a sense they were ultimately black boxes from which you consumed or contributed data. They did the transforms, managed traffic, loaded the data efficiently etc. Of course, there were also those dedicated to bulk/batch data as well such as Informatica and later on Ab Initio, but it’s funny, these two very related worlds never came together into a unified offering successfully.

This approach didn’t change how software systems were designed, developed or deployed in a radical way though. Rather, the point was to isolate integration requirements and do them better via a dedicated server and toolkit. This approach delivered a lot of value to enterprises in terms of building robust systems integrations, and also helped large firms adopt packaged software with a lot less code writing, but in the end it offered only marginal improvements in development efficiency, while injecting yet another system dependency (new tool, specialized staff, ongoing operations costs) into the “stack”. Sure you could build competency centers and “factories” but to this day, such approaches end up creating more bureaucracy, more dependencies and complexity while adding less and less value compared to what a developer can build him/herself with RESTful services/micro-services, and most things one wants to integrate with today already have well defined APIs so it’s often much easier to connect and share data anyway. Innovative ideas like “eventually consistent data” and the incredible cost advantages and computing advantages of open source versus proprietary technologies, in addition to public IaaS welcoming such computing workloads at the container level, well let’s just say there is much more changing than just integration. Smart people I talk to tell me that using an ESB as the backbone of a micro-services based system is contrary to the architecture.

Messaging bus/ESB – This is a system for managing messaging traffic on a general purpose bus from which a developer can publish, subscribe, broadcast and multi-cast messages from target and source services/systems. This platform predates the web, fyi, and was present in the late ’80s and ’90s in specialized high speed messaging systems for trading floors, as well as in manufacturing and other industrial settings where low latency was crucial and huge traffic spikes were the norm. Tibco’s early success in trading rooms was due to the fact they provided a services based architecture for consuming market data which scaled and also allowed them to build systems using services/message bus design. These allowed such systems to approach the near-real time requirements of such environments, but they were proprietary, hugely expensive and not at all applicable for general purpose computing. However, using a service and bus architecture with pub/sub, broadcast, multi-point and point to point communications available as services for developers was terrific for isolating apps from data dependencies and dealing with the traffic spikes of such data.

Over time, it became clear that building systems out of services had much general promise (inheriting basic ideas from object oriented design actually), and that’s when the idea of Web Services began to emerge. Entire standards bodies were formed and many open source and other projects spun off to support the standards based approach to web services system design and deployment. Service Oriented Architectures became all the rage – I worked on one project for Travelers Insurance while at SeeBeyond where we exposed many of their mainframe applications as web services. This approach was supposed to unleash agility in IT development.

But along the way, a problem became obvious. The cost, expertise, complexity and time involved in building such elegantly designed and governed systems frameworks ran counter to building systems fast. A good developer could get something done that worked and was high quality without resorting to using all those WS standardized services and conforming to its structure. Central problems included using XML to represent data because the processing power necessary for transforming these structures was always way too expensive for the payoff. SOAP was also a highly specialized protocol, injecting yet another layer of systems/knowledge/dependency/complexity to building systems with a high degree of integration.

The entire WS framework was too formalized, so enterprising developers said, ‘how could I use a service based architecture to my advantage when building a system without relying on all of that nonsense? This is where RESTful web services come in, and then JSON, which changes everything. Suddenly, new, complex and sophisticated web services began to be callable just by HTTP. Creating and using such services became trivially easy in comparison to the old way. Performance and security could be managed in other ways. What was lost was the idea of operating with “open systems” standards from a systems design standpoint, but it turns out that it was less valuable in practice given all the overhead.

The IT View
IT leadership already suspects that the messaging bus frameworks did not give them the most important benefit they were seeking in the first place – agility – yet they have all this messaging infrastructure and all these incredibly well behaved web services running. In a way, you can see this all as part of how IT organizations are learning what real agility looks like when using services based architectures. I think many are ready to dump all that highly complex and expensive overhead which came along with messaging buses when an enterprise class platform comes along that enables them to do so.

But IT still loves integration servers. I think they are eager for a legit, hybrid-cloud based integration server (iPaaS) that gives them an easy way to build and maintain interfaces for various projects at lower cost than on ESB-based solutions while running in the cloud and on-prem. It will need to provide the benefits of a service based architecture – contractual level relationships – without the complexity/overhead of messaging buses. It needs to leverage the flexibility of JSON while providing a meta-data level control over the services, along with a comprehensive operational governance and management framework. It also needs to scale massively, and be distributed in design in order to even be considered for this central role in enterprise systems.

The Real Driver of the Cloud
What gets lost sometimes with all of our technical focus is that economics are what is mostly driving cloud adoption. With the massive cost differentials opening up in public computing versus on-prem or even outsourced data centers due to the tens of billions IBM, Google, MSFT, AWS and others are pouring into their services, the public cloud will become the platform of choice for all computing going forward. All that’s up for debate inside any good IT organization is how long this transition is going to take.

Many large IT organizations clearly see the tipping point cost-wise is at hand already with regard to IaaS and PaaS. This has happened in lots of cloud markets already – Salesforce didn’t really crush Siebel until its annual subscription fees were less than Siebel’s on prem maintenance fees. Remember, there was no functional advantage in Salesforce over Siebel. Hint: Since the economics are the driver, it’s not about having the most elegant solution rather than being able to actually do the job somehow, even if not pretty or involving tradeoffs.

Deeper hint: Give a good systems engineer a choice between navigating a nest of tools and standards and services, along with all the performance penalties, overhead, functional compromises and dependencies they bring along, versus having to write a bit more code and being mindful of systems management, and she/he will choose the latter every time. Another way of saying this is that complexity and abstraction are very expensive when building systems and the benefits have to be really large for the tradeoff to make sense, and I don’t believe that in the end, ESBs and WS standards for the backbone of SOA ever paid off in the way it needed to. And never will. The industry paid a lot to learn that lesson, yet it seems that there is a big reluctance to discuss this openly. The lesson? Simplicity = IT Agility.

The most important question is what core enterprise workloads can move to the public cloud feasibly, and that is still a work in progress. Platforms like Apprenda and other hybrid cloud solutions are crucial part of that mix as one has to be able to assign data and process to appropriate resources from a security/privacy and performance/cost POV. A future single “cloud” app built by an enterprise may have some of its data in servers on Azure in three different data centers to manage data residency, other data in on prem data stores, be accessing data from SaaS apps, and accessing still other data in super cheap simple data stores on AWS. Essentially, this is a “policy” issue and I’d say that if a hybrid iPaaS like Snaplogic can fit into those policies, and behave well with them and facilitate connecting these systems, it has a great shot at being a central part of the great rebuild of core IT systems we are going to watch happen over the next 10 years. This is because it provides the enterprise with what it needs, an integration server while leveraging RESTful services, micro-services based architectures and JSON without an ESB.

This is all coming together now, so you will see growing interest in throwing out the old integration server/message bus architectures in organizations focused on transformation and agility as core values. I think the leadership of most IT organizations are ready to leave the old WS standards stuff behind, along with bus/message/service architectures, but their own technical organizations are still built around them so this will take some time. It’s also true that message buses will not be eliminated entirely, as there is a place for messaging services in systems development – just not as the glue for SOA. And of course, low latency buses will still apply to building near-real time apps in say engineering settings or on trading floors, but using message buses as a general purpose design pattern will just fade from view.

The bottom line is that IT leaders are just as frustrated they didn’t get the agility out of building systems using messaging bus/SOA patterns as business sponsors are about the costs and latency of these systems. All the constituencies are eager for the next paradigm and now is the time to engage in dialog about this new vision. To my thinking, this is one of the last crucial parts of IT infrastructure which needs to be modernized and cloud enabled to move the enterprise computing workloads to the cloud, and I think SnapLogic has a great shot at helping enterprises realize this vision.

______

About the Author

Glenn Donovon has advised, consulted to or been an employee of 23 startups, and he has extensive experience working with enterprise B2B IT organizations. To learn more about Glenn, please visit his website.