SnapLogic-LiveInterested in seeing a live demo of SnapLogic integration in action? We’re kickstarting this year’s bi-weekly SnapLogic Live demo sessions, starting this week with a focus on big data and powering the data lake.

As we did with our previous SnapLogic Live series, each week we will focus on one particular type of integration and, via a live demo, show how customers are able to solve their varying integration challenges using the SnapLogic Elastic Integration Platform.

In addition to the demo, we also take time to address questions we typically hear from customers in a given use case. You can check out previously recorded SnapLogic Live sessions here, and register for this week’s Big Data demo here.

The calendar for the next few months is below. Registration for each coming soon!

  • Thursday, February 25th – An overview of the SnapLogic Winter 2016 Release
  • Thursday, March 10th – Integrating Workday Financials
  • Thursday, March 24th – Using the SnapLogic Spark Snap
  • Thursday, April 7th – Integration for Amazon Web Services (AWS)
  • Thursday, April 21st – Big Data Integration and Powering the Data Lake

This post originally appeared on Data Informed.

binary-big-dateAs organizations look to increase their agility, IT and lines of business need to connect faster. Companies need to adopt cloud applications more quickly and they need to be able to access and analyze all their data, whether from a legacy data warehouse, a new SaaS application, or an unstructured data source such as social media. In short, a unified integration platform has become a critical requirement for most enterprises.

According to Gartner, “unnecessarily segregated application and data integration efforts lead to counterproductive practices and escalating deployment costs.”

Don’t let your organization get caught in that trap. Whether you are evaluating what you already have or shopping for something completely new, you should measure any platform by how well it address the “three A’s” of integration: Anything, Anytime, Anywhere.

Anything Goes

For today’s enterprise, the spectrum of what needs to be integrated is broader than ever. Most companies are dealing with many different data sources and targets, from software-as-a-service applications to on-premises ERP/CRM, databases and data warehouses, Internet of Things (IoT) sensors, clickstreams, logs, and social media data, just to name a few. Some older sources are being retired, but new sources are being added, so don’t expect simplicity any time soon. Instead, focus on making your enterprise ready for “anything.”

Beyond point-to-point. You may have managed integration before on a point-to-point basis. This approach is labor intensive, requiring hand-coding to get up and running, and additional coding any time there’s a change to either “point.” Integration of your endpoints could run into trouble when this happens, and then you would have to wait for your IT department to get around to fixing the issues. But the more serious problem is that this inflexible approach simply doesn’t scale to support enterprise-wide integration in a time of constant change.

Some modern concepts, when applied to integration, provide this flexibility and scale.

Microservices. An architecture approach in which IT develops a single service as a suite of small services that communicate with each other using lightweight REST APIs, microservices have become, during the past year or so, the standard architecture for developing enterprise applications.

When applied to integrations, these open up tremendous opportunity for achieving large-scale integration at a very low cost. Instead of one big execution engine running all integrations, smaller execution engines run some integrations. This way, you can supply more compute power to the integrations that need it, when they need it. You also can distribute the integrations between nodes on a cluster based on volume variations for horizontal scaling.

The document data model. Today’s modern applications produce more than just row and column data. So how do you achieve loose coupling while simultaneously supporting semi-structured and unstructured data, all without sacrificing performance? You can group data together more naturally and logically, and loosen the restrictions on database schema, by using a document data model to store data. Document-based data models help with loose coupling, brevity in expression, and overall reuse.

In this approach, each record and its associated data are thought of as a “document,” an independent unit that improves performance and makes it easier to distribute data across multiple servers while preserving its locality. You can turn object hierarchical data into a document. But this is not a seamless solution. Documents are a superset of row-column based records, so while you can put rows and columns into a document, it doesn’t work the other way around.

Anytime is Real Time, and It’s Happening Now

Today’s use cases like recommendation engines, predictive analytics, and fraud detection increasingly demand real-time, “anytime” capture and processing of data from applications, A modern integration platform needs to have a streaming layer that can handle real-time use cases as well as batch processing.

Many organizations are used to choosing tools based on data momentum: ESB platforms for event-based, low latency application integrations; and ETL tools for high-volume batch jobs. Now, though, enterprises have to look for the simplicity and flexibility of a framework that can support both batch and real-time, “anytime” processing, and architectures like the Lambda architecture are a result of that need.

The Lambda architecture is designed to balance latency and throughput in handling batch and real-time use cases. The batch layer provides comprehensive and accurate views. It can reprocess the entire data set available to it in case of any errors. However, it has a high latency, so to compensate, it also has a speed layer that provides real-time processing of streaming data. The serving layer of this architecture consists of an appropriate database for the speed and batch layers, which can be combined and queried to get answers from the data.

Because of these real-time use cases, streaming platforms have become very desirable.

Anywhere Should Look Like Everywhere

With today’s hybrid data and deployment architecture, your data can be anywhere, in any format, and might need a different processing treatment based on the particular use case. So, for example:

  • If all your applications and data are in the cloud, you would need a cloud-first approach for all the other parts of the application ecosystem, including integration.
  • If you have a hybrid architecture comprising both on-premises data and cloud applications, you may need to restrict data from leaving the premises. Consider an on-premises data plane for processing within the firewall.
  • If you have a big data ecosystem, you probably need the flexibility to run natively on Hadoop using the YARN resource manager and to use MapReduce for processing any integration or transformation jobs.

Meanwhile, Spark has been gaining a lot of traction for processing low-latency jobs. For certain use cases that require analysis in real time, such as fraud detection, log processing, and processing data from IoT, Spark is an ideal processing engine.

Integration is at the heart of every successful social, mobile, analytics (big data), cloud, and IoT initiative. It’s no longer possible to scale up to success while having to choose between multiple tools and teams for application, process, and data integration. Successful enterprises today need to access and stream resources instantly through a single platform. When you have the right platform – one that provides anything, anytime, anywhere it’s needed – your users will never need to stop and ask what resources are being used, whether information is current, or where it’s located. Whatever they need will be available to them when they need it and wherever they are.

I also posted this on LinkedIn. Comments are welcome on Data Informed, LinkedIn or here.

In this new customer video, Mark Patton, VP of enterprise architecture at GameStop, speaks about the importance of speed and agility when it comes to modern enterprise data and application integration. He notes that with SnapLogic:

“We’ve been able to decrease the time it takes to implement a well-defined integration by 83%.”

Mark goes on to speak to the power of self-service integration in the video, which I’ve embedded below:

Next steps:

  • Visit the SnapLogic video site for more customer testimonials, recorded webinars and demonstrations
  • Let us know if you’re interested in a custom SnapLogic Elastic Integration Platform demonstration

Alan Leung is the Sr. Enterprise Systems Program Manager at Box. Prior to Box he worked at Appirio, where he had hands-on experience with many application and data integration technologies. After evaluating a number vendors in the market, here’s what he had to say about SnapLogic:

“I was able to see that SnapLogic can not only do all of the integrations that we need, but more importantly, it’s far more advanced than all of the other players out there.”

Here is a video of Alan discussing his SnapLogic deployment at Box and his experiences compared to legacy integration tools:

Thanks for the positive review Alan!

Visit our video site to hear from other SnapLogic customers and watch a demonstration around specific cloud and big data integration use cases.

As a part of a wider analytics project I’m working on, analyzing runtime information from the SnapLogic platform, I chose to use the functionality exposed to all customers, the Public API for Pipeline Monitoring API and the REST API. These two things are combined in this post. I started by reading the documentation (of course), which shows the format of the request and response. So I created a new pipeline and dropped a REST GET Snap on the canvas:
snaplogic_REST_pipeline

I wanted to get the runtime data out of the snaplogic org on the SnapLogic platform so in the URL I specified snaplogic as the org. The other thing to remember is that the API will require authentication, so I created an basic auth account with my credentials. This works well, and retrieves me the info I wanted, as follows:

[  

  {  

     “headers”:{  

        “x-frame-options”:“DENY”,

        “connection”:“keep-alive”,

        “x-sl-userid”:“cstewart@snaplogic.com”,

        “access-control-max-age”:“17600”,

        “content-type”:“application/json”,

        “date”:“Sun, 31 Jan 2016 01:29:59 GMT”,

        “access-control-allow-credentials”:“true”,

        “access-control-allow-methods”:“GET, POST, OPTIONS, PUT, DELETE”,

        “x-sl-statuscode”:“200”,

        “content-length”:“5175”,

        “content-security-policy”:“frame-ancestors ‘none'”,

        “access-control-allow-origin”:“*”,

        “server”:“nginx/1.6.2″,

        “access-control-allow-headers”:“authorization, x-date, content-type, if-none-match”

     },

     “statusLine”:{  

        “reasonPhrase”:“OK”,

        “statusCode”:200,

        “protoVersion”:“HTTP/1.1″

     },

     “entity”:{  

        “response_map”:{  

           “entries”:[  

              {  

                 “pipe_id”:“1b10f684-24c0-4002-abd3-09b2e87e975f”,

                 “ccid”:“56ab61ed63766e7406c491ba”,

                 “runtime_path_id”:“snaplogic/rt/cloud/dev”,

                 “subpipes”:{ },

                 “state_timestamp”:“2016-01-31T01:29:43.758000+00:00″,

                 “parent_ruuid”:null,

                 “create_time”:“2016-01-31T01:29:43.484000+00:00″,

                 “id”:“0aa7943d-d6c3-408c-9490-1d2b6b281227″,

                 “runtime_label”:“cloud-dev”,

                 “cc_label”:“prodxl-jcc1″,

                 “documents”:9,

                 “user_id”:“kterada@snaplogic.com”,

                 “label”:“REPORT – NGM FSM Ultra Task Failure Audit”,

                 “state”:“Completed”,

                 “invoker”:“scheduled”

              },

           …another 9 of these..

           ],

           “total”:154,

           “limit”:10,

           “offset”:0

        },

        “http_status_code”:200

     }

  }

]

Note that small section at the end, with the total, limit and offset, indicating that I had only retrieved the first 10 runtimes available. Retrieving more runtimes calls for the “pagination” feature that was added to the REST GET Snap last year (the documentation details examples on how to use the pagination with Eloqua, Marketo and HubSpot). The two fields in the infobox for the Snap are “Has Next”, a boolean which indicates if there is further iteration to be done, and “Next URL”, both of which are expressions, so the condition can be dynamic.

In the case of my SnapLogic runtime API, I had to figure out from those three fields (total, limit, offset) how to indicate if, and what the next URL was going to be. So, in order to work out my logic, I actually put a Mapper Snap next in the pipeline so I could iterate until I had the right expressions.
To give you an idea of the way I went about working it out, here is the state of my Mapper:snaplogic_mapper

The important one in here is the expression I used to indicate if there is more to fetch, hasNext:

$entity.response_map.offset + $entity.response_map.limit <  $entity.response_map.total

This I can now apply to the REST GET Snap, but first I should work out how to create the Next URL. The Next URL will be the same as the Service URL, but simply appending “?offset=n”, where n is the number I have already fetched, minus one, as we always start counting in true CS style, with 0!  So my URL ends up being the expression:

‘https://elastic.snaplogic.com/api/1/rest/public/runtime/snaplogic?offset=’ + ($entity.response_map.offset+$entity.response_map.limit)

You do have to ensure that you set both fields toggled to expression mode.

This we can test to ensure the full data set is retrieved. When I save this time, the preview on the REST GET is as follows:

snaplogic_rest_runtimes

The time I ran this, the count of runtimes was as follows:

snaplogic_rest_example

Hence the seventeen iterations.

snaplogic_rest_view

Therein lies another issue; it comes out as seventeen documents, one for each iteration set, but with up to 10 (the default limit size) runtimes per iteration.
This causes me to put a JSON Splitter Snap to expand it out:

json_splitter_snaplogic

Note that I selected the $entity.response_mpa.entries as the JSONPath to split on.  This is selectable from that drop-down (note that I cut off the bottom of the list, it does go on for all the keys in the document:

snaplogic_json_path

I also included the Scalar Parent, so that each document includes the direct parents:

snaplogic_pipeline

Now when I run, I will get the full number of pipeline runtimes each as separate documents:

pipeline_runtime_Snaplogic

Note: You may note that the number of runtimes in the request varies over this article, this is due to a different set of data being available over the time it took me to write it. You could be specific with the options on the API to be specific on the period of the request.

Next Steps:

bloomberg_rest_seminarIf you’re interested in enterprise IT architecture, chances are you’ve heard of Jason Bloomberg. The president of Intellyx, which is “the first and only industry analysis, advisory, and training firm focused on agile digital transformation,” Jason is a globally recognized expert on agile digital transformation who writes and speaks on how today’s disruptive enterprise technology trends support the digital professional’s business transformation goals. He is a prolific writer who is a regular contributor to Forbes, has a biweekly newsletter called the Cortex, and several contributed blogs. His latest book is The Agile Architecture Revolution (Wiley, 2013).

Recently Jason has published a series of articles that are directed towards today’s enterprise architect (EA), focusing on what’s new and what’s different in the era of social, mobile, analytics, cloud and the Internet of Things (SMACT). Here are the four posts he’s written so far:

Supporting the ‘Citizen Integrator’ with Enterprise Architecture

“Developing strategies for accelerating and automating governance that maintains consistency across the organization is essential to the success of any self-service effort, including self-service integration. Who better than the enterprise architects to develop such strategies?”

Data Lake Considerations for the Enterprise Architect

“The EA’s role has always been to maintain an end-to-end perspective on the organization, and how it leverages technology to meet business needs. With the rise of digital transformation, this end-to-end perspective is especially critical, and EAs should apply that perspective to their organization’s data lake initiatives.”

Avoiding Enterprise Web Scale Pitfalls

“In the final analysis, enterprise web scale requires more than simply adding new technology. It requires both modern integration approaches as well as an end-to-end organizational context that enterprise architects are well-suited to lead.”

How EAs Should Support the Chief Digital Officer

 ”If you find that in spite of your EA title, nothing on your list of duties bears much resemblance to architecting an enterprise in transformation – then don’t wait for permission. Take the initiative to gain the digital skills you require to make a difference in your organization, and find a way to provide value to the CDO. You will be more valuable to your organization, your skills will be more current, and you’ll have more fun as well. What do you have to lose?”

Some solid advice for today’s forward-thinking enterprise architect. Jason is working on his final post in this series. What topic would you like to see him cover?

In this final post in this series from Mark Madsen’s whitepaper: Will the Data Lake Drown the Data Warehouse?, I’ll summarize SnapLogic’s role in the enterprise data lake.

SnapLogic is the only unified data and application integration platform as a service (iPaaS). The SnapLogic Elastic Integration Platform has 350+ pre-built intelligent connectors – called Snaps – to connect everything from AWS Redshift to Zuora and a streaming architecture that supports real-time, event-based and low latency enterprise integration requirements plus the high volume, variety and velocity of big data integration in the same easy-to-use, self service interface.

SnapLogic’s distributed, web-oriented architecture is a natural fit for consuming and moving large data sets residing on premises, in the cloud, or both and delivering them to and from the data lake. The SnapLogic Elastic Integration Platform provides many of the core services of a data lake, including workflow management, dataflow, data movement, and metadata.

SnapLogic_data_lake

More specifically, SnapLogic accelerates development of a modern data lake through:

  • Data acquisition: collecting and integrating data from multiple sources. SnapLogic goes beyond developer tools such as Sqoop and Flume with a cloud-based visual pipeline designer, and pre-built connectors for 350+ structured and unstructured data sources, enterprise applications and APIs.
  • Data transformation: adding information and transforming data. Minimize the manual tasks associated with
    data shaping and make data scientists and analysts more efficient. SnapLogic includes Snaps for tasks such as transformations, joins and unions without scripting.
  • Data access: organizing and preparing data for delivery and visualization. Make data processed on Hadoop
    or Spark easily available to off-cluster applications and data stores such as statistical packages and business intelligence tools.

SnapLogic’s platform-agnostic approach decouples data processing specification from execution. As data volume or latency requirements change, the same pipeline can be used just by changing the target data platform. SnapLogic’s SnapReduce enables SnapLogic to run natively on Hadoop as a YARN-managed resource that elastically scales out to power big data analytics, while the Spark Snap helps users create Spark-based data pipelines ideally suited for memory-intensive, iterative processes. Whether MapReduce, Spark or other big data processing framework, SnapLogic allows customers to adapt to evolving data lake requirements without locking into a specific framework.

We call it “Hadoop for Humans.” 

Next Steps: