Alan Leung is the Sr. Enterprise Systems Program Manager at Box. Prior to Box he worked at Appirio, where he had hands-on experience with many application and data integration technologies. After evaluating a number vendors in the market, here’s what he had to say about SnapLogic:

“I was able to see that SnapLogic can not only do all of the integrations that we need, but more importantly, it’s far more advanced than all of the other players out there.”

Here is a video of Alan discussing his SnapLogic deployment at Box and his experiences compared to legacy integration tools:

Thanks for the positive review Alan!

Visit our video site to hear from other SnapLogic customers and watch a demonstration around specific cloud and big data integration use cases.

As a part of a wider analytics project I’m working on, analyzing runtime information from the SnapLogic platform, I chose to use the functionality exposed to all customers, the Public API for Pipeline Monitoring API and the REST API. These two things are combined in this post. I started by reading the documentation (of course), which shows the format of the request and response. So I created a new pipeline and dropped a REST GET Snap on the canvas:
snaplogic_REST_pipeline

I wanted to get the runtime data out of the snaplogic org on the SnapLogic platform so in the URL I specified snaplogic as the org. The other thing to remember is that the API will require authentication, so I created an basic auth account with my credentials. This works well, and retrieves me the info I wanted, as follows:

[  

  {  

     “headers”:{  

        “x-frame-options”:“DENY”,

        “connection”:“keep-alive”,

        “x-sl-userid”:“cstewart@snaplogic.com”,

        “access-control-max-age”:“17600”,

        “content-type”:“application/json”,

        “date”:“Sun, 31 Jan 2016 01:29:59 GMT”,

        “access-control-allow-credentials”:“true”,

        “access-control-allow-methods”:“GET, POST, OPTIONS, PUT, DELETE”,

        “x-sl-statuscode”:“200”,

        “content-length”:“5175”,

        “content-security-policy”:“frame-ancestors ‘none'”,

        “access-control-allow-origin”:“*”,

        “server”:“nginx/1.6.2″,

        “access-control-allow-headers”:“authorization, x-date, content-type, if-none-match”

     },

     “statusLine”:{  

        “reasonPhrase”:“OK”,

        “statusCode”:200,

        “protoVersion”:“HTTP/1.1″

     },

     “entity”:{  

        “response_map”:{  

           “entries”:[  

              {  

                 “pipe_id”:“1b10f684-24c0-4002-abd3-09b2e87e975f”,

                 “ccid”:“56ab61ed63766e7406c491ba”,

                 “runtime_path_id”:“snaplogic/rt/cloud/dev”,

                 “subpipes”:{ },

                 “state_timestamp”:“2016-01-31T01:29:43.758000+00:00″,

                 “parent_ruuid”:null,

                 “create_time”:“2016-01-31T01:29:43.484000+00:00″,

                 “id”:“0aa7943d-d6c3-408c-9490-1d2b6b281227″,

                 “runtime_label”:“cloud-dev”,

                 “cc_label”:“prodxl-jcc1″,

                 “documents”:9,

                 “user_id”:“kterada@snaplogic.com”,

                 “label”:“REPORT – NGM FSM Ultra Task Failure Audit”,

                 “state”:“Completed”,

                 “invoker”:“scheduled”

              },

           …another 9 of these..

           ],

           “total”:154,

           “limit”:10,

           “offset”:0

        },

        “http_status_code”:200

     }

  }

]

Note that small section at the end, with the total, limit and offset, indicating that I had only retrieved the first 10 runtimes available. Retrieving more runtimes calls for the “pagination” feature that was added to the REST GET Snap last year (the documentation details examples on how to use the pagination with Eloqua, Marketo and HubSpot). The two fields in the infobox for the Snap are “Has Next”, a boolean which indicates if there is further iteration to be done, and “Next URL”, both of which are expressions, so the condition can be dynamic.

In the case of my SnapLogic runtime API, I had to figure out from those three fields (total, limit, offset) how to indicate if, and what the next URL was going to be. So, in order to work out my logic, I actually put a Mapper Snap next in the pipeline so I could iterate until I had the right expressions.
To give you an idea of the way I went about working it out, here is the state of my Mapper:snaplogic_mapper

The important one in here is the expression I used to indicate if there is more to fetch, hasNext:

$entity.response_map.offset + $entity.response_map.limit <  $entity.response_map.total

This I can now apply to the REST GET Snap, but first I should work out how to create the Next URL. The Next URL will be the same as the Service URL, but simply appending “?offset=n”, where n is the number I have already fetched, minus one, as we always start counting in true CS style, with 0!  So my URL ends up being the expression:

‘https://elastic.snaplogic.com/api/1/rest/public/runtime/snaplogic?offset=’ + ($entity.response_map.offset+$entity.response_map.limit)

You do have to ensure that you set both fields toggled to expression mode.

This we can test to ensure the full data set is retrieved. When I save this time, the preview on the REST GET is as follows:

snaplogic_rest_runtimes

The time I ran this, the count of runtimes was as follows:

snaplogic_rest_example

Hence the seventeen iterations.

snaplogic_rest_view

Therein lies another issue; it comes out as seventeen documents, one for each iteration set, but with up to 10 (the default limit size) runtimes per iteration.
This causes me to put a JSON Splitter Snap to expand it out:

json_splitter_snaplogic

Note that I selected the $entity.response_mpa.entries as the JSONPath to split on.  This is selectable from that drop-down (note that I cut off the bottom of the list, it does go on for all the keys in the document:

snaplogic_json_path

I also included the Scalar Parent, so that each document includes the direct parents:

snaplogic_pipeline

Now when I run, I will get the full number of pipeline runtimes each as separate documents:

pipeline_runtime_Snaplogic

Note: You may note that the number of runtimes in the request varies over this article, this is due to a different set of data being available over the time it took me to write it. You could be specific with the options on the API to be specific on the period of the request.

Next Steps:

bloomberg_rest_seminarIf you’re interested in enterprise IT architecture, chances are you’ve heard of Jason Bloomberg. The president of Intellyx, which is “the first and only industry analysis, advisory, and training firm focused on agile digital transformation,” Jason is a globally recognized expert on agile digital transformation who writes and speaks on how today’s disruptive enterprise technology trends support the digital professional’s business transformation goals. He is a prolific writer who is a regular contributor to Forbes, has a biweekly newsletter called the Cortex, and several contributed blogs. His latest book is The Agile Architecture Revolution (Wiley, 2013).

Recently Jason has published a series of articles that are directed towards today’s enterprise architect (EA), focusing on what’s new and what’s different in the era of social, mobile, analytics, cloud and the Internet of Things (SMACT). Here are the four posts he’s written so far:

Supporting the ‘Citizen Integrator’ with Enterprise Architecture

“Developing strategies for accelerating and automating governance that maintains consistency across the organization is essential to the success of any self-service effort, including self-service integration. Who better than the enterprise architects to develop such strategies?”

Data Lake Considerations for the Enterprise Architect

“The EA’s role has always been to maintain an end-to-end perspective on the organization, and how it leverages technology to meet business needs. With the rise of digital transformation, this end-to-end perspective is especially critical, and EAs should apply that perspective to their organization’s data lake initiatives.”

Avoiding Enterprise Web Scale Pitfalls

“In the final analysis, enterprise web scale requires more than simply adding new technology. It requires both modern integration approaches as well as an end-to-end organizational context that enterprise architects are well-suited to lead.”

How EAs Should Support the Chief Digital Officer

 ”If you find that in spite of your EA title, nothing on your list of duties bears much resemblance to architecting an enterprise in transformation – then don’t wait for permission. Take the initiative to gain the digital skills you require to make a difference in your organization, and find a way to provide value to the CDO. You will be more valuable to your organization, your skills will be more current, and you’ll have more fun as well. What do you have to lose?”

Some solid advice for today’s forward-thinking enterprise architect. Jason is working on his final post in this series. What topic would you like to see him cover?

In this final post in this series from Mark Madsen’s whitepaper: Will the Data Lake Drown the Data Warehouse?, I’ll summarize SnapLogic’s role in the enterprise data lake.

SnapLogic is the only unified data and application integration platform as a service (iPaaS). The SnapLogic Elastic Integration Platform has 350+ pre-built intelligent connectors – called Snaps – to connect everything from AWS Redshift to Zuora and a streaming architecture that supports real-time, event-based and low latency enterprise integration requirements plus the high volume, variety and velocity of big data integration in the same easy-to-use, self service interface.

SnapLogic’s distributed, web-oriented architecture is a natural fit for consuming and moving large data sets residing on premises, in the cloud, or both and delivering them to and from the data lake. The SnapLogic Elastic Integration Platform provides many of the core services of a data lake, including workflow management, dataflow, data movement, and metadata.

SnapLogic_data_lake

More specifically, SnapLogic accelerates development of a modern data lake through:

  • Data acquisition: collecting and integrating data from multiple sources. SnapLogic goes beyond developer tools such as Sqoop and Flume with a cloud-based visual pipeline designer, and pre-built connectors for 350+ structured and unstructured data sources, enterprise applications and APIs.
  • Data transformation: adding information and transforming data. Minimize the manual tasks associated with
    data shaping and make data scientists and analysts more efficient. SnapLogic includes Snaps for tasks such as transformations, joins and unions without scripting.
  • Data access: organizing and preparing data for delivery and visualization. Make data processed on Hadoop
    or Spark easily available to off-cluster applications and data stores such as statistical packages and business intelligence tools.

SnapLogic’s platform-agnostic approach decouples data processing specification from execution. As data volume or latency requirements change, the same pipeline can be used just by changing the target data platform. SnapLogic’s SnapReduce enables SnapLogic to run natively on Hadoop as a YARN-managed resource that elastically scales out to power big data analytics, while the Spark Snap helps users create Spark-based data pipelines ideally suited for memory-intensive, iterative processes. Whether MapReduce, Spark or other big data processing framework, SnapLogic allows customers to adapt to evolving data lake requirements without locking into a specific framework.

We call it “Hadoop for Humans.” 

Next Steps:

In the whitepaper, How to Build an Enterprise Data Lake: Important Considerations Before You Jump In, industry expert Mark Madsen outlined the principles that must guide the design of your new reference architecture and some of the difference from the traditional data warehouse. In his follow-up paper, “Will the Data Lake Drown the Data Warehouse,” he asks the question, “What does this mean for the tools we’ve been using for the last ten years?”

In this latest in the series of posts from this paper (see the first post here and second post here) Mark writes about tackling big data integration using an example:

“The best way to see the challenge faced when building a data lake is to focus on integration in the Hadoop environment. A common starting point is the idea of moving ETL and data processing from traditional tools to Hadoop, then pushing the data from Hadoop to a data warehouse or database like Amazon Redshift so that users can still work with data in a familiar way. If we look at some of the specifics in this scenario, the problem of using a bill of materials as a technology guide becomes apparent. For example, the processing of web event logs is unwieldy and expensive in a database, so many companies shift this workload to Hadoop.”

The following table summarizes the log processing requirements in an ETL offload scenario and the components that could be used to implement them in Hadoop:

data_lake_ETL_offload

He goes on to review the development challenges and tradeoffs of following this type of an approach and concludes:

“Building a data lake requires thinking about the capabilities needed in the system. This is a bigger problem than just installing and using open source projects on top of Hadoop. Just as data integration is the foundation of the data warehouse, an end-to-end data processing capability is the core of the data lake. The new environment needs a new workhorse.”

Next steps:

 

Inricity_Quote_PRToday we announced a new program designed to help innovative IT leaders accelerate cloud and big data adoption within their organizations. Working with our partner Intricity, a leading data management consultancy, we’ve developed a 3-day Integration Modernization Assessment that will outline the steps for transitioning from legacy data and application integration technologies to a modern integration platform as a service (iPaaS), paving the way for faster, easier integrations, reduced costs and increased return on cloud and big data technology investments.

Here are some highlights from today’s announcement:

  • According to Gartner, “Integration can be a source of competitive differentiation and an enabler for bimodal IT, but most CIOs have yet to recognize that their traditional, established integration strategies cannot cope with digitalization’s fast technology innovation and accelerated pace of business.”
  • Jack Kudale, SnapLogic’s SVP of Field Operations had this to say: “Many IT organizations today are still relying on legacy technologies that were built for the rows-and-columns world, causing them to struggle with modern data requirements such as REST, JSON, and Hadoop. Over the past 12 months, we’ve been talking to a growing number of Informatica customers, for example, who are facing difficult upgrades and end-of-life for older versions of their data integration software. To assist with digital transformation initiatives, we’ve worked with Intricity to introduce a 3-day Integration Modernization Assessment that helps companies understand their options for developing a more cost-effective and productive integration strategy.”

You can learn more about how to get to the cloud and big data faster in this week’s webinar, which will feature Arkady Kleyner from Intricity and Kai Thapa from SnapLogic. We’ll also dive into a demonstration of the SnapLogic Elastic Integration Platform and outline some of the key differences between our enterprise iPaaS solution and legacy ETL and application integration technologies.

Get to the Cloud and Big Data Integration Faster with Modern Data Integration
Thursday, January 21, 2016
10:00am PDT / 1:00pm EDT

snaplogic_webinar_intricity

Mark MadsenThis is the 2nd post in the series from Mark Madsen’s whitepaper: Will the Data Lake Drown the Data Warehouse? In the first post,  Mark outlined the differences between the data lake and the traditional data warehouse, concluding: “The core capability of a data lake, and the source of much of its value, is the ability to process arbitrary data.”

In this post, Mark reviews the new environment and new requirements:

“The pre-Hadoop environments, including integration tools that were built to handle structured rows and columns, limit the type of data that can be processed. Requirements in the new ecosystem that tend to cause the most trouble for traditional environments are variable data structure, streaming data and nonrelational datasets.

JSONData Structures: JSON is the New CSV

The most common interchange format between applications is not database connectors but flat files in comma-separated value (CSV) format, often exchanged via FTP. One of the big shifts in application design over the past ten years was a move to REST APIs with payloads formatted in JSON, an increasingly common data format. When combined with streaming infrastructure, this design shift reduces the need for old style file integration. JSON and APIs are becoming the new CSV and FTP.

Most enterprise data integration tools were built assuming use of a relational database. This works well for data coming from transactional applications. It works less well for logs, event streams and human-authored data. These do not have the same regular structure of rows, columns and tables that databases and integration tools require. These tools have difficulty working with JSON and must do extra work to process and store it.

The reverse is not true. Newer data integration tools can easily represent tables in JSON, whereas nested structures in JSON are difficult to represent in tables. Flexible representation of data enables late binding for data structures and data types.

This is a key advantage of JSON when compared to the early binding and static typing used by older data integration tools. One simple field change upstream can break a dataflow in the older tools, where the more flexible new environment may be able to continue uninterrupted.

JSON is not the best format for storing data, however. This means tools are needed to translate data from JSON to more efficient storage formats in Hadoop, and from those formats back to JSON for applications. Much of the web and non-transactional data is sent today as JSON messages. The more flexible Hadoop and streaming technologies are a better match for transporting and processing this data than conventional data integration tools.

Streams are the new batch
Often, the initial sources of data in a data lake come from event streams and can be processed continuously rather than in batch. As a rule, a data warehouse is a poor place to process data that must be available in less than a few minutes. The architecture was designed for periodic incremental loads, not for a continuous stream of data. A data lake should support multiple speeds from near real-time to high latency batch.

Batch processing is actually a subset of stream processing. It is easy to persist data for a time and then run a job to process it as a batch. It is not as easy to take a batch system and make it efficiently process data one message at a time. A batch engine can’t keep up with streaming requirements, but tools that have been designed to process streaming data can behave like a batch engine.

Streaming data also implies that data volume can fluctuate, from a small trickle during one hour to a flood in the next. The fixed server models and capacity planning of a traditional architecture do not translate well to dynamic server scaling as message volume grows and shrinks. This requires rethinking how one scales data collection and integration.”

——

The paper Will the Data Lake Drown the Data Warehouse? goes on to note that, “different datasets drive new engines.” In the next post in this series, Mark will describe the new data lake architecture, diving into some of the concepts he covered in the companion data lake whitepaper: How to Build an Enterprise Data Lake: Important Considerations Before You Jump In. Be sure to also check out the recent webinar presentation and recording with SnapLogic here and learn more about SnapLogic for big data integration here.