Applying machine learning tools to data integration

greg-bensonBy Gregory D. Benson

Few tasks are more personally rewarding than working with brilliant graduate students on research problems that have practical applications. This is exactly what I get to do as both a Professor of Computer Science at the University of San Francisco and as Chief Scientist at SnapLogic. Each semester, SnapLogic sponsors student research and development projects for USF CS project classes, and I am given the freedom to work with these students on new technology and exploratory projects that we believe will eventually impact the SnapLogic Enterprise Integration Cloud Platform. Iris and the Integration Assistant, which applies machine learning to the creation of data integration pipelines, represents one of these research projects that pushes the boundaries of self-service data integration.

For the past seven years, these research projects have provided SnapLogic Labs with bright minds and at the same time given USF students exposure to problems found in real-world commercial software. I have been able to leverage my past 19 years of research and teaching at USF in parallel and distributed computing to help formulate research areas that enable students to bridge their academic experience with problems found in large-scale software that runs in the cloud. Project successes include Predictive Field Linking, the first SnapLogic MapReduce implementation called SnapReduce, and the Document Model for data integration. It is a mutually beneficial relationship.

During the research phase of Labs projects, the students have access to the SnapLogic engineering team, and can ask questions and get feedback. This collaboration allows the students to ramp up quickly with our codebase and gets the engineering team familiar with the students. Once we have prototyped and demonstrated the potential for a research project we transition the code to production. But the relationship doesn’t end there – students who did the research work are usually hired on to help with transitioning the prototype to production code.

The SnapLogic Philosophy
Iris technology was born to help an increasing number of business users design and implement data integration tasks that previously required extensive programming skills. Most companies must manage an increasing number of data sources and cloud applications as well as an increasing amount of data volume. And it’s data Integration platforms that help business connect and transform all of this disparate data. The SnapLogic philosophy has always been to truly provide self-service integration through visual programming. Iris and the Integration Assistant further advances this philosophy by learning from the successes and failures of thousands of pipelines and billions of executions on the SnapLogic platform.

The Project
Two years ago, I led a project that refined our metadata architecture and last year I proposed a machine learning project for USF students. At the time, I gave some vague ideas about what we could achieve. The plan was to spend the first part of the project doing data science on the SnapLogic metadata to see what patterns we could find and opportunities for applying machine learning.

One of the USF graduate students working on the project, Thanawut “Jump” Anapiriyakul, discovered that we could learn from past pipeline definitions in our metadata to help recommend likely next Snaps during pipeline creation. Jump experimented with several machine learning algorithms to find the ones that give the best recommendation accuracy. We later combined the pipeline definition with Snap execution history to further improve recommendation accuracy. The end result: Pipeline creation is now much faster with the Integration Assistant.

The exciting thing about the Iris technology is that we have created an internal metadata architecture that supports not only the Integration Assistant but also the data science needed to further leverage historical user activity and pipeline executions to power future applications of machine learning in the SnapLogic Enterprise Cloud. In my view, true self-service in data integration will only be possible through the application of machine learning and artificial intelligence as we are doing at SnapLogic.

As for the students who work on SnapLogic projects, most are usually offered internships and many eventually become full-time software engineers at SnapLogic. It is very rewarding to continue to work with my students after they graduate. After ceremonies this May at USF, Jump will join SnapLogic full-time this summer, working with the team on extending Iris and its capabilities.

I look forward to writing more about Iris and our recent technology advances in the weeks to come. In the meantime, you can check out my past posts on JSON-centric iPaaS and Hybrid Batch and Streaming Architecture for Data Integration.

Gregory D. Benson is a Professor in the Department of Computer Science at the University of San Francisco and Chief Scientist at SnapLogic. Follow him on Twitter @gregorydbenson.

Testing… Testing… 1, 2, 3: How SnapLogic tests Snaps on the Apache Spark Platform

The SnapLogic Elastic Integration Platform connects your enterprise data, applications, and APIs by building drag-and-drop data pipelines. Each pipeline is made up of Snaps, which are intelligent connectors, that users drag onto a canvas and “snap” together like puzzle pieces.

A SnapLogic pipeline being built and configured
A SnapLogic pipeline being built and configured

These pipelines are executed on a Snaplex, an application that runs on a multitude of platforms: on a customer’s infrastructure, on the SnapLogic cloud, and most recently on Hadoop. A Snaplex that runs on Hadoop can execute pipelines natively in Spark.

The SnapLogic data management platform is known for its easy-to-use, self-service interface, made possible by our team of dedicated engineers (we’re hiring!). We work to apply the industry’s best practices so that our clients get the best possible end product — and testing is fundamental. Continue reading “Testing… Testing… 1, 2, 3: How SnapLogic tests Snaps on the Apache Spark Platform”

Snaplex Thresholds and Pipeline Queuing

As the integration market continues to mature, there is a constant demand to support and process more complex data and process flows. When applications process large data, they often run out of resources and become unresponsive, leaving users confused and unhappy. Gauging resources and alerting users with appropriate messages are some of the most important factors of ideal software. In the Winter 2016 release of the SnapLogic Elastic Integration Platform, we introduced the concept of pipeline queuing, which allows users to define thresholds for their Snaplexes, and when thresholds are reached, any further requests to it are queued until the next resources are available. Continue reading “Snaplex Thresholds and Pipeline Queuing”

March 2016 Snap Update

This weekend is our latest SnapLogic Snap Update, with many new and updated Snaps being delivered. Here’s a brief overview. Be sure to contact our Customer Success team if you have any questions about the release.

There is a new RabbitMQ Snap Pack available with this update, which contains a Consumer and Producer Snap. Updated Snap Packs include: Continue reading “March 2016 Snap Update”

SnapLogic Best Practices: Deploying Projects Between Phases

[update – check out what’s new in our Spring 2016 release – the Metadata Snaps are also useful for Lifecycle Management requirements]

One of the areas our integrated data services team and partners spend time with customers early in a SnapLogic Elastic Integration Platform deployment is on deploying from one project phase to the other (Dev -> QA -> Prod). There are a number of different configuration options. In this post, I’ll describe one. First a few assumptions:

  • The enterprise Lifecycle Management feature is not implemented in this example
  • The phases that are in use are Development, QA and Production
  • Each phase in use is being managed at a project level as a separate project with in a single Organization Setup
  • The users have the necessary permissions to perform the operations described in this post
  • The enhanced account encryption feature is not in use in the current SnapLogic Org

Continue reading “SnapLogic Best Practices: Deploying Projects Between Phases”

LNAV: The Log File Navigator

Today I wanted to throw a shameless plug and highlight one of the side-projects from one of our very own engineers, Tim Stack.

For developers, ops, and those who frequently live in the terminal, lnav is log file viewer that can help you analyze and navigate through your logs. Simply point lnav at a file or a directory and it will decompress the files, detect their format, and index any log messages contained within. The log messages from all files are then displayed in a single view ordered by time and several hotkeys are available for jumping to messages based on their time or log level. Semantic information is highlighted and filters may be applied to help focus your investigation. Additional features include a pretty-printing structured data, like XML or JSON, a SQL interface for querying logs and more. To check out the complete list of features and download for Linux or Mac OS X, head on over to http://lnav.org/

Here are a couple of screenshots: Continue reading “LNAV: The Log File Navigator”

REST GET and the SnapLogic Public APIs for Pipeline Executions

As a part of a wider analytics project I’m working on, analyzing runtime information from the SnapLogic platform, I chose to use the functionality exposed to all customers, the Public API for Pipeline Monitoring API and the REST API. These two things are combined in this post. I started by reading the documentation (of course), which shows the format of the request and response. So I created a new pipeline and dropped a REST GET Snap on the canvas:

snaplogic_REST_pipeline
Continue reading “REST GET and the SnapLogic Public APIs for Pipeline Executions”