Machine learning has a data integration problem: The need for self-service

5 min read

When we built the Iris Integration Assistant, an AI-powered recommendation engine, it was SnapLogic’s first foray into machine learning (ML). While the experience left us with many useful insights, one stood out above the rest: machine learning, we discovered, is full of data integration challenges.

Of course, going into the process, we understood that developing an ML model entails integrating data. But we didn’t appreciate how severe and widespread the integration challenges would be.

Integration hurdles are the norm

Indeed, we’re not the only ones who, when venturing into machine learning, met a host of integration hurdles. A survey of nearly 200 data scientists revealed that 53 percent of respondents devoted most of their time to collecting, labeling, cleaning, and organizing data – all integration tasks.

Unfortunately, in machine learning, you can’t escape the need to clean and prep your data. If you train a model with bad data, you’ll get a bad model in return. “Dirty data” remains the biggest problem facing data scientists today.[1]

This leads us to conclude that the need to integrate data throughout the machine learning lifecycle isn’t going away. But prevailing code-first approaches to these data integration problems have got to change. Manual integration tasks guzzle valuable time that data scientists ought to be spending on strategic, high-impact work. In the worst cases, they thwart your machine learning projects entirely, keeping you from seeing the promised return on your AI investments.

Machine learning development and deployment is in desperate need of self-service integration.

What are the main integration challenges in machine learning?

At the very outset, the data scientist runs into integration challenges. They must acquire data from various sources with the goal of creating a large, quality training dataset.

The data scientist may need to extract POS data from a cloud data lake like Amazon S3, pull log files from a web server, or collect inventory data from an Oracle ERP system. Typically, they’ll ask IT for access to this data in the form of a one-time data dump. Or, they’ll write custom scripts in, say, Python. Both options are slow and are difficult to reliably repeat. Should the data scientist want access to other tables within a given data source, they must take the same cumbersome steps, further delaying their time-to-value.

More integration challenges beset the data scientist as they prepare the raw data they’ve acquired. They must filter out irrelevant details, scrub sensitive information, detect and remove errors, alter data types, clean missing values, and plod through other data cleansing chores. Traditionally, data scientists will prepare data by coding in Python – or another programming language – within Jupyter Notebooks. To be sure, coding offers flexibility in customizing data, but it eats up valuable time for the sake of non-strategic, humdrum work.

The integration burden doesn’t stop there. Once the data scientist has chosen an algorithm (e.g., a logistic regression), they must feed the model the training data they’ve toiled so arduously to prepare. This, again, requires more coding. After training, the model must undergo testing and cross-validation to ensure its predictions are accurate. More integrations, more coding.

When the model is finally ready for real-world use, the data scientist often has to hand the model over to a software development team (DevOps) for operationalization. In many cases, DevOps must convert this code into a different format. What’s more, they must host the model in a web service to fulfill API requests. Such activities deal heavily with integration and require manual scripting.

Ideally, you will iterate your model to keep improving its prediction accuracy after it’s gone live. But you can only do so if you continuously train it with new data. This means you have to go through the whole rigmarole of acquiring new source data, cleansing and preparing the data, enlisting developers to put your model back into production, and so on.

The excessive coding, redundancy, and manual trial and error in the traditional approach to machine learning cannot be sustained. It’s time to bring self-service integration to the machine learning process. 

Imagining a self-service future for machine learning

A self-service solution for the machine learning lifecycle should automate routine – but still important – work like shuffling data. It should also stamp out redundancies. For example, when creating an initial training dataset, you should be able to integrate source data once and then reuse that pipeline for continuous training in the future.

In a self-service environment, data scientists will no longer cram their schedules with integration activities like gathering, cleansing, and organizing data. Instead, they’ll employ critical thinking, solve crucial business problems, build extraordinary machine learning models, dream up other use cases for AI, and find new ways to add value.

Self-service ML should also make machine learning accessible to those with less specialized skills but who are attuned to the line of business priorities (e.g., business analysts and citizen data scientists). Much as with the “democratization” of tasks in other areas, this will cut down on bottlenecks and empower more people within organizations to deliver results with machine learning.

This will enable companies to explore far more areas where the technology could be applied than they would otherwise. As a result, they’ll be able to produce a greater number of effective models that deliver value to the business.

The impact of self-service ML will be massive

Ultimately, a self-service ML solution should accelerate machine learning development and deployment. This will enable companies to explore far more areas where the technology can be applied than they would otherwise be able. As a result, they’ll produce a greater number of effective models that deliver value to the business. They’ll not only have models that, say, streamline operations, but they’ll have ones that improve product safety, increase sales for existing products, and forge new revenue channels.

What we’re describing is a self-service solution that handles both data integration and machine learning development and deployment. The impact of such a solution could be profound. At SnapLogic, we have a hunch that self-service machine learning very well may be right around the bend. Stay tuned.


[1] This is based on a survey of thousands of data scientists. This particular survey question received 7,376 responses from data scientists and other data-centric professional such as analysts, data engineers, programmers, etc.
https://www.kaggle.com/surveys/2017

Former Chief Data Officer at SnapLogic

We're hiring!

Discover your next great career opportunity.