Will the Cloud Save Big Data?

This article was originally published on ITProPortal.

Employees up and down the value chain are eager to dive into big data, hunting for golden nuggets of intelligence to help them make smarter decisions, grow customer relationships and improve business efficiency. To do this, they’ve been faced with a dizzying array of technologies – from open source projects to commercial software products – as they try to wrestle big data to the ground.

Today, a lot of the headlines and momentum focus around some combination of Hadoop, Spark and Redshift – all of which can be springboards for big data work. It’s important to step back, though, and look at where we are in big data’s evolution.

In many ways, big data is in the midst of transition. Hadoop is hitting its pre-teen years, having launched in April 2006 as an official Apache project – and then taking the software world by storm as a framework for distributed storage and processing of data, based on commodity hardware. Apache Spark is now hitting its strides as a “lightning fast” streaming engine for large-scale data processing. And various cloud data warehousing and analytics platforms are emerging, from big names (Amazon Redshift, Microsoft Azure HDInsight and Google BigQuery) to upstart players like Snowflake, Qubole and Confluent.

The challenge is that most big data progress over the past decade has been limited to big companies with big engineering and data science teams. The systems are often complex, immature, hard to manage and change frequently – which might be fine if you’re in Silicon Valley, but doesn’t play well in the rest of the world. What if you’re a consumer goods company like Clorox, or a midsize bank in the Midwest, or a large telco in Australia? Can this be done without deploying 100 Java engineers who know the technology inside and out?

At the end of the day, most companies just want better data and faster answers – they don’t want the technology headaches that come along with it. Fortunately, the “mega trend” of big data is now colliding with another mega trend: cloud computing. While Hadoop and other big data platforms have been maturing slowly, the cloud ecosystem has been maturing more quickly – and the cloud can now help fix a lot of what has hindered big data’s progress.

The problems customers have encountered with on-premises Hadoop are often the same problems that were faced with on-premises legacy systems: there simply aren’t enough of the right people to get everything done. Companies want cutting-edge capabilities, but they don’t want to deal with bugs and broken integrations and rapidly changing versions. Plus, consumption models are changing – we want to consume data, storage and compute on demand. We don’t want to overbuy. We want access to infrastructure when and how we want it, with just as much as we need but more.

Big Data’s Tipping Point is in the Cloud

In short, the tipping point for big data is about to happen – and it will happen via the cloud. The first wave of “big data via the cloud” was simple: companies like Cloudera put their software on Amazon. But what’s “truly cloud” is not having to manage Hadoop or Spark – moving the complexity back into a hosted infrastructure, so someone else manages it for you. To that end, Amazon, Microsoft and Google now deliver “managed Hadoop” and “managed Spark” – you just worry about the data you have, the questions you have and the answers you want. No need to spin up a cluster, research new products or worry about version management. Just load your data and start processing.

There are three significant and not always obvious benefits to managing big data via the cloud: 1) Predictability – the infrastructure and management burden shifts to cloud providers, and you simply consume services that you can scale up or down as needed; 2) Economics – unlike on-premises Hadoop, where compute and storage were intermingled, the cloud separates compute and storage so you can provision accordingly and benefit from commodity economics; and 3) Innovation – new software, infrastructure and best practices will be deployed continuously by cloud providers, so you can take full advantage without all the upfront time and cost.

Of course, there’s still plenty of hard work to do, but it’s more focused on the data and the business, and not the infrastructure. The great news for mainstream customers (well beyond Silicon Valley) is that another mega-trend is kicking in to revolutionize data integration and data consumption – and that’s the move to self-service. Thanks to new tools and platforms, “self-service integration” is making it fast and easy to create automated data pipelines with no coding, and “self-service analytics” is making it easy for analysts and business users to manipulate data without IT intervention.

All told, these trends are driving a democratization of data that’s very exciting – and will drive significant impact across horizontal functions and vertical industries. Data is thus becoming a more fluid, dynamic and accessible resource for all organizations. IT no longer holds the keys to the kingdom – and developers no longer control the workflow. Just in the nick of time, too, as the volume and velocity of data from digital and social media, mobile tools and edge devices threaten to overwhelm us all. Once the full promise of the Internet of Things, Artificial Intelligence and Machine Learning begins to take hold, the data overflow will be truly inundating.

The only remaining question: What do you want to do with your data?

Ravi Dharnikota is the Chief Enterprise Architect at SnapLogic. 

Gaurav Dhillon on Nathan Latka’s “The Top” Podcast

Popular podcast host Nathan Latka has a built a large following getting top CEOs, founders, and entrepreneurs to share strategies and tactics that set them up for business success. A data industry veteran and self-described “company-builder,” SnapLogic founder and CEO Gaurav Dhillon was recently invited by Nathan to appear as a featured guest on “The Top.”

Nathan is known for his rapid-fire, straight-to-the-point questioning, and Gaurav was more than up to the challenge. In this episode, the two looked back at Gaurav’s founding of Informatica in the ’90s; how he took that company public and helped it grow to become a billion-plus dollar business; why he stepped away from Informatica and decided to start SnapLogic; how integration fuels digital business and why customers are demanding modern solutions like SnapLogic’s that are easy to use and built for the cloud; and how he’s building a fast-growing, innovative business that also has it’s feet on the ground.

The two also kept it fun, with Gaurav fielding Nathan’s “Famous Five” show-closing questions, including favorite book, most admired CEO, advice to your 20-year-old self, and more.

You can listen to the full podcast above or via the following links:

Igniting data discovery: SnapLogic delivers a “quanta” in improvement over Informatica

md_craig-BW-1443725112In my previous blog post, I talked about how a pharmaceutical company uses SnapLogic and Amazon Redshift to capitalize on market and environmental fluctuations, driving sales for its asthma relief medication. In this post, I’ll tell you the path the company took to get there. Hint: It wasn’t a straight one.

An IT organization abandons Informatica

Several months prior to launching its current environment, with data flows powered by SnapLogic, the pharmaceutical company tried, unsuccessfully, to integrate its data warehouses in Redshift using Informatica PowerCenter and Informatica Cloud. The IT team’s original plan was to move data from Salesforce, Veeva, and third-party sources into Amazon Simple Storage Service (S3), and then integrate the data into Redshift for sales and marketing analytics.

However, the project stalled due to difficulty with Informatica PowerCenter, the IT team’s initial choice for data integration. PowerCenter, which Informatica describes as a “metadata-driven integration platform,” is a data extract, transfer, and load (ETL) product rooted in mid-1990s enterprise architecture. The team found PowerCenter complicated to use and slow to deliver the urgently needed integrations.

Looking for faster results, the pharmaceutical company then attempted to use Informatica Cloud, Informatica’s cloud-based integration solution. The data integration initiative was again derailed, this time by the solution’s lack of maturity and functionality. The pharmaceutical company’s data was forced back on-premises, jeopardizing the entire cloud data warehouse initiative.

Data integration aligned with the cloud

But the IT team kept searching for the right data integration solution. “Cloud was instrumental to our plans, and we needed data integration that aligned with where we were headed,” said the senior business capability manager in charge of the integration project. The pharmaceutical company chose the SnapLogic Enterprise Integration Cloud.

After a self-evaluation, the IT team was able to quickly build data integrations with SnapLogic; no specialized resources or consultants were required. To accomplish the integrations in Redshift, the pharmaceutical company used:

  • Salesforce Snap
  • Redshift Snap
  • Various RDBMS Snaps
  • ReST/SOAP Snaps
  • Transformation Snaps

With the data integration accomplished in a matter of days, the IT organization was assured that current skills sets could support the company’s future global BI architecture. In addition, the IT team found the SnapLogic Enterprise Integration Cloud easy enough for business users, such as the marketing team, to integrate new data into Redshift.

Leveraging Redshift’s nearly infinite availability of low-cost data storage and compute resources, the analytic possibilities are equally limitless – igniting the marketing team’s discovery of new strategies to drive new insights, revenues, and operational efficiencies.

SnapLogic delivers a “quanta” in improvement 

What is quanta? It’s the plural of the word “quantum,” a physics term that describes “a discrete quantity of energy proportional in magnitude to the frequency of the radiation it represents.” If you’re not a physicist, your closest association is probably “quantum leap” – basically a gigantic leap forward.

Which is exactly what SnapLogic delivers. With regard to Informatica, Gaurav Dhillon, founder and CEO of SnapLogic, says:

“Fundamentally, I believe that SnapLogic is 10 times better than Informatica. That’s a design goal, and it’s also a necessary and sufficient condition for success. If a startup is going to survive, it’s got to have some 10x factor, some quanta of a value proposition.

“The quanta over the state of the art – the best-of-the-best of the incumbents – is vital. SnapLogic can fluently solve enterprise data problems almost as they are happening. That has a ‘wow’ factor people experience when they harness the power of our data integration technology.”

The SnapLogic Enterprise Integration Cloud is a mature, full-featured Integration Platform-as-a-Service (iPaaS) built in the cloud, for the cloud. Through its visual, automated approach to integration, the SnapLogic Enterprise Integration Cloud uniquely empowers both business and IT users, accelerating cloud data warehouse and analytics initiatives on Redshift and other cloud data warehouses

Unlike on-premises ETL or immature cloud tools, SnapLogic combines ease of use, streaming scalability, on-premises, and cloud integration, and managed connectors. Together, these capabilities present an improvement of up to 10 times over legacy ETL solutions such as Informatica or other “cloud-washed” solutions originally designed for on-premises use, accelerating cloud data warehouse integrations from months to days.

To learn more about how SnapLogic allows citizen data scientists to be productive with Amazon Redshift in days, not months, register for the webcast “Supercharge your Cloud Data Warehouse: 7 ways to achieve 10x improvement in speed and ease of Redshift integration.”

Craig Stewart is Vice President, Product Management at SnapLogic.

Discovery in overdrive: SnapLogic and Amazon Redshift power today’s pharma marketing

md_craig-BW-1443725112At its most fundamental, pharmaceutical marketing is based on straightforward market sizing and analytic baselines:

“The global market is composed of many submarkets [aka therapeutic categories] (TCs), whose number is given and equal to nTC. Each TC has a different number of patients (PatTC) in need of treatment for a specific disease, which determines the potential demand for drugs in each submarket. This number is set at the beginning of each simulation by drawing from a normal distribution [PatTC~N(μp,σp)] truncated in 0 to avoid negative values, and it is known by firms. Patients of each TC are grouped according to their willingness to buy drugs characterised by different qualities.”*

Yet capturing market share in today’s competitive environment is anything but easy. In the recent past, an army of sales reps would market directly to doctors, their efforts loosely coupled with consumer advertising placed across demographically compatible digital and traditional media.

This “spray and pray” approach with promotional spending, while extremely common, made it difficult to pinpoint specific tactics that drove individual product revenues. Projections and sales data factored heavily into the campaign planning stage, and in reports that summarized weekly, monthly, and quarterly results, but the insights gleaned were nearly always backward-looking and without a predictive element.

A pharmaceutical company pinpoints opportunity

Today, sophisticated pharmaceutical marketers have a much firmer grasp of how to use data to drive sales in a predictive manner – by deploying resources with pinpoint precision. A case in point: To maximize the market share of a prescription asthma medication, a leading pharmaceutical company uses SnapLogic and Amazon Redshift to analyze and correlate enormous volumes of data on a daily basis, capitalizing on even the smallest market and environmental fluctuations.

  • Each night, the marketing team takes in pharmacy data from around the US to monitor sales in each region, to learn how many units of the asthma medication sold the previous day. These numbers are processed, analyzed, and reported back to the sales team the following morning, allowing reps to closely monitor progress against their sales objectives.
  • With this data, the pharmaceutical marketing team can monitor, at aggregate and territory levels, the gross impact of many variables including:
    • Consumer advertising campaigns
    • Rep incentive programs
    • News coverage of air quality and asthma
  • However, the pharmaceutical marketing team takes its exploration much deeper. Layered on top of the core sales data, the marketing team correlates weather data from the National Weather Service (NWS) and multiple data sets from the US Environmental Protection Agency (EPA), such as current air quality, historic air quality, and air quality over time. Like the sales data, the weather and EPA data cover the entire US.

By correlating these multiple data sets, the marketing team can extract extraordinary insights that improve tactical decisions and inform longer-term strategy. At a very granular, local level, the team can see:

  • How optimal timing and placement of advertising across digital and traditional media drives demand
  • Which regional weather conditions stimulate the most sales in specific locales
  • The impact of rep incentive programs on sales
  • How news coverage of air quality and asthma influences demand

Ultimately, the pharmaceutical marketing team can identify, with uncanny precision, markets to concentrate spending on local and regional media, which can change on a constant basis. In this way, prospective consumers are targeted with laser-like accuracy, raising their awareness of the pharmaceutical company’s asthma medication at the time they need it most.

The results of the targeted marketing strategy are clear: The pharmaceutical company has enjoyed significant market share growth with its asthma relief medication, while reducing advertising costs due to more effective targeting.

Tools to empower business users

The pharmaceutical industry example exemplifies perhaps the biggest trend in recent business history: massive demand for massive amounts of data, to provide insight and drive informed decision-making. But five years after data scientist was named “the sexiest job of the 21st century,” it’s not data scientists who are gathering, correlating, and analyzing all this data; at the most advanced companies, it’s business users. At the pharmaceutical company and countless others like it, the analytics explosion is ignited by “citizen data scientists” using SnapLogic and Redshift.

In my next blog post, the second of this two-part series, I’ll talk about how SnapLogic turned around a failing initial integration effort at the pharmaceutical company, replacing Informatica PowerCenter and Informatica Cloud.

To find out more on how to use SnapLogic with Amazon Redshift to ignite discovery within your organization, register for the webcast “Supercharge your Cloud Data Warehouse: 7 ways to achieve 10x improvement in speed and ease of Redshift integration.”

Craig Stewart is Vice President, Product Management at SnapLogic.

* JASSS, A Simulation Model of the Evolution of the Pharmaceutical Industry: A History-Friendly Model, October 2013

Applying machine learning tools to data integration

greg-bensonBy Gregory D. Benson

Few tasks are more personally rewarding than working with brilliant graduate students on research problems that have practical applications. This is exactly what I get to do as both a Professor of Computer Science at the University of San Francisco and as Chief Scientist at SnapLogic. Each semester, SnapLogic sponsors student research and development projects for USF CS project classes, and I am given the freedom to work with these students on new technology and exploratory projects that we believe will eventually impact the SnapLogic Enterprise Integration Cloud Platform. Iris and the Integration Assistant, which applies machine learning to the creation of data integration pipelines, represents one of these research projects that pushes the boundaries of self-service data integration.

For the past seven years, these research projects have provided SnapLogic Labs with bright minds and at the same time given USF students exposure to problems found in real-world commercial software. I have been able to leverage my past 19 years of research and teaching at USF in parallel and distributed computing to help formulate research areas that enable students to bridge their academic experience with problems found in large-scale software that runs in the cloud. Project successes include Predictive Field Linking, the first SnapLogic MapReduce implementation called SnapReduce, and the Document Model for data integration. It is a mutually beneficial relationship.

During the research phase of Labs projects, the students have access to the SnapLogic engineering team, and can ask questions and get feedback. This collaboration allows the students to ramp up quickly with our codebase and gets the engineering team familiar with the students. Once we have prototyped and demonstrated the potential for a research project we transition the code to production. But the relationship doesn’t end there – students who did the research work are usually hired on to help with transitioning the prototype to production code.

The SnapLogic Philosophy
Iris technology was born to help an increasing number of business users design and implement data integration tasks that previously required extensive programming skills. Most companies must manage an increasing number of data sources and cloud applications as well as an increasing amount of data volume. And it’s data Integration platforms that help business connect and transform all of this disparate data. The SnapLogic philosophy has always been to truly provide self-service integration through visual programming. Iris and the Integration Assistant further advances this philosophy by learning from the successes and failures of thousands of pipelines and billions of executions on the SnapLogic platform.

The Project
Two years ago, I led a project that refined our metadata architecture and last year I proposed a machine learning project for USF students. At the time, I gave some vague ideas about what we could achieve. The plan was to spend the first part of the project doing data science on the SnapLogic metadata to see what patterns we could find and opportunities for applying machine learning.

One of the USF graduate students working on the project, Thanawut “Jump” Anapiriyakul, discovered that we could learn from past pipeline definitions in our metadata to help recommend likely next Snaps during pipeline creation. Jump experimented with several machine learning algorithms to find the ones that give the best recommendation accuracy. We later combined the pipeline definition with Snap execution history to further improve recommendation accuracy. The end result: Pipeline creation is now much faster with the Integration Assistant.

The exciting thing about the Iris technology is that we have created an internal metadata architecture that supports not only the Integration Assistant but also the data science needed to further leverage historical user activity and pipeline executions to power future applications of machine learning in the SnapLogic Enterprise Cloud. In my view, true self-service in data integration will only be possible through the application of machine learning and artificial intelligence as we are doing at SnapLogic.

As for the students who work on SnapLogic projects, most are usually offered internships and many eventually become full-time software engineers at SnapLogic. It is very rewarding to continue to work with my students after they graduate. After ceremonies this May at USF, Jump will join SnapLogic full-time this summer, working with the team on extending Iris and its capabilities.

I look forward to writing more about Iris and our recent technology advances in the weeks to come. In the meantime, you can check out my past posts on JSON-centric iPaaS and Hybrid Batch and Streaming Architecture for Data Integration.

Gregory D. Benson is a Professor in the Department of Computer Science at the University of San Francisco and Chief Scientist at SnapLogic. Follow him on Twitter @gregorydbenson.

Podcast: James Markarian and David Linthicum on New Approaches to Cloud Integration

SnapLogic CTO James Markarian recently joined cloud expert David Linthicum as a guest on the Doppler Cloud Podcast. The two discussed the mass movement to the cloud and how this is changing how companies approach both application and data integration.

In this 20-minute podcast, “Data Integration from Different Perspectives,” the pair discuss how to navigate the new realities of hybrid app integration, data and analytics moving to the cloud, user demand for self-service technologies, the emerging impact of AI and ML, and more.

You can listen to the full podcast here, and below:

 

VIDEO: SnapLogic Discusses Big Data on #theCUBE from Strata+Hadoop World San Jose

It’s Big Data Week here in Silicon Valley with data experts from around the globe convening at Strata+Hadoop World San Jose for a packed week of keynotes, education, networking and more - and SnapLogic was front-and-center for all the action.

SnapLogic stopped by theCUBE, the popular video-interview show that live-streams from top tech events, and joined hosts Jeff Frick and George Gilbert for a spirited and wide-ranging discussion of all things Big Data.

First up was SnapLogic CEO Gaurav Dhillon, who discussed SnapLogic’s record-growth year in 2016, the acceleration of Big Data moving to the cloud, SnapLogic’s strong momentum working with AWS Redshift and Microsoft Azure platforms, the emerging applications and benefits of ML and AI, customers increasingly ditching legacy technology in favor of modern, cloud-first, self-service solutions, and more. You can watch Gaurav’s full video below, and here:

Next up was SnapLogic Chief Enterprise Architect Ravi Dharnikota, together with our customer, Katharine Matsumoto, Data Scientist at eero. A fast-growing Silicon Valley startup, eero makes a smart wireless networking system that intelligently routes data traffic on your wireless network in a way that reduces buffering and gets rid of dead zones in your home. Katharine leads a small data and analytics team and discussed how, with SnapLogic’s self-service cloud integration platform, she’s able to easily connect a myriad of ever-growing apps and systems and make important data accessible to as many as 15 different line-of-business teams, thereby empowering business users and enabling faster business outcomes. The pair also discussed ML and IoT integration which is helping eero consistently deliver an increasingly smart and powerful product to customers. You can watch Ravi and Katharine’s full video below, and here: