Moving your data warehouse to the cloud: Look before you jump

By Ravi Dharnikota

Where’s your data warehouse? Is it still on-premises? If so, you’re not alone. Way back in 2011, in its IT predictions for 2012 and beyond, Gartner said, “At year-end 2016, more than 50 percent of Global 1000 companies will have stored customer-sensitive data in the public cloud.”

While it’s hard to find an exact statistic on how many enterprise data warehouses have migrated, cloud warehousing is increasingly popular as companies struggle with growing data volumes, service-level expectations, and the need to integrate structured warehouse data with unstructured data in a data lake.

Cloud data warehousing provides many benefits but getting there isn’t easy. Migrating an existing data warehouse to the cloud is a complex process of moving schema, data, and ETL. The complexity increases when restructuring of database schema or rebuilding of data pipelines is needed.

This post is the first in a “look before you leap” three-part series on how to jump-start your migration of an existing data warehouse to the cloud. As part of that, I’ll also cover how cloud-based data integration solutions can significantly speed your time to value.

Beyond basic: The benefits of cloud data warehousing

Cloud data warehousing is a Data Warehouse as a Service (DWaaS) approach that simplifies time-consuming and costly management, administration, and tuning activities that are typical of on-premises data warehouses. But beyond the obvious – data warehouses being stored in the cloud - there’s more. Processing is also cloud-based, and all major solution providers charge separately for storage and compute resources, both of which are highly scalable.

All of which leads us to a more detailed list of key advantages:

  • Scale up (and down): The volume of data in a warehouse typically grows at a steady pace as time passes and history is collected. Sudden upticks in data volume occur with events such as mergers and acquisitions, and when new subjects are added. The inherent scalability of a cloud data warehouse allows you to adapt to growth, adding resources incrementally (via automated or manual processes) as data and workload increase. The elasticity of cloud resources allows the data warehouse to quickly expand and contract data and processing capacity as needed, with no impact to infrastructure availability, stability, performance, and security.
  • Scale out: Adding more concurrent users requires the cloud data warehouse to scale out. You will be able to add more resources – either more nodes to an existing cluster or an entirely new cluster, depending on the situation – as the number of concurrent users rises, allowing more users to access the same data without query performance degradation.
  • Managed infrastructure: Eliminating the overhead of data center management and operations for the data warehouse frees up resources to focus where value is produced: using the data warehouse to deliver information and insight.
  • Cost savings: On-premises data centers themselves are extremely expensive to build and operate, requiring staff, servers, and hardware, networking, floor space, power, and cooling. (This comparison site provides hard dollar data on many data center elements.) When your data warehouse lives in the cloud, the operating expense in each of these areas is eliminated or substantially reduced.
  • Simplicity: Cloud data warehouse resources can be accessed through a browser and activated with a payment card. Fast self-service removes IT middlemen and democratizes access to enterprise data.

In my next post, I’ll do a quick review of additional benefits and then dive into data migration. If you’d like to read all the details about the benefits, techniques, and challenges of migrating your data warehouse to cloud, download the Eckerson Group white paper, “Jump-Start Your Cloud Data Warehouse: Meeting the Challenges of Migrating to the Cloud.

Ravi Dharnikota is Chief Enterprise Architect at SnapLogic. Follow him on Twitter @rdharn1

How to get valuable insights on data stored in Azure Data Lake Store

In a previous blog post, I discussed major trends in the data integration space and customers moving from on-prem to cloud. I’d like to focus on one trend which involves moving data from on-premises or cloud data analytics platforms to a Data Lake technology such as Azure Data Lake.

What is a Data Lake?

The Data Lake is a term coined for storing large amounts of data in its raw native form, including structured and unstructured data in one location. This data can come from various sources, and the Data Lake can act as a single source of truth for any organization. From the architecture standpoint, the data is first stored in data swamp/data acquisition, then cleansed/transformed as part of data transformation, and later published to gain business insights.

Data Lake

As seen in the diagram above, enterprises have multiple systems such as ERP, CRM, RDBMS, NoSQL, IoT sensors, etc. The disparate data, stored in different systems makes, is difficult to pull data from. A Data Lake brings all the data under one roof (data acquisition) using one of the following services:

  • Azure Blob
  • Azure Data Lake Store
  • Amazon S3
  • HDFS
  • Others

Data stored in one of these services can then be transformed in the following ways:

  • Aggregate
  • Sort
  • Join
  • Merge
  • Other

The transformed data is then moved to the data publish/data access section (could be the same as data acquisition services) where users can utilize the following tools to query the data:

  • Microsoft’s U-SQL
  • Amazon Athena
  • Hive
  • Presto
  • Others etc.

The bottom line is that a Data Lake can serve as a platform to run analytics in order to provide better customer experience, recommendations, and more.

Azure Data Lake is one such Data Lake from Microsoft and the repository used to store all the data is Azure Data Lake Store. Users can run Analytics Service, HDInsight or use U-SQL – a big data query language on top of this data store to gain better business insights.

ADLSSource: Microsoft

Azure Data Lake Store (ADLS) can store any data in its native format. One of the goals of this data store is to bring data from disparate sources. The Snaplogic Enterprise Integration Cloud with its pre-built connectors called Snaps help by moving data from different systems to the data store in a fast manner.

ADLS provides a complex API, which applications use to store data in ADLS. Snaplogic has abstracted all these complexities via Snaps so users can now easily move data from various systems to ADLS without needing to know anything of the complexities of these APIs.

Use case

A business needs to track and analyze content to better recommend products or services to its customers. Its data – from various sources such as Oracle, files, Twitter, etc. – needs to be stored in a central repository such as ADLS so that business users can run analytics on top to measure customer buying behavior, their interests, and products purchased.

Here’s a sample pipeline that can address this use case using Snaps:

Using the File Writer Snap and choosing the Azure Data Lake account as shown below, one can store the data merged from various systems into Azure Data Lake with ease.

All in all, the Data Lake can be a one-stop shop of storage for any data, giving users more ways to derive insights from multiple data sources. And SnapLogic is ready to make it easier for users to move their data into the Data Lake (in this case, an Azure Data Lake Store) in a quick and easy way.

Pavan Venkatesh is Senior Product Manager at SnapLogic. Follow him on Twitter @pavankv.

Big Data Ingestion Patterns: Ingesting Data from Cloud & Ground Sources into Hive

What is Apache Hive? Hive provides a mechanism to query, create and manage large datasets that are stored on Hadoop, using SQL like statements. It also enables adding a structure to existing data that resides on HDFS. In this post I’ll describe a practical approach on how to ingest data into Hive, with the SnapLogic Elastic Integration Platform, without the need to write code.

Continue reading “Big Data Ingestion Patterns: Ingesting Data from Cloud & Ground Sources into Hive”

SnapLogic CTO James Markarian on DisrupTV

SnapLogic CTO James Markarian recently appeared as a guest on DisrupTV, a weekly live-interview web-series produced by analyst firm Constellation Research and hosted by R “Ray” Wang and Vala Afshar. The trio discussed a variety of enterprise topics including modern data management, data lake strategy considerations and big data analytics.

Continue reading “SnapLogic CTO James Markarian on DisrupTV”

SnapLogic CEO Gaurav Dhillon on Andreessen Horowitz Podcast

SnapLogic co-founder and CEO Gaurav Dhillon sat down recently with Scott Kupor, managing partner at Andreessen Horowitz, for a wide-ranging podcast discussion of all-things-data.

The two discussed how the data management landscape has changed in recent years, the rise of advanced analytics, the move from data warehouses to data lakes, and other changes which are enabling organizations to “take back their enterprise.”

Continue reading “SnapLogic CEO Gaurav Dhillon on Andreessen Horowitz Podcast”

SnapLogic CTO James Markarian Discusses the Evolving Big Data Landscape on theCUBE

SnapLogic was in New York this week for Strata + Hadoop World NYC, and our CTO James Markarian took the opportunity to sit down with Dave Vellante and George Gilbert, hosts of theCUBE, for a wide-ranging discussion on the shifting big data landscape.

Continue reading “SnapLogic CTO James Markarian Discusses the Evolving Big Data Landscape on theCUBE”

SnapLogic Introduces Intelligent Connectors for Microsoft Azure Data Lake Store

SnapLogic announced the availability of new pre-built intelligent connectors – called Snaps – for Microsoft Azure Data Lake Store. The new Snaps provide fast, self-service data ingestion and transformation from virtually any source – whether on-premises, in the cloud or in hybrid environments – to Microsoft’s highly-scalable, cloud-based repository for big data analytics workloads. This latest integration between SnapLogic and Microsoft Azure helps enterprise customers gain new insights and unlock business value from their cloud-based big data initiatives.

Microsoft Quote Continue reading “SnapLogic Introduces Intelligent Connectors for Microsoft Azure Data Lake Store”