Moving your data warehouse to the cloud: Look before you jump

By Ravi Dharnikota

Where’s your data warehouse? Is it still on-premises? If so, you’re not alone. Way back in 2011, in its IT predictions for 2012 and beyond, Gartner said, “At year-end 2016, more than 50 percent of Global 1000 companies will have stored customer-sensitive data in the public cloud.”

While it’s hard to find an exact statistic on how many enterprise data warehouses have migrated, cloud warehousing is increasingly popular as companies struggle with growing data volumes, service-level expectations, and the need to integrate structured warehouse data with unstructured data in a data lake.

Cloud data warehousing provides many benefits but getting there isn’t easy. Migrating an existing data warehouse to the cloud is a complex process of moving schema, data, and ETL. The complexity increases when restructuring of database schema or rebuilding of data pipelines is needed.

This post is the first in a “look before you leap” three-part series on how to jump-start your migration of an existing data warehouse to the cloud. As part of that, I’ll also cover how cloud-based data integration solutions can significantly speed your time to value.

Beyond basic: The benefits of cloud data warehousing

Cloud data warehousing is a Data Warehouse as a Service (DWaaS) approach that simplifies time-consuming and costly management, administration, and tuning activities that are typical of on-premises data warehouses. But beyond the obvious – data warehouses being stored in the cloud - there’s more. Processing is also cloud-based, and all major solution providers charge separately for storage and compute resources, both of which are highly scalable.

All of which leads us to a more detailed list of key advantages:

  • Scale up (and down): The volume of data in a warehouse typically grows at a steady pace as time passes and history is collected. Sudden upticks in data volume occur with events such as mergers and acquisitions, and when new subjects are added. The inherent scalability of a cloud data warehouse allows you to adapt to growth, adding resources incrementally (via automated or manual processes) as data and workload increase. The elasticity of cloud resources allows the data warehouse to quickly expand and contract data and processing capacity as needed, with no impact to infrastructure availability, stability, performance, and security.
  • Scale out: Adding more concurrent users requires the cloud data warehouse to scale out. You will be able to add more resources – either more nodes to an existing cluster or an entirely new cluster, depending on the situation – as the number of concurrent users rises, allowing more users to access the same data without query performance degradation.
  • Managed infrastructure: Eliminating the overhead of data center management and operations for the data warehouse frees up resources to focus where value is produced: using the data warehouse to deliver information and insight.
  • Cost savings: On-premises data centers themselves are extremely expensive to build and operate, requiring staff, servers, and hardware, networking, floor space, power, and cooling. (This comparison site provides hard dollar data on many data center elements.) When your data warehouse lives in the cloud, the operating expense in each of these areas is eliminated or substantially reduced.
  • Simplicity: Cloud data warehouse resources can be accessed through a browser and activated with a payment card. Fast self-service removes IT middlemen and democratizes access to enterprise data.

In my next post, I’ll do a quick review of additional benefits and then dive into data migration. If you’d like to read all the details about the benefits, techniques, and challenges of migrating your data warehouse to cloud, download the Eckerson Group white paper, “Jump-Start Your Cloud Data Warehouse: Meeting the Challenges of Migrating to the Cloud.

Ravi Dharnikota is Chief Enterprise Architect at SnapLogic. Follow him on Twitter @rdharn1

Will the Cloud Save Big Data?

This article was originally published on ITProPortal.

Employees up and down the value chain are eager to dive into big data solutions, hunting for golden nuggets of intelligence to help them make smarter decisions, grow customer relationships and improve business efficiency. To do this, they’ve been faced with a dizzying array of technologies – from open source projects to commercial software products – as they try to wrestle big data to the ground.

Today, a lot of the headlines and momentum focus around some combination of Hadoop, Spark and Redshift – all of which can be springboards for big data work. It’s important to step back, though, and look at where we are in big data’s evolution.

In many ways, big data is in the midst of transition. Hadoop is hitting its pre-teen years, having launched in April 2006 as an official Apache project – and then taking the software world by storm as a framework for distributed storage and processing of data, based on commodity hardware. Apache Spark is now hitting its strides as a “lightning fast” streaming engine for large-scale data processing. And various cloud data warehousing and analytics platforms are emerging, from big names (Amazon Redshift, Microsoft Azure HDInsight and Google BigQuery) to upstart players like Snowflake, Qubole and Confluent.

The challenge is that most big data progress over the past decade has been limited to big companies with big engineering and data science teams. The systems are often complex, immature, hard to manage and change frequently – which might be fine if you’re in Silicon Valley, but doesn’t play well in the rest of the world. What if you’re a consumer goods company like Clorox, or a midsize bank in the Midwest, or a large telco in Australia? Can this be done without deploying 100 Java engineers who know the technology inside and out?

At the end of the day, most companies just want better data and faster answers – they don’t want the technology headaches that come along with it. Fortunately, the “mega trend” of big data is now colliding with another mega trend: cloud computing. While Hadoop and other big data platforms have been maturing slowly, the cloud ecosystem has been maturing more quickly – and the cloud can now help fix a lot of what has hindered big data’s progress.

The problems customers have encountered with on-premises Hadoop are often the same problems that were faced with on-premises legacy systems: there simply aren’t enough of the right people to get everything done. Companies want cutting-edge capabilities, but they don’t want to deal with bugs and broken integrations and rapidly changing versions. Plus, consumption models are changing – we want to consume data, storage and compute on demand. We don’t want to overbuy. We want access to infrastructure when and how we want it, with just as much as we need but more.

Big Data’s Tipping Point is in the Cloud

In short, the tipping point for big data is about to happen – and it will happen via the cloud. The first wave of “big data via the cloud” was simple: companies like Cloudera put their software on Amazon. But what’s “truly cloud” is not having to manage Hadoop or Spark – moving the complexity back into a hosted infrastructure, so someone else manages it for you. To that end, Amazon, Microsoft and Google now deliver “managed Hadoop” and “managed Spark” – you just worry about the data you have, the questions you have and the answers you want. No need to spin up a cluster, research new products or worry about version management. Just load your data and start processing.

There are three significant and not always obvious benefits to managing big data via the cloud: 1) Predictability – the infrastructure and management burden shifts to cloud providers, and you simply consume services that you can scale up or down as needed; 2) Economics – unlike on-premises Hadoop, where compute and storage were intermingled, the cloud separates compute and storage so you can provision accordingly and benefit from commodity economics; and 3) Innovation – new software, infrastructure and best practices will be deployed continuously by cloud providers, so you can take full advantage without all the upfront time and cost.

Of course, there’s still plenty of hard work to do, but it’s more focused on the data and the business, and not the infrastructure. The great news for mainstream customers (well beyond Silicon Valley) is that another mega-trend is kicking in to revolutionize data integration and data consumption – and that’s the move to self-service. Thanks to new tools and platforms, “self-service integration” is making it fast and easy to create automated data pipelines with no coding, and “self-service analytics” is making it easy for analysts and business users to manipulate data without IT intervention.

All told, these trends are driving a democratization of data that’s very exciting – and will drive significant impact across horizontal functions and vertical industries. Data is thus becoming a more fluid, dynamic and accessible resource for all organizations. IT no longer holds the keys to the kingdom – and developers no longer control the workflow. Just in the nick of time, too, as the volume and velocity of data from digital and social media, mobile tools and edge devices threaten to overwhelm us all. Once the full promise of the Internet of Things, Artificial Intelligence and Machine Learning begins to take hold, the data overflow will be truly inundating.

The only remaining question: What do you want to do with your data?

Ravi Dharnikota is the Chief Enterprise Architect at SnapLogic. 

The 3 A’s of Enterprise Integration

This post originally appeared on Data Informed.

binary-big-dateAs organizations look to increase their agility, IT and lines of business need to connect faster. Companies need to adopt cloud applications more quickly and they need to be able to access and analyze all their data, whether from a legacy data warehouse, a new SaaS application, or an unstructured data source such as social media. In short, a unified integration platform has become a critical requirement for most enterprises.

According to Gartner, “unnecessarily segregated application and data integration efforts lead to counterproductive practices and escalating deployment costs.”

Don’t let your organization get caught in that trap. Whether you are evaluating what you already have or shopping for something completely new, you should measure any platform by how well it address the “three A’s” of integration: Anything, Anytime, Anywhere. Continue reading “The 3 A’s of Enterprise Integration”

Collaborations in Building Hybrid Cloud Computing and Data Integrations

Post first published by Ravi Dharnikota on LinkedIn.

It’s one thing to create application and data integrations; it’s an even bigger challenge to collaborate with other teams in the enterprise to reuse and repurpose and standardize on what has already been built.

The need for seamless content collaboration is a key ingredient for overall success in app and data integrations, just as it is in app development and delivery. A platform that allows for easy sharing of information between employees is the different between a platform’s adoption throughout the enterprise or becoming shelf-ware. Continue reading “Collaborations in Building Hybrid Cloud Computing and Data Integrations”

The SnapLogic Hadooplex in Action

I recently wrote about how SnapLogic’s Hadooplex achieves elastic scalability, running as a native YARN application in a Hadoop cluster. As a I noted in the post:

“As the workload increases, the application master requests the YARN ResourceManager to spin up more Hadooplex nodes one at a time as shown in the diagram below. This scale out occurs dynamically until either the workload starts decreasing or a maximum number of Hadooplex nodes allowed has been met.

As the workload decreases, the nodes start spinning down. This is how SnapLogic achieves elastic scaling based on the workload volumes within a Hadoop cluster utilizing the YARN ResourceManager. This is possible only if an application is a native YARN application.”

I wanted to take this further by showing what this looks like in a SnapLogic Elastic Integration Platform demonstration.  In this demo, you can see how how the Hadooplex, which is the run-time execution engine elastically scales depending on the workload.

You can read more about SnapLogic big data processing platforms in this paper and check out more SnapLogic demonstrations here. Be sure to also check out our upcoming webinar with Mark Madsen, which will focus on the new reference architecture for the enterprise data lake.

The SnapLogic Hadooplex: Achieving Elastic Scalability Using YARN

Elastic-big-dataYARN, a major advancement in Hadoop 2.0, is a resource manager that separates out the execution and processing management from the resource management capabilities of MapReduce. Like an operating system on a server, YARN is designed to allow multiple, diverse user applications to run on a multi-tenant platform.

Developers are no longer limited to writing multi-pass MapReduce programs with disadvantages like high latency, when a better option can be modeled using a directed acyclic graphic (DAG) approach.

Any application, including the likes of Spark, can be deployed onto an existing Hadoop cluster, and take advantage of YARN for scheduling and resource allocation. This is also the basic ingredient of a Hadooplex in SnapLogic – to achieve elastic scale out and scale in for integration jobs.

The per-application ApplicationMaster is, in effect, a framework specific a library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor tasks.

SnapLogic’s application master is responsible for negotiating resources with the ResourceManager. The control plane in SnapLogic is the brain (read this post on software defined integration), which holds all critical information and helps make logical decisions for scale out and scale in. The Hadooplex is the actual application itself that runs the workload.

In the this diagram you can see that the Hadooplex reports its workload information to the control plane at regular intervals. The application master gets the load information from the control plane, also at regular intervals.

Hadooplex

As the workload increases, the application master requests the YARN ResourceManager to spin up more Hadooplex nodes one at a time as shown in the diagram below. This scale out occurs dynamically until either the workload starts decreasing or a maximum number of Hadooplex nodes allowed has been met.

Hadooplex-nodes

As the workload decreases, the nodes start spinning down. This is how SnapLogic achieves elastic scaling based on the workload volumes within a Hadoop cluster utilizing the YARN ResourceManager. This is possible only if an application is a native YARN application. (Read about the importance of YARN-native here.)

Next steps:

This post originally appeared on LinkedIn.

Various Flavors of Yarn Integrations

YARN is the prerequisite for Enterprise Hadoop. It provides resource management across Hadoop clusters and extends the power of Hadoop to new technologies so that they can take advantage of cost effective, linear-scale storage and processing. It provides ISVs and developers a consistent framework for writing data access applications that run IN Hadoop.

Customers building a data lake expect to operate on the data without moving it to other systems, leveraging the processing resources of the data lake. Applications that use YARN fulfill that promise, lowering operational costs while improving quality and time-to-insight.

Integration with YARN

To harness the power or YARN, a third party application can either use YARN natively or use a YARN framework (Apache Tez, Apache slider, etc.) and if it does not use YARN it most probably reads directly from HDFS.

There are 3 broad options for integration into YARN.

  1. Full Control or YARN native: Fine grained control of cluster resources, which allows elastic scaling.
  2. Interaction through an existing YARN framework like MapReduce: Limited to one of batch or interactive or real time. No support for elastic scaling using YARN.
  3. Interaction with applications already running on a YARN framework like Hive: Limited to very specific applications or use cases for example using Hive. No support for elastic scaling using YARN.

Obviously any application, which has full control and is yarn native, provides a significant advantage to be able to do very advanced things within Hadoop using the capabilities of YARN.

This difference is necessary as the space become more interesting and confusing at the same time. Hadoop vendors like Hortonworks offer both Yarn Native and Yarn Ready certifications. Yarn ready means that an application can work with and is limited to any of the Yarn enabled applications like Hive, whereas Yarn Native means full control and fine-grained access of cluster resources.

SnapLogic is Yarn Native. This means as data volumes or workloads increase, the SnapLogic Elastic Integration Platform can automatically, elastically scale out leveraging more nodes in the Hadoop cluster on demand, and as these workloads decrease, scale down automatically. This in SnapLogic is called the Hadooplex. This blog post reviews examples of SnapLogic big data integration pipelines.
SnapLogic Hadooplex

This post originally appeared on LinkedIn. Ravi Dharnikota is a Sr. Advisor SnapLogic, working closely with customers on their big data and cloud reference architecture.