Will the Cloud Save Big Data?

This article was originally published on ITProPortal.

Employees up and down the value chain are eager to dive into big data, hunting for golden nuggets of intelligence to help them make smarter decisions, grow customer relationships and improve business efficiency. To do this, they’ve been faced with a dizzying array of technologies – from open source projects to commercial software products – as they try to wrestle big data to the ground.

Today, a lot of the headlines and momentum focus around some combination of Hadoop, Spark and Redshift – all of which can be springboards for big data work. It’s important to step back, though, and look at where we are in big data’s evolution.

In many ways, big data is in the midst of transition. Hadoop is hitting its pre-teen years, having launched in April 2006 as an official Apache project – and then taking the software world by storm as a framework for distributed storage and processing of data, based on commodity hardware. Apache Spark is now hitting its strides as a “lightning fast” streaming engine for large-scale data processing. And various cloud data warehousing and analytics platforms are emerging, from big names (Amazon Redshift, Microsoft Azure HDInsight and Google BigQuery) to upstart players like Snowflake, Qubole and Confluent.

The challenge is that most big data progress over the past decade has been limited to big companies with big engineering and data science teams. The systems are often complex, immature, hard to manage and change frequently – which might be fine if you’re in Silicon Valley, but doesn’t play well in the rest of the world. What if you’re a consumer goods company like Clorox, or a midsize bank in the Midwest, or a large telco in Australia? Can this be done without deploying 100 Java engineers who know the technology inside and out?

At the end of the day, most companies just want better data and faster answers – they don’t want the technology headaches that come along with it. Fortunately, the “mega trend” of big data is now colliding with another mega trend: cloud computing. While Hadoop and other big data platforms have been maturing slowly, the cloud ecosystem has been maturing more quickly – and the cloud can now help fix a lot of what has hindered big data’s progress.

The problems customers have encountered with on-premises Hadoop are often the same problems that were faced with on-premises legacy systems: there simply aren’t enough of the right people to get everything done. Companies want cutting-edge capabilities, but they don’t want to deal with bugs and broken integrations and rapidly changing versions. Plus, consumption models are changing – we want to consume data, storage and compute on demand. We don’t want to overbuy. We want access to infrastructure when and how we want it, with just as much as we need but more.

Big Data’s Tipping Point is in the Cloud

In short, the tipping point for big data is about to happen – and it will happen via the cloud. The first wave of “big data via the cloud” was simple: companies like Cloudera put their software on Amazon. But what’s “truly cloud” is not having to manage Hadoop or Spark – moving the complexity back into a hosted infrastructure, so someone else manages it for you. To that end, Amazon, Microsoft and Google now deliver “managed Hadoop” and “managed Spark” – you just worry about the data you have, the questions you have and the answers you want. No need to spin up a cluster, research new products or worry about version management. Just load your data and start processing.

There are three significant and not always obvious benefits to managing big data via the cloud: 1) Predictability – the infrastructure and management burden shifts to cloud providers, and you simply consume services that you can scale up or down as needed; 2) Economics – unlike on-premises Hadoop, where compute and storage were intermingled, the cloud separates compute and storage so you can provision accordingly and benefit from commodity economics; and 3) Innovation – new software, infrastructure and best practices will be deployed continuously by cloud providers, so you can take full advantage without all the upfront time and cost.

Of course, there’s still plenty of hard work to do, but it’s more focused on the data and the business, and not the infrastructure. The great news for mainstream customers (well beyond Silicon Valley) is that another mega-trend is kicking in to revolutionize data integration and data consumption – and that’s the move to self-service. Thanks to new tools and platforms, “self-service integration” is making it fast and easy to create automated data pipelines with no coding, and “self-service analytics” is making it easy for analysts and business users to manipulate data without IT intervention.

All told, these trends are driving a democratization of data that’s very exciting – and will drive significant impact across horizontal functions and vertical industries. Data is thus becoming a more fluid, dynamic and accessible resource for all organizations. IT no longer holds the keys to the kingdom – and developers no longer control the workflow. Just in the nick of time, too, as the volume and velocity of data from digital and social media, mobile tools and edge devices threaten to overwhelm us all. Once the full promise of the Internet of Things, Artificial Intelligence and Machine Learning begins to take hold, the data overflow will be truly inundating.

The only remaining question: What do you want to do with your data?

Ravi Dharnikota is the Chief Enterprise Architect at SnapLogic. 

The 3 A’s of Enterprise Integration

This post originally appeared on Data Informed.

binary-big-dateAs organizations look to increase their agility, IT and lines of business need to connect faster. Companies need to adopt cloud applications more quickly and they need to be able to access and analyze all their data, whether from a legacy data warehouse, a new SaaS application, or an unstructured data source such as social media. In short, a unified integration platform has become a critical requirement for most enterprises.

According to Gartner, “unnecessarily segregated application and data integration efforts lead to counterproductive practices and escalating deployment costs.”

Don’t let your organization get caught in that trap. Whether you are evaluating what you already have or shopping for something completely new, you should measure any platform by how well it address the “three A’s” of integration: Anything, Anytime, Anywhere. Continue reading “The 3 A’s of Enterprise Integration”

Collaborations in Building Hybrid Cloud Computing and Data Integrations

Post first published by Ravi Dharnikota on LinkedIn.

It’s one thing to create application and data integrations; it’s an even bigger challenge to collaborate with other teams in the enterprise to reuse and repurpose and standardize on what has already been built.

The need for seamless content collaboration is a key ingredient for overall success in app and data integrations, just as it is in app development and delivery. A platform that allows for easy sharing of information between employees is the different between a platform’s adoption throughout the enterprise or becoming shelf-ware. Continue reading “Collaborations in Building Hybrid Cloud Computing and Data Integrations”

The SnapLogic Hadooplex in Action

I recently wrote about how SnapLogic’s Hadooplex achieves elastic scalability, running as a native YARN application in a Hadoop cluster. As a I noted in the post:

“As the workload increases, the application master requests the YARN ResourceManager to spin up more Hadooplex nodes one at a time as shown in the diagram below. This scale out occurs dynamically until either the workload starts decreasing or a maximum number of Hadooplex nodes allowed has been met.

As the workload decreases, the nodes start spinning down. This is how SnapLogic achieves elastic scaling based on the workload volumes within a Hadoop cluster utilizing the YARN ResourceManager. This is possible only if an application is a native YARN application.”

I wanted to take this further by showing what this looks like in a SnapLogic Elastic Integration Platform demonstration.  In this demo, you can see how how the Hadooplex, which is the run-time execution engine elastically scales depending on the workload.

You can read more about SnapLogic big data processing platforms in this paper and check out more SnapLogic demonstrations here. Be sure to also check out our upcoming webinar with Mark Madsen, which will focus on the new reference architecture for the enterprise data lake.

The SnapLogic Hadooplex: Achieving Elastic Scalability Using YARN

Elastic-big-dataYARN, a major advancement in Hadoop 2.0, is a resource manager that separates out the execution and processing management from the resource management capabilities of MapReduce. Like an operating system on a server, YARN is designed to allow multiple, diverse user applications to run on a multi-tenant platform.

Developers are no longer limited to writing multi-pass MapReduce programs with disadvantages like high latency, when a better option can be modeled using a directed acyclic graphic (DAG) approach.

Any application, including the likes of Spark, can be deployed onto an existing Hadoop cluster, and take advantage of YARN for scheduling and resource allocation. This is also the basic ingredient of a Hadooplex in SnapLogic – to achieve elastic scale out and scale in for integration jobs.

The per-application ApplicationMaster is, in effect, a framework specific a library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor tasks.

SnapLogic’s application master is responsible for negotiating resources with the ResourceManager. The control plane in SnapLogic is the brain (read this post on software defined integration), which holds all critical information and helps make logical decisions for scale out and scale in. The Hadooplex is the actual application itself that runs the workload.

In the this diagram you can see that the Hadooplex reports its workload information to the control plane at regular intervals. The application master gets the load information from the control plane, also at regular intervals.

Hadooplex

As the workload increases, the application master requests the YARN ResourceManager to spin up more Hadooplex nodes one at a time as shown in the diagram below. This scale out occurs dynamically until either the workload starts decreasing or a maximum number of Hadooplex nodes allowed has been met.

Hadooplex-nodes

As the workload decreases, the nodes start spinning down. This is how SnapLogic achieves elastic scaling based on the workload volumes within a Hadoop cluster utilizing the YARN ResourceManager. This is possible only if an application is a native YARN application. (Read about the importance of YARN-native here.)

Next steps:

This post originally appeared on LinkedIn.

Various Flavors of Yarn Integrations

YARN is the prerequisite for Enterprise Hadoop. It provides resource management across Hadoop clusters and extends the power of Hadoop to new technologies so that they can take advantage of cost effective, linear-scale storage and processing. It provides ISVs and developers a consistent framework for writing data access applications that run IN Hadoop.

Customers building a data lake expect to operate on the data without moving it to other systems, leveraging the processing resources of the data lake. Applications that use YARN fulfill that promise, lowering operational costs while improving quality and time-to-insight.

Integration with YARN

To harness the power or YARN, a third party application can either use YARN natively or use a YARN framework (Apache Tez, Apache slider, etc.) and if it does not use YARN it most probably reads directly from HDFS.

There are 3 broad options for integration into YARN.

  1. Full Control or YARN native: Fine grained control of cluster resources, which allows elastic scaling.
  2. Interaction through an existing YARN framework like MapReduce: Limited to one of batch or interactive or real time. No support for elastic scaling using YARN.
  3. Interaction with applications already running on a YARN framework like Hive: Limited to very specific applications or use cases for example using Hive. No support for elastic scaling using YARN.

Obviously any application, which has full control and is yarn native, provides a significant advantage to be able to do very advanced things within Hadoop using the capabilities of YARN.

This difference is necessary as the space become more interesting and confusing at the same time. Hadoop vendors like Hortonworks offer both Yarn Native and Yarn Ready certifications. Yarn ready means that an application can work with and is limited to any of the Yarn enabled applications like Hive, whereas Yarn Native means full control and fine-grained access of cluster resources.

SnapLogic is Yarn Native. This means as data volumes or workloads increase, the SnapLogic Elastic Integration Platform can automatically, elastically scale out leveraging more nodes in the Hadoop cluster on demand, and as these workloads decrease, scale down automatically. This in SnapLogic is called the Hadooplex. This blog post reviews examples of SnapLogic big data integration pipelines.
SnapLogic Hadooplex

This post originally appeared on LinkedIn. Ravi Dharnikota is a Sr. Advisor SnapLogic, working closely with customers on their big data and cloud reference architecture.