The SnapLogic Hadooplex: Achieving Elastic Scalability Using YARN

Elastic-big-dataYARN, a major advancement in Hadoop 2.0, is a resource manager that separates out the execution and processing management from the resource management capabilities of MapReduce. Like an operating system on a server, YARN is designed to allow multiple, diverse user applications to run on a multi-tenant platform.

Developers are no longer limited to writing multi-pass MapReduce programs with disadvantages like high latency, when a better option can be modeled using a directed acyclic graphic (DAG) approach.

Any application, including the likes of Spark, can be deployed onto an existing Hadoop cluster, and take advantage of YARN for scheduling and resource allocation. This is also the basic ingredient of a Hadooplex in SnapLogic – to achieve elastic scale out and scale in for integration jobs.

The per-application ApplicationMaster is, in effect, a framework specific a library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor tasks.

SnapLogic’s application master is responsible for negotiating resources with the ResourceManager. The control plane in SnapLogic is the brain (read this post on software defined integration), which holds all critical information and helps make logical decisions for scale out and scale in. The Hadooplex is the actual application itself that runs the workload.

In the this diagram you can see that the Hadooplex reports its workload information to the control plane at regular intervals. The application master gets the load information from the control plane, also at regular intervals.

Hadooplex

As the workload increases, the application master requests the YARN ResourceManager to spin up more Hadooplex nodes one at a time as shown in the diagram below. This scale out occurs dynamically until either the workload starts decreasing or a maximum number of Hadooplex nodes allowed has been met.

Hadooplex-nodes

As the workload decreases, the nodes start spinning down. This is how SnapLogic achieves elastic scaling based on the workload volumes within a Hadoop cluster utilizing the YARN ResourceManager. This is possible only if an application is a native YARN application. (Read about the importance of YARN-native here.)

Next steps:

This post originally appeared on LinkedIn.

Category: Product

We're hiring!

Discover your next great career opportunity.