Executing Spark Pipelines on HDInsight

Microsoft Azure HDInsight is an Apache Hadoop distribution powered by the cloud. Internally HDInsight leverages the Hortonworks data platform. HDInsight supports a large set of Apache big data projects like Spark, Hive, HBase, Storm, Tez, Sqoop, Oozie and many more. The suite of HDInsight projects can be administered via Apache Ambari.

SnapLogic-for-MicrosoftThis post lists out the steps involved in spinning up an HDInsight cluster, setting up SnapLogic’s Hadooplex on HDInsight, and building and executing a Spark data flow pipeline on HDInsight. We start with spinning up a HDInsight cluster from the MS Azure Portal. Continue reading “Executing Spark Pipelines on HDInsight”

SnapLogic’s Modern Approach to ETL

We all realize by now that our corporate data is exploding in volume, velocity and variety. We’ve also witnessed the expansion of new data sources and the resulting business insights, which has created the demand for data sources with longer histories. Legacy data transformation systems and development groups are being pushed to their limits to effectively deal with these challenges.

The concept of extract, load, and transform (ELT) was introduced to alleviate the problems associated with loading high volumes of raw data directly into a data warehouse staging area where it could be processed and transformed after loading. But now we’re realizing that traditional ETL tools simply aren’t cutting it any longer. Based on point-to-point, row/column architectures, traditional ETL tools are now at a crossroads as they struggle with huge volumes of real-time, unstructured and hierarchical data. Let’s face it folks, traditional ETL solutions are just too expensive, they don’t scale, they’re too rigid, and they require too much maintenance.

snaplogic_ipaasSnapLogic takes a truly innovative approach to data integration that is focused on data streams between applications and disparate data sources with the flexibility to connect to both cloud and on-premises systems. Delivered as a multi-tenant cloud service with a hybrid data processing engine that scales out, the SnapLogic Elastic Integration Platform takes advantage of powerful parallel processing and management capabilities with a simple-to-use, drag-and-drop Designer.  With 300+ prebuilt Snaps, you can build pipelines for data flows from any number of sources to any number of destinations. And because the SnapLogic iPaaS is 100% REST-based, pipelines are abstracted and addressable, usable, consumable, trigger-able and schedule-able as REST calls. This means they’re able to do the job of many traditional static integrations with considerable advantage. Whether one-to-one, one-to-many, many-to-one, or many-to-many orchestration scenarios, Cloud ETL challenges disappear because enterprise-grade scalability, simplicity and reliability promote fast implementations while dramatically lowering costs.

But what about real-time analytics and insights? As the demand for real-time analytics and insights increases, it’s becoming increasingly vital for data scientists and business analysts to have more control over the entire data lifecycle. Adds, moves and changes in the integration cycle need to be easily orchestrated, while preserving the time required for analytics and insights. Accomplishing this requires a modern, agile and simplified approach to ETL/ELT.

snaplogic_elastic_integrationBig Data Integration as a Service
Hadoop is a widely accepted distributed processing platform that has tools that can extract, transform and load data from high volumes of disparate data sources, but data ingest on Hadoop is a major challenge, typically requiring complex, time consuming and costly programming and scripting frameworks. Automation, metadata management and on-going maintenance in Hadoop is equally challenging.

SnapLogic has taken a novel approach to solving the Hadoop ingest challenge by moving the extraction process beyond structured data sets, allowing queries across disparate data types and structures, and then streaming all data as JSON documents. Rather than traditional point-to-point data loading, SnapLogic’s horizontally scalable elastic pipeline provides powerful multi-point, multimodal integration while hiding the underlying complexity of data integration from the user.

A big plus in the Hadoop equation is Snaplogic’s version of MapReduce called SnapReduce. As with SnapLogic’s theme of simplifying everything, SnapReduce allows the data scientist/business analyst to run MapReduce operations without special training and without the complexity of YARN-based MapReduce scripts and algorithms. With SnapLogic, the data scientist/business analyst is truly (and fully) empowered. You can think of it as “Hadoop for Humans.”

Randy Hamilton is a Silicon Valley entrepreneur and technologist who writes periodically about industry related topics including the cloud, big data and iOT.  Randy has held positions as Instructor (Open Distributed Systems) at UC Santa Cruz and has enjoyed positions at Basho (Riak NoSQL database), Sun Microsystems, and Outlook Ventures, as well as being one of the founding members and VP Engineering at Match.com.

Next Steps: