Executing Spark Pipelines on HDInsight

Microsoft Azure HDInsight is an Apache Hadoop distribution powered by the cloud. Internally HDInsight leverages the Hortonworks data platform. HDInsight supports a large set of Apache big data projects like Spark, Hive, HBase, Storm, Tez, Sqoop, Oozie and many more. The suite of HDInsight projects can be administered via Apache Ambari.

SnapLogic-for-MicrosoftThis post lists out the steps involved in spinning up an HDInsight cluster, setting up SnapLogic’s Hadooplex on HDInsight, and building and executing a Spark data flow pipeline on HDInsight. We start with spinning up a HDInsight cluster from the MS Azure Portal. Continue reading “Executing Spark Pipelines on HDInsight”

Machine Learning for the Enterprise, Part 3: Building the Pipeline

In the last post we went into some detail about anomaly detectors, and showed how some simple models would work. Now we are going to build a pipeline to do streaming anomaly detection.

We are going to use a triggered pipeline for this task. A triggered pipeline is instantiated whenever a request comes in. The instantiation can take a couple of seconds, so it is not recommended for low latency or high-traffic situations. If we’re getting data more frequently than that, or want less latency, we should use an Ultra pipeline. An Ultra pipeline stays running, so the input-to-output latency is significantly less.

For the purpose of this post, we’re going to assume we have an Anomaly-Detector-as-a-Service Snap.  In the next post, we’ll show how to create that Snap using Azure ML. Our pipeline will look like this:

Final Pipeline
Final Pipeline

Continue reading “Machine Learning for the Enterprise, Part 3: Building the Pipeline”