The Strata Data Conference in New York is where thousands of cutting-edge companies deep dive into emerging big data technologies and techniques. From hot topics like AI and machine learning to implementing data strategy, this seven-year running conference series is a hotbed for new ideas and strategies to tackle the challenges that have emerged in the data field.
SnapLogic, a Gartner leader in enterprise-grade application and data integration, provides a serverless, cloud-based runtime environment for complex and high-volume data transformation routines servicing various big data use-cases. We are a sponsor at the Strata conference and will be in the exhibit hall at booth #1415. Visit our booth to get a demo or to sign up for a free trial and you’ll receive a $10 gift card. You will also be entered to win a Sonos playbar + Sonos one set.
If you’re an integration architect attending this conference, here are four sessions we recommend:
1. Building A Large-Scale Machine Learning Application Using Amazon SageMaker and Spark
David Arpin (Amazon Web Services)
9:00am–12:30pm Tuesday, 09/11/2018
Location: 1A 12/14 Level: Intermediate
Machine learning’s popularity has grown tremendously in recent years, and the drive to integrate into every solution has never been more pronounced. The path from investigation to model development to implementation in production can be difficult. But Amazon SageMaker AWS’s new machine learning platform seeks to make this process easier.
Machine learning starts with data, and Spark is one of the most popular and flexible solutions for handling large datasets for ETL, ad hoc analysis, and advanced machine learning. However, using Spark for production machine learning use cases can create problems with inconsistencies in algorithm scale, conflicts over cluster resources, and prediction latencies. By offloading training to Amazon SageMaker’s highly scalable algorithms, distributed, managed training environment, and deploying with SageMaker’s real-time, production endpoints, implementing machine learning in production is easier and more reliable.
This tutorial will walk you through how to build a machine learning application, from data manipulation to algorithm training to deployment to a real-time prediction endpoint, using Spark and Amazon SageMaker.
2. Running multidisciplinary big data workloads in the cloud
Sudhanshu Arora (Cloudera), Tony Wu (Cloudera), Stefan Salandy (Cloudera), Suraj Acharya (Cloudera), Brandon freeman (Cloudera, Inc.)
1:30pm–5:00pm Tuesday, 09/11/2018
Location: 1E 14 Level: Intermediate
Organizations now run diverse, multidisciplinary big data workloads that span data engineering, analytic database, and data science applications. Many of these workloads operate on the same underlying data and the workloads themselves can be transient or long-running in nature. One of the challenges is keeping the data context consistent across these various workloads.
In this tutorial, we will use the Cloudera Altus PaaS offering, powered by Cloudera Altus SDX, to run various big data workloads. In this tutorial we will learn to successfully manage the shared data experience to ensure a consistent experience across all various workloads with the following:
– Learn how to successfully run a data analytics pipeline in the cloud and integrate data engineering and data analytic workflows
– Understand the considerations and best practices for data analytics pipelines in the cloud
– Explore sharing metadata across workloads in a Big Data PaaS
3. Stream Processing with Kafka and KSQL
Tim Berglund (Confluent)
9:00am–12:30pm Tuesday, 09/11/2018
Location: 1E 14 Level: Intermediate
Apache Kafka is a de facto standard streaming data processing platform, being widely deployed as a messaging system, and having a robust data integration framework (Kafka Connect) and stream processing API (Kafka Streams) to meet the needs that common attend real-time message processing. But there’s more!
Kafka now offers KSQL, a declarative, SQL-like stream processing language that lets you define powerful stream-processing applications easily. What once took some moderately sophisticated Java code can now be done at the command line with a familiar and eminently approachable syntax. Come to this talk for an overview of KSQL with live coding on live streaming data.
4. Architecting a next-generation data platform
Ted Malaska (Blizzard Entertainment), Jonathan Seidman (Cloudera)
1:30pm–5:00pm Tuesday, 09/11/2018
Location: 1A 06/07 Level: Advanced
Rapid advancements are causing a dramatic evolution in both the storage and processing capabilities in the open source enterprise data software ecosystem. These advancements include projects like:
- Apache Kudu, a modern columnar data store that complements HDFS and Apache HBase by offering efficient analytical capabilities and fast inserts and updates with Hadoop;
- Apache Kafka, which provides a high-throughput and highly reliable distributed message transport;
- Apache Spark, which is rapidly replacing parallel processing frameworks such as MapReduce due to its efficient design and optimized use of memory. Spark components such as Spark Streaming and Spark SQL provide powerful near real-time processing;
- Distributed storage systems, such as HDFS and Cassandra;
- Parallel query engines such as Apache Impala and CockroachDB, which provide capabilities for highly parallel and concurrent analysis of datasets.
These storage and processing systems provide a powerful platform to implement data processing applications on batch and streaming data. While these advancements are exciting, they also add a new array of tools that architects and developers need to understand when architecting modern data processing solutions.
Using Customer 360 and the Internet of Things as examples, Jonathan Seidman and Ted Malaska explain how to architect a modern, real-time big data platform leveraging these components to reliably integrate multiple data sources, perform real-time and batch data processing, reliably store massive volumes of data, and efficiently query and process large datasets. Along the way, they discuss considerations and best practices for utilizing these components to implement solutions, cover common challenges and how to address them, and provide practical advice for building your own modern, real-time data architectures.
Topics include:
- Accelerating data processing tasks such as ETL and data analytics by building near real-time data pipelines using modern open source data integration and processing components
- Building reliable and efficient data pipelines, starting with source data and ending with fully processed datasets
- Providing users with fast analytics on data using modern storage and query engines
- Leveraging these capabilities along with other tools to provide sophisticated machine learning and analytical capabilities for users
Don’t forget to visit booth #1415 to get a SnapLogic Enterprise Integration Cloud or eXtreme demo or to sign up for a free trial (and get a $10 gift card!) You will also be entered to win a Sonos playbar + Sonos one set! See you there!