The SnapLogic big data team was at the Spark Summit last week in San Francisco. Around 2,500 people attended this year and featured several high profile speakers such as Matei Zaharia the creator of Spark, Jeff Dean of Google, Andrew Ng of Baidu, and representatives from influential tech companies like Amazon, Microsoft, and Intel.
The major buzz of the event was about the 2.0 release of Spark, which continues the trend of building a unified engine, improving the high level APIs, and integrating broadly with data analysis and machine learning libraries. In Spark 2.0, the Structured Streaming engine unifies batch and stream processing. In addition, the engine supports the same Spark SQL API that was introduced in previous versions of Spark and still supports the query optimizations developed for it. Overall, Spark 2.0 should reduce the cost of development and improve performance, while maintaining backwards compatibility.
A big push of the conference was the release of a “community edition” that makes it free for people to learn and start building applications with Spark. There were many good demonstrations of this, including one by Databricks. It seems like a great place to get started with Spark as it removes a lot of the operational complexity and has many learning resources built in.
One of the more exciting messages of the conferences is that several traditionally “hard” artificial intelligence (AI) problems like speech recognition, image processing, and unstructured problem solving have had several important break throughs recently. Andrew Ng of Baidu described the AI challenge as similar to space flight: building a rocket requires the right balance of an engine and fuel, just as success in AI requires the right balance of sophisticated machine-learning models and ample amounts of data. Spark Summit was pretty optimistic in that these advances were going to kick off an “intelligence revolution” that would be as impactful as the industrial revolution was to the 20th century.
A couple more observations from the event:
All talks in the use cases track I went to (Uber, Netflix, Airbnb) were doing some form of ETL, but no one tool seemed to be preferred. Data ingestion and preparation still seems like a big pain point for data engineers.
Everyone talks about “data pipelines”, which fits nicely with SnapLogic’s terminology.
Parquet is the preferred format used for big data storage.
MapReduce is now considered antiquated, even Doug Cutting was in agreement, but companies have invested in that infrastructure and training so it will be sticking around. One hurdle to adopting Spark at Netflix (according to Kurt Brown) was finding developers with Spark experience. This is significant for SnapLogic’s Spark data pipelines and the Hadooplex as it allows people to start using Spark with no experience with the APIs and can re-use some of the Yarn experience people have been exposed to through MapReduce.
Overall it was a great event to gain some understanding of where Spark is moving and how people are using it. It was also a good sounding board around some of SnapLogic’s big data integration focus: investing in Parquet, Spark, IoT, and streaming, which all seem to be in alignment with the community. We look forward to adopting this experience into the work we’re doing at SnapLogic.
If you are interested in learning more about how SnapLogic works with Spark or Big Data, go visit our video page to watch engaging webinars and SnapLogic demonstrations. We are also looking for Sr. Big Data Developers, so join our Big Data team!