The SnapLogic big data team was at the Spark Summit last week in San Francisco. Around 2,500 people attended this year and featured several high profile speakers such as Matei Zaharia the creator of Spark, Jeff Dean of Google, Andrew Ng of Baidu, and representatives from influential tech companies like Amazon, Microsoft, and Intel.
The major buzz of the event was about the 2.0 release of Spark, which continues the trend of building a unified engine, improving the high level APIs, and integrating broadly with data analysis and machine learning libraries. In Spark 2.0, the Structured Streaming engine unifies batch and stream processing. In addition, the engine supports the same Spark SQL API that was introduced in previous versions of Spark and still supports the query optimizations developed for it. Overall, Spark 2.0 should reduce the cost of development and improve performance, while maintaining backwards compatibility.
- All talks in the use cases track I went to (Uber, Netflix, Airbnb) were doing some form of ETL, but no one tool seemed to be preferred. Data ingestion and preparation still seems like a big pain point for data engineers.
- Everyone talks about “data pipelines”, which fits nicely with SnapLogic’s terminology.
- Parquet is the preferred format used for big data storage.
- MapReduce is now considered antiquated, even Doug Cutting was in agreement, but companies have invested in that infrastructure and training so it will be sticking around. One hurdle to adopting Spark at Netflix (according to Kurt Brown) was finding developers with Spark experience. This is significant for SnapLogic’s Spark data pipelines and the Hadooplex as it allows people to start using Spark with no experience with the APIs and can re-use some of the Yarn experience people have been exposed to through MapReduce.