Big Data Management: Doug Henschen Dives into the Data Lake

doug_henschen_constellationYesterday SnapLogic hosted a webinar featuring Doug Henschen from Constellation Research called Democratizing the Data Lake: The State of Big Data Management in the Enterprise. Doug kicked things off by walking through where we were and where we are today with some compelling examples from The Second Machine Age, by Erik Brynjolfsson and Andrew McAfee. When it comes to the power of modern computing, for instance, in 1996 the U. S ASCI Red at Sandia Labs cost $55M, was 1,600 square feet and had 1.8 Teraflops of computing power. In 2006 the Sony PlayStation 3 sold for $499 at the size of 4 x 12 x 10 inches and could handle 1.8 Teraflops of computing power. Amazing! Doug went on to discuss the impact of distributed computing and how software has evolved (think: Kasparov vs. Big Blue compared to the chess game on your laptop today).

Sure, some of these factoids are often discussed, and there’s no shortage of stats about the impact of big data on every industry as well our everyday lives, but what I really liked about Doug’s message was the importance of focusing on what actually drives business value. Use big data to improve analytical insight, but remember that “big data is only part of the digital disruption trend.”

Doug’s presentation reviewed the Hadoop market today, noting that the fastest growing segment is the shift to the cloud. Hadoop has been accepted as the platform standard with growing adoption in the enterprise, but Spark is definitely the accelerator. On the topic of the data lake, Doug made a number of important points:

  • It doesn’t just consist of all new data types. It’s often data that organizations just couldn’t afford to retrain or practically analyze in the past.
  • It’s not a replacement for an enterprise data warehouse – there is still a need for what he calls “industrialized queries against known data.”
  • It’s about integrated new data, with proactive and predictive analytics as a common driver.
  • A cluster can turn into a swamp without a well-ordered infrastructure.

Before diving into the nuts and bolts of the enterprise data lake and reviewing vendors in each category, the conversation focused on specific big data use cases by industry. Specific examples of case studies he’s worked on were shared - from campaign analysis and optimization in digital marketing and advertising, to archiving, to money laundering in financial services, to supply chain optimization in retail, to customer churn initiatives in telecommunications, to claims fraud analysis in insurance.

I encourage you to watch the entire presentation here. Similar to some of the data lake architecture examples and whitepapers we’ve recently shared on the SnapLogic blog, there are a number of solid conclusions about how to think about the data lake relative to your existing data infrastructure. The bottom line? As Doug wrote about on his blog, Hadoop is 10 years old, and as all parents know, it’s important to spend time with your kids, try to mitigate risks and set appropriate limits as they mature. The same is true with your data: know your data, your users and your risks and set the appropriate limits along your maturity curve.

As an industry, we need to democratize Hadoop and simplify efforts to create data lakes. We’re moving to a more cognitive era and data monetization is a hot trend, but the “journey to digital cannot be accomplished without connectedness.” When it comes to your data integration strategy, ensure that it is cloud, services and data enrichment savvy. But think bigger – how will a data lake drive new businesses and data models?

I’d like to thank Doug Henschen and Constellation Research for a great market overview and discussion. There’s a lot more that I didn’t cover in the full presentation, which is available on the SnapLogic website. I’ll leave you with this slide, which summarizes what Hadoop users are saying today:

Constellation Research: Common Data Lake Challenges. What Hadoop users say…