Empty or Full: What Lies Beneath the Data Lake
With Hadoop Summit this week in San Jose and so many opinions (and survey results) being shared about whether the data lake and Hadoop are half full or half empty, I thought I’d repost an article I wrote that was fist published on the Datanami site a few weeks ago. But first, a few of the half full/empty posts I’m referring to:
- The Data Lake – Half Full or Half Empty
- Enterprise Hadoop Adoption: Half Empty or Half Full?
- Hadoop Adoption – Is the Cluster Half Full?
I’ll be at the Hadoop Summit this week with the SnapLogic Team (details here) and would love to explore these and other big data topics. Here’s my Datanami post: What Lies Beneath the Data Lake. Please let me know if you have feedback.
Hadoop and the data lake represents potential business breakthrough for enterprise big data goals, yet beneath the surface is the murky reality of data chaos.
In big data circles, the “data lake” is one of the top buzzwords today. The premise: companies can collect and store massive volumes of data from the Web, sensors, devices and traditional systems, and easily ingest it in one place for analysis.
The data lake is a strategy from which business-changing big data projects can begin, revealing potential for new types of real-time analyses which have long been a mere fantasy. From connecting more meaningfully with customers while they’re on your site to optimizing pricing and inventory mix on-the-fly to designing smart products, executives are tapping their feet waiting for IT to deliver on the promise.
Until recently, though, even large companies couldn’t afford to continue investing in traditional data warehouse technologies to keep pace with the growing surge of data from across the Web. Maintaining a massive repository for cost-effectively holding terabytes of raw data from machines and websites as well as traditional structured data was technologically and economically impossible until Hadoop came along.
Hadoop, in its many iterations, has become a way to at last manage and merge these unlimited data types, unhindered by the rigid confines of relational database technology. The feasibility of an enterprise data lake has swiftly improved, thanks to Hadoop’s massive community of developers and vendor partners that are working valiantly to make it more enterprise friendly and secure.
Yet with the relative affordability and flexibility of this data lake come a host of other problems: an environment where data is not organized or easily manageable, rife with quality problems and unable to quickly deliver business value. The worst-case scenario is that all that comes from the big data movement is data hoarding – companies will have stored petabytes of data, never to be used, eventually forgotten and someday deleted. This outcome is doubtful, given the growing investment in data discovery, visualization, predictive analytics and data scientists.
For now, there are several issues to be resolved to make the data lake clear and beautiful—rather than a polluted place where no one wants to swim.
Poor Data Quality
This one’s been debated for a while, and of course, it’s not a big data problem alone. Yet it’s one glaring reason why many enterprises are still buying and maintaining Oracle and Teradata systems, even alongside their Hadoop deployments. Relational databases are superb for maintaining data in structures that allow for rapid reporting, protection, and auditing. DBAs can ensure data is in good shape before it gets into the system. And, since such systems typically deal only with structured data in the first place, the challenge for data quality is not as vast.
In Hadoop, however, it’s a free-for-all: typically no one’s monitoring anything in a standard way and data is being ingested raw and ad hoc from log files, devices, sensors and social media feeds, among other unconventional sources. Duplicate and conflicting data sets are not uncommon in Hadoop. There’s been some effort by new vendors to develop tools that incorporate machine learning for improved filtering and data preparation. Yet companies also need a foundation of people—skilled Hadoop technicians—and process to attack the data quality challenge
Lack of Governance
Closely related to the quality issue is data governance. Hadoop’s flexible file system is also its downside. You can import endless data types into it, but making sense of the data later on isn’t easy. There’s also been plenty of concerns about securing data (specifically access) within Hadoop. Another challenge is that there are no standard toolsets yet for importing data in Hadoop and extracting it later. This is a Wild West environment, which can lead to compliance problems as well as slow business impact.
To address the problem, industry initiatives have appeared, including the Hortonworks-sponsored Data Governance Initiative. The goal of DGI is to create a centralized approach to data governance by offering “fast, flexible and powerful metadata services, deep audit store and an advanced policy rules engine.” These efforts among others will help bring maturity to big data platforms and enable companies to experiment with new analytics programs.
In a recent survey of enterprise IT leaders conducted by TechValidate and SnapLogic, the top barrier to big data ROI indicated by participants was a lack of skills and resources. Still today, there are a relatively small number of specialists skilled in Hadoop. This means that while the data lake can be a treasure chest, it’s one that is still somewhat under lock and key. Companies will need to invest in training and hiring of individuals who can serve as so-called “data lake administrators.” These data management experts have experience managing and working with Hadoop files and possess in-depth knowledge of the business and its various systems and data sources that will interact with Hadoop.
Transforming the data lake into a business strategy that benefits customers, revenue growth and innovation is going to be a long journey. Aside from adding process and management tools, as discussed above, companies will need to determine how to integrate old and new technologies. More than half of the IT leaders surveyed by TechValidate indicated that they weren’t sure how they were going to integrate big data investments with their existing data management infrastructure in the next few years. Participants also noted that the top big data investments they would be making in the near term are analytics and integration tools.
We’re confident that innovation will continue rapidly for new Big Data-friendly integration and management platforms, but there’s also need to apply a different lens to the data lake. It’s time to think about how to apply processes, controls and management tools to this new environment, yet without weakening what makes the data lake such a powerful and flexible tool for exploration and delivering novel business insights.
For more information about SnapLogic big data integration visit www.snaplogic.com/bigdata. Please be sure to also take a minute to complete the Hadoop Maturity Survey for a chance to win an Amazon Gift Card.