The data journey: From the data warehouse to data marts to data lakes

By Mark Gibbs

Published September 19, 2018

3 min read

With data increasingly recognized as the corporate currency of the digital age, new questions are being raised as to how that data should be collected, managed, and leveraged as part of an overall enterprise data architecture.

Data warehouses: Model of choice

For the last few decades, data warehouses have been the model of choice, used by enterprises to extract structured data from operational systems like enterprise resource planning (ERP) and supply chain management (SCM) platforms. Enterprises have consolidated and centralized the data, and have leveraged business intelligence and decision support tools to do in-depth, historical reporting and analysis. While the data warehouse serves as a centralized, multi-purpose repository under lock and care of IT, data marts surfaced as a subset of the technology built to address the specific reporting needs of a particular department or business function. Data warehouses are built with a top-down approach and store detailed, structured data, and data marts usually emanate from the bottom up with the purpose of housing a summarized form of select data.

Both approaches have co-existed and enjoyed success for years. But the advent of bigger, more varied data – including unstructured information such as weblogs, images, video, direct messages, and the near endless stream of real-time Internet of Things (IoT) data – presents challenges that the traditional data warehouse/data mart architectures simply aren’t equipped to handle. Also, the centralized vision for a single data warehouse repository never fully materialized, leaving most organizations with a smattering of data silos, which can impede effective decision making.

A shift to data lakes

Research by Vanson Bourne found that disconnected data, propagated by legacy systems and outdated data architectures, is costing companies big time. According to the survey of IT leaders and business users, organizations in the United States and the United Kingdom are losing $140 billion annually in wasted time and resources, de-duplication of effort, and missed opportunities because of disconnected data. More than half of the respondents (56 percent) said that data silos were a barrier to meeting their organization’s business objectives.

Enter the data lake, the latest rendition of a centralized platform for collecting and processing data, this time with a flat, schema-less architecture typically built around Hadoop and tuned for general-purpose data processing. Like a data warehouse, the data lake can store varied sources of data, but in contrast, the data doesn’t need to be cleaned and transformed during the acquisition process. The lack of structure and pre-defined schema gives the data lake more versatility, making it well suited for data discovery and a broader array of analytics use cases. Moreover, a data lake is capable of ingesting and processing data in real-time, which is more in keeping with the immediacy of today’s digital business applications.

Complementary, not a replacement

While some pitch the data lake as a replacement for the data warehouse, many data management experts don’t see it that way. Rather, they see the two technologies as complementary, each serving their own use case. For example, the data warehouse is well suited for business users who need to work with pre-aggregated and pre-integrated information targeted for historical analytics applications. Data lakes, on the other hand, are good for data scientists and others who want to work with raw data, perhaps to build machine learning-based models and need rapid discovery, exploration, and testing – processes related to the new generation of prescriptive and predictive analytics.

When planning for a data lake, one thing is clear: Organizations need to map out a new architecture and invest in tools that will enable integration and support end-to-end processing, including data acquisition, data transformation, and data access. With such an infrastructure in place, organizations can move forward with the next-generation, data-driven applications that will be the engine behind digital business success.