There’s no doubt about it – this has been the summer of Big Data. From Hadoop, to Cloudera and Greenplum distributors, to integration vendors such as ourselves, everyone is focused on improved access, connections, and analysis for the new onslaught of non-traditional data. I recently stumbled across a blog post by Loraine Lawson wherein she interviews IBM?s David Corrigan about what constitutes a Big Data platform for an integration vendor. Her post got me thinking about the evolution of relational databases and how technologies have adapted to suit a world now driven by Big Data/NoSQL.
Although they’re now at the heart of the technology portfolios offered by behemoths such as IBM, Oracle, and Microsoft, once upon a time relational databases were a newfangled invention. Prior to about 1988, everyone knew? that mission-critical, enterprise-class data could only be entrusted to mainframes, using serious packages such as IMS and VSAM. But the tandem of relational databases and mini/microcomputers soon took off, and now it’s hard to envision any crucial application using anything but a relational database to hold its data. Yet mainframes and their data storage infrastructure are still very much active today, proof that old technologies continue to supply value.
A similar data management transformation is currently underway. Although relational databases continue their role as the kings of enterprise information storage, the database world has recently been experiencing more change than it has for decades. An entirely new class of post-relational solutions has arisen to address the unique requirements of 21st century applications. These challenges include huge amounts of data being captured and managed, enormous read/write workflow, and all sorts of fine-grained, sub-transactional metrics. Scientific modeling, click streams, search indexes, and phone logs are just a few examples of where all this new data is coming from.
While relational databases are certainly capable of managing tremendous volumes of information, there are a host of business and technical realities that are spurring architects to look elsewhere. First, when working with these new, non-traditional information sources, some databases have had difficulty delivering results during an acceptable window. Scaling up or out is the traditional way of surmounting these kinds of performance obstacles, but the vendors? licensing structures often make this cost-prohibitive. Even if money weren’t an issue, it?s becoming clear that the relational model itself may not be the ideal underpinning for managing these non-traditional information sources.
Though there are many diverse examples and implementations of these new, post-relational solutions, collectively they’re known as Big Data/NoSQL. Some are commercial, while others are open source. Despite their relative youth, they?re now supporting some of the largest Web sites such as Facebook, Yahoo, Google, and eBay. They’re also beginning to gain traction in corporate IT.
Since integration is a key aspect of any enterprise?s portfolio, it?s no surprise that Big Data/NoSQL needs to be part of the equation: these new data sources contain gigantic quantities of valuable information. However, to fully exploit all this new intelligence, it’s critical that it be easily associated with traditional, relational data and then made available to the rest of the organization. For example, a retailer might want to adjust its inventory forecasts (maintained in Oracle) based on what?s learned from e-commerce clickstreams (maintained in a non-relational key-value store such as Amazon SimpleDB). Client/server-era integration software simply won?t work in this new world: it’s too heavy, slow, and tightly coupled to proprietary adapters and APIs.
When we designed our architecture, we deliberately avoided hard-wiring our products to relational databases. Instead, (as described in Gaurav’s ‘Science Behind Snaps’ blog posting) Snaps abstract out all details about the application or data sources. All Snaps follow the same pattern, use the same APIs, and take advantage of common functions offered by the SnapLogic platform. This approach works equally well with non-relational sources. To prove my point, we offer Snaps for SimpleDB, Google Analytics, and the Hadoop File System. Our strategy lets you create innovative new solutions without having to worry about how to connect disparate information sources, whether they?re relational or post-relational.
Interested in keeping the big data discussion going? Don’t forget to swing by next month’s Hadoop meetup at USF on August 10 from 6-8pm.