Dion Hinchliffe wrote an interesting post on the challenges facing enterprise mashups. Of the 10 items in his list, all very relevant, one stood out to me as way more significant that the others – data quality and accuracy, aka the ‘truthiness’ problem. Even if all the other challenges were solved, I believe the singular issue of lack of trust in the data would prevent the widespread adoption of the mashup model in the enterprise. As Dion notes, this is a critical issue for mashup adoption.
In the traditional enterprise data warehouse, one of the key characteristics is the concept of trusted data. In order to make business decisions, data must be accurate and reliable. Accomplishing this goal is one of the reasons a data warehouse is much more that a ‘copy’ of operational data, rearranged into a ‘nicer’ schema for reporting. Also, in our (US) corporate world of Sarbanes-Oxley and Graham-Leach-Bliley, compliance and audits are real concerns for decision makers and executives.
On the warehouse implementation front, almost everyone has stopped writing code, and realized they need an ETL tool. Development productivity is one commonly mentioned motivation for using a tool, but lets face it, the replacement of C and Cobol by dynamic languages like Python, Ruby, and PHP have made that particular argument weaker and weaker. But good ETL tools also have another, more important characteristic – they maintain the centralized metadata necessary to validate, control, and audit the data.
Now, switch to the ‘enterprise mashup’ model, applying the principles of the decentralized Web to corporate data. Data from core systems, warehouses, and other trusted sources is being processed through many different layers, in different languages, on different servers. And then, once it’s been processed, it’s republished so it can be re-mixed again! And again!
Nobody wants to make business decisions based on data that has played the telephone game. In order for enterprise mashups to succeed, we must be able to trust and audit the data that is being manipulated.
As the data moves further from the core to the edge, it is mixed and remixed, and it becomes increasingly important to maintain the lineage of the data. Access control and security are important, but information about where the data came from, and how it was processed is essential.
Solving this problem was one of the core challenges for SnapLogic. We needed to maintain the flexibility and power of the Web’s distributed model, while still maintaining strong metadata to provide lineage and provenance.
Data integration is essentially a data flow problem. In the core of the enterprise, systems are centralized, using a combination of bus and hub architectures. Data flows along buses, and in and out of hubs. Maintaining control of the data flow is a tractable problem.
As you move away from the core, that centralized model no longer exists. Mashups and the Web are a fundamentally distributed model, and the nicely defined data flows turn into a complex graph. The graph itself is defined by whatever mashups have been built, and there will be many mashups. As mashups gain adoption, that graph will increasingly expand to potentially include data from sources that aren’t part of the core at all.
Our solution was to base our core metadata store on a distributed graph using RDF. Every transformation pipeline within SnapLogic, and every field link, is automatically maintained as part of the core metadata store. The user interfaces show pictures, but theres a complex graph hidden behind the scenes.
At the endpoints of pipelines, and for each set of inputs and outputs, this makes it straightforward for us to expose a basic description of the data set available. But it’s also possible to drill into that endpoint, and find out whats it’s connected to, where the data came from, and how it was processed.
Even better, since RDF was designed from the beginning for distributed applications, it’s possible to follow the flow across and through SnapLogic servers, avoiding the need for a centralized metadata store; metadata is automatically collected as mashups are built.
The rise of REST web services, and the power of dynamic languages have lowered the bar for data access and manipulation. But they have also resulted in a lot of core data processing moving to distributed locations, with no audit or flow information available. With a model like this, it’s not surprising that enterprise IT professionals aren’t ready to jump in yet.
Are you playing telephone with your corporate data ?