A few days ago, an analyst at Gartner group came out with an interesting description of the “Data Lake” concept on a blog post and concluded that it was essentially a “swamp” (and other murky metaphors).
So, why the hype then?
The idea that you can store all your data without doing anything and retrieve it when you want, can be appealing, especially to business people who don’t have to deal with data integration, data quality and data governance in general. They just want to see data in certain formats.
For technologists like us, especially folks who have used the Data Vault, it’s also appealing because you can perpetually store your raw data in the format it came in – something that was relatively more expensive to do earlier.
But, it also makes us complacent and prevents us from seeing the forest from the trees.
No matter how much data you have, you have to eventually consolidate it, work on it, put business rules and eventually present it to the business users to help them make their business decisions.
The current incarnation of the Data Lake is dump everything here. It works very well as a staging area and lots of corporations are using Apache Hadoop as a use case for staging data – a rather well suited purpose. The Gartner version has a slight variation and appears to point to pulling data from everywhere (including your EDW) into Hadoop for data scientists to analyze. This is ALSO a very good use case and is a common way to leverage Hadoop for Data Scientists (People who have the skills to analyze data using technologies like data mining).
So what’s the problem?
There are a few, especially when you’re looking at process optimization on the flow of data from source to business user.
You can’t get away from transforming data into business friendly formats. This can be done by the Hadoop team and delivered to the business user, however suffers from the very old school paradigm, where the onus on data delivery was from programmers in a data processing department. There is very little reuse of routines and zero visibility of how things came about. Things like traceability and data lineage were not even an afterthought. This paradigm is quite popular amongst the Hadoop crowd where they’re happy to be “map/reduce programmers” who are going to end up playing ping-pong with requirements vs delivery against the business users.
It obviously causes a lot of extra churn and duplication of effort compared to a well designed EDW style of data implementation (especially as a Data Vault) where governance and agility play important roles.
To reduce the effort you can move some of the processing upstream, such as business key evaluations and alignment, data integration and even change data capture (if you want). Granted, CDC across documents is a completely different beast than across rows in a relational database. By designing an implementation architecture from an EDW standpoint instead of a Data Lake standpoint, you will end up doing some processing upstream as compared to none.
This upstream work may slow down the storage cycles but has inherent benefits because the data is now pre-organized for query (and for storage). To leverage technologies like Hive, you would need to do optimizations by converting data into optimal storage formats anyway. This results in better storage and more map/reduce friendly data which is organized, which as a byproduct reduces the effort of pulling it out.
Is the Data Vault an optimal structure to store this data?
That is still being decided. Even between the two of us Sanjay strongly believes it is, while Dan needs more convincing.
The one thing the Data Vault does have going for it is that, it is definitely a viable alternative that looks much better than a data lake and with DV 2.0, you can actually store this in constructs that don’t really modify the core data too much.
Also with a glue like nature of Apache Hive wherein you can have both managed (hive internal) tables as well as external tables – stored directly on HDFS or via HBase or Cassandra, it does offer an interesting way of looking at a potential all encompassing EDW that is actually distributed not only across nodes, but across technologies with a feasible common query interface at least for data delivery.
You can reduce the burden on the downstream work. You can reduce the levels of churn by providing a unifying interface and reducing duplication of effort. You can reduce time to delivery for business on the information mart side.
Note: Hadoop isn’t the only game in town. There are innovations in big data by risk management vendors, telecom companies and more and we’ll talk about it soon.
At the same time you also increase optimization of storage. You also increase optimization of performance wherein you can leverage new SQL on Hadoop technologies like the work done with orcfile format on Hive or with the Spark/Shark layers. You have data in a more map/reduce friendly format which increases performance of other processes as well.
Dan Linstedt and Sanjay Pande
PS: Presenters at the WWDVC2014 used several terms for disorganized stored sets including on the Hadoop platform. The phrases used were “Garbage Dump”, “Data Landfill” and “Data Wasteland”.
The WWDVC2014 was the first time when we had presentations that specifically talked about big data issues with solutions that implemented Hadoop and NoSQL. There was a DV 2.0 NoSQL case study wherein the customer saved a considerable amount of money. There’s still a few seats left for WWDVC 2015
[ WWDVC2015 ]