… You know it (and the beginning of “pragmatic” solutions).
As 2014 is launched, we wanted to talk about one of the most hyped up topics of last year … big data and of course the hadoop project.
But first, here’s to a very happy and prosperous 2014 for you and everyone around you.
2013 is when big data displaced NoSQL as the buzzword of choice and hadoop has been consistently in the news day in and day out with some new innovation on the platform or with vendor promotion. It’s definitely an innovative platform with automated backup and recovery of data spread across thousands of nodes in an MPP style and with a claim to a lower TCO because of licensing costs and it’s ability to run on commodity hardware.
Similar scaling on proprietary platforms would cost and arm and a leg and then some.
But, is this the solution to all the problems?
It has often been positioned as the “panacea” for scalability and flexibility in storage formats with a Data Warehouse being the most common use case.
It’s essentially a storage platform with a built-in MPP based data processing engine to take advantage of parallel processing of the data sets.
Yet, there are many unresolved issues starting with at least 2 camps of people. The first group are evangelists of the platform who believe it’s the best thing since sliced bread and tend to be developers more than experienced Business Intelligence implementers. They tend to be very good programmers who understand processing data at scale using programming techniques and languages. Everything for them is a “processing”and/or”programming” problem. The other group are the legacy BI developers who simply reject the platform as a fad and want nothing to do with it. They cringe when they hear words like Big Data, Hadoop and BI in the same sentence.
There are also people like you and us who believe both those groups are taking a one sided view of what really should be a technology and architecture decision.
The reality is these platforms are now part of the landscape of many organizations and can be leveraged as assets for both data storage and data processing. However, there’s also a significant investment in traditional platforms like RDBMS engines which still serve the majority of BI needs today especially pertaining to structured data.
The prescribed format as well as knowledge transfer happening to the big data developers appears to be completely misled because they’re still trying to push the platform to the business user with tool vendors jumping on the band wagon. Some approaches still hold a bit of water, but they’re really not leveraging it in the optimal way that Data Vault 2.0 can (Ironically, from a business perspective, we’re still solving the same recurring problems with some new technological twists and innovations).
And Data Vault 2.0 isn’t really only for platforms like hadoop. It actually started out for traditional MPP database engines like Teradata, but was then extended to include big data platforms. It even helps to tie multi-structured data together and to tie big data platforms to traditional platforms.
The conventional big data platform advice these days is to use the platform as a dumping ground and get what you want from it later. They call it a”data lake”. Some organizations have even called it a “data hub” which only adds to the confusion.
It’s a start, but a rather inefficient way to store your data assets.
The disconnect between the two groups continues.
Tomorrow, we’ll talk about how you can inadvertently make your big data problems even bigger with traditional architectures.
Dan Linstedt and Sanjay Pande
PS:The first DV 2.0 templates were introduced on LDV last year, but a full-fledged DV 2.0 course is coming soon. The only way to currently get DV 2.0 certified is directly via Dan and you can get one or both of us on the unique kick start package as shown here:
More DV 2.0 information is also forthcoming. Stay tuned.