History of the Data Vault

Today lets talk about what influenced the Data Vault Architecture …

At the time Dan invented the DV (1990) the concept of 3NF DWs and Dimensional Modeling was already there. However, these only influenced the modeling aspects of the DV. There’s so much more to the architecture and I’d argue that the methodology is far more important than the model. In fact, I don’t believe you’ll derive as much value from the model if you don’t use it within the recommended architecture.

I also noticed, other people had questions related to it’s origins. I’ve actually asked Dan this many times before (because I like history), but I thought, it’s a good day to understand the influences. Understanding history, I find, helps give us a completely different perspective.

Here’s what he said:

On Modeling – I’ve said it many times. The principles behind “Data Vault Modeling” are not new. Data Vault modeling was influenced by

a) A hub and spoke architecture

– and –

b) A hybrid design of 2nd normal formal, 3rd normal form and an element borrowed from dimensional modeling.

At the outset, it looks rather simplistic, but it did take time, testing and lots of experimentation to get it to the point where it actually looks simple. Formalization of the standards helped with it’s adoption as well as consistency in implementation.

The design of satellites and breaking apart the context from the business key is something that people take for granted. Interestingly, my influence on satellite design actually came from a study of inverted index trees (used with linguistic processing and high volume text data parsing). I may have arrived at the same conclusions in other ways.

Hub and Spoke architecture was used to create the model to scale on MPP style hardware if and when needed. In hindsight, there is actually an issue with the current DV 1.0 design, and there’s a DV 2.0 modification being tested on a very large scale because the shared nothing process can’t really scale on the key generation process without dependencies (We’ll talk more about this in the reasons for DV 2.0 enhancements).

On Data Integration (ETL) – I noticed a common problem with ETL. Every EDW would have to have a different sets of load designs for history and current data loads. This would get complicated by the fact that when a business rule had to be applied to history which potentially changed the data, it would cause extra ETL work and extra related work such as testing. Sub-sequent iterations were almost always more expensive despite all the object sharing and shared routines.

This was fixed using two concepts.

a) Dividing the processes using the hard and soft rule definitions and pushing soft rules downstream of the DW load process.

– and –

b) Splitting relationships apart using a mandated many-to-many relationship with links.

Doing this made the Data Vault future proof.

These were the primary influences for those two things. On the methodology side I had to actually get into a predictable systems architecture, have predictable and defined load patterns, predictable project plans, predictable business IT collaboration and many other nuances. I actually went through the trouble of defining all of these in the methodology and refining it as and when something worked well. Agile BI was mostly an accident. We were doing Agile BI long before it was formally defined.

I did have discussions with Scott Ambler about agile. In fact, when he read the Super Charge Your Data Warehouse book, he said, “This book captures a practical body of knowledge for data warehouse development which both agile and traditional practitioners will benefit from.”

The funny thing is agile was in fact a happy accident. What we were building was in fact agile but we just wanted to deliver predictably and quickly to the business. The truth is, so did pretty much all of the others, but it was almost like I’ve found a secret key by addressing and eliminating the issues caused by the other architectures.

Now, it’s not that I didn’t create my own problems. Since, relational databases have been the de-facto storage for data warehouses, I did run into issues at times with the extraction from the DV. Helper tables like PIT and Bridge were born from that.

Dan even went into details on what prompted him to create hubs and satellites for the Data Vault, but I think this is already getting quite long, so perhaps another time.

If you want to see the templates of the helper tables like the PIT and the Bridge, then they’re currently available in the Informatica Data Vault 2.0 Template training here:

[ Data Vault 2.0 Implementation with Informatica PowerCenter ]

The funny thing is, I’ve actually worked as a part of a team who were early adopters of “agile methodologies” at a university and we used to source data from the application and it gave me a lot of insight into what agile really means. A lot of people actually have it wrong and simply associate agile with speed, however that’s only one aspect.

One of the most important principles of agile methodologies is continuous refinement using feedback loops. The speed is in the feedback loops with the business. The Data Vault can also get quite misunderstood. It’s an extremely flexible model but an extremely rigorous methodology. In fact, I compared the complete agile manifesto and agile principles and the Data Vault Architecture was quite a good fit on all counts.

Regards,

Sanjay Pande (from frozen Canada) with Dan Linstedt (from warm Australia)

P.S. The Informatica course covers quite a bit of the hashing concepts used in DV 2.0 and you can see the DV 2.0 modules with the rest of the curriculum here:

[ Data Vault 2.0 Implementation with Informatica PowerCenter ]