I got a very good question today (and it’s a common one), so lets address it.
“Isn’t moving work like cleansing and business rules downstream of the Data Warehouse (DV) just shifting the problem? We still need to fix it. So how is the Data Vault any different from any other architecture?”
Great question actually!
(and a common misconception)
As far as I know, no other architecture tries to completely avoid re-engineering efforts in at least one tier/layer of the solution. The Data Vault deliberately addresses this with a goal to avoid re-engineering as much as possible. This is why both the model and the methodology are so important.
When you’re trying to avoid touching any of the data that you’ve already brought in to your Data Warehouse, specially when there are changes to your source systems, or new source systems are introduced into your Data Warehouse loads, then the first thing that will affect you are “soft” business rules – rules that make you restate the data in a different way – because they’re subject to the interpretation of these business rules.
Add to that, modifications to existing constructs in your Data Warehouse, then it starts to get complicated because …
There is no fallback mechanism (beyond going back to the sources … if they still exist).
These fallback mechanisms are required in case of a corrupted load or some other unforeseen event. Data cleanup in the Data Vault is actually quite simple. And, it’s even much easier with DV 2.0 because of the hashes we spoke about yesterday. In fact there’s an entire module on bad data and types of cleanup in the Data Vault Implementation course here:
And to top it off, you’re doing everything in a single pass. So, the loading routines can get fairly complicated. For example, if you have the extracts, data joins, cleansing, business rules, data type alignment, normalization and change data capture in a single pass, it makes for a fairly complex data integration routine. This complexity is hard to build, test and maintain. The level of complexity keeps going up with time as more and more applications get added and the routine gets modified to accommodate them.
You also have to reload the data sets sometimes because of structural changes. Alternatively, sometimes you’ll create one-time use routines for fixing the data (Translates to time, money and effort on something which can never ever be re-used).
With the Data Vault, many of these issues go away because the model was in fact designed from the ground up for change. The separation of the hubs from the satellites (business key from it’s context), is not a coincidence. Separating satellites based on source systems is based on years of testing various formats and avoidance of something called data explosion (A very important topic which we’ll talk about sometime).
Even after some of your source systems are retired and moved offline, the Data Vault still gives you a fallback to the original data set. And the architecture works by breaking apart different pieces. The engineering components and the business components are separated on purpose so there is “division of labor” and a logical separation of related tasks.
So, yes you do have to push work downstream, but it’s a lot less than doing it in a single pass (And it actually reduces overall effort).
For the sake of argument, lets say your complex single pass routine does 10 different tasks. With a DV, say you split these 10 tasks into 5 while entering the DV like business key alignment, CDC, data type alignment etc and 5 on the way out to the data marts. Now, suppose you have something as simple as a business rule change which requires you to re-interpret the data in a different way. Your impact is automatically lessened – however, it’s the best case scenario.
If we look at the other end of the spectrum at a worst case scenario, you probably have to integrate a completely new system with data into your environment. The impacts are potentially severe which requires schema modifications in every layer. It can then get pretty complex. With a Data Vault based DW, the impacts are minimal. Existing constructs and load routines are seldom impacted. You only have to build on loading the difference which can be as simple as adding new business keys to a hub and adding a new satellite without touching anything that’s already working. It can get more complex too, but the key point is – anything in the existing DW can be kept as is without impact in most cases. Existing routines and data are untouched.
The downstream impacts are also easier to handle because of the broken down processes as explained earlier. There are very good reasons why the automatable, generateable and repeatable processes are delegated to the “engineering” back-end portions of the architecture and why we call it the Data Warehouse (Data Vault).
The value of decomposing the load process into separate tasks and data into areas is very, very valuable. Yes, it’s been done in Data Warehousing before, however the Data Vault is the only architecture where the actual data in the DW is required to be raw and uncleansed (only hard business rules like data type alignment) are permitted. Doing this actually does a LOT more than simply moving the problem downstream.
P.S. Avoiding re-engineering whenever possible actually saves projects time, money and effort in the long run. It’s actually one of the top 10 process goals and objectives as you’ll see in the second module of the Data Vault Implementation training here: