I’ve been asked many times over the years, what is the best way to model Key-Value pairs in the Raw Data Vault? In this entry, I will answer that question with my suggestions. Remember: there are always at least 3 possible and viable data modeling design choices when choosing any solution (Data Vault or not).
What is a Key Value Pair?
A Key Value Pair is basically a “metadata key” (like a column name for instance) matched with a value (data value). It can be represented in many ways, but theoretically appears as follows:
KEY = VALUE
What makes it difficult to handle?
Traditional Relational Database Management Systems (RDBMS) do not typically do well with “long” data where the column name is also the “data set”. The issue is: to build a “row” of data, it becomes a series of access requests followed by a pivot of the data set. Row-based databases are not built for this sort of thing, where column based databases handle this with ease, elegance and performance.
On to the RDBMS recommendations…
When applying these techniques to the Raw Data Vault, there are some very simple recommendations. You might find it surprising that you already know the answers, but – be that as it may, here they are. Unfortunately there is no “magic bullet” or “special sauce” – again because RDBMS engines are not built for this (especially over VOLUMES of data where you might have billions of key/value pairs….
Always, and I mean always attach your modeling decisions to the business use cases for maximum success!!!
Store it all in Key-Value in a Satellite, and pivot columns in to Point in Time (PIT) and Bridge Structures. Only pivot those columns that meet the needs of the business use cases for query performance, indexing, aggregation, or business rules. Do NOT pivot ALL the columns, even when it’s a fixed setup.
Store it all in Key-Value in a Satellite, pivot critical columns only on the way to the Business Vault or the Information Marts, again – only pivot those columns that meet the needs of the business use cases (as stated above).
Pivot SOME of the columns during the ELT phase of loading the Raw Data Vault. Land the data in two different Satellites, one that has the pivot already applied, and the other that stores ALL values (or only non-pivoted values) in Key-Value Satellite. Leverage queries to access pivoted columns, only join to Key-Value Satellites when absolutely necessary.
Get a good Key-Value or Row Based Database, like SnowflakeDB to handle and store your Raw Data Vault in, perhaps PostGreSQL, or Hive and Hadoop. Join to the Key-Value store when necessary, or if you are in SnowflakeDB, just use it for your entire Data Vault / EDW.
Store ALL your key-value data in Hive / Hadoop platform, then pivot data in to the Raw Data Vault (RDBMS based) when needed for a business use case. This is the most cost effective, but not necessarily the most performant, nor the most flexible – unless the use cases are coming from Data Scientists.
Well folks, that’s all there is from my end on this one, hope you found it valuable.
(C) 2017 all rights reserved