Data Vault 2.0 and Hadoop Map/Reduce

One of the features of Hadoop is the MapReduce algorithm which was popularized by Google. Basically MapReduce breaks down complex tasks into simpler chunks to be executed in parallel.

It’s not new.

All MPP shared nothing architectures have had something similar for a very long time. Every process the Data Vault 2.0 uses from the models to the Data Integration routines take advantage of shared nothing parallelism.

It’s been this way for a decades since Data Vault 1.0

I asked my partner Sanjay to explain the MapReduce functionality because he’s actually done some functional programming in Lisp and Scheme which is where MapReduce originally came from.

Dan: Sanjay, Can you explain MapReduce?

Sanjay: It’s actually 2 different components it’s best if you have a dynamically typed language with first-class functions to understand the map and the reduce steps.

Dan: What do you mean by first-class functions?

Sanjay: It’s the ability to pass around functions just like you would variables. A function could in fact return a function instead of a variable. More importantly, a function could take another function as an input, just like it would a variable.

Dan: Most functional programming languages already do this and some of the newer languages like Python do support this, don’t they?

Sanjay: Yes, they do and they’re getting better. Python took it from Lisp and languages like Ruby took it from Smalltalk where they implement it via block closures.

Dan: So, tell us about Map and Reduce.

Sanjay: They’re actually fairly simple to understand

Map permits you to take a function and apply it to a group of variables like a list or an array.

If list = (x1 x2 x3 …. xn)

and f(x) is a function you want to apply, f(x) can
be defined on the fly, you’d do:

map [ f(x) list ]

A real example in Common Lisp would be:

(mapcar #'(lambda (x)
(* x x)) ‘(1 2 3 4))

This basically is an anonymous function defining a squaring and applying it across the list and will result in

=> (1 4 9 16)

Reduce is similar but instead of applying the function to individual elements, it takes all the elements of the list as an input. So, for reduce to work, you need to have similar types and a function that can be applied across them. Examples
including adding or multiplying numbers.

reduce [ f(x) list ]

A real example is

(reduce #’+ ‘(1 2 3 4))

Basically the function to add is reduced on the list and the output in this case is 10. Think of reduce like an aggregator.

So, when you put the two functions together. Let’s just take the same examples.

(reduce #’+ (mapcar (lambda (x)
(* x x)) ‘(1 2 3 4)))

We get the + function reduced over the squares of the numbers in the list resulting in 30.

Dan: So, it’s fairly powerful. Is this what is the implementation of mapReduce on platforms like Hadoop?

Sanjay: Well, not exactly, but the idea is borrowed from these principles. See, the map task is used to distribute the tasks across parallel nodes for processing and the reduce is used to collect it back. The actual implementation is VERY different, but the basic idea is the same. Also within Common Lisp, it’s not a parallel operation by default under the skin (with some exceptions) but requires extra coding to do so.

They’ve taken these principles down to the hardware abstraction layers like the file system and the API lets the programmers use it seamlessly without bothering about the details. They’ve also built in fault tolerance which isn’t directly included in the original map and reduce.

The implementation also permits MapReduce to return multiple values unlike in Common Lisp where only one value is returned after the reduce step.

Dan: So, do you think the Data Vault 1.0 or Data Vault 2.0 are tailor made for Hadoop then?

Sanjay: They very well could be. After all you did borrow everything for the nothing shared MPP architecture even with Data Vault 1.0. With Data Vault 2.0 which completely dwarfs 1.0 in coverage since it’s a complete system the elimination of numeric integer surrogates from the architecture does help with parallel platforms as that would make it an unnecessary hard problem. The hash keys is actually a very good idea in my opinion, and I’ll go into details another time. However, in short for composite keys, you’ll tend to reduce Map/Reduce processing downstream of a DV.

Also, it doesn’t really matter from a hardware perspective because the methodology would still be the same whether you use Oracle or DB2 or MySQL or PostgreSQL or Hadoop.

The DV is going to continue requiring the “soft” business rules be moved downstream and the data look as close as possible to the source data albeit in flexible, integrated constructs of hubs, links and satellites.

It’s reason enough to go with the DV.

Dan: A lot of people don’t understand the power behind that. They’ve been indoctrinated with the BUS architecture which is in fact the most popular today. Why?

Sanjay: I’ve never been a fan of popularity. Usually it’s a sign of something not being right. Look at Windows. Look at the popular databases. When I started, I was also a victim of the Kimball style indoctrination which unfortunately starts either at school or with something that’s already popular and considered “safe” despite it being contrary.

I’ve yet to see a Kimball BUS architecture withstand change. Each and every change breaks it down and adds a ton of work. You’re also forced to assume design constructs which can be incorrect and those constructs are not flexible since they’re built for a single purpose. That’s NOT a smart way to build anything in my personal opinion as well as experience.

Dan: If people are wary or scared of using the Data Vault 2.0, what do you recommend they do?

Sanjay: There are a few different things. The Data Vault is so granular that you can pick a piece of your entire model and start with it. See it in action for yourself before you pass judgement on it. There’s also the WWDVC coming up where they can come and actually talk to people from all over the world who’ve implemented DV 2.0 in their organizations and have saved time, effort and money. I know a few people who are attending this year who can show actual numbers in comparative architectures.

I also personally HATE the word life cycle on a data warehouse. A well designed DW should NEVER have an end-of-life. After all it’s only when it has a ton of data is when it truly becomes a really valuable asset which should continue to grow in value.

***

Hope you enjoyed that as much as I did.

There is of course a correct way to implement a Data Vault 2.0 and there is a wrong way. The worst thing you can do is implement it incorrectly and not get the benefits.

Data Vault 2.0 training is only available through me or through authorized training partners and Sanjay happens to be one of them. The others are listed here:

Authorized Data Vault 2.0 Trainers

Regards,

Dan Linstedt

PS: The online version of this course is also in the works. Day 1 of CDVP2 is almost completed. To get notified about the online version, signup at:

Data Vault 2.0 Advanced Notification