I blogged about end dates here sometime back: End Dates Are Dead Since then, I’ve received hundreds of emails, tons of comments, and a lot of questions about why, how to work without them, and what does this do to query performance. In this entry I will answer a few of these questions… If you still don’t understand the principles behind the mathematics, then I encourage you to study set logic, the nature of updates at hundred+ terabyte levels, along with the nature of system dates versus business dates.
What’s all the hubbub about?
First off, people have misused and continue to misuse load dates and load-end-dates.
Load Dates are system dates, as are load-end-dates. They should NEVER have business meaning outside the warehouse. (this is a topic we get in to and lecture on, in our CDVP2 (Certified Data Vault 2.0) class). Practitioners insist for some reason on assigning business value to load dates and load end-dates, as if they hold some special value to the business. This couldn’t be more wrong. They back-date load dates, future date end-dates, and when the business “back-dates data”, they incorrectly assign source system dates, and business dates to load-dates, thereby destroying, utterly destroying the purpose of data time-lines.
What about Load End-Dates?
Well, the Load Date discussion is a different one, and I will write about that later, in another entry. Load end-dates were never part of the original architecture in 1990’s. They are a mechanical manufactured span, a system driven field – only there because in the early 2000’s the database vendors did not have OLAP or CTE (common table expression) functions – hence we needed between clauses to understand when data contained delta’s.
Why then are Load End Dates Dead?
Well, let me tell you: 1) you should learn, breathe, eat, sleep Point in Time and Bridge Table structures. If you really really need spans for between clauses, compute them based on business dates in your Satellites!!! 2) You need to learn, study, and understand OLAP and CTE functions. The optimizers in the databases these days do an incredible job of resolving proper query access paths underneath. The optimizers are tuned to perform better over scale using these functions.
and 3) most importantly: PERFORMANCE, PERFORMANCE, PERFORMANCE… The physical act of “updating an end date” on a hundred and fifty terabyte solution (or more) causes problems. In fact, in Hive / Hadoop, you CANNOT update end-dates and have it finish in time for lunch on a single large Satellite structure (something with 10 Billion records or more embedded in it). Furthermore, on Hive in Hadoop, you don’t want Load Dates at all – so why would you even begin working with Load End Dates?
Wait a minute… there is ONE MORE CRITICAL REASON!!!
4) Real-Time data, streaming data… If you load real-time data from message queues or event driven systems, you have no idea what order the data is in, nor do you have any control over when it gets loaded… To top things off, if you are using machine generated data at high rates of speed, you cannot afford to “update and end date during the process”… In fact, load-end-dates are dead before they begin in a full real-time solution. They “mess up” the timing, the delta’s and the processing of data.
That said, in order to build a full real-time Data Vault or Data Warehouse you must have one critical piece of information attached to every single transaction: an origination date and time / transaction creation date and time… otherwise you can never reconstruct the data according to the business flow when building reports or analytical solutions!!
Wait, if you take my load end dates away HOW do I query?
You haven’t gotten it yet?? Sorry about this, but I feel strongly that you need to learn business terms, business data, and business dates. Temporality of data is only “valid” or realistically applied to business time-lines. System time lines (like Load Dates and Load End Dates) hold no business meaning. All they provide is a window in to when the data arrived in the warehouse on that server. Let me ask you this question: What is the meaning of a Load Date or Load End Date if your data warehouse is in the cloud? What time zone do you use? what date/time do you set the data to?
Trouble trouble…. Load Dates are in trouble too…. But that’s a discussion for another time. For now, you need to trust me: load end dates are dead.
To understand how to query you need to start focusing on the business dates. Any other dates that appear in the Satellites that arrive from the source system (planned dates, actual dates, manufacturing dates, scheduled dates, campaign dates, etc…) these are the business dates from which you should be building your analytics and business warehouses/reports downstream.
IF technically speaking you still need a “between clause” for query, then caluclate the SPAN of these dates, or calculate the END of business dates when loading the PIT and BRIDGE tables. Set an index on the business dates, and write your between clauses accordingly.
Wait, what??? I still don’t understand, I won’t give up Load End-Dates…
Ok, let me say this one more time: focus on business dates, STOP FOCUSING ON SYSTEM DATES that have no meaning to the business. STOP ASSIGNING “perceived business value” to SYSTEM DATES!!!
What is the impact? Final Verdict?
The impact is dire: as your warehouse (data vault or not) grows, you will be forced to re-engineer your model, and your load processes, and your queries. Beyond a certain level of volume, the “end dates” no longer can be sustained with an update statement. In fact, if you move to Hadoop with your data warehouse, the same will be applied… your performance for “update in Hive” will be EVEN WORSE!!!
Do not end up re-engineering your solution, remove the load-end dates today, scale without problems, without issues.
Hope this helps,