Case study
Motivation
In this case study we’ll motivate bitemporal principles by demonstrating an example data lifecycle that is not easy to manage or to query in the traditional relational model. We’ll slowly improve upon the relational model until we arrive at the bitemporal model, which elegantly solves the problem at hand.
A bond portfolio
Suppose that we invest in a portfolio of bonds. Each bond pays principle and interest on an independent, known, fixed schedule, and we wish to project our income. In a traditional relational database, we could model this as:
bonds |
bond_payments |
|
---|---|---|
id |
<- | bond_id |
company_name |
payment_date |
|
payment |
||
did_pay |
Since all bonds have the risk of default, we keep track of whether or not the payment was made (did_pay
).
Some initial data
Let’s create some data and make a few projections about our future income.
bonds
id |
company_name |
---|---|
1 | ACME Corp |
2 | Beta Industries |
bond_payments
bond_id |
payment_date |
payment |
did_pay |
---|---|---|---|
1 | 2022-01-01 | $100 | |
1 | 2022-04-01 | $100 | |
1 | 2022-07-01 | $100 | |
1 | 2022-10-01 | $100 | |
2 | 2022-01-01 | $25 | |
2 | 2022-03-01 | $25 | |
2 | 2022-05-01 | $25 | |
2 | 2022-07-01 | $25 | |
2 | 2022-09-01 | $25 | |
2 | 2022-11-01 | $25 |
Suppose that today is 2022-04-01–we’ve received every expected payment to date! Let’s project our future income:
SELECT SUM(payment) AS future_income
FROM bond_payments
WHERE payment_date > DATE'2022-04-01';
=> 300
By the end of the year, we can expect to receive an additional $300. Combined with the $250 that we’ve already received, we’ll receive $550 during the year. Not too shabby!
A complication arises
Here we are in mid-May, and an economic downturn may dash our plans. According to news reports, ACME Corp is entering bankruptcy court, and bondholders are expected to receive only 50% of their scheduled payouts. What’s more, Beta Industries announced on 2022-05-01 that they would miss the payment due that day.
As of 2022-05-15, here’s what our data looks like, including the change to the 2022-05-01 payment:
bond_payments
bond_id |
payment_date |
payment |
did_pay |
---|---|---|---|
1 | 2022-01-01 | $100 | |
1 | 2022-04-01 | $100 | |
1 | 2022-07-01 | $100 | |
1 | 2022-10-01 | $100 | |
2 | 2022-01-01 | $25 | |
2 | 2022-03-01 | $25 | |
2 | 2022-05-01 | $0 | |
2 | 2022-07-01 | $25 | |
2 | 2022-09-01 | $25 | |
2 | 2022-11-01 | $25 |
Suddenly, our future income looks uncertain, but it’s not clear exactly how to query it. We could try a more complicated SQL expression:
SELECT SUM(CASE
WHEN b.company_name = 'ACME Corp' THEN 0.5 * bp.payment
ELSE bp.payment
END) AS future_income
FROM bond_payments bp
JOIN bonds b
ON bp.bond_id = b.id
WHERE bp.payment_date > DATE'2022-05-15';
=> 175
Since we UPDATE
d some data in place ($25
-> $0
for the scheduled payment on 2022-05-01
), we have irreparably limited our future queries–all knowledge of the originally scheduled payment amount is now lost. This loss of data is unacceptable.
We can circumvent this problem with some extra database columns, but we’ll also show that as more fields change, the complexity grows, making it harder and harder to query. What we’ll need is a more rigorous model.
bond_payments
bond_id |
payment_date |
payment |
did_pay |
updated_at |
updated_payment |
---|---|---|---|---|---|
1 | 2022-01-01 | $100 | |||
1 | 2022-04-01 | $100 | |||
1 | 2022-07-01 | $100 | 2022-05-15 | $50 | |
1 | 2022-10-01 | $100 | 2022-05-15 | $50 | |
2 | 2022-01-01 | $25 | |||
2 | 2022-03-01 | $25 | |||
2 | 2022-05-01 | $25 | |||
2 | 2022-05-01 | $25 | 2022-05-01 | $0 | |
2 | 2022-07-01 | $25 | |||
2 | 2022-09-01 | $25 | |||
2 | 2022-11-01 | $25 |
WITH t AS (
SELECT *, ROW_NUMBER() OVER (
PARTITION BY bond_id, payment_date
ORDER BY COALESCE(updated_at, DATE'0001-01-01') DESC
) AS rank
FROM bond_payments
WHERE payment_date > DATE'2022-05-15'
)
SELECT SUM(COALESCE(updated_payment, payment)) AS future_income
FROM t
WHERE rank = 1;
=> 175
That’s not the prettiest SQL in the world, but it does project our future income, and it doesn’t require overwriting existing data.
An announcement
On 2022-06-30, ACME Corp announces that they have reached a deal with creditors to pay 75% of their debt on the original schedule. For investors who expected to receive only 50%, this is welcome news!
Separately, Beta Industries announced that they expect to make scheduled payments in full, and they will make the missed May payment on 2022-07-15. This poses a problem for our traditional relational schema: how do we model the new payment date without losing knowledge of the originally scheduled date? Let’s try adding another column, updated_payment_date
, and projecting our future income.
bond_payments
bond_id |
payment_date |
payment |
did_pay |
updated_at |
updated_payment |
updated_payment_date |
---|---|---|---|---|---|---|
1 | 2022-01-01 | $100 | ||||
1 | 2022-04-01 | $100 | ||||
1 | 2022-07-01 | $100 | 2022-05-15 | $50 | ||
1 | 2022-07-01 | $100 | 2022-06-30 | $75 | ||
1 | 2022-10-01 | $100 | 2022-05-15 | $50 | ||
1 | 2022-10-01 | $100 | 2022-06-30 | $75 | ||
2 | 2022-01-01 | $25 | ||||
2 | 2022-03-01 | $25 | ||||
2 | 2022-05-01 | $25 | ||||
2 | 2022-05-01 | $25 | 2022-05-01 | $0 | ||
2 | 2022-05-01 | $25 | 2022-06-30 | $25 | 2022-07-15 | |
2 | 2022-07-01 | $25 | ||||
2 | 2022-09-01 | $25 | ||||
2 | 2022-11-01 | $25 |
WITH t AS (
SELECT *, ROW_NUMBER() OVER (
PARTITION BY bond_id, payment_date
ORDER BY COALESCE(updated_at, DATE'0001-01-01') DESC
) AS rank
FROM bond_payments
WHERE COALESCE(updated_payment_date, payment_date) > DATE'2022-06-30'
)
SELECT SUM(COALESCE(updated_payment, payment)) AS future_income
FROM t
WHERE rank = 1;
=> 250
These queries are getting progressively harder to write, to read, and to maintain. Moreover, our table structure is rapidly deteriorating; it seems that every field in our table will need a corresponding updated_
field to capture new information while retaining previous knowledge.
Discussion
So far, we’ve shown that it’s possible to model our scenario this way, but it’s difficult to query, let alone to audit. For example, imagine that we want to revisit our projections from April. Can we reproduce them?
WITH t AS (
SELECT CASE
WHEN updated_at IS NULL THEN payment
WHEN updated_at > DATE'2022-04-01' THEN payment
ELSE updated_payment
END AS expected_payment, ROW_NUMBER() OVER (
PARTITION BY bond_id, payment_date
ORDER BY COALESCE(updated_at, DATE'0001-01-01') DESC
) AS rank
FROM bond_payments
WHERE COALESCE(updated_payment_date, payment_date) > DATE'2022-04-01'
AND (updated_at IS NULL OR updated_at < DATE'2022-04-01')
)
SELECT SUM(expected_payment) AS future_income
FROM t
WHERE rank = 1;
=> 300
That produces the expected answer, but it is difficult to reason about.
A principled model
The above produces correct answers with the extra columns because we don’t overwrite any data; if we can write appropriate SQL, then we can reproduce old results. This append-only style is also used in bitemporal modeling. In a bitemporal database, we only append to the historical record, and once written, the view of past data never changes.
Another improvement that bitemporal modeling makes is that it is explicit about the time bounds for each piece of data. In the example above, it is challenging to reason about data because the time bounds are implicit: the data is valid until a new record is inserted with a larger updated_at
. To make this easier to query, we simply need to add an update_valid_until
timestamp (or to treat updated_at
as an interval
type) so that queries may explicitly state, e.g., myTime BETWEEN updated_at AND update_valid_until
.
Lastly, a bitemporal database tracks both the transaction_time
, which is when the database recorded each datum, and the valid_time
, which is the time that a user assigned to a datum. Both times are treated as intervals so that both are easy to query. Having two separate timelines allows us to easily filter the data that we query to “what we want to know” (using the valid_time
) and “when we want to know it” (using the transaction_time
).
As a simple example, let’s revisit our income projection as we knew it on 2022-06-30, but this time, let’s use the bitemporal model:
bond_payments
bond_id |
scheduled_date |
payment |
tt_from |
tt_to |
vt_from |
vt_to |
---|---|---|---|---|---|---|
1 | 2022-01-01 | $100 | 2022-01-01 | 2022-01-01 | ||
1 | 2022-04-01 | $100 | 2022-01-01 | 2022-04-01 | ||
1 | 2022-07-01 | $100 | 2022-01-01 | 2022-05-15 | 2022-07-01 | |
1 | 2022-07-01 | $50 | 2022-05-15 | 2022-06-30 | 2022-07-01 | |
1 | 2022-07-01 | $75 | 2022-06-30 | 2022-07-01 | ||
1 | 2022-10-01 | $100 | 2022-01-01 | 2022-05-15 | 2022-10-01 | |
1 | 2022-10-01 | $50 | 2022-05-15 | 2022-06-30 | 2022-10-01 | |
1 | 2022-10-01 | $75 | 2022-06-30 | 2022-10-01 | ||
2 | 2022-01-01 | $25 | 2022-01-01 | 2022-01-01 | ||
2 | 2022-03-01 | $25 | 2022-01-01 | 2022-03-01 | ||
2 | 2022-05-01 | $25 | 2022-01-01 | 2022-05-01 | 2022-05-01 | |
2 | 2022-05-01 | $0 | 2022-05-01 | 2022-06-30 | 2022-05-01 | |
2 | 2022-05-01 | $0 | 2022-06-30 | 2022-05-01 | 2022-07-15 | |
2 | 2022-05-01 | $25 | 2022-06-30 | 2022-07-15 | ||
2 | 2022-07-01 | $25 | 2022-01-01 | 2022-07-01 | ||
2 | 2022-09-01 | $25 | 2022-01-01 | 2022-09-01 | ||
2 | 2022-11-01 | $25 | 2022-01-01 | 2022-11-01 |
SELECT SUM(payment) AS future_income
FROM bond_payments_bt
WHERE tt_from <= DATE'2022-06-30'
AND tt_to > DATE'2022-06-30'
AND vt_from <= DATE'2022-06-30'
AND vt_to > DATE'2022-06-30';
=> 250
Et voilĂ ! The bitemporal model has reproduced our previous results in a succinct query.
How about auditing our projections from 2022-05-15? That’s just as easy!
SELECT SUM(payment) AS future_income
FROM bond_payments_bt
WHERE tt_from <= DATE'2022-05-15'
AND tt_to > DATE'2022-05-15'
AND vt_from > DATE'2022-05-15';
=> 175
Lastly, let’s audit our projections from 2022-05-15, but this time, let’s use knowledge that we gained on 2022-06-30:
SELECT SUM(payment) AS future_income
FROM bond_payments_bt
WHERE tt_from <= DATE'2022-06-30'
AND tt_to > DATE'2022-06-30'
AND vt_from > DATE'2022-05-15';
=> 250
The previous two queries begin to reveal the power of the bitemporal model. By structuring our data in this way, it’s easy to see why we projected $175 based on what we knew on May 15th. Similarly, it’s easy to see why we projected $250 on June 30th. No matter how our data changes in the future, we will always be able to re-run these queries and get exactly the same results.
Next: learn more about what TwiceDB provides in what we offer.