Case study

Folder Case study

Motivation

In this case study we’ll motivate bitemporal principles by demonstrating an example data lifecycle that is not easy to manage or to query in the traditional relational model. We’ll slowly improve upon the relational model until we arrive at the bitemporal model, which elegantly solves the problem at hand.

A bond portfolio

Suppose that we invest in a portfolio of bonds. Each bond pays principle and interest on an independent, known, fixed schedule, and we wish to project our income. In a traditional relational database, we could model this as:

bonds bond_payments
id <- bond_id
company_name payment_date
payment
did_pay

Since all bonds have the risk of default, we keep track of whether or not the payment was made (did_pay).

Some initial data

Let’s create some data and make a few projections about our future income.

bonds

id company_name
1 ACME Corp
2 Beta Industries

bond_payments

bond_id payment_date payment did_pay
1 2022-01-01 $100 Check
1 2022-04-01 $100 Check
1 2022-07-01 $100
1 2022-10-01 $100
2 2022-01-01 $25 Check
2 2022-03-01 $25 Check
2 2022-05-01 $25
2 2022-07-01 $25
2 2022-09-01 $25
2 2022-11-01 $25

Suppose that today is 2022-04-01–we’ve received every expected payment to date! Let’s project our future income:

SELECT SUM(payment) AS future_income
FROM bond_payments
WHERE payment_date > DATE'2022-04-01';

=> 300

By the end of the year, we can expect to receive an additional $300. Combined with the $250 that we’ve already received, we’ll receive $550 during the year. Not too shabby!

A complication arises

Here we are in mid-May, and an economic downturn may dash our plans. According to news reports, ACME Corp is entering bankruptcy court, and bondholders are expected to receive only 50% of their scheduled payouts. What’s more, Beta Industries announced on 2022-05-01 that they would miss the payment due that day.

As of 2022-05-15, here’s what our data looks like, including the change to the 2022-05-01 payment:

bond_payments

bond_id payment_date payment did_pay
1 2022-01-01 $100 Check
1 2022-04-01 $100 Check
1 2022-07-01 $100
1 2022-10-01 $100
2 2022-01-01 $25 Check
2 2022-03-01 $25 Check
2 2022-05-01 $0
2 2022-07-01 $25
2 2022-09-01 $25
2 2022-11-01 $25

Suddenly, our future income looks uncertain, but it’s not clear exactly how to query it. We could try a more complicated SQL expression:

SELECT SUM(CASE
  WHEN b.company_name = 'ACME Corp' THEN 0.5 * bp.payment
  ELSE bp.payment
  END) AS future_income
FROM bond_payments bp
JOIN bonds b
  ON bp.bond_id = b.id
WHERE bp.payment_date > DATE'2022-05-15';

=> 175

Since we UPDATEd some data in place ($25 -> $0 for the scheduled payment on 2022-05-01), we have irreparably limited our future queries–all knowledge of the originally scheduled payment amount is now lost. This loss of data is unacceptable.

We can circumvent this problem with some extra database columns, but we’ll also show that as more fields change, the complexity grows, making it harder and harder to query. What we’ll need is a more rigorous model.

bond_payments

bond_id payment_date payment did_pay updated_at updated_payment
1 2022-01-01 $100 Check
1 2022-04-01 $100 Check
1 2022-07-01 $100 2022-05-15 $50
1 2022-10-01 $100 2022-05-15 $50
2 2022-01-01 $25 Check
2 2022-03-01 $25 Check
2 2022-05-01 $25
2 2022-05-01 $25 2022-05-01 $0
2 2022-07-01 $25
2 2022-09-01 $25
2 2022-11-01 $25
WITH t AS (
  SELECT *, ROW_NUMBER() OVER (
    PARTITION BY bond_id, payment_date
    ORDER BY COALESCE(updated_at, DATE'0001-01-01') DESC
  ) AS rank
  FROM bond_payments
  WHERE payment_date > DATE'2022-05-15'
)
SELECT SUM(COALESCE(updated_payment, payment)) AS future_income
FROM t
WHERE rank = 1;

=> 175

That’s not the prettiest SQL in the world, but it does project our future income, and it doesn’t require overwriting existing data.

An announcement

On 2022-06-30, ACME Corp announces that they have reached a deal with creditors to pay 75% of their debt on the original schedule. For investors who expected to receive only 50%, this is welcome news!

Separately, Beta Industries announced that they expect to make scheduled payments in full, and they will make the missed May payment on 2022-07-15. This poses a problem for our traditional relational schema: how do we model the new payment date without losing knowledge of the originally scheduled date? Let’s try adding another column, updated_payment_date, and projecting our future income.

bond_payments

bond_id payment_date payment did_pay updated_at updated_payment updated_payment_date
1 2022-01-01 $100 Check
1 2022-04-01 $100 Check
1 2022-07-01 $100 2022-05-15 $50
1 2022-07-01 $100 2022-06-30 $75
1 2022-10-01 $100 2022-05-15 $50
1 2022-10-01 $100 2022-06-30 $75
2 2022-01-01 $25 Check
2 2022-03-01 $25 Check
2 2022-05-01 $25
2 2022-05-01 $25 2022-05-01 $0
2 2022-05-01 $25 2022-06-30 $25 2022-07-15
2 2022-07-01 $25
2 2022-09-01 $25
2 2022-11-01 $25
WITH t AS (
  SELECT *, ROW_NUMBER() OVER (
    PARTITION BY bond_id, payment_date
    ORDER BY COALESCE(updated_at, DATE'0001-01-01') DESC
  ) AS rank
  FROM bond_payments
  WHERE COALESCE(updated_payment_date, payment_date) > DATE'2022-06-30'
)
SELECT SUM(COALESCE(updated_payment, payment)) AS future_income
FROM t
WHERE rank = 1;

=> 250

These queries are getting progressively harder to write, to read, and to maintain. Moreover, our table structure is rapidly deteriorating; it seems that every field in our table will need a corresponding updated_ field to capture new information while retaining previous knowledge.

Discussion

So far, we’ve shown that it’s possible to model our scenario this way, but it’s difficult to query, let alone to audit. For example, imagine that we want to revisit our projections from April. Can we reproduce them?

WITH t AS (
  SELECT CASE
    WHEN updated_at IS NULL THEN payment
    WHEN updated_at > DATE'2022-04-01' THEN payment
    ELSE updated_payment
  END AS expected_payment, ROW_NUMBER() OVER (
    PARTITION BY bond_id, payment_date
    ORDER BY COALESCE(updated_at, DATE'0001-01-01') DESC
  ) AS rank
  FROM bond_payments
  WHERE COALESCE(updated_payment_date, payment_date) > DATE'2022-04-01'
    AND (updated_at IS NULL OR updated_at < DATE'2022-04-01')
)
SELECT SUM(expected_payment) AS future_income
FROM t
WHERE rank = 1;

=> 300

That produces the expected answer, but it is difficult to reason about.

A principled model

The above produces correct answers with the extra columns because we don’t overwrite any data; if we can write appropriate SQL, then we can reproduce old results. This append-only style is also used in bitemporal modeling. In a bitemporal database, we only append to the historical record, and once written, the view of past data never changes.

Another improvement that bitemporal modeling makes is that it is explicit about the time bounds for each piece of data. In the example above, it is challenging to reason about data because the time bounds are implicit: the data is valid until a new record is inserted with a larger updated_at. To make this easier to query, we simply need to add an update_valid_until timestamp (or to treat updated_at as an interval type) so that queries may explicitly state, e.g., myTime BETWEEN updated_at AND update_valid_until.

Lastly, a bitemporal database tracks both the transaction_time, which is when the database recorded each datum, and the valid_time, which is the time that a user assigned to a datum. Both times are treated as intervals so that both are easy to query. Having two separate timelines allows us to easily filter the data that we query to “what we want to know” (using the valid_time) and “when we want to know it” (using the transaction_time).

As a simple example, let’s revisit our income projection as we knew it on 2022-06-30, but this time, let’s use the bitemporal model:

bond_payments

bond_id scheduled_date payment tt_from tt_to vt_from vt_to
1 2022-01-01 $100 2022-01-01 Check 2022-01-01 Check
1 2022-04-01 $100 2022-01-01 Check 2022-04-01 Check
1 2022-07-01 $100 2022-01-01 2022-05-15 2022-07-01 Check
1 2022-07-01 $50 2022-05-15 2022-06-30 2022-07-01 Check
1 2022-07-01 $75 2022-06-30 Check 2022-07-01 Check
1 2022-10-01 $100 2022-01-01 2022-05-15 2022-10-01 Check
1 2022-10-01 $50 2022-05-15 2022-06-30 2022-10-01 Check
1 2022-10-01 $75 2022-06-30 Check 2022-10-01 Check
2 2022-01-01 $25 2022-01-01 Check 2022-01-01 Check
2 2022-03-01 $25 2022-01-01 Check 2022-03-01 Check
2 2022-05-01 $25 2022-01-01 2022-05-01 2022-05-01 Check
2 2022-05-01 $0 2022-05-01 2022-06-30 2022-05-01 Check
2 2022-05-01 $0 2022-06-30 Check 2022-05-01 2022-07-15
2 2022-05-01 $25 2022-06-30 Check 2022-07-15 Check
2 2022-07-01 $25 2022-01-01 Check 2022-07-01 Check
2 2022-09-01 $25 2022-01-01 Check 2022-09-01 Check
2 2022-11-01 $25 2022-01-01 Check 2022-11-01 Check
SELECT SUM(payment) AS future_income
FROM bond_payments_bt
WHERE tt_from <= DATE'2022-06-30'
  AND tt_to > DATE'2022-06-30'
  AND vt_from <= DATE'2022-06-30'
  AND vt_to > DATE'2022-06-30';

=> 250

Et voilĂ ! The bitemporal model has reproduced our previous results in a succinct query.

How about auditing our projections from 2022-05-15? That’s just as easy!

SELECT SUM(payment) AS future_income
FROM bond_payments_bt
WHERE tt_from <= DATE'2022-05-15'
  AND tt_to > DATE'2022-05-15'
  AND vt_from > DATE'2022-05-15';

=> 175

Lastly, let’s audit our projections from 2022-05-15, but this time, let’s use knowledge that we gained on 2022-06-30:

SELECT SUM(payment) AS future_income
FROM bond_payments_bt
WHERE tt_from <= DATE'2022-06-30'
  AND tt_to > DATE'2022-06-30'
  AND vt_from > DATE'2022-05-15';

=> 250

The previous two queries begin to reveal the power of the bitemporal model. By structuring our data in this way, it’s easy to see why we projected $175 based on what we knew on May 15th. Similarly, it’s easy to see why we projected $250 on June 30th. No matter how our data changes in the future, we will always be able to re-run these queries and get exactly the same results.

Next: learn more about what TwiceDB provides in what we offer.

Copyright © 2024 TwiceDB TwiceDB