Lock contention in PostgreSQL

At deriv.com, our clients can trade in binary options. We keep all the related information in a PostgreSQL database. The life cycle of such an option begins when it is bought by a client. In our language, this is called an open position. Later the client sells the position back to us and the difference in the price is his profit or loss. A sold option is then called a closed position. When a position is closed, it has reached its final state.

We have translated that behavior into 2 tables in the database, one for open positions and another one for closed ones. The open position table is naturally quite busy. Open positions are inserted all the time. When a position is closed, it's deleted from the open position table and appended to the closed position table. That way, the open position table remains relatively small, up to ~100000 rows. Also, the closed position table, although with about 10^9 rows quite large, has a nice natural order.

As a responsible company, we have to maintain our risk exposure as well as provide our clients with tools to manage their risk. Hence, we have all sorts of trading limits. This article is about one particular limit we have introduced recently.

The problem

Up to this point, we had only client-specific limits. That way, lock contention was avoided naturally. Now, we wanted to limit our overall exposure as a company. Translated into database terms, that means for each option that is to be bought, we need to aggregate over all open positions and calculate how much the company would lose if all open positions are closed with the maximum profit for our clients.

The first, naïve way to do that, SELECT sum(...) FROM open_positions, was ruled out quickly. Even though it's only a couple of 10000 rows, that query takes hundreds of milliseconds. And it would be added to our overall transaction time for each buy transaction. That was too long.

Maintaining aggregates with triggers

The obvious solution was to maintain the aggregates in a separate table:

CREATE TABLE open_contract_aggregates (
    currency_code VARCHAR(3) PRIMARY KEY,
    payout        NUMERIC    NOT NULL,
    cnt           BIGINT     NOT NULL
);

Two triggers on the open position table would then update this table every time an option is bought or sold.

CREATE FUNCTION update_open_contract_aggregates_buy()
RETURNS TRIGGER AS $$
BEGIN
  INSERT INTO open_contract_aggregates AS a (currency_code, payout, cnt)
  SELECT currency_code, NEW.payout_price, 1
    FROM account
   WHERE id=NEW.account_id
      ON CONFLICT (currency_code)
      DO UPDATE SET payout = a.payout + EXCLUDED.payout,
                    cnt = a.cnt + EXCLUDED.cnt;

  RETURN new;
END;
$$ LANGUAGE plpgsql;

CREATE FUNCTION update_open_contract_aggregates_sell()
RETURNS TRIGGER AS $$
BEGIN
  INSERT INTO open_contract_aggregates AS a (currency_code, payout, cnt)
  SELECT currency_code, -OLD.payout_price, -1
    FROM account
   WHERE id=OLD.account_id
      ON CONFLICT (currency_code)
      DO UPDATE SET payout = a.payout + EXCLUDED.payout,
                    cnt = a.cnt + EXCLUDED.cnt;

  RETURN new;
END;
$$ LANGUAGE plpgsql;

Since we are still on Postgres 9.6 and are dealing only (mostly) single-row operations, row-level triggers seemed good enough.

Calculating the company's overall risk would then look like this:

SELECT sum(o.payout * e.rate)
  FROM open_contract_aggregates o
 CROSS JOIN exchangeToUSD_rate(o.currency_code) e;

We are offering binary options only in a handful of currencies. So, this was pretty fast.

Lock contention

However, when we took this solution into production, we saw from time to time spikes in our transaction time. A couple of queries like select * from pg_locks where not granted; revealed we had a lock contention problem.

Of course, now every transaction updates and, thus, locks one row out of a very limited number of rows.

Avoid locking

To avoid the type of locking problem that bit us here, we'd have to avoid the UPDATE operation in the trigger. INSERT is quite harmless from a locking perspective. The new row is not seen by any other transaction until committed.

So, the idea was to append to the aggregate table instead of updating it. Then, from time to time, we'd need sort-of a cron job that sums up all the existing rows and updates some aggregates. That way, the table would remain quite small. My gut feeling was telling me about 100 rows would be a good size.

The aggregate table now looks like this:

CREATE TABLE open_contract_aggregates (
    currency_code VARCHAR(3),
    is_aggregate  BOOLEAN DEFAULT false,
    payout        NUMERIC    NOT NULL,
    cnt           BIGINT     NOT NULL
);
CREATE UNIQUE INDEX open_contract_aggregates_uk
    ON open_contract_aggregates (currency_code)
 WHERE is_aggregate;

The partial unique index allows us to still use the INSERT ON CONFLICT UPDATE upsert style. That way, there is no need to pre-populate aggregates.

The trigger functions then look like this:

CREATE FUNCTION update_open_contract_aggregates_buy()
RETURNS TRIGGER AS $$
BEGIN
  INSERT INTO open_contract_aggregates AS a (currency_code, payout, cnt)
  SELECT currency_code, NEW.payout_price, 1
    FROM account
   WHERE id=NEW.account_id;

  RETURN new;
END;
$$ LANGUAGE plpgsql;

CREATE FUNCTION update_open_contract_aggregates_sell()
RETURNS TRIGGER AS $$
BEGIN
  INSERT INTO open_contract_aggregates AS a (currency_code, payout, cnt)
  SELECT currency_code, -OLD.payout_price, -1
    FROM account
   WHERE id=OLD.account_id;

  IF random() < 0.05 THEN
      WITH del AS (
          DELETE FROM open_contract_aggregates WHERE NOT is_aggregate
          RETURNING *
      )
      , add AS (
          SELECT currency_code,
                 sum(payout) AS payout,
                 sum(cnt) AS cnt,
                 true AS is_aggregate
            FROM del
           GROUP BY 1
      )
      INSERT INTO open_contract_aggregates AS a
                  (currency_code, payout, cnt, is_aggregate)
      SELECT *
        FROM add
          ON CONFLICT (currency_code) WHERE is_aggregate
          DO UPDATE
         SET payout = a.payout + EXCLUDED.payout
           , cnt = a.cnt + EXCLUDED.cnt;
  END IF;

  RETURN new;
END;
$$ LANGUAGE plpgsql;

Somewhere along the way, I realized the cron job can be avoided. Selling options back is much less performance critical than buying. So, doing the aggregation every now and again after a sell transaction is probably not noticeable. With the random() < 0.05 condition, the aggregation is done on average after 20 sell transactions.

How does it do?

Monitoring the aggregates table

To monitor the aggregates table, I used this command for almost a day

sql='select count(*) from open_contract_aggregates'
psql service=... -XAtF $'\t' <<<"$sql"'\watch 0.5' \
     >open_contract_aggregates.rowcount&

This gives a column of numbers with roughly 2 samples per second. The highest number of rows recorded was 382.

To get a more detailed view, I produced a diagram by dividing the interval between 0 and the maximum row count (I took 400 for simplicity) into 50 batches and counting the number of samples per batch.

rowcount-distribution

How does it affect our buy transaction timing?

Remember, the initial reason to do all of this was that we want to assess the company's risk every time we sell an option to a client. The query to do this is still the same as in the original design. Only now it has to iterate over 43 rows on average.

SELECT sum(o.payout * e.rate)
  FROM open_contract_aggregates o
 CROSS JOIN exchangeToUSD_rate(o.currency_code) e;

Since the time this query takes is added to every single buy transaction, it's a good idea to know how much that is.

I think the most accurate way to measure the execution time is to use EXPLAIN ANALYZE. To fetch just the execution time I used the dblink technique explained here.

The complete command was like

sql='select (x->0->>$$Execution Time$$)::double precision
       from dblink($$dbname=...$$,
                   $$explain (analyze,format json) $$ ||
                   $$SELECT sum(o.payout * e.rate)
                       FROM open_contract_aggregates o
                       CROSS JOIN exchangeToUSD_rate(o.currency_code) e
                   $$) t(x json)
'
psql service=... -XAtF $'\t' <<<"$sql"'\watch 0.5' \
     >open_contract_aggregates.execution_time&

As before, this produces roughly 2 samples per second, and it ran for the better part of a day.

The average execution time was 0.955 milliseconds. That looks good.

The longest execution time, though, was 54.673 milliseconds. That is quite a lot. However, the picture gets better if we look at the details. Initially, I wanted to use the same approach as for the row count. But with only 50 batches, basically all the samples fall in the first 2. Eventually, I divided the whole interval into 400 batches, with 0.15 milliseconds per batch.

The following picture contains the statistics up to 10 milliseconds duration. Beyond that, the graph is indistinguishable from the x-axis.

execution_time-distribution

Is there anything misbehaving?

By now, after more than a week in production, the table size on disk has grown to about 350 kb and the index to 16 kb. That means both, the table and the index, have some bloat. But their sizes remain constant. So, the autovacuum system is able to perform its job.

Note: This post has been ported from https://tech.binary.com/ (our old tech blog)