Obfuscation in PostgreSQL

Motivation

I recently worked on a project prohibiting using sequential IDs in its public API. It is a common requirement for many projects, as sequential IDs give a lot of information about the system.

Without obfuscation, an external user can get an idea of how many entities are in the system and how often they are created.

Let’s say we have a user registration system but forgot to set up a user ID obfuscation. In this case, an external user can quickly get an idea of how many users are registered per day by comparing the IDs of the newly created users at the beginning of the day and at the end.

Why not UUIDs?

Of course, I could use UUIDs instead of sequential IDs, but they have some drawbacks that I tried to avoid in this project:

UUIDs are not sortable by creation time
UUIDs require more space to be stored and indexed
UUID index can become fragmented, which can lead to suboptimal query performance

The solution could have been to use UUIDs for external usage, like API, and sequential IDs for internal usage.

But I was thinking about a way to use a single identifier for all purposes - internal referencing, sorting by time, and obfuscation. And luckily, I found a way to do that in the “Sharding & IDs at Instagram” article.

Solution

The solution is to build an integer ID from two parts:

Current timestamp in milliseconds
Serial ID generated by “PostgreSQL sequence”

All we need is to take these two parts and squeeze them into a single BIGINT field. That is how we can get an obfuscated ID that is sortable by creation time.

But let’s do the math first.

The length of BIGINT field in PostgreSQL is limited to 63 bits for positive values. Let’s divide these 63 bits between the timestamp and the serial ID.

The dilemma is:

The more bits we use for a timestamp, the longer we can generate IDs.
The more bits we use for serial ID, the more IDs we can generate per millisecond.

Let’s say we decided to use 53 bits for the timestamp in milliseconds and 10 bits for the serial ID:

Now, we have to answer two questions:

For how long can we generate IDs this way before we overflow 53 bits?

The answer is: 53 bits can store 9007199254740991 milliseconds or 285616 years.

How many serial IDs could be stored in 10 bits?

The answer is: 10 bits can store 1024 serial IDs.

In other words, we can generate 1024 IDs per millisecond for 285616 years long, in the following way:

and so on… so, yeah, 285616 years is definitely quite a lot :)

But enough with the theory, let’s practice.

Generic SQL helper

First of all, let’s kick the tire and calculate some IDs for the first millisecond of January 14, 2024 12:00:00 AM GMT ( timestamp in milliseconds: 1705190400000).

-- The first ID:
SELECT (1705190400000 << 10) | 0;
-- 1746114969600000

-- The second ID:
SELECT (1705190400000 << 10) | 1;
-- 1746114969600001

-- ...

-- The last ID:
SELECT (1705190400000 << 10) | 1023;
-- 1746114969601023

Now, we can reduce the current timestamp by subtracting a base timestamp from it. The base timestamp is like the beginning of time or a starting point for our application. Not to overthink, we can just put it like the beginning of the year: 1704067200000.

-- The first ID:
SELECT (1705190400000 - 1704067200000 << 10) | 0;
-- 1150156800000

As you can see, the first ID is smaller than the previous one, like 1518 times smaller:

1746114969600000 / 1150156800000 = 1518.15

Now, let’s wrap it up into a generic helper next_id() that could be used to generate IDs for different tables:

CREATE OR REPLACE FUNCTION next_id(seq_name TEXT, OUT result BIGINT) AS
$$
DECLARE
    base_timestamp_ms    BIGINT := 1704067200000;
    current_timestamp_ms BIGINT;
    seq_val              BIGINT;
BEGIN
    -- Get current timestamp in milliseconds.     
    SELECT FLOOR(EXTRACT(EPOCH FROM CLOCK_TIMESTAMP()) * 1000) INTO current_timestamp_ms;
    -- Get the next value from the specified sequence
    -- and reduce it to a range from 0 to 1023.
    SELECT (NEXTVAL(seq_name) - 1) % 1024 INTO seq_val;
    -- Combine them using the bitwise OR operator.  
    result := ((current_timestamp_ms - base_timestamp_ms) << 10) | (seq_val);
END;
$$ LANGUAGE plpgsql;

We have a helper, now let’s create a test table book which we want to obfuscate IDs for:

-- Create a sequence for book IDs.
CREATE SEQUENCE book_id_seq AS BIGINT;
-- Create the table.
CREATE TABLE book
(
    id    BIGINT NOT NULL UNIQUE DEFAULT next_id('book_id_seq'),
    title TEXT   NOT NULL
);

Let’s insert some books:

INSERT INTO book (title)
VALUES ('Book 1');
INSERT INTO book (title)
VALUES ('Book 2');

And here we are – our nice and shiny, sortable by time, obfuscated, and easy to reference integer IDs:

SELECT *
FROM book;
-- 1307850102784, Book 1
-- 1307850927105, Book 2

Worth to mention

It’s worth mentioning that timestamps we rely on in our ID obfuscation are not monotonic. They are subject to clock synchronization by the OS and can skip forwards and backwards.

It means that it is only a matter of time before we get a new ID, which is smaller than the previous one.

Is that a problem? Not really, or, I would say it depends on your case. Sometimes, a small disorder in IDs won’t hurt anyone, but sometimes it will. So it’s up to you to decide whether you can live with it or not.

Motivation#

Why not UUIDs?#

Solution#

Generic SQL helper#

Worth to mention#

Motivation

Why not UUIDs?

Solution

Generic SQL helper

Worth to mention