Coerce a string into a uuid in Postgres
From one of my data sources, Segment, gives me event data with a GUID/UUID that looked like this when I queried it
id | received_at |
---|---|
DA588117-0F6A-41AA-B0B0-D0DA20F21745 | 2022-10-04 19:51:04.271+00 |
4ca41477-a9ed-4ea2-a357-7ba5c19cc27c | 2022-10-04 19:46:49.421+00 |
CDB9B507-44BC-4BEE-B9D8-3D230C59FF01 | 2022-10-04 19:45:45.545+00 |
783236a1-1a50-4191-a4f6-fceca77fc991 | 2022-10-04 19:45:41.645+00 |
BC6C1BC0-02D4-461A-9EDD-EBF98BF47574 | 2022-10-04 19:45:28.542+00 |
43F3C421-6CA9-40E0-AA79-AD36A85D5EF5 | 2022-10-04 19:45:19.204+00 |
F45C4B33-40CC-422F-B5F5-1982155656F7 | 2022-10-04 19:44:59.033+00 |
890b2d6b-6383-4af9-a556-ab9480ffc963 | 2022-10-04 19:44:54.164+00 |
7570242a-c646-4224-836a-73637e7bbeeb | 2022-10-04 19:44:54.164+00 |
26a14851-daee-4aee-b13d-08b06ea24287 | 2022-10-04 19:44:54.164+00 |
DE32C925-1528-4008-A404-DD38307C6138 | 2022-10-04 19:44:38.161+00 |
Querying this table was slow, resulting in dashboard views that took 60+ seconds to load, and I wanted a quick win to reduce the size of it. Interestingly, the casing wasn’t consistent, but that shouldn’t matter.
CREATE TABLE tmp_raw_table
SELECT
id as event_id,
received_at
FROM segment.event
LIMIT 100000;
CREATE TABLE tmp_raw_uuid
SELECT
id::uuid as event_id,
received_at
FROM segment.event
LIMIT 10000000;
SELECT pg_size_pretty( pg_total_relation_size('tmp_raw_table') );
SELECT pg_size_pretty( pg_total_relation_size('tmp_uuid_table') );
Running this to test my syntax, and the reported size of the differences in tables goes from 7488 kB to 5096 kB. Not bad for something as low-effort as coercing a type, so I tried to ship it to the full table, and then it failed…
ERROR: invalid input syntax for type uuid: "0qREGUQhbFNoRd-aeFacw"
In my experience, halting functions like these are nasty when used in a production ETL flow. Imagine if a malformed ID like that were introduced months later; now the flow is broken until developer time can be allocated to fix it. Gross. In some situations you want your code to fail-fast, but I just wanted a damned UUID. Or maybe null would have been acceptable. This post had a good snippet, so I turned that into a database function.
CREATE OR REPLACE FUNCTION to_uuid(raw text)
RETURNS uuid IMMUTABLE STRICT
AS $$
BEGIN
RETURN raw::uuid;
EXCEPTION WHEN invalid_text_representation THEN
RETURN uuid_in(overlay(overlay(md5(raw) placing '4' from 13) placing '8' from 17)::cstring);
END;
$$ LANGUAGE plpgsql;
There are a lot of good reasons to avoid database functions, but I think as long as they’re very concise, it shouldn’t be any more annoying than using built-in functions. I added that IMMUTABLE
bit to make sure it’s deterministic and fast/safe to use in indexes.
SELECT
to_uuid(id) as event_id,
received_at
FROM segment.event;