Postgres large sub-string query performance

about | archive

[ 2022-February-27 16:47 ]

Following up on my last post about large JSON queries, I also benchmarked sub-string queries on large variable-length strings. I wanted to check if sub-string queries might be faster than HSTORE or JSONB key lookups. I tested both binary (BYTEA) and Unicode text (TEXT). Unfortunately, Postgres sub-string queries are about 6× slower than HSTORE or JSONB key queries. I also learned that Postgres regular expressions are extremely slow: about 3× slower than using LIKE. Queries on BYTEA are faster than queries on TEXT. Perhaps the most interestingly, TOAST inline compressed storage was faster than the uncompressed storage. However, similar to my results with JSON values, out-of-line storage using a separate TOAST table is quite a bit slower than inline storage.

I spent a bit of time looking at the code that implements Postgres's LIKE operator and the implementation of the POSITION function. I'm pretty sure they could be made quite a bit faster, at least for UTF-8 or binary strings. For example, the BYTEA implementation of POSITION is a function called byteapos. It implements the simple O(nm) implementation, where for each byte of the string, you compare it to the substring. There are much faster implementations that can use SIMD instructions. I think there is an opportunity to make POSITION() and LIKE substantially faster.

Query performance benchmark

For details on the benchmark setup, see my previous article. In this case, I tested TEXT and BLOB columns that are near the 2004 byte limit for inline uncompressed tuples. I measured querying substrings that either did not match, or matched at the very end, which should represent the "worst case" search times. I tested LIKE, regular expressions (~ operator), and the POSITION() function. The entire workload was in memory, and parallel queries were disabled. For more details, see the benchmark source code in Github, which links to a Google sheet with the raw results.

Fastest query times (ms)

This table shows the fastest query times for each of the column types and sizes. In general, BYTEA is faster than TEXT, and using LIKE is better than using POSITION, because it avoids strangly slow queries for TEXT columns that match.

inline uncompressed inline compressed toast uncompressed toast compressed
TEXT 3262 2900 6325 10623
BYTEA 2067 1803 5010 7054

Random observations