Postgres large sub-string query performance (evanjones.ca)

[ 2022-February-27 16:47 ]

Following up on my last post about large JSON queries, I also benchmarked sub-string queries on large variable-length strings. I wanted to check if sub-string queries might be faster than HSTORE or JSONB key lookups. I tested both binary (BYTEA) and Unicode text (TEXT). Unfortunately, Postgres sub-string queries are about 6× slower than HSTORE or JSONB key queries. I also learned that Postgres regular expressions are extremely slow: about 3× slower than using LIKE. Queries on BYTEA are faster than queries on TEXT. Perhaps the most interestingly, TOAST inline compressed storage was faster than the uncompressed storage. However, similar to my results with JSON values, out-of-line storage using a separate TOAST table is quite a bit slower than inline storage.

I spent a bit of time looking at the code that implements Postgres's LIKE operator and the implementation of the POSITION function. I'm pretty sure they could be made quite a bit faster, at least for UTF-8 or binary strings. For example, the BYTEA implementation of POSITION is a function called byteapos. It implements the simple O(nm) implementation, where for each byte of the string, you compare it to the substring. There are much faster implementations that can use SIMD instructions. I think there is an opportunity to make POSITION() and LIKE substantially faster.

Query performance benchmark

For details on the benchmark setup, see my previous article. In this case, I tested TEXT and BLOB columns that are near the 2004 byte limit for inline uncompressed tuples. I measured querying substrings that either did not match, or matched at the very end, which should represent the "worst case" search times. I tested LIKE, regular expressions (~ operator), and the POSITION() function. The entire workload was in memory, and parallel queries were disabled. For more details, see the benchmark source code in Github, which links to a Google sheet with the raw results.

Fastest query times (ms)

This table shows the fastest query times for each of the column types and sizes. In general, BYTEA is faster than TEXT, and using LIKE is better than using POSITION, because it avoids strangly slow queries for TEXT columns that match.

	inline uncompressed	inline compressed	toast uncompressed	toast compressed
TEXT	3262	2900	6325	10623
BYTEA	2067	1803	5010	7054

Random observations

Filters (e.g. length > 100 or IS NOT NULL) make a query slightly slower than implicitly counting the values.
Queries that don't match anything are faster than queries that match everything.
Compressed inline values can make queries faster. This is likely because the test data I used compresses really well (the entire table is 73 MiB compressed versus nearly 2 GiB uncompressed). Likely this is because of some combination of a large reduction in Postgres per-page overhead, and maybe also CPU memory caching.
Queries that don't touch the data (e.g. only checking the string length), are faster when stored out of line in a separate table. Again, this is likely due to per-page overhead and/or improved caching.
For POSITION() with TEXT columns: results where there is a match are 2-3X slower than results where there is not a match. This doesn't make sense to me, but seems to be fairly consistent