Saving Space by Mapping Big Objects to Small Integer IDs (evanjones.ca)

[ 2020-April-19 21:24 ]

On rare occasions, you need to figure out how to reduce storage space for a big collection of data. One trick is to de-duplicate "big" objects and replace them with smaller integer IDs. This helps because you only store a single copy of the big object, and can refer to it with an integer that is smaller than a pointer. In this article, I'll describe how to implement this efficiently by implementing a slightly unusual hash table. This is the first time in my career I've written a hash table that might actually get used in an application! (To be clear: As of writing I am still considering it.) My analysis is that converting objects to small integers saves space if the objects are reused often (somewhere around 5 times each for 8 byte objects, less if the objects are larger). If the objects are large (around 64 bytes) and used in a hash map, this optimization saves space even if the objects are not duplicated, because you reduce hash map overhead by packing the objects in a dense array. In short, this is a useful trick on rare occasions. The rest of this article goes into way too much detail about how this works, including modeling when it might make sense. This article is more so I could save my notes somewhere, and it is unlikely you will want to read it unless you need to de-duplicate objects. I also implemented an example in Go.

To make this a bit more concrete, let's say we have a big collection of N Photos, represented by a pointer or ID (P bytes each). We need to record a string category label for each one (S bytes each, ignoring the variable length part). A simple way to solve this problem is to create a hash map from Photo to string. This uses at least N(P+S) bytes, ignoring map overhead. However, if there are a small number of unique labels, we can instead assign each label a small integer id, say an int32. Then we can use a map from Photo to int32, which will occupy N(P+4) bytes, plus some additional space to store the L unique labels. This optimization will only make sense if the cost to store the label mapping is smaller than the savings. What is a good way to store the labels, and when does this optimization make sense?

Using the standard library

The simple approach is to have an array of L strings for the id to string mapping, which takes L×S space. We also need a map of string to int32 to find the id for a label, which takes L(S+4) space (again ignoring overhead), for a total of L(2×S+4) bytes. This potential optimization saves N(S-4) bytes in the "big map", but requires an additional L(2×S+4) to store the labels. Lets assume strings take 8 bytes (S=8, the size of an ID or pointer). Some algebra tells us that this optimization makes sense when N ≥ 5L. This means each object needs to be duplicated at least 5 times on average for this to save memory. If the objects are bigger, then you need less duplication (e.g. 16 bytes = 3 times). To pick some realistic numbers to make this a bit more real, if we have 100k photos and 10k distinct labels, the "big map" takes 1600k bytes, but the ID version takes 1400k bytes, which is a 12.5% savings.

The savings comes at the cost of extra memory reads and CPU to convert to and from IDs. Converting from an ID to a string is an array lookup, so is fairly efficient (one random memory read). Converting from the label to the ID is a map lookup, so it is more expensive but still fairly cheap.

This mapping with a hash table and an array is straightforward to implement and should be pretty efficient. I found that Gnome's glib contains "Quarks" which seem to be implemented exactly this way, which is some evidence this can be useful. However, there is an annoyance: it stores strings once in the map from string to id, then again in the array to convert from id to string. We can fix this by implementing our own hash table that only stores indexes.

Custom hash table using only indexes

To reduce the duplication, we only store strings once, in the array of size L. To convert from string to ID, we build a hash table that only stores the indexes, rather than the strings themselves. We hash the string to a table slot as usual, but that slot contains an index instead of a string. However, we can then load the corresponding string from the array to do the key comparison. Unfortunately, this is an unusual use of a hash table, so most standard libraries do not make it possible to do this. However, we can build a pretty simple hash table for this use case (about 100 lines of code). This version only needs to store an array of int32 indexes for the hash map, which is L×4 bytes (ignoring overhead), instead of L(S+4). This means our total overhead, for the hash table plus the array is L(S+4) instead of L(2×S+4). For 8 bytes strings, we save 4N bytes from the "big map", but our ID lookup now only costs us 12L bytes. This means this optimization makes sense when N ≥ 3L, or an average of 3 duplicated labels, instead of 5 for the previous approach. For our example scenario of 100k photos with 10k labels, this saves 280k bytes, which is a 17.5% reduction, or an extra 80k bytes/5% compared to the simpler approach.

As usual, nothing is free. This additional savings now means that hash table lookups when converting from a string to an ID are going to be more expensive, because they now require an additional memory read from a separate array. However, my limited benchmarks shows that my implementation is basically the same speed as the Go standard library for this specific use case, so the CPU cost should be neglibible. The mapping from ID to string is exactly the same as the standard library approach. As a result, if you are going to implement an ID mapping, I suggest you use this approach. It requires a small custom hash table, but will always use less storage.

Including map load factor

The previous analysis ignored the map load factor. Hash tables need additional "empty" slots to reduce the number of hash collisions. For example, Go uses a load factor of 0.8125, which means that maps need an array 1.23-2.46× the size of the key/value pairs they store. For large objects, this map overhead can be larger than the space needed to store the ID mapping, so you may not need any duplication for this optimization to make sense!

To include the map overhead, the "big map" of Photo to string stores N key/value pairs, which requires N(P + S) bytes. With overhead, this is a minimum of 1.23N(P+S). This means changing the value type to int32 saves a minimum of 1.23N(S - 4) bytes. It requires storing a map of int32, so with overhead a maximum of 2.46×4×L = 9.84L. Then it needs to store the array of values (LS). Rearranging, if S = 8, this optimization is profitable if N ≥ 3.62 L. If S = 16, the necessary ratio falls to N ≥ 1.75 L. This means this optimization is a bit better once we include map overhead. To reuse our previous example, a single map with 100k photo/label pairs takes at least 1968k bytes including overhead. The optimized version reduces the map to 1476k, plus 138k for the mapping, for a total of 1614k, an 18.0% reduction. In the "best case", where the big map is very sparse, but the ID map is dense, the savings is 22.7%

This means this optimization is better when you take map overhead into account. Effectively, this stores objects in a dense array, then puts small integer IDs in the "sparse" hash map, so it saves that overhead. It turns out that if the objects are big enough, that savings is more than the cost of the extra int32 hash map. By my math, for the Go map load factor, this happens if the objects are at least 43 bytes in size. If the objects are that large, then this optimization will always save space.