Huge Pages are a Good Idea (evanjones.ca)

[ 2023-January-16 11:46 ]

Nearly all programs are written to access virtual memory addresses, which the CPU must translate to physical addresses. These translations are usually fast because the mappings are cached in the CPU's Translation Lookaside Buffer (TLB). Unfortunately, virtual memory on x86 has used a 4 kiB page size since the 386 was released in 1985, when computers had a bit less memory than they do today. Also unfortunately, TLBs are pretty small because they need to be fast. For example, AMD's Zen 4 Microarchitecture, which first shipped in September 2022, has a first level data TLB with 72 entries, and a second level TLB with 3072 entries. This means when an application's working set is larger than approximately 4 kiB × 3072 = 12 MiB, some memory accesses will require page table lookups, multiplying the number of memory accesses required. This is a brand-new CPU, with one of the biggest TLBs on the market, so most systems will be worse. Using larger virtual memory page sizes (aka huge pages) can reduce page mapping overhead substantially. Since RAM is so much larger than it was in 1985, a larger page size seems like obviously a good idea to me.

In 2021, Google published a paper about making their malloc implementation (TCMalloc) huge page aware (called Temeraire). They report this improved average requests-per-second throughput across their fleet by 7%, by increasing the amount of memory that is backed by huge pages. This made me curious about the "best case" performance benefits. I wrote a small program that allocates 4 GiB, then randomly reads uint64 values from it. On my Intel 11th generation Core i5-1135G7 (Tiger Lake) from 2020, using 2 MiB huge pages is 2.9× faster. I also tried 1 GiB pages, which is 3.1× faster than 4 kiB pages, but only 8% faster than 2 MiB pages. My conclusion: Using madvise() to get the kernel to use huge pages seems like a relatively easy performance win for applications that use a large amount of RAM.

Unfortunately, using larger pages is not without its disadvantages. Notably, when the Linux kernel's transparent huge page implementation was first introduced, it was enabled by default, which caused many performance problems. See the section below for more details. Today's default to use huge pages only for applications that opt-in (aka madvise) should improve this. The kernel's policies for managing huge pages have also changed since then, and are hopefully better now. At the very least, the fact that Google uses transparent huge pages for all their applications is some evidence that this can work for a wide variety of workloads.

The second problem with larger page sizes is software incompatibility, since so much software is only tested on x86 with 4 kiB pages. Linux on ARM64 used to default to 64 kiB pages. However, this caused many problems (e.g. dotnet, Go, Chrome, jemalloc, Asahi Linux list of broken software). It appears that around 2020 most distributions switched to 4 kiB pages to avoid these problems (e.g. RedHat RHEL9 change in 2021, Ubuntu note about the page size change).

Page size historical details

Other CPU architectures have made different page size choices. Notably, iOS and Mac OS X on ARM64 uses 16 kiB pages (ARM64 aka aarch64 supports 4, 16, and 64 kiB pages, although specific CPUs will only support some of them). Alpha and Sparc used 8 kiB pages. PowerPC on Linux uses 64 kiB pages, although Redhat/Fedora are considering switching to 4 kiB due to the same compatibility issues. See page sizes used by Windows on various processors.

Latency and throughput problems with transparent huge pages

The Linux kernel's implementation of transparent huge pages has been the source of performance problems. When introduced, it was initially enabled for all processes and memory regions by default. This caused a large number of problems, which eventually caused the kernel's default to change to madvise, where programs have to opt-in to use huge pages (see Nelson Elhage's summary (2017), and Ubuntu bug that changed the default (2017/released 2019).

The performance problems are rare high latency (e.g. operations being substantially slower than normal), throughput issues due to excess CPU consumption of the kernel background tasks, or substantial increases in memory usage. Some examples are Hadoop (2012), TokuDB/MySQL (2014), Redis/jemalloc (2015), TiKV/TiDB (2020). The problems seem to fall into the following categories:

Increasing memory usage by making fragmentation worse: using transparent huge pages rounds allocations up to 2 MiB. If an application allocates many separate memory regions, this can cause lots of memory to be wasted. Most of the problems have been where an application uses a large amount of memory, then frees a lot of it, leaving "holes" in the large pages. Sometimes the kernel's transparent page policy can decide to turn these back into huge pages, which causes the memory usage to increase. For example, see a Go bug (2015) and the corresponding kernel bug report (2015). The fix for Go was to only return memory on huge page granularity. This also happened to Redis with jemalloc (2015)
malloc implementations that are not huge page aware may add more kernel CPU overhead: When returning memory to the operating system, if the memory allocator is not aware of huge pages, it may return part of a huge page. This causes the kernel to split the huge page back into separate 4 kib pages. This adds overhead, and also fragments memory, making fewer huge pages available, causing the kernel to do more work the next time it tries to allocate a huge page. This article about TokuDB from 2014 suggests that it ran into this problem with jemalloc. The good news is that it now seems like all major malloc implementations (jemalloc, tcmalloc, mimalloc, and glibc malloc) all have some huge page support, which should make this less bad.
slow memory allocations due to fragmentation (latency): When trying to allocate a huge page, the kernel may spend time moving memory around to free up a page. See a detailed thread about impacts on the JVM (2017). The kernel's current default is to only do this for regions that have opted in with madvise. This should mean that other processes won't be penalized too much by this, but it does mean the process that called madvise could be stalled briefly when allocating new pages. One way to avoid this is to immediately touch every huge page in an allocation, to cause the cost to happen up front. This would work well for allocations that are made at program startup, such as caches.
fork() e.g. Redis: Calling fork marks all of the process's pages as copy-on-write. Then when a single byte on a page is modified, the page must be copied. Redis uses fork to create a read-only "snapshot" of memory, when writing a checkpoint to disk. Since huge pages are 512X larger than "normal" pages, the time to copy a page increases by 512X. It also means the memory usage is higher, since modifying a single byte causes 2 MiB to be copied, instead of only 4 kiB. Using fork() in this way with huge pages seems like a bad idea. See details about a workload that causes this behavior (2014).

References

Huge Page Demo Evan Jones 2022-01-18: My huge page demonstration program.
Larger Pages: Richard Sites 2022-05-06: Argues we should increase the minimum page size to 64 kiB, and maintain compatibility by using access flags on 4 kiB sub-pages.
Stack Overflow: Why is the page size 4 KB? Answer by Hadi Brais 2018-04-26: a great look at the history of why 4 kiB pages were chosen.
Using huge pages on Linux: Erik Rigtorp 2020-10-08: A hash table benchmark in C++ with results for both transparent and explicit huge pages.
Reliably allocating huge pages in Linux: Francesco Mazzoli 2021-11-22: Includes C code describing how to verify if an address is a huge page.
Intel Coffee Lake Microarchitecture (2017 aka Core 9th gen): L1 Data TLB: 64 entries for 4 kiB pages / 32 entries for 2 MiB pages / 4 entries for 1 GiB pages ; L2 unified TLB: 1536 for 4 kiB/2 MiB pages; 16 entries for 1 GiB pages.