Huge Pages are a Good Idea

about | archive

[ 2023-January-16 11:46 ]

Nearly all programs are written to access virtual memory addresses, which the CPU must translate to physical addresses. These translations are usually fast because the mappings are cached in the CPU's Translation Lookaside Buffer (TLB). Unfortunately, virtual memory on x86 has used a 4 kiB page size since the 386 was released in 1985, when computers had a bit less memory than they do today. Also unfortunately, TLBs are pretty small because they need to be fast. For example, AMD's Zen 4 Microarchitecture, which first shipped in September 2022, has a first level data TLB with 72 entries, and a second level TLB with 3072 entries. This means when an application's working set is larger than approximately 4 kiB × 3072 = 12 MiB, some memory accesses will require page table lookups, multiplying the number of memory accesses required. This is a brand-new CPU, with one of the biggest TLBs on the market, so most systems will be worse. Using larger virtual memory page sizes (aka huge pages) can reduce page mapping overhead substantially. Since RAM is so much larger than it was in 1985, a larger page size seems like obviously a good idea to me.

In 2021, Google published a paper about making their malloc implementation (TCMalloc) huge page aware (called Temeraire). They report this improved average requests-per-second throughput across their fleet by 7%, by increasing the amount of memory that is backed by huge pages. This made me curious about the "best case" performance benefits. I wrote a small program that allocates 4 GiB, then randomly reads uint64 values from it. On my Intel 11th generation Core i5-1135G7 (Tiger Lake) from 2020, using 2 MiB huge pages is 2.9× faster. I also tried 1 GiB pages, which is 3.1× faster than 4 kiB pages, but only 8% faster than 2 MiB pages. My conclusion: Using madvise() to get the kernel to use huge pages seems like a relatively easy performance win for applications that use a large amount of RAM.

Unfortunately, using larger pages is not without its disadvantages. Notably, when the Linux kernel's transparent huge page implementation was first introduced, it was enabled by default, which caused many performance problems. See the section below for more details. Today's default to use huge pages only for applications that opt-in (aka madvise) should improve this. The kernel's policies for managing huge pages have also changed since then, and are hopefully better now. At the very least, the fact that Google uses transparent huge pages for all their applications is some evidence that this can work for a wide variety of workloads.

The second problem with larger page sizes is software incompatibility, since so much software is only tested on x86 with 4 kiB pages. Linux on ARM64 used to default to 64 kiB pages. However, this caused many problems (e.g. dotnet, Go, Chrome, jemalloc, Asahi Linux list of broken software). It appears that around 2020 most distributions switched to 4 kiB pages to avoid these problems (e.g. RedHat RHEL9 change in 2021, Ubuntu note about the page size change).

Page size historical details

Other CPU architectures have made different page size choices. Notably, iOS and Mac OS X on ARM64 uses 16 kiB pages (ARM64 aka aarch64 supports 4, 16, and 64 kiB pages, although specific CPUs will only support some of them). Alpha and Sparc used 8 kiB pages. PowerPC on Linux uses 64 kiB pages, although Redhat/Fedora are considering switching to 4 kiB due to the same compatibility issues. See page sizes used by Windows on various processors.

Latency and throughput problems with transparent huge pages

The Linux kernel's implementation of transparent huge pages has been the source of performance problems. When introduced, it was initially enabled for all processes and memory regions by default. This caused a large number of problems, which eventually caused the kernel's default to change to madvise, where programs have to opt-in to use huge pages (see Nelson Elhage's summary (2017), and Ubuntu bug that changed the default (2017/released 2019).

The performance problems are rare high latency (e.g. operations being substantially slower than normal), throughput issues due to excess CPU consumption of the kernel background tasks, or substantial increases in memory usage. Some examples are Hadoop (2012), TokuDB/MySQL (2014), Redis/jemalloc (2015), TiKV/TiDB (2020). The problems seem to fall into the following categories: