A Simple Macro
It began with a simple macro: (approximative code)
1 2 3 4 5 6 7 8 9 10 11 12
C-illiterates, this simple macro just adds the byte ‘C’ to the dynamically allocated buffer
buffer, whose capacity (in bytes) is
capa. The next position to be written is identified by
offset. The capacity is doubled each time the buffer is filled (starting with a minimum of 16 bytes).
We’re adding bytes in a dynamic buffer – this is one of the most common thing you can do in a given program (used in strings, in arrays, etc.)
The problem is to understand if the reallocation strategy is good enough. If you take a look at the complexity, assuming
realloc is O(N), you quickly conclude that adding a single byte is on average O(1), which is good.
But what is the worst-case complexity, when reallocating a buffer ?
Guy: Are you sure this is the good strategy ? It looks like you have a big performance issue if you extend the size of a large array – say, one gigabyte for example. Imagine the impact, especially if the buffer has to be moved from the swap.
Me: Hummm, I never really thought about this, but I think it’s fine. I’m pretty sure the system can handle that gently, actually.
Guy: I think a linked structure might be a better idea – maybe with the same exponential strategy, for example
Me: I’m pretty sure this is a waste of time
Guy: Okay, show me
Okay, now I need to back my arguments. As all programmers, I am lazy. I should say: as all programmers should be, I am lazy. Laziness is an incentive to be smarter and more efficient. You are too lazy to execute repetitive or boring tasks, and you need something smarter, something faster. This is sometimes called laziness. I prefer to call that efficiency.
Having a linked list of blocks to store a buffer would be a bit cumbersome. I didn’t say impossible – Impossible n’est pas français as we say sometimes (well, it did not end well for Napoléon, but nevermind) – but cumbersome. Copying a sub-array or storing it on disk would require some work, we probably would need an index of all arrays base pointers, get the low offset 2-based logarithm to get the initial block, and do some other boring stuff. Sheesh, I’m already tired.
I have the feeling that being lazy is the right thing to do here. Because for small memory blocks, we basically do not care, and for large blocks, the system’s virtual memory will gently take care of the issue.
Let’s have a look.
realloc(), By The Way ?
1 2 3 4
For the curious, “
T” stands for “text symbol” (the uppercase means that the symbol is public and visible) ; the text segment of a program being the code, the data segment being initialized data (variables), and the bss segment being uninitialized (well, actually initialized to zero) data (variables).
Allocating memory, reallocating memory and freeing memory is done through an allocator (thank you Captain Obvious!). There are plenty of them (one of the most common being the buddy allocator).
We can implement our own trivial allocator using the venerable
sbrk call, which basically expand the end of the data segment to have more space:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
[ Note: as someone noted on hacker news, the old_size address needs to be known here ; for example by using a header before the returned address. And no, we shouldn’t free the original block upon realloc failure :) ]
A real allocator would be a bit more clever, of course, with complex data structures aimed to limit the number of actual
We need to dig a bit inside the Glibc used on Linux to figure out what’s happening for large blocks.
Digging In The Glibc
Just download a recent glibc release, and have a look at the source tree.
There is an interesting directory out there: the
malloc directory. Just pick the
malloc.c file and open it in an editor.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
The interesting part (for us) is: For very large requests (>= 128KB by default), it relies on system memory mapping facilities, if supported..
The 128KB by default can actually be controlled by the
M_MMAP_THRESHOLD For allocations greater than or equal to the limit specified (in bytes) by M_MMAP_THRESHOLD that can't be satisfied from the free list, the memory-allocation functions employ mmap(2) instead of increasing the program break using sbrk(2).
Okay, so as I said previously, the system’s virtual memory will handle everything when dealing with large blocks.
Basically, it means that:
mallocwill be trunked to anonymous
freewill be trunked to
reallocwill be trunked to
mremap(Darn, not yet in POSIX ? Guys ?)
Okay, so we need to dig a little deeper inside mremap.
man mremap will tell us that:
mremap() uses the Linux page table scheme. mremap() changes the mapping between virtual addresses and memory pages. This can be used to implement a very efficient realloc(3).
Humm, very efficient realloc. How efficient ?
mremap do have definitions in the glibc. Well, pretty small entry points, actually:
1 2 3 4
Note that the default entry points are weak symbols in this case, which basically means that they can be overridden by someone else, during dynamic link ; eg. in this case:
1 2 3 4
Anyway, our symbols in the glibc are only trunks to kernel system calls (aka “syscalls”) whether they are in the glibc or inside the vdso ; see the
sysdeps/unix/sysv/linux/mmap64.c default implementations, for example:
1 2 3 4 5 6 7 8 9 10 11
Okay, so our initial question is not anymore glibc-related, but involves the Linux kernel.
Digging In The Kernel
Now download a recent kernel version and let’s have a quick look at what
If you take a look at the The Linux Kernel Howto, you see that there is an interesting directory for us:
mm This directory contains all of the memory management code. The architecture specific memory management code lives down in arch/*/mm/, for example arch/i386/mm/fault.c.
Ah, nice, just what we needed to check!
mm/mremap.c file seems to be of some interest. And inside the file, you’ll discover the syscall entry point of the mremap function. It’s just here:
1 2 3 4 5 6 7 8 9 10 11
This is basically where you enter in the kernel when invoking the corresponding syscall in your user code (through the glibc or through the corresponding vdso thunk).
If you read this function’s code, you first see various arguments checks, and trivial cases being handled (especially shrinking a mapped memory block, which is just freeing the pages on the tail).
The kernel then attempt to expand the mapped area by growing it (as would do an allocator in a C library with
sbrk), and successfully return if expanding the area was possible.
All these trivial cases are O(1) (even if entering in the kernel has a cost – a cost lower now that interrupts aren’t involved anymore, but this is still costly).
What about the worst case ?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
At first glance, the kernel is doing what we did in our trivial allocator:
- allocate a new area
- copy the previous area to the new one, and free the previous one (ie. move)
But let’s have a closer look.
move_vma call follows:
1 2 3 4 5 6 7 8 9 10 11 12 13
There are two interesting calls here:
copy_vma function is inside the sibling
mm/mmap.c file, and basically moves the internal kernel structure related to the mmap’ed block, which is a
You can find its definition in
include/linux/mm_types.h ; this small structure contains all information on the region: the start and end address, the mmap’ed file on disk if any, etc.
So moving this one will be cheap, O(1).
The interesting part seems to lie within this loop:
1 2 3 4 5 6
The code is a bit complicated (and lacks a bit some comments, too), but is mostly basic arithmetics and accounting.
move_ptes function contains itself the deepest inner loop:
1 2 3 4 5 6 7 8 9 10 11 12
What the hell are we doing in this loop ? We basically are moving the PTE, the Page Table Entries of the mmap’ed region somewhere else. These are basically long integers assigned to each pages.
We’re pretty done here: the kernel never actually touched a single bit of memory related to our allocated block, and just swapped few bits to relocate the entire region elsewhere.
We may argue that we are still O(N) technically, but
- instead of copying entire pages of 4KB (or 2MB for huge pages), we are swapping integers inside the kernel structure
- we are touching hot structures in the kernel, neither cold memory, nor bytes swapped on disk
Therefore, we are O(N), but with a huge divisor.
Oh, by the way, did you know that O(N) was in many cases misleading ?
In this case, N can be at most 248 (the maximum virtual space size). Most machines only deal with few gigabytes of memory, at most 64GB for most architectures, that is, 236 bytes. A huge page is 221 bytes, leaving 215 operations at most for a
mremap (yes, that’s only 32768 operations).
Well, the worst case scenario is cheap in all cases, and this is what I suspected from the beginning.
Further read: do not hesitate to have a look at Gorman’s book Understanding the Linux Virtual Memory Manager.
TL;DR: do not worry, be lazy, and trust