100 lines or less: Memory Pools
Recently, I found myself with a queue that I had a desperate need to optimize, which had a relatively interesting implementation. The queue would hold all entries in a single data set, and kept a numerical index of the position in that memory which was the front, and back of the queue. If the front of the queue moved too much, (over 33.333~% of the queue size) it would move all of the existing entries back to the front of the data set. When the back of the queue reached the end of the dataset, it would reallocate the queue to be current size * 3 / 2, or 150% of it's current size. Unfortunately, this was a very, very hot queue. In fact, it was effectively the core of the application at hand.
So I sat around for a while and tried to think of a better mechanism for handling this queue. And then I realized, all this figuring and copying is a bit slow, so it would be nice of I had a linked list so I could just insert items where they need to go, a few pointer assignments are a lot less expensive than all of these copies on a potentially large dataset. Of course, the problem with linked lists is that they use a fair amount of memory just for all the pointers.
And then I had a completely unoriginal revelation. malloc() is for most intents and purposes O(1), but it's slow. The problem with linked lists is the constant calls to malloc() and free() per node. So why don't we reuse the data sets we utilize for each node in our linked list. And this lead me down the path of creating a memory pool to implement my linked list.
Let's talk about the overhead of malloc for a minute, so we can understand its pitfalls and how to optimize around them.
malloc() isn't perfect
malloc() is implemented, at some level or other, through a call to sbrk() on most platforms. sbrk() resizes the memory foot print of a process. On virtual memory operating systems, this involves scanning available heap space and locating a suitable data segment, and mapping it to the application running. That's slow, and even modern malloc() implementations will take reasonable measures to try and avoid doing it.
The glibc implementation of malloc() starts out preemptively requesting memory for the operating system as a buffer. It splits this buffer up into multiple indexed bins which it will return to the application when requested. Allocations for small pieces of memory will be deducted from these bins.
| bins | size |
| 64 | 8 |
| 32 | 64 |
| 16 | 512 |
| 8 | 4096 |
| 4 | 32768 |
| 2 | 262144 |
| 1 | what's left |
Book keeping is performed for each allocated node by using a data structure that is 16 bytes in size on a 32 bit machine. The data structure keeps the size of the current chunk, the size of the previous chunk, a pointer to the user memory and a pointer to the next chunk.
struct malloc_chunk {
INTERNAL_SIZE_T prev_size; /* Size of previous chunk (if free). */
INTERNAL_SIZE_T size; /* Size in bytes, including overhead. */
struct malloc_chunk* fd; /* double links -- used only if free. */
struct malloc_chunk* bk;
};
This is a total of 16 bytes of memory per allocated chunk on a 32 bit system. The circular list of allocated chunks must be searched and the appropriate chunk returned to the pool when free() is called.
Allocating memory consists of checking and aligning the requested size, locating a suitable bin for the requested size and pulling the next available data segment if possible. If this is not possible, it will attempt to use condensed free bins to create a continuous segment of memory to return, and if this is not suitable it will call sbrk() resizing the process. Once the process is resized, it will rebuild indexes and re-slice memory, and finally return the suitable data segment.
This mechanism of allocating memory is very fast, but for really high demand items it's less than optimal. A more efficient mechanism can be created when the size data segments needed is known before requests are made.
Memory is arbitrary
In order to make a linked list fast, we mainly only have to avoid deallocating and reallocating memory for each node. While the even sized bins managed by malloc are fast, and they are held in a big circular linked list that is relatively easy to seek, they are unpredictable. malloc() may be very fast, or very slow, depending entirely on the conditions in the process when they are called. Calls to malloc() may even require deligation to sbrk() which is notoriously slow. So the goal is to make acquiring ram to allocate to a node as fast as possible.
One of the great benefits of C is that it is close enough to the system we have the ability to manage resources ourselves. C is a language with no boundary checking or safety, when you ask for memory you are just given memory arbitrarily. That memory is the same as any other memory, thus, you can do what you wish with it. Including split it up and dole it out amongst your process.
Data structures in C are actually represented by a single continuous piece of memory that is the size of the sum of the sizes of the types of each member within the data structure. Each member is represented precisely in the order in which it appears in the data structures definition, within that continuous data segment.
Figure 1.1 - A simple datastructure
struct example {
int foo;
long bar;
char baz[30];
}
In the above example, an instance of example would be represented by 42 continuous bytes. Assignment to foo, would modify the first 4 bytes of the data structure. Assignment to bar would modify the 8 bytes following, as long is a 64 bit digital. Following the first 12 bytes of 30 bytes of space, reserved for the member named baz, respectably.
Figure 1.2 - Illustration of a data segment
Since data structures (and all other types) in C work this way, it is possible to arbitrarily utilize any piece of memory any way one wishes by utilizing a type cast. For instance one could allocate 50kb, and use the first 42 bytes for the data structure example as displayed above. One could then move the pointer by 42 bytes, giving us 49958 bytes remaining. The remaining memory utilized in any way the program wishes to use it.
Figure 1.3 - allocsplit.c
struct one {
char *foo;
int bar;
float baz;
};
struct two {
char foo[35];
int bar;
};
int main (void) {
void *block = malloc(50000);
struct one *a = (struct one *) block;
block += sizeof(struct one);
struct two *b = (struct two *) block;
a->foo = "test one";
strcpy(b->foo, "test two");
printf("%s...%s\n", a->foo, b->foo);
}
Note, the above code sample performs arithmetic on a void pointer. GCC unfortunately does not warn you when you perform this operation by default. In ANSI C arithmetic is not allowed on pointers to incomplete datatypes.
Getting memory faster
So what we would do to optimize hot data structures that need to be allocated frequently (such as the nodes for a linked list) is allocate them in bulk with a reasonable expansion plan, and reuse items after allocating them. This helps us call malloc() less frequently. Also in the case of a data set that has a shifting memory requirement, we can often relatively quickly achieve a state in which allocation is no longer needed. This technique is called a memory pool.
The expansion plan I have chosen for my memory pools is to double on demand. This is an easy and reasonable mechanism for ensuring we always return memory, so long as memory is available, and allow rapid growth with infrequent allocation. In order to double on demand, we have to record the size of what we are allocating for and the number of items currently allocated. That will determine our bookkeeping data structure.
struct chunk {
struct chunk *next;
void *chunk;
};
typedef struct {
long pos; // Top of the stack marker
size_t blocksize; // Size of blocks for this pool
size_t blocks; // Number of blocks we currently own
void **stack; //
struct chunk *chunk;
} apool;
We're going to use the fastest possible implementation, which means that we will keep a stack of available memory chunks. Each time this stack is exhausted, we will allocate more memory, and adjust our record keeping accordingly. For each piece of memory we allocate, we will also need to allocate a small 8 byte data structure for tracking these allocations. We use this to allow us to free all used resources when our pool is no longer going to be used.
First step is obviously to create the pool itself.
apool * apool_new (size_t blocks, size_t blocksize) {
apool *pool = malloc(sizeof(apool));
struct chunk *chunk = malloc(sizeof(struct chunk));
chunk->next = NULL;
pool->chunk = chunk;
char *ram = chunk->chunk = calloc(blocks, blocksize);
pool->blocksize = blocksize;
pool->blocks = blocks;
pool->stack = calloc(blocks, sizeof(void *));
pool->pos = (int) blocks;
/* Use the ram in reverse order */
ram += (blocks - 1) * blocksize;
for (size_t i = 0; i < blocks; ram -= blocksize) {
pool->stack[i++] = ram;
}
return pool;
}
So first things first, enough space is allocated for our main book keeping data structure. Then allocate our small slab tracking structure. An array of void pointers is then allocated, with a maximum capacity of the number of chunks allocated during initialization. This provides a stack that can be easily removed from and added to, and has a small memory foot print.
Note, we put the pointers in the stack from the back to the front. It was suggested to me by a colleage of mine that some systems may optimize sequential payloads in some way, so I might want to make sure that addresses come out of my pool in canonical order. I have nothing to back up this theory, but I figured in a worst case scenario it won't do any harm.
In order to provide memory from this chunk that was initialized, an accessor will have to be created. We will call it alloc for lack of a better term.
void * apool_alloc (apool *pool) {
return pool->stack[--pool->pos];
}
Now it is evident how fast this approach can be for acquiring more memory. However currently there is a problem, in that only up to the amount of data that was initially allocated is available and any subsequent requests to apool_alloc will result in undefined behavior. Therefore, this routine will have to be modified to ensure there is memory to provide the caller.
void * apool_alloc (apool *pool) {
if (pool->pos == 0) {
apool_double(pool);
}
return pool->stack[--pool->pos];
}
A simple check is added prior to returning the available item in the stack. If the top available item in the stack is the bottom of the stack, then memory cannot be returned until the slab being managed by this pool is expanded. For simplicity, and because it's proven to be a decent expansion plan, a double on demand expansion plan has been chosen. So next, the apool_double function must be implemented.
void apool_double (apool *pool) {
size_t blocks = pool->blocks,
blocksize = pool->blocksize;
struct chunk *chunk = malloc(sizeof(struct chunk));
chunk->next = pool->chunk;
pool->chunk = chunk;
char *ram = chunk->chunk = malloc(blocks * blocksize);
/* Our new position */
pool->pos = (int) blocks;
/* This may look strange, but we're freeing the array
* of pointers. I use free and calloc instead of realloc
* because my stack is empty, I don't care about any data
* in this.
*/
free(pool->stack);
pool->blocks *= 2;
pool->stack = calloc(pool->blocks, sizeof(void *));
ram += (blocks - 1) * blocksize;
for (size_t i = 0; i < blocks; ram -= blocksize) {
pool->stack[i++] = ram;
}
}
Note there is some duplication between what the apool_double and apool_new operations do at this point, those are refactored in the end result of this tutorial. I assume most of you understand good refactoring practices.
Now a mechanism must be created for putting pointers back on the stack, effectively making them free for reuse. This is even more simple than the apool_alloc operation, because it should always be assured that there is enough space to store the pointer in the stack.
void apool_free (apool *pool, void * chunk) {
pool->stack[pool->pos++] = chunk;
}
And now we have a complete implementation of a memory pool. And in less than 100 lines of code. I didn't find the duplication satisfactory, so I rewrote apool_new to allocate the requested slab during initialization with a routine called apool_expand, and also made apool_double a macro that calls this routine. This not only made the code cleaner and DRY Compliant?, but also reduced the implementation to 70 lines of code. I placed the macros and data structures in the header, which is 23 lines.
Finished, refactored header
#include <malloc.h>
struct chunk {
struct chunk *next;
void *chunk;
};
typedef struct {
long pos;
size_t blocksize;
size_t blocks;
void **stack;
struct chunk *chunk;
} apool;
#define apool_double(pool) apool_expand(pool, pool->blocks)
#define apool_empty(pool) (pool->pos == 0)
extern apool * apool_new (size_t blocks, size_t blocksize);
extern void apool_expand (apool *pool, size_t blocks);
extern void apool_destroy (apool *pool);
extern void * apool_alloc (apool *pool);
extern void apool_free (apool *pool, void * chunk);
Finished, refactored implementation
#include <malloc.h>
#include <stdio.h>
#include "apool.h"
void apool_expand (apool *pool, size_t blocks) {
size_t blocksize = pool->blocksize;
struct chunk *chunk = malloc(sizeof(struct chunk));
chunk->next = pool->chunk;
pool->chunk = chunk;
char *ram = chunk->chunk = calloc(blocks, blocksize);
/* Stack must be able to hold atleast all available items */
free(pool->stack);
pool->blocks += blocks;
pool->stack = calloc(pool->blocks, sizeof(void *));
/* Use the ram in reverse order */
ram += (blocks - 1) * blocksize;
for (size_t i = pool->pos; i < pool->pos + blocks; ram -= blocksize) {
pool->stack[i++] = ram;
}
pool->pos += (int) blocks;
}
apool * apool_new (size_t blocks, size_t blocksize) {
apool *pool = malloc(sizeof(apool));
pool->blocksize = blocksize;
pool->blocks = 0;
pool->pos = 0;
pool->stack = NULL; /* We have no data, free(null) is ok */
pool->chunk = NULL; /* We must initialize pool->chunk so it's stored */
apool_expand(pool, blocks);
return pool;
}
void apool_destroy (apool *pool) {
if (pool->pos < pool->blocks) {
fprintf(stderr, "runtime error: apool_destroy called, "
"not all resources have been checked in\n");
}
struct chunk *chunk = pool->chunk;
while (chunk) {
struct chunk *next = chunk->next;
free(chunk->chunk);
free(chunk);
chunk = next;
}
free(pool->stack);
free(pool);
}
void * apool_alloc (apool *pool) {
if (apool_empty(pool)) apool_double(pool);
return pool->stack[--pool->pos];
}
void apool_free (apool *pool, void * chunk) {
pool->stack[pool->pos++] = chunk;
}
About this example
This code is available from my repository in trunk/libapool/. You may also check it out from my repository directly.
svn co http://www.blisted.org/svn/trunk/libapool
Or download the tarball attached to this page.
Comment by bsmith@… on Sun Jun 11 04:51:12 2006
Great article! Alternative memory allocation is very useful, especially for helping with strategies for deallocation. Your apool_destroy actually tells programmers that they haven't freed enough.
Comment by scott on Fri Jun 16 16:07:01 2006
Since no one else seems to have noticed, I should mention that there is a bug in apool_expand. If apool_expand is called arbitrarily on a pool who's available data size has not reached zero, the behavior after the newly allocated segment is utilized is undefined. Reason being the value of each entry in the stack member less than or equal to the current size when apool_expand was called is undefined. It may or may not be the memory we were previously using.
This could be fixed relatively easliy by copying from 0 to pool→pos into the newly allocated stack, and then freeing the previously allocated stack. Alternatively realloc() could be used, although it would not yield as good of performance.
Attachments
-
libapool.tgz
(1.9 kB) - added by scott
2 years ago.
Example code.
-
struct.png
(5.6 kB) - added by scott
2 years ago.
Data structure Illustration