Jan. 21, 2014, 4:55 p.m.

Custom Vector Allocation

(Number 6 in a series of posts about Vectors and Vector based containers.)

A few posts back I talked about the idea of 'rolling your own' STL-style vector class, based on my experiences with this at PathEngine.

In that original post and these two follow-ups I talked about the general approach and also some specific performance tweaks that actually helped in practice for our vector use cases.

I haven't talked about custom memory allocation yet, however. This is something that's been cited in a number of places as a key reason for switching away from std::vector so I'll come back now and look at the approach we took for this (which is pretty simple, but nonstandard, and also pre C++11), and assess some of the implications of using this kind of non-standard approach.

I approach this from the point of view of a custom vector implementation, but I'll be talking about some issues with memory customisation that also apply more generally.

Why custom allocation?

In many situations it's fine for vectors (and other containers) to just use the same default memory allocation method as the rest of your code, and this is definitely the simplest approach.

(The example vector code I posted previously used malloc() and free(), but works equally well with global operator new and delete.)

But vectors can do a lot of memory allocation, and memory allocation can be expensive, and it's not uncommon for memory allocation operations to turn up in profiling as the most significant cost of vector based code. Custom memory allocation approaches can help resolve this.

And some other good reasons for hooking into and customising allocations can be the need to avoid memory fragmentation or to track memory statistics.

For these reasons generalised memory customisation is an important customer requirement for our SDK code in general, and then by extension for the vector containers used by this code.

Custom allocation in std::vector

The STL provides a mechanism for hooking into the container allocation calls (such as vector buffer allocations) through allocators, with vector constructors accepting an allocator argument for this purpose.

I won't attempt a general introduction to STL allocators, but there's a load of material about this on the web. See, for example, this article on Dr Dobbs, which includes some example use cases for allocators. (Bear in mind that this is pre C++11, however. I didn't see any similarly targeted overview posts for using allocators post C++11.)

A non-standard approach

We actually added the possibility to customise memory allocation in our vectors some time after switching to a custom vector implementation. (This was around mid-2012. Before that PathEngine's memory customisation hooks worked by overriding global new and delete, and required dll linkage if you wanted to manage PathEngine memory allocations separately from allocations in the main game code.)

We've generally tried to keep our custom vector as similar as possible to std::vector, in order to avoid issues with unexpected behaviour (since a lot of people know how std::vector works), and to ensure that code can be easily switched between std::vector and our custom vector. When it came to memory allocation, however, we chose a significantly different (and definitely non-standard) approach, because in practice a lot of vector code doesn't actually use allocators (or else just sets allocators in a constructor), because we already had a custom vector class in place, and because I just don't like STL allocators!

Other game developers

A lot of other game developers have a similar opinion of STL allocators, and for many this is actually then also a key factor in a decision to switch to custom container classes.

For example, issues with the design of STL allocators are quoted as one of the main reasons for the creation of the EASTL, a set of STL replacement classes, by Electronic Arts. From the EASTL paper:

Among game developers the most fundamental weakness is the std allocator design, and it is this weakness that was the largest contributing factor to the creation of EASTL.

And I've heard similar things from other developers. For example, in this blog post about the Bitsquid approach to allocators Niklas Frykholm says:

If it weren't for the allocator interface I could almost use STL. Almost.

Let's have a look at some of the reasons for this distaste!

Problems with STL allocators

We'll look at the situation prior to C++11, first of all, and the historical basis for switching to an alternative mechanism.

A lot of problems with STL allocators come out of confusion in the initial design. According to Alexander Stepanov (primary designer and implementer of the STL) the custom allocator mechanism was invented to deal with a specific issue with Intel memory architecture. (Do you remember near and far pointers? If not, consider yourself lucky I guess!) From this interview with Alexander:

Question: How did allocators come into STL? What do you think of them?

Answer: I invented allocators to deal with Intel's memory architecture. They are not such a bad ideas in theory - having a layer that encapsulates all memory stuff: pointers, references, ptrdiff_t, size_t. Unfortunately they cannot work in practice.

And it seems like this original design intention was also only partially executed. From the wikipedia entry for allocators:

They were originally intended as a means to make the library more flexible and independent of the underlying memory model, allowing programmers to utilize custom pointer and reference types with the library. However, in the process of adopting STL into the C++ standard, the C++ standardization committee realized that a complete abstraction of the memory model would incur unacceptable performance penalties. To remedy this, the requirements of allocators were made more restrictive. As a result, the level of customization provided by allocators is more limited than was originally envisioned by Stepanov.

and, further down:

While Stepanov had originally intended allocators to completely encapsulate the memory model, the standards committee realized that this approach would lead to unacceptable efficiency degradations. To remedy this, additional wording was added to the allocator requirements. In particular, container implementations may assume that the allocator's type definitions for pointers and related integral types are equivalent to those provided by the default allocator, and that all instances of a given allocator type always compare equal, effectively contradicting the original design goals for allocators and limiting the usefulness of allocators that carry state.

Some of the key problems with STL allocators (historically) are then:

Unnecessary complexity, with some boiler plate stuff required for features that are not actually used
A limitation that allocators cannot have internal state ('all instances of a given allocator type are required to be interchangeable and always compare equal to each other')
The fact the allocator type is included in container type (with changes to allocator type changing the type of the container)

There are some changes to this situation with C++11, as we'll see below, but this certainly helps explain why a lot of people have chosen to avoid the STL allocator mechanism, historically!

Virtual allocator interface

So we decided to avoid STL allocators, and use a non-standard approach.

The approach we use is based on a virtual allocator interface, and avoids the need to specify allocator type as a template parameter.

This is quite similar to the setup for allocators in the BitSquid engine, as described by Niklas here (as linked above, it's probably worth reading that post if you didn't see this already, as I'll try to avoid repeating the various points he discussed there).

A basic allocator interface can then be defined as follows:

class iAllocator
{
public:
    virtual ~iAllocator() {}
    virtual void* allocate(uint32_t size) = 0;
    virtual void deallocate(void* ptr) = 0;
// helper
    template <class T> void
    allocate_Array(uint32_t arraySize, T*& result)
    {
        result = static_cast<T*>(allocate(sizeof(T) * arraySize));
    }
};

The allocate_Array() method is for convenience, concrete allocator objects just need to implement allocate() and free().

We can store a pointer to iAllocator in our vector, and replace the direct calls to malloc() and free() with virtual function calls, as follows:

    static T*
    allocate(size_type size)
    {
        T* allocated;
        _allocator->allocate_Array(size, allocated);
        return allocated;
    }
    void
    reallocate(size_type newCapacity)
    {
        T* newData;
        _allocator->allocate_Array(newCapacity, newData);
        copyRange(_data, _data + _size, newData);
        deleteRange(_data, _data + _size);
        _allocator->deallocate(_data);
        _data = newData;
        _capacity = newCapacity;
    }

These virtual function calls potentially add some overhead to allocation and deallocation. It's worth being quite careful about this kind of virtual function call overhead, but in practice it seems that the overhead is not significant here. Virtual function call overhead is often all about cache misses and, perhaps because there are often just a small number of actual allocator instance active, with allocations tending to be grouped by allocator, this just isn't such an issue here.

We use a simple raw pointer for the allocator reference. Maybe a smart pointer type could be used (for better modern C++ style and to increase safety), but we usually want to control allocator lifetime quite explicitly, so we're basically just careful about this.

Allocators can be passed in to each vector constructor, or if omitted will default to a 'global allocator' (which adds a bit of extra linkage to our vector header):

    cVector(size_type size, const T& fillWith,
        iAllocator& allocator = GlobalAllocator()
        )
    {
        _data = 0;
        _allocator = &allocator;
        _size = size;
        _capacity = size;
        if(size)
        {
            _allocator->allocate_Array(_capacity, _data);
            constructRange(_data, _data + size, fillWith);
        }
    }

Here's an example concrete allocator implementation:

class cMallocAllocator : public iAllocator
{
public:
    void*
    allocate(uint32_t size)
    {
        assert(size);
        return malloc(static_cast<size_t>(size));
    }
    void
    deallocate(void* ptr)
    {
        free(ptr);
    }
};

(Note that you normally can call malloc() with zero size, but this is something that we disallow for PathEngine allocators.)

And this can be passed in to vector construction as follows:

    cMallocAllocator allocator;
    cVector<int> v(10, 0, allocator);

Swapping vectors

That's pretty much it, but there's one tricky case to look out for.

Specifically, what should happen in our vector swap() method? Let's take a small diversion to see why there might be a problem.

Consider some code that takes a non-const reference to vector, and 'swaps a vector out' as a way of returning a set of values in the vector without the need to heap allocate the vector object itself:

class cVectorBuilder
{
    cVector<int> _v;
public:
    //.... construction and other building methods
    void takeResult(cVector<int>& result); // swaps _v into result
};

So this code doesn't care about allocators, and just wants to work with a vector of a given type. And maybe there is some other code that uses this, as follows:

void BuildData(/*some input params*/, cVector& result)
{
  //.... construct a cVectorBuilder and call a bunch of build methods
    builder.takeResult(result);
}

Now there's no indication that there's going to be a swap() involved, but the result vector will end up using the global allocator, and this can potentially cause some surprises in the calling code:

   cVector v(someSpecialAllocator);
   BuildData(/*input params*/, v);
   // lost our allocator assignment!
   // v now uses the global allocator

Nobody's really doing anything wrong here (although this isn't really the modern C++ way to do things). This is really a fundamental problem arising from the possibility to swap vectors with different allocators, and there are other situations where this can come up.

You can find some discussion about the possibilities for implementing vector swap with 'unequal allocators' here. We basically choose option 1, which is to simply declare it illegal to call swap with vectors with different allocators. So we just add an assert in our vector swap method that the two allocator pointers are equal.

In our case this works out fine, since this doesn't happen so much in practice, because cases where this does happen are caught directly by the assertion, and because it's generally straightforward to modify the relevant code paths to resolve the issue.

Comparison with std::vector, is this necessary/better??

Ok, so I've outlined the approach we take for custom allocation in our vector class.

This all works out quite nicely for us. It's straightforward to implement and to use, and consistent with the custom allocators we use more generally in PathEngine. And we already had our custom vector in place when we came to implement this, so this wasn't part of the decision about whether or not to switch to a custom vector implementation. But it's interesting, nevertheless, to compare this approach with the standard allocator mechanism provided by std::vector.

My original 'roll-your-own vector' blog post was quite controversial. There were a lot of responses strongly against the idea of implementing a custom vector, but a lot of other responses (often from the game development industry side) saying something like 'yes, we do that, but we do some detail differently', and I know that this kind of customisation is not uncommon in the industry.

These two different viewpoints makes it worthwhile to explore this question in a bit more detail, then, I think.

I already discussed the potential pitfalls of switching to a custom vector implementation in the original 'roll-your-own vector' blog post, so lets look at the potential benefits of switching to a custom allocator mechanism.

Broadly speaking, this comes down to three key points:

Interface complexity
Stateful allocator support
Possibilities for further customisation and memory optimisation

Interface complexity

If we look at an example allocator implementation for each setup we can see that there's a significant difference in the amount of code required. The following code is taken from my previous post, and was used to fill allocated memory with non zero values, to check for zero initialisation:

// STL allocator version
template <class T>
class cNonZeroedAllocator
{
public:
    typedef T value_type;
    typedef value_type* pointer;
    typedef const value_type* const_pointer;
    typedef value_type& reference;
    typedef const value_type& const_reference;
    typedef typename std::size_t size_type;
    typedef std::ptrdiff_t difference_type;
    template <class tTarget>
    struct rebind
    {
        typedef cNonZeroedAllocator<tTarget> other;
    };
    cNonZeroedAllocator() {}
    ~cNonZeroedAllocator() {}
    template <class T2>
    cNonZeroedAllocator(cNonZeroedAllocator<T2> const&)
    {
    }
    pointer
    address(reference ref)
    {
        return &ref;
    }
    const_pointer
    address(const_reference ref)
    {
        return &ref;
    }
    pointer
    allocate(size_type count, const void* = 0)
    {
        size_type byteSize = count * sizeof(T);
        void* result = malloc(byteSize);
        signed char* asCharPtr;
        asCharPtr = reinterpret_cast<signed char*>(result);
        for(size_type i = 0; i != byteSize; ++i)
        {
            asCharPtr[i] = -1;
        }
        return reinterpret_cast<pointer>(result);
    }
    void deallocate(pointer ptr, size_type)
    {
        free(ptr);
    }

    size_type
    max_size() const
    {
        return 0xffffffffUL / sizeof(T);
    }
    void
    construct(pointer ptr, const T& t)
    {
        new(ptr) T(t);
    }
    void
    destroy(pointer ptr)
    {
        ptr->~T();
    }
    template <class T2> bool
    operator==(cNonZeroedAllocator<T2> const&) const
    {
        return true;
    }
    template <class T2> bool
    operator!=(cNonZeroedAllocator<T2> const&) const
    {
        return false;
    }
};

But with our custom allocator interface this can now be implemented as follows:

// custom allocator version
class cNonZeroedAllocator : public iAllocator
{
public:
    void*
    allocate(uint32_t size)
    {
        void* result = malloc(static_cast<size_t>(size));
        signed char* asCharPtr;
        asCharPtr = reinterpret_cast<signed char*>(result);
        for(uint32_t i = 0; i != size; ++i)
        {
            asCharPtr[i] = -1;
        }
        return result;
    }
    void
    deallocate(void* ptr)
    {
        free(ptr);
    }
};

As we saw previously a lot of stuff in the STL allocator relates to some obsolete design decisions, and is unlikely to actually be used in practice. The custom allocator interface also completely abstracts out the concept of constructed object type, and works only in terms of actual memory sizes and pointers, which seems more natural and whilst doing everything we need for the allocator use cases in PathEngine.

For me this is one advantage of the custom allocation setup, then, although probably not something that would by itself justify switching to a custom vector.

If you use allocators that depend on customisation of the other parts of the STL allocator interface (other than for data alignment) please let me know in the comments thread. I'm quite interested to hear about this! (There's some discussion about data alignment customisation below.)

Stateful allocator requirement

Stateful allocator support is a specific customer requirement for PathEngine.

Clients need to be able to set custom allocation hooks and have all allocations made by the SDK (including vector buffer allocations) routed to custom client-side allocation code. Furthermore, multiple allocation hooks can be supplied, with the actual allocation strategy selected depending on the actual local execution context.

It's not feasible to supply allocation context to all of our vector based code as a template parameter, and so we need our vector objects to support stateful allocators.

Stateful allocators with the virtual allocator interface

Stateful allocators are straightforward with our custom allocator setup. Vectors can be assigned different concrete allocator implementations and these concrete allocator implementations can include internal state, without code that works on the vectors needing to know anything about these details.

Stateful allocators with the STL

As discussed earlier, internal allocator state is something that was specifically forbidden by the original STL allocator specification. This is something that has been revisited in C++11, however, and stateful allocators are now explicitly supported, but it also looks like it's possible to use stateful allocators in practice with many pre-C++11 compile environments.

The reasons for disallowing stateful allocators relate to two specific problem situations:

Splicing nodes between linked lists with different allocation strategies
Swapping vectors with different allocation strategies

C++11 addresses these issues with allocator traits, which specify what to do with allocators in problem cases, with stateful allocators then explicitly supported. This stackoverflow answer discusses what happens, specifically, with C++11, in the vector swap case.

With PathEngine we want to be able to support clients with different compilation environments, and it's an advantage not to require C++11 support. But according to this stackoverflow answer, you can also actually get away with using stateful allocators in most cases, without explicit C++11 support, as long as you avoid these problem cases.

Since we already prohibit the vector problem case (swap with unequal allocators), that means that we probably can actually implement our stateful allocator requirement with std::vector and STL allocators in practice, without requiring C++11 support.

There's just one proviso, with or without C++11 support, due to allowances for legacy compiler behaviour in allocator traits. Specifically, it doesn't look like we can get the same assertion behaviour in vector swap. If propagate_on_container_swap::value is set to false for either allocator then the result is 'undefined behaviour', so this could just swap the allocators silently, and we'd have to be quite careful about these kinds of problem cases!

Building on stateful allocators to address other issues

If you can use stateful allocators with the STL then this changes things a bit. A lot of things become possible just by adding suitable internal state to standard STL allocator implementations. But you can also now use this allocator internal state as a kind of bootstrap to work around other issues with STL allocators.

The trick is wrap up the same kind of virtual allocator interface setup we use in PathEngine in an STL allocator wrapper class. You could do this (for example) by putting a pointer to our iAllocator interface inside an STL allocator class (as internal state), and then forward the actual allocation and deallocation calls as virtual function calls through this pointer.

So, at the cost of another layer of complexity (which can be mostly hidden from the main application code), it should now be possible to:

remove unnecessary boiler plate from concrete allocator implementations (since these now just implement iAllocator), and
use different concrete allocator types without changing the actual vector type.

Although I'm still not keen on STL allocators, and prefer the direct simplicity of our custom allocator setup as opposed to covering up the mess of the STL allocator interface in this way, I have to admit that this does effectively remove two of the key benefits of our custom allocator setup. Let's move on to the third point, then!

Refer to the bloomberg allocator model for one example of this kind of setup in practice (and see also this presentation about bloomberg allocators in the context C++11 allocator changes).

Memory optimisation

The other potential benefit of custom allocation over STL allocators is basically the possibility to mess around with the allocation interface.

With STL allocators we're restricted to using the allocate() and deallocate() methods exactly as defined in the original allocator specification. But with our custom allocator we're basically free to mess with these method definitions (in consultation with our clients!), or to add additional methods, and generally change the interface to better suit our clients needs.

There is some discussion of this issue in this proposal for improving STL allocators, which talks about ways in which the memory allocation interface provided by STL allocators can be sub-optimal.

Some customisations implemented in the Bitsquid allocators are:

an 'align' parameter for the allocation method, and
a query for the size of allocated blocks

PathEngine allocators don't include either of these customisations, although this is stuff that we can add quite easily if required by our clients. Our allocator does include the following extra methods:

    virtual void*
    expand(
            void* oldPtr,
            uint32_t oldSize,
            uint32_t oldSize_Used,
            uint32_t newSize
            ) = 0;
// helper
    template <class T> void
    expand_Array(
            T*& ptr,
            uint32_t oldArraySize,
            uint32_t oldArraySize_Used,
            uint32_t newArraySize
            )
    {
        ptr = static_cast<T*>(expand(
            ptr,
            sizeof(T) * oldArraySize,
            sizeof(T) * oldArraySize_Used,
            sizeof(T) * newArraySize
            ));
    }

What this does, essentially, is to provide a way for concrete allocator classes to use the realloc() system call, or similar memory allocation functionality in a custom head, if this is desired.

As before, the expand_Array() method is there for convenience, and concrete classes only need to implement the expand() method. This takes a pointer to an existing memory block, and can either add space to the end of this existing block (if possible), or allocate a larger block somewhere else and move existing data to that new location (based on the oldSize_Used parameter).

Implementing expand()

A couple of example implementations for expand() are as follows:

// in cMallocAllocator, using realloc()
    void*
    expand(
        void* oldPtr,
        uint32_t oldSize,
        uint32_t oldSize_Used,
        uint32_t newSize
        )
    {
        assert(oldPtr);
        assert(oldSize);
        assert(oldSize_Used <= oldSize);
        assert(newSize > oldSize);
        return realloc(oldPtr, static_cast<size_t>(newSize));
    }

// as allocate and move
    void*
    expand(
        void* oldPtr,
        uint32_t oldSize,
        uint32_t oldSize_Used,
        uint32_t newSize
        )
    {
        assert(oldPtr);
        assert(oldSize);
        assert(oldSize_Used <= oldSize);
        assert(newSize > oldSize);
        void* newPtr = allocate(newSize);
        memcpy(newPtr, oldPtr, static_cast<size_t>(oldSize_Used));
        deallocate(oldPtr);
        return newPtr;
    }

So this can either call through directly to something like realloc(), or emulate realloc() with a sequence of allocation, memory copy and deallocation operations.

Benchmarking with realloc()

With this expand() method included in our allocator it's pretty straightforward to update our custom vector to use realloc(), and it's easy to see how this can potentially optimise memory use, but does this actually make a difference in practice?

I tried some benchmarking and it turns out that this depends very much on the actual memory heap implementation in use.

I tested this first of all with the following simple benchmark:

template <class tVector> static void
PushBackBenchmark(tVector& target)
{
    const int pattern[] = {0,1,2,3,4,5,6,7};
    const int patternLength = sizeof(pattern) / sizeof(*pattern);
    const int iterations = 10000000;
    int32_t patternI = 0;
    for(int32_t i = 0; i != iterations; ++i)
    {
        target.push_back(pattern[patternI]);
        ++patternI;
        if(patternI == patternLength)
        {
            patternI = 0;
        }
    }
}

(Wrapped up in some code for timing over a bunch of iterations, with result checking to avoid the push_back being optimised out.)

This is obviously very far from a real useage situation, but the results were quite interesting:

OS	container type	time
Linux	std::vector	0.0579 seconds
Linux	cVector without realloc	0.0280 seconds
Linux	cVector with realloc	0.0236 seconds
Windows	std::vector	0.0583 seconds
Windows	cVector without realloc	0.0367 seconds
Windows	cVector with realloc	0.0367 seconds

So the first thing that stands out from these results is that using realloc() doesn't make any significant difference on windows. I double checked this, and while expand() is definitely avoiding memory copies a significant proportion of the time, this is either not significant in the timings, or memory copy savings are being outweighed by some extra costs in the realloc() call. Maybe realloc() is implemented badly on Windows, or maybe the memory heap on Windows is optimised for more common allocation scenarios at the expense of realloc(), I don't know. A quick google search shows that other people have seen similar issues.

Apart from that it looks like realloc() can make a significant performance difference, on some platforms (or depending on the memory heap being used). I did some extra testing, and it looks like we're getting diminishing returns after some of the other performance tweaks we made in our custom vector, specifically the tweaks to increase capacity after the first push_back, and the capacity multiplier tweak. With these tweaks backed out:

OS	container type	time
Linux	cVector without realloc, no tweaks	0.0532 seconds
Linux	cVector with realloc, no tweaks	0.0235 seconds

So, for this specific benchmark, using realloc() is very significant, and even avoids the need for those other performance tweaks.

Slightly more involved benchmark

The benchmark above is really basic, however, and certainly isn't a good general benchmark for vector memory use. In fact, with realloc(), there is only actually ever one single allocation made, which is then naturally free to expand through the available memory space!

A similar benchmark is discussed in this stackoverflow question, and in that case the benefits seemed to reduce significantly with more than one vector in use. I hacked the benchmark a bit to see what this does for us:

template <class tVector> static void
PushBackBenchmark_TwoVectors(tVector& target1, tVector& target2)
{
    const int pattern[] = {0,1,2,3,4,5,6,7};
    const int patternLength = sizeof(pattern) / sizeof(*pattern);
    const int iterations = 10000000;
    int32_t patternI = 0;
    for(int32_t i = 0; i != iterations; ++i)
    {
        target1.push_back(pattern[patternI]);
        target2.push_back(pattern[patternI]);
        ++patternI;
        if(patternI == patternLength)
        {
            patternI = 0;
        }
    }
}
template <class tVector> static void
PushBackBenchmark_ThreeVectors(tVector& target1, tVector& target2, tVector& target3)
{
    const int pattern[] = {0,1,2,3,4,5,6,7};
    const int patternLength = sizeof(pattern) / sizeof(*pattern);
    const int iterations = 10000000;
    int32_t patternI = 0;
    for(int32_t i = 0; i != iterations; ++i)
    {
        target1.push_back(pattern[patternI]);
        target2.push_back(pattern[patternI]);
        target3.push_back(pattern[patternI]);
        ++patternI;
        if(patternI == patternLength)
        {
            patternI = 0;
        }
    }
}

With PushBackBenchmark_TwoVectors():

OS	container type	time
Linux	std::vector	0.0860 seconds
Linux	cVector without realloc	0.0721 seconds
Linux	cVector with realloc	0.0495 seconds

With PushBackBenchmark_ThreeVectors():

OS	container type	time
Linux	std::vector	0.1291 seconds
Linux	cVector without realloc	0.0856 seconds
Linux	cVector with realloc	0.0618 seconds

That's kind of unexpected.

If we think about what's going to happen with the vector buffer allocations in this benchmark, on the assumption of sequential allocations into a simple contiguous memory region, it seems like the separate vector allocations in the modified benchmark versions should actually prevent each other from expanding. And I expected that to reduce the benefits of using realloc. But the speedup is actually a lot more significant for these benchmark versions.

I stepped through the benchmark and the vector buffer allocations are being placed sequentially in a single contiguous memory region, and do initially prevent each other from expanding, but after a while the 'hole' at the start of the memory region gets large enough to be reused, and then reallocation becomes possible, and somehow turns out to be an even more significant benefit. Maybe these benchmark versions pushed the memory use into a new segment and incurred some kind of segment setup costs?

With virtual memory and different layers of memory allocation in modern operating systems, and different approaches to heap implementations, it all works out as quite a complicated issue, but it does seem fairly clear, at least, that using realloc() is something that can potentially make a significant difference to vector performance, in at least some cases!

Realloc() in PathEngine

Those are all still very arbitrary benchmarks and it's interesting to see how much this actually makes a difference for some real uses cases. So I had a look at what difference the realloc() support makes for the vector use in PathEngine.

I tried our standard set of SDK benchmarks (with common queries in some 'normal' situations), both with and without realloc() support, and compared the timings for these two cases. It turns out that for this set of benchmarks, using realloc() doesn't make a significant difference to the benchmark timings. There are some slight improvements in some timings, but nothing very noticeable.

The queries in these benchmarks have already had quite a lot of attention for performance optimisation, of course, and there are a bunch of other performance optimisations already in the SDK that are designed to avoid the need for vector capacity increases in these situations (reuse of vectors for runtime queries, for example). Nevertheless, if we're asking whether custom allocation with realloc() is 'necessary or better' in the specific case of PathEngine vector use (and these specific benchmarks) the answer appears to be that no this doesn't really seem to make any concrete difference!

Memory customisation and STL allocators

As I've said above, this kind of customisation of the allocator interface (to add stuff like realloc() support) is something that we can't do with the standard allocator setup (even with C++11).

For completeness it's worth noting the approach suggested by Alexandrescu in this article where he shows how you can effectively shoehorn stuff like realloc() calls into STL allocators.

But this does still depends on using some custom container code to detect special allocator types, and won't work with std::vector.

Conclusion

This has ended up a lot longer than I originally intended so I'll go ahead and wrap up here!

To conclude:

It's not so hard to implement your own allocator setup, and integrate this with a custom vector (I hope this post gives you a good idea about what can be involved in this)
There are ways to do similar things with the STL, however, and overall this wouldn't really work out as a strong argument for switching to a custom vector in our case
A custom allocator setup will let you do some funky things with memory allocation, if your memory heap will dance the dance, but it's not always clear that this will translate into actual concrete performance benefits

A couple of things I haven't talked about:

Memory fragmentation: custom memory interfaces can also be important for avoiding memory fragmentation, and this can be an important issue. We don't have a system in place for actually measuring memory fragmentation, though, and I'd be interested to hear how other people in the industry actually quantify or benchmark this.

Memory relocation: the concept of 'relocatable allocators' is quite interesting, I think, although this has more significant implications for higher level vector based code, and requires moving further away from standard vector usage. This is something I'll maybe talk about in more depth later on..

Comments (discussion closed)

Marek Knápek22 January 2014 15:42

Hi, I have few comments to your article:

You probably missing alignment in your allocate_Array(std::size_t numElems, T*& result) helper function, but I’m not sure because malloc() is supposed to return pointer alligned to suit needs for any (build-in) data type. But what if you/your customer defines custom class with stronger alignment requirements? For example to be able to work with SSE/AVX and similar data and instructions.

In your custom vector constructor you are taking allocator by reference and storing pointer to it in data member. It is not clear who is responsible for destroying such an allocator. And in default case you are taking address of a temporary(!) GlobalAllocator instance – I don’t know how this could work. Perhaps std::unique_ptr with zero space and time overhead (compared to raw pointer) could help here (it has also customizable deleter).

You are saying that you could omit capacity member if you use realloc feature. I’m not sure with this, (your custom) vector should satisfy the asymptotic constant complexity of push_back operation. That means multiplication growth rate (1.5 or 2 or whatever times of old capacity). But I think it could not do it with realloc which can return arbitrary (smaller) value. Sorry it is possible, I realized that later.

Realloc works bad on Windows: It is probably caused by heap implementation, for example LFH (low fragmentation heap) has bucket for each allocation size from 1 to 64k bytes (I guess). So it returns exactly the size you asked for (no realloc possible here).

I read somewhere that on systems with „good“ overcommit setup, address space size (32bit is low) and virtual memory management it is possible to do a huge (1GB) allocation for your vector buffer. Then you will not need to do any reallocation – so no need for capacity tracking and grow code logic. And you will not go out of memory too because OS will track unused/untouched pages and commit them only when needed, this will bring some fragmentation on OS level but it will be invisible for your application.

Thanks for the article, Marek.

Thomas Young23 January 2014 10:05

Hi Marek,

- You probably missing alignment in your allocate_Array(std::size_t
numElems, T*& result) helper function, but I’m not sure because
malloc() is supposed to return pointer alligned to suit needs for any
(build-in) data type.

Memory returned by malloc() should be aligned for any data type, as I understand, so an alignment value is not required in the general case. But this can definitely be worthwhile in specific situations where there is a memory heap that supports different alignments. The reason we don't have this currently in PathEngine is then basically just because none of our clients have asked for this. (I talked about this a bit, in fact, in relation to the Bitsquid allocator setup.)

- In your custom vector constructor you are taking allocator by reference
and storing pointer to it in data member. It is not clear who is
responsible for destroying such an allocator.

Yes, I agree that this is a bit messy, and should probably be improved!

- I read somewhere that on systems with „good“ overcommit setup, address
space size (32bit is low) and virtual memory management it is possible
to do a huge (1GB) allocation for your vector buffer. Then you will not
need to do any reallocation – so no need for capacity tracking and grow
code logic.

Yes, that's an interesting point.
And an extension of this idea could be to setup a small number of these objects, which are essentially extra stack pointers, in addition to the actual program stack,.
I thought about this as one way to solve the 'run time sized buffers' requirement I talked about in my first post in the series.
The point is that the query code paths that lead to this run time sized buffers requirement generally use just a small number of actual vector buffers, i.e. up to about 3, and then these could each be treated in parallel with what would be essentially just very fast stack allocation.

I find this all quite interesting but it's all a bit academic for me at the moment (unfortunately!), since query buffer allocations are not a significant issue (or at least none of our clients are reporting this as an issue) and either no-one has any memory fragmentation issues with PathEngine or nobody is reporting any memory fragmentation issues..

Aaron MacDougall31 March 2014 20:36

Hi, I'm a bit late to the party but I also have a few comments on the topic in general. Sorry for the long post but I find the topic very interesting!

Like you I use allocators with a virtual interface and all custom containers. In fact I require a valid allocator object to be passed into the constructor or to the initialisation function (sometimes the allocator object isn't available at the point of container construction). I feel that memory budgets and choosing the correct allocator are important throughout the code so I made it impossible to allocate anything without passing in an allocator pointer.

An unexpected micro-optimisation I made to my vector was to use type traits to decide when it is safe to not call placement new at all. It turns out that placement new has an extra branch to check for a null pointer, at least with the Visual C++ compiler. As far as I can tell this is required by the language.

For pushing back I think I had similar performance improvements to your vector over std::vector. Every single function ended up at least a little faster than the stl version. The really big performance benefits I had came from custom allocators (I also profiled std::vector with the same allocator for fairness) and from adding more efficient functions such as PushBackAssumeSpace(). Kind of verbose but it clearly says what it does. I observed that often our code knew the number of items to reserve, so why bother with all the extra work inside PushBack() to check if a resize should occur (except for asserts in debug obviously)? Using that function the inline code basically boils down to pure array accesses without any branches or unnecessary code polluting the instruction stream.

Regarding relocatable allocators, at one point I implemented support for relocatable memory handles to allow for memory defragmentation. Memory handles could either be relocatable or fixed depending on whether the internal "pointer" had the least significant bit set. This was really interesting from an implementation point of view, but I would warn others that it can come with some big disadvantages, at least in the way I did it. The most important one is that it permeated the entire codebase. I had to make sure that nobody held onto raw pointers when it wasn't safe, add memory locking for dealing with threading and I had to create custom containers with interfaces that weren't particularly stl-like. Actually this memory system was the reason I wrote my own containers in the first place. For example I discovered that because of the indirect pointer lookup it was inefficient to allow array indexing on vectors, and iterators weren't particularly suitable either because of the need to do the indirect pointer lookup multiple times for the begin and end iterators. In the end I used ranges instead which are like pair of iterators representing the beginning and end of a range, but the pointer lookup would only be done once.

After using the system for a while I decided to scrap it because of those disadvantages and because I found that memory fragmentation can generally be dealt with more efficiently by thoughtful allocation patterns at the source. At the time I was also targeting a platform with relatively slow memory so the cost of moving memory around was kind of high. My containers now use interfaces which are very similar to stl but still support ranges for compatibility and because they are nice is some circumstances, such as returning a single range object from a function call.

Thomas Young01 April 2014 07:53

Hi Aaron,

Thanks for your comment.

That's some really valuable return from experience there regarding relocatable allocators, and the point about null pointer checking in placement new is also very interesting!

What do you think about the idea I talked about with Marek (in the comments below) about using large overcommitted buffers to create what are effectively additional program stacks?

It seems like this could potentially be a good solution (on some platforms, where this is well supported) for issues like the 'runtime resized buffers' issue I talked about in the first post in the series ('Using STL Vectors').

The point is that the PathEngine query code only ever requires dynamic resizing for a very small number of buffers at any one time, and then, if you have that many stacks available, these buffers could all use simple stack allocation..

Aaron MacDougall01 April 2014 19:32

To be honest until recently I've never considered using virtual memory in a game because of differing levels of OS support on consoles and 32-bit address ranges. On consoles I've always allocated all the physical memory at startup and set strict budgets.

Having said that on a 64-bit machine with decent virtual memory support then the options become interesting. I think you would want to limit the number of vectors using that strategy because you could waste memory for vectors that should be using only a small amount of memory, but at least one page has to be committed. The page size could be as large as 2MB. Also I think at least one platform would require manual commits so the growing logic would still be required, and the OS cost could be quite high. Another thing to consider is that not the whole 64-bit address range will be available. I think the current processors limit the range to 48-bit, and then the OS may support only 8TB. Of course that is quite a lot, but it could run out if the same strategy is used throughout a large codebase.

There is one more disadvantage that I can think of, but whether it matters probably depends on your requirements and workflow. I think when dealing with huge address ranges it could make it more difficult to track memory and budgets. Personally I like having a simple memory map that I can view in debugging tools for tracking budgets and debugging fragmentation.

On another topic, I've seen a lot of code in past that will place a vector object on the stack and call a function that takes a reference in order to get a list of results. The temporary vector is then thrown away almost immediately once the results have been analysed. I think this is one of the cases that you have been trying to deal with (except you keep the objects around as an optimisation). In some cases there isn't really a way around that, but in non-critical code paths I sometimes break the algorithm up into a custom iterator class which breaks the gathering of results into processing one result at a time. Because the user code iterates over the results as they are being processed, it means you can get away with no temporary allocations. Some algorithms can be tricky to break up in this way, but it can restore peace of mind about the temporary allocations and the potential fragmentation.

I suppose another option would be to use callbacks or function objects to inject result handling into the middle of processing. This would also get rid of the need for temporary buffers.