Dec. 15, 2013, 12:14 p.m.

Roll-Your-Own Vector

(Third in a series of posts about Vectors and Vector based containers.)

STL vectors offer a great combination of dynamically resizing convenience and low level contiguous buffer accessing performance, but std::vector is not the only option, and custom implementations of this container can sometimes be a better fit for your specific requirements.

Roll your own

It seems like it's fairly common practice in the game development industry to replace stl::vector (and other standard containers) with custom STL-style vector or array implementations.

Some public examples are the EASTL (Electronic Arts Standard Template Library) and the Bitsquid Foundation Library, and I know that a lot of other developers also do something similar.

I think that a lot of people may have been 'got bitten' in the past with STL containers, or at least found that they were able to resolve issues or significantly improve performance by replacing STL containers with their own code. This was definitely the case with PathEngine.

We switched to a 'home grown' version of std::vector some years back as a result of some concrete run time performance issues, and saw some quite significant improvements from this change.

Two parts

There are potentially two main reasons for switching:

issues with the mechanism for custom memory allocation for STL containers, and
performance considerations

For some people memory allocation is the main reason for switching to a custom vector allocation, but this wasn't an issue for us when we first switched to our own vector implementation. At that time, memory allocation customisation in PathEngine was fairly minimal (based on overriding global new and delete and only possible for DLL builds) and we didn't use STL allocators.

The initial motivation for switching PathEngine to a custom vector was performance, and I'll be looking exclusively at the performance side of things in this post, but will come back to look at custom allocation issues separately later on.

Initial issue and expectations

The initial issue for us was with stl::vector methods showing up as significant in performance profiling, for certain operations, and with clients telling us that they needed these operations to run faster.

Most of our vectors contain simple POD (or at least trivially copyable) elements. I had read the 'Generic typed buffers' series of articles by Andrei Alexandrescu on Dr Dobbs (part 1 here, part 2 here, part 3 here, but you have to register to read more than 2 articles). and I was expecting to see the biggest improvements from type specialisations for these elements, but this turned out not to be the most significant issue.

Basic vector implementation

Let's have a look at what a basic vector implementation might look like.

We'll make some simplifications.

In particular we'll avoid adding any iterator classes and just use index values or raw pointers to elements for iteration. But we'll also ignore things like exceptions. (As with a lot of other game development code PathEngine doesn't throw or handle exceptions.) And we'll stick to C++03 for backwards compatibility.

Let's start with the following:

#include <string.h>
#include <stdlib.h>
#include <new>

template <class T>
class cVector
{
public:
    typedef typename std::size_t size_type;
private:
    size_type _size;
    size_type _capacity;
    T* _data;

    static T*
    allocate(size_type size)
    {
        return static_cast<T*>(malloc(sizeof(T) * size));
    }
    static void
    copyRange(T* begin, T* end, T* dest)
    {
        while(begin != end)
        {
            new((void*)dest) T(*begin);
            ++begin;
            ++dest;
        }
    }
    static void
    deleteRange(T* begin, T* end)
    {
        while(begin != end)
        {
            begin->~T();
            ++begin;
        }
    }

public:
    typedef T* iterator;
    typedef T value_type;

    cVector()
    {
        _size = 0;
        _capacity = 0;
        _data = 0;
    }
    ~cVector()
    {
        deleteRange(_data, _data + _size);
        free(_data);
    }

    // needs some other constructors
    // size(), empty() and a bunch of other methods!

    void
    push_back(const T& value)
    {
        if(_size != _capacity)
        {
            new((void*)(_data + _size)) T(value);
            ++_size;
            return;
        }
        size_type newCapacity = _capacity ? _capacity * 2 : 1;
        T* newData = allocate(newCapacity);
        copyRange(_data, _data + _size, newData);
        new((void*)(newData + _size)) T(value);
        deleteRange(_data, _data + _size);
        free(_data);
        _data = newData;
        _capacity = newCapacity;
        ++_size;
    }
    T&
    operator[](size_type index)
    {
        return _data[index];
    }
    const T&
    operator[](size_type index) const
    {
        return _data[index];
    }
    T*
    begin() const
    {
        return _data;
    }
    T*
    end() const
    {
        return _data + _size;
    }
};

This misses out a lot of the vector interface, obviously, but still manages to give us what are arguably the key benefits of STL style vectors, amortized constant time push back and automatic buffer memory management, in quite a small amount of code.

Note that we can continue using a load of other stuff from the standard library (such as sort algorithms), because of the way STL design decouples algorithms from actual container implementations through the iterators concept. Raw pointers into our vector buffer meet same the requirements for random access iterators as the iterators returned by std::vector.

The code is pretty straightforward, but with one slightly tricky aspect being placement new. This is kind of fundamental to vector containers, and so it's important to understand a bit about what's going on here.

We're using malloc and free for underlying buffer allocation, with constructor and destructor calls invoked separately from actual memory allocation. The call to new((void*)dest) T(*begin) constructs an object of type T at the (already allocated) memory location pointed to by dest, using T's copy constructor with argument *begin. The call to begin->~T() calls T's destructor without freeing the underlying memory.

Malloc and free will do for now, but I'll look at customising the underlying memory allocation methods in my next article.

A lot of stuff like size(), empty(), front(), back() has been omitted to save space, but can be added easily enough.

Doing it yourself

So far it's unlikely that this code will actually give us any speedups over standard library versions, at least in release builds, but there are already some interesting implications and potential advantages of 'doing it yourself' like this:

We can control exact container semantics, independent of compiler version
The code is simpler and easier to understand
All of the code is available to us
It can work out faster in debug builds
There are very minimal dependencies

Knowing exactly what semantics will be applied for underlying buffer allocation can be important in order to use vectors appropriately, for example it is nice to know for sure if calling clear() will free the underlying buffer (as discussed in this previous post).

If you step into the push_back operation of your standard library vector implementation in a debugger, you'll probably find this goes through a bunch of cryptically named template methods, which may then also end up with some calls to standard library binary components (for which source may not be available). It's not so easy to understand what's going on, and although this should all get optimised out in release builds, debug builds are obliged to actually perform each of these calls (so that everything makes sense in the debugger).

And the vector implementation in your standard library also probably depends on a bunch of other stuff, with the vector header pulling in a bunch of other headers. In many situations we can depend on the C++ standard library being available, and working correctly, but in some cases this may not be the case, and extra dependencies may be an issue. A case in point was with the Android NDK, particularly in early versions, where C++ standard library support was limited and it wasn't possible to depend on all of this working completely.

But besides the above points I think that although it's great just to be able to have a look under the hood, understand a bit about how this stuff works, or can work, and dispel some of the mystery that can otherwise shroud some of our standard library invocations!

Reinventing the wheel

There are a lot of reasons, conversely why not to do something like this. We're told not to 'reinvent the wheel' and this is pretty good general advice.

Standard vector implementations get a lot of testing, and there may be non-obvious corner cases or tricky situations, which we don't consider, but which are handled correctly by std::vector.

A vector implementation written by your compiler vendor may take advantage of compiler specifics, or otherwise work together with the compiler to get better performance than an implementation written in generic C++.

Cutting stuff out of the vector interface means that some standard vector based code won't work, and other semantic differences from std::vector can be a potential source of confusion.

Standard vectors may be tooled up to the teeth with stuff like debug checking instrumentation, which can help catch some tricky bugs. (But note that this is partly orthogonal to the issue of faster debug builds, because we can add some debug checking back in without the same kind of call complexity as standard vector implementations.)

Also, standard vector implementations are under development by an external party, and if we stick with the standard implementation we can get the benefits from new releases in the future without any work on our side (kind of like middleware licensing).

One case in point is C++11 move semantics. The PathEngine vector implementation was written before move semantics and so you probably won't get much benefit from implementing move constructors in the contained elements when using this class, unless we update the vector class implementation, but if we stuck with std::vector this is something we would have got 'for free'. (In practice PathEngine avoids using non-simple element types for vectors in performance critical code, so move semantics are not so important for us, but the basic point about external updates and improvements remains valid.)

Some good points on both sides of the argument then, but keeping these various considerations in mind, let's go on to look at what we can actually do with custom vector code in place..

Resize methods

We'll need to look at what goes on in the resize() methods, so lets go ahead and add these:

    void
    resize(size_type newSize)
    {
        if(newSize <= _size)
        {
            deleteRange(_data + newSize, _data + _size);
            _size = newSize;
            return;
        }
        if(newSize <= _capacity)
        {
            constructRange(_data + _size, _data + newSize);
            _size = newSize;
            return;
        }
        size_type newCapacity = newSize;
        if(newCapacity < _size * 2)
        {
            newCapacity = _size * 2;
        }
        reallocate(newCapacity);
        constructRange(_data + _size, _data + newSize);
        _size = newSize;
    }
    void
    resize(size_type newSize, const T& fillWith)
    {
        if(newSize <= _size)
        {
            deleteRange(_data + newSize, _data + _size);
            _size = newSize;
            return;
        }
        if(newSize <= _capacity)
        {
            constructRange(_data + _size, _data + newSize, fillWith);
            _size = newSize;
            return;
        }
        size_type newCapacity = newSize;
        if(newCapacity < _size * 2)
        {
            newCapacity = _size * 2;
        }
        reallocate(newCapacity);
        constructRange(_data + _size, _data + newSize, fillWith);
        for(size_type i = _size; i < newSize; i++)
        {
            ::new((void*)(_data + i)) T(fillWith);
        }
        _size = newSize;
    }

These depend on the following additional private helper methods:

    void
    reallocate(size_type newCapacity)
    {
        T* newData = allocate(newCapacity);
        copyRange(_data, _data + _size, newData);
        deleteRange(_data, _data + _size);
        free(_data);
        _data = newData;
        _capacity = newCapacity;
    }
    static void
    constructRange(T* begin, T* end)
    {
        while(begin != end)
        {
            new((void*)begin) T();
            ++begin;
        }
    }
    static void
    constructRange(T* begin, T* end, const T& fillWith)
    {
        while(begin != end)
        {
            new((void*)begin) T(fillWith);
            ++begin;
        }
    }

Potential gotcha: push back reference to an existing element

We've moved some shared logic out into that reallocate() method. This is quite similar to some code in the push_back() method, and it can be tempting to use the same reallocate call there also, but there's a subtle trap to watch out for.

The temptation is to move copy construction of the pushed element after the buffer copy and delete so we can refactor push_back() as follows:

    void
    push_back(const T& value)
    {
        if(_size != _capacity)
        {
            new((void*)(_data + _size)) T(value);
            ++_size;
            return;
        }
        size_type newCapacity = _capacity ? _capacity * 2 : 1;
        reallocate(newCapacity);
        new((void*)(newData + _size)) T(value); // this line moved after buffer delete
        ++_size;
    }

But this will now fail (potentially unpredictably!) for certain cases where the value being pushed back actually references back into the existing vector buffer, for example with something like v.push_back(v.front()).

There's some discussion about this in this stackoverflow question, but it looks like standard vector implementations all support this kind of push_back operation, and so this is something we should try to support in our custom vector.

[Update: turns out this gotcha got me again! As pointed out by DeadMG on the comments for a later post, the same issue also applies to one of the resize() methods I've posted here, e.g. with v.resize(v.size() + 3, v.back()).]

Code repetition, resizing without a fill value

There's still a lot of code repetition between the two resize methods, but we want to make sure that this is fast, and so I'm kind of treating this as the first priority. For example, we could combine these into a single method by setting a default value for the fillWith argument, (fillWith=T()) but this will most likely result in a load of unnecessary copy operations for the case where no fill value is actually required.

We've added a variation on the placement new call, in fact, for that no fill construction case. Let's look at this a bit more closely.

Construction subtleties and zero initialisation

To construct elements without a fill value in the above code, we use new((void*)begin) T(). The idea is to call the default constructor for class T with memory already allocated, but it turns out that there are some subtleties here in certain cases, notably when T is a built in type.

It turns out that when this version of placement new is called for a built in type we get 'zero initialisation', but we can rewrite this as new((void*)begin) T (no brackets!) and get 'default initialisation'.

There's an good explanation about the difference between these two initialisation types on this stackoverflow question.

Note that the same issue applies on vector construction when we do something like cVector<int> v(10), which uses quite similar code but with the corresponding constructor not shown in the example code.

When we first switched to a custom vector implementation we decided to change these construction semantics, and use default initialisation, and it turned out that this was actually the main concrete benefit in terms of actual speedups for our vector use cases.

I tested this again with the current SDK build, in a modern compiler environment (Clang 3.0, with -std=c++0x), and this still makes a significant difference in certain situations, although now mostly limited to specific load and save methods.

For example, loading unobstructed space for our flatshead_dungeon benchmark mesh:

with standard zero initialisation semantics takes 0.000143 seconds
with modified default initialisation semantics takes 0.000113 seconds

It looks like the difference in this specific case is then actually mainly down to the following code for loading persistent data into a buffer:

// _backStore is a cVector<char>
// 'is' is an input stream object
    uint32_t arraySize;
    is.openNextElementAsArray_GetSize(arraySize);
    _backStore.resize(arraySize);
    is.finishReadElementAsArray(&_backStore[0]);

So with standard vector semantics this will normally zero the buffer before loading, and this can be a noticeable overhead.

If we want _backStore to correctly report it's size, and are avoiding nasty hacks, it doesn't look like there is any way to avoid this overhead with a standard vector implementation, but we could decide just not to use a vector in this case and revert to raw buffer allocation.

Changed semantics

When using our custom vector implementation with this change we need to be aware of the changed semantics, but to be quite honest I actually wasn't aware of this aspect of std::vector behaviour before we switched (elements being zeroed on vector construction or resize) and so was already writing code as if element values were undefined.

I guess an alternative approach for the resize() method, could be to add a separate method called resize_Uninitialised(), and allow this to be chosen explicitly by the calling code, but it's not so easy to provide this kind of choice on construction.

Standalone benchmark for vector resize

Checking this against std::vector, for completeness, with the following benchmarking code (with Clang 3.0 and -std=c++0x):

template <class tVector> static int
ResizeBenchmark()
{
    int sum = 0;
    for(int i = 0; i != 400; ++i)
    {
        tVector v;
        v.resize(100000);
        v.back() += 1;
        sum += v.back();
    }
    return sum;
}

I get:

container type	build	time	sum
std::vector	debug	0.0924 seconds	400
std::vector	release	0.0125 seconds	400
cVector, zero initialisation	debug	0.0930 seconds	400
cVector, zero initialisation	release	0.0338 seconds	400
cVector, default initialisation	debug	0.0002 seconds	79801
cVector, default initialisation	release	0.0001 seconds	79801

The sum variable is there to help ensure the benchmark logic is not optimised out, but also shows us that different semantics are being applied (and we can see that, in the default initialisation cases, the OS is actually giving us back the same buffer again each time through the loop).

In general stand-alone benchmarks like this should be taken with a handful of salt (because of cache issues and because the actual effects on real code performance will usually be quite different), but can be useful at least to demonstrate that there is a difference.

(Note that there can be other ways to avoid the initialisation cost shown in this benchmark, however. The next post in this series discusses this in more detail.)

Type specialisation

The other main change we made was to detect POD types and use specialised code paths for vectors with these element types.

The point is that, for built-ins and certain user defined types, constructor and destructor calls will have no effect and can be omitted, and copy construction can be replaced with memcpy.

With C++11 this becomes quite straightforward, as we can use is_trivial and is_trivially_copyable to ask the compiler whether or not the corresponding shortcuts can be applied for a given type.

Before C++11, however, we had to set this up for ourselves, and we used the following mechanism for this:

struct cPOD_Tag {};
struct cNonPOD_Tag {};
template <class T>
struct cTypeTraitsHelper
{
    typedef cNonPOD_Tag tPODType;
};
#define DECLARE_POD_TYPE(t) template<> struct cTypeTraitsHelper<t*> { typedef cPOD_Tag tPODType };
DECLARE_POD_TYPE(int)
DECLARE_POD_TYPE(char)
//...... add declarations for other types, as required
template <class T>
typename cTypeTraitsHelper<T>::tPODType PODType(T&)
{
    return typename cTypeTraitsHelper<T>::tPODType();
}

With this set up appropriately we can then switch our buffer helper methods depending on whether or not the contained element is declared as a POD type, for example:

    static void
    constructRange(T* begin, T* end, cPOD_Tag)
    {
    }
    static void
    constructRange(T* begin, T* end, cNonPOD_Tag)
    {
        while(begin != end)
        {
            new((void*)begin) T;
            ++begin;
        }
    }
    static void
    constructRange(T* begin, T* end)
    {
        constructRange(begin, end, PODType(begin));
    }

(And similarly for stuff like destruction, copy construction and buffer fill.)

This kind of type specialisation made some difference to performance historically, but in practice it turned out that the compiler is often clever enough to optimise this stuff out by itself, and the actual performance difference resulting from this change was not hugely significant.

I've tested this again in the current release (with Clang 3.0, with and without -std=c++0x)and if I remove all the POD type declarations it looks like the only stuff that actually gets affected by this significantly are now some methods for saving preprocess objects to persistence, which are not so performance critical.

Tweaking initial push_back behaviour

Another optimisation is to adjust the initial behaviour when calling push_back on an empty vector, to avoid small allocations.

After some benchmarking we changed the following line in the push_back() method:

    size_type newCapacity = _capacity ? _capacity * 2 : 1;

    size_type newCapacity = _capacity ? _capacity * 2 : 8;

What this does is to avoid initial allocations of size 1, 2 and 4 and jump straight to a buffer of size 8, which works out as a net performance gain for us.

If you think about what's going on in the underlying memory allocation subsystem this makes sense, because small allocations are probably usually going to be aligned and padded up to a certain minimum size anyway.

In-class buffers

Another way to avoid small allocations can be to reserve a fixed amount of space in the vector itself for use when the buffer allocation requirement is below a certain size.

I know of at least one developer that uses this technique but we don't do this currently at PathEngine.

Amortize faster!

The above code doubles vector capacity each time a new buffer is allocated to ensure amortized constant time push_back, but there's nothing special about the number 2 here. In fact, we can should be able to use any constant > 1 and still get this amortized complexity.

There's a trade-off here between number of allocations (which take time) and buffer overallocation (which cost memory, and also potentially reduce cache efficiency), but in PathEngine, changing this constant to 3 seems to give us a net benefit:

    size_type newCapacity = _capacity ? _capacity * 3 : 8;

Note that it's all the more important with this setting to make sure we avoid permanent overallocations wherever possible, e.g. by applying the 'shrink-to-fit' idiom, as discussed here.

Other optimisations and benchmarking

There are other tweaks you could try out, but this can start to be a bit hit and miss if you don't have some good high level benchmarks set up to capture the real performance situation needing to be optimised.

With benchmarks in place a sampling profiler can help you pin down areas where you can most benefit from optimisations, but watch out for situations where the real issues are with higher level vector management and using vectors effectively. Maybe a whole bunch of allocations could be avoided just by reusing a vector across calls or across frames, but a profiler isn't going to tell this directly.

Note that a number of the above tweaks target the underlying memory allocations because this is often the most significant cost associated with vector operations, and there may then be other ways to reduce these memory allocation costs.

I'll look into the issue of memory allocation customisation more generally in my next post, and we'll see, for example, that using realloc() is something that can also potentially improve performance.

Conclusion

It's good to know that we can replace std::vector, first of all. There's a bit of placement new trickery, and at least one gotcha to look out for, but overall this turns out to be not as hard as you might expect. After that, whether or not this is a good idea is something to be decided on a case by case basis!

You should have some ideas, now, about how a custom vector class can potentially improve performance and in my next post we'll see how this can also be desirable if you need a lot of control over buffer memory allocation.

Optimising your vector class is something of a low level optimisation and there are often higher level changes you can make to your program that make a lot more difference, and can even remove the need for this kind of low level optimisation.

In the case of PathEngine, the various low level vector optimisations I describe here are less significant than when we first switched because of other higher level optimisations since then that avoid or reduce costly vector operations in many cases where these were an issue, but our custom vector nevertheless still helps to improve SDK performance.

When measuring performance differences it's important to test vector operations in the actual target run-time context, and to include the effects of memory allocation and processor caching in real situations. If you think that a custom vector class could improve your application performance, then perhaps the best way to measure this could be to actually go ahead and plug in a custom vector implementation (perhaps based on the code shown above) and compare some high level benchmark timings..

Comments (discussion closed)

Herb Sutter15 December 2013 17:15

Thanks for this article. It will get a little more than the usual scrutiny given that it's recommending replacing a standard component. :) So a few quick questions:

You seem to have one goal: performance. Presumably you reach that goal, but that's not well documented. What are you comparing against -- which compiler (incl. switches), which OS, and which std::vector implementation (even saying "Clang 3.0" is insufficient to distinguish which standard library implementation)? Have you ditched exception safety? Are the performance improvements portable (to different compilers and OSes, and compared against different std::vector implementations some of which may be better than others)? Are the performance improvements limited to a single or very few applications, and how are those characterized so we can determine when to consider going down this road? Finally, what are the new rules for using vector?

I enjoyed the article, just asking a few direct questions because many people have thought they can build a better string or vector, and many such efforts seemed good initially but later failed for non-obvious reasons -- often ones their std::vector implementer had already taken into account.

Thomas Young16 December 2013 13:51

Herb,

Thanks for the kind words.
To answer some of those questions:

The benchmark I showed above was against GNU libstdc++ 6.

I tried this on Windows, under Visual Studio 2012, and I get the following:

std::vector
debug        0.112 seconds       sum = 400
release      0.011 seconds       sum = 400
cVector with zero initialisation semantics
debug        0.178 seconds       sum = 400
release      0.031 seconds       sum = 400
cVector with default initialisation semantics
debug        0.022 seconds       sum = -1852730912
release     0.0002 seconds       sum = 80200

(Note that the cVector with default initialisation case includes type specialisations but
the cVector with zero initialisation case has these replaced with simple looped constructor calls.)

But it's not really surprising that the default initialisation code is much faster,
and I don't think that stl implementation or exact build details are going to make much difference to this,
since our cVector implementation is basically cheating.

It's a requirement for std::vector to perform zero initialisation (I assume, although I didn't actually check this in the standard),
and so std::vector _has to_ fill the buffer with zeros, whereas we can just ignore this requirement, in our version, and skip all these memory writes.

And that's kind of the whole point here (cheating).

I'm not claiming that the custom vector class used in PathEngine is a better general vector implementation than the standard library implementations,
but just that it's possible for us to tweak the detail implementation for a custom vector class, once we have this implemented, to get more performance for our specific use cases.

It's a similar situation for the other optimisation points.
We're free, for example, to adjust the behaviour for initial push_backs, because we know that most of our vector contain relatively small element types,
and we can adjust the capacity multiplier because we know we take a lot of care to shrink to fit.

> Have you ditched exception safety?

PathEngine doesn't throw or handle exceptions, and, like most of our clients, we basically then just ignore the whole issue of exception safety.

(Life is good, no?)

> Are the performance improvements portable (to different compilers and OSes, and compared against different std::vector implementations some of which may be better than others)?

When we first switched to a custom vector implementation, our main optimisation targets were 32 bit Windows and Xbox 360 with Visual Studio 2005 and 2008,
but we also checked the results fairly carefully for other platforms at that time.

At that time the improvements were reasonably significant, and across all platforms.

It _looks_ like custom vector is still an improvement, although less significantly than before.
But I haven't done the same kind of rigorous benchmarking in more up to date build environments, since our custom vector class is working fine, and is no longer actually a performance bottleneck to be resolved.

I'll look at doing some more general benchmarking with my next post, though, (which will look at the memory allocation side of things,
which can also have performance implications) and see if I can get a more conclusive answer about this point..

> Finally, what are the new rules for using vector?

Our custom vector provides a subset of std::vector operations (e.g. no rbegin() or rend()) and has slightly altered semantics (e.g. default initialisation), but none of this is really a big deal and we use it pretty much exactly the same way we would use std::vector.

Of course reimplementing stuff in a 'non-standard' way is a bad thing, generally speaking,
but in practice in this case I've found that it actually works out much easier to be able to see the exact detail semantics implemented
in our own code (identically on all platforms) than it is to try and work out what exactly is supposed to be guaranteed by the C++ standard and to what extent this is
then actually following by the various implementations against which our clients can potentially link our source code!

Thomas

Herb Sutter17 December 2013 18:45

Thanks. I think maybe you didn't realize you can get the effect you seem to be looking for with std::vector's .reserve()?

As you point out, your test is so fast because it because it's 'cheating' as you put it -- it's really not measuring std::vector vs. an alternative, it's just measuring initializing an array vs. not initializing it at all. The code isn't constructing/initializing any of the elements, not even with memset.

So in that test what you are actually measuring is not cVector vs. std::vector, but initializing an array vs. not initializing it, and the latter is known to be faster and it's one reason why std::vector offers that too as .reserve() (so you can go back and initialize values only as they're actually used). So your cVector.resize() as used in this test is actually like std::vector.reserve(), and an apples-to-apples comparison would be to call .reserve() in both cases, and I think you will likely see your performance difference disappear once you compare equivalent code.

Thomas Young17 December 2013 20:33

I don't think we can get the required behaviour with std::vector reserve(). Note that there are some key differences between cVector resize() and std::vector reserve():

1. cVector resize() increases the vector size and we can (officially) write data into the vector elements after calling this method (i.e. without some kind of nasty hack involving writing into data above vector buffer size).

2. cVector resize() will call a user supplied element class constructor if one exists (because of the way the new((void*)begin) T (no brackets!), placement new expression works). The 'no initialisation' semantics then only apply for stuff like built-in types.

Consider the actual use case I mention, for example, which is where we need to load some data from persistence into a vector object.

So we need to increase vector size to the size required of the data being loaded. But we're going to write stuff to the buffer straight away, so element initialisation is not required. Unless I'm missing something, it looks like there is no way to achieve this with std::vector without an extra zero initialisation operation (without some kind of nasty hack).

Nathan18 December 2013 04:56

std::vector::reserve() does not initialize the array it merely reserves the memory for future storage (as it does not call std::vector::uninitalize_default_fill()).

std::vector::resize() reserves the size for the number provided and also std::vector::uninitalize_default_fill() to initialize the elements thus having the elements initialized ready for use. (which from my understanding you wish to avoid).

In your use case, if I understand it correctly you wish to reserve the array for storage and than fill it with elements which you have already constructed.

I believe and correct me If im wrong that herb was stating that in your comparison you were not reserving the array before hand and thus every std::vector::push_back() was causing reallocation and copying etc which is not comparing the same as functionality as you are reserving the area than copying or moving into place in your vector but using the std::vector without std::vector::reserve() you are than reallocating and copying upon each std::vector::push_back(), and with std::vector::resize() you are reserving, allocating than destructing and copying upon each std::vector::push_back().

In order to have a fair comparison you must call std::vector::reserve() than std::vector::push_back() which is what herb was stating.

Also if you wish to avoid the copy on std::vector::push_back() you can simply move construct in place with std::vector::emplace_back() or move copy construct the element by using std::move() (providing you have your move constructor as no except as I believe at going native STL mentioned their was an issue with using std::move() and std::vector as it has a concept of having to have your move constructor throw no exceptions with no except and if it doesnt it default to a copy construct).

Thomas Young18 December 2013 08:21

Not sure where you're coming from with the idea of 'fair comparison'!

The point is:

1) That this semantic detail exists for vector element construction, first of all (_I_ didn't know that vectors of built in types did zero initialisation, originally)

2) That this can be changed in a custom vector class by using different placement new syntax

3) And that this then gave us some fairly significant performance improvements in our concrete situation (PathEngine SDK)

So there is a decision about whether or not to use a modified version of std::vector, and the motivation for this change comes from performance measures for some concrete code.

A simplified example based on the actual concrete example I gave from our code:

Imagine you have a vector with element type char.

Sometimes you build this vector with push_back operations, so the vector container type is useful, but in some cases you want to build the vector directly from persistent data.

This vector is currently empty, but you have a 10000 byte file on disk that you want to load into the vector as fast as possible.

The challenge (if you like): do this with std::vector without the cost of unnecessary zero initialisation. Attention! We may want to query vector size later on, and this should be set correctly..

Nathan18 December 2013 09:36

Sorry I more meant in comparing the functionality of your custom vector and std::vector I assume now you did call reserve.

Also these articles are very informative, I didn't mention that before :) Its great seeing more software engineers in gaming posting their findings I've been following bitsquids for a while aswell and that one is also very good.

I have also ran into a similar issue as you and had to implement my own vector for our game engine at the game development company I work for, as we found in things that required extremely high performance from storage (in regards to retrieval, updating and insertion) such as image conversion and compression the std::vector simply was not cutting it due to the overhead of bounds checking etc, especially when I had the code very tightly looped and controlled and knew it would not go out of the bounds based on knowing the pitch of the data and the resolution of the image.

Also in regards to std::vector::reserve() on windows it uses ::operator new() to create the pointer to the data (an array of size n *sizeof(T)) which with my understanding just reserves the memory with the OS and does NOT zero initialize it. I performed your test with a unsigned chars and chars and they were not zero initialized it than performs a memcpy if their is any data in the vector to copy which copies nothing in the case of a new vector obviously, and than push_back calls ::new ((void*)_ptr) T(Args).

Could you tell me if you have seen the same behaviour (or if your original tests were calling reserve and if so apologies), and also if so was your custom implementation still faster.

Their is also the new std::vector::data() function available on the std::vector so you can access the raw pointer of the std::vector's underlying array to also further avoid any other code the std implementation does for bounds checking, exceptions etc.

Thomas Young18 December 2013 10:13

Reserve doesn't actually help here, as far as zero initialisation is concerned, because we still need to call resize() _after_ calling reserve(), and zero initialisation will then be performed in the std::vector::resize() call.

Nathan18 December 2013 10:25

Sorry maybe I'm missing something about your use case :(

Why are you calling resize after reserve, after your reserve you can start inserting data into the vector.

You would simply create the std::vector, call reserve to reserve however much data you need, than either memcpy() the data into the underlying array (using std::vector::data()) this however would not update the size :( but obviously :).

You would simply create the std::vector call std::vector::reserve() to reserve however much data you need than start inserting the data using std::vector::push_back() in your own loop.

(Both of the above the data can come from either a file, another vector, network, persistent storage wherever)

Both of the ways above simply do the following.

1. Create the vector filling out the fields (size, capacity, etc) and setting the data to nullptr.
2. Reserve gets you a block of data of however much size you need NOT zero initialized (which from your post is what your after and sets the capacity to the size your requested)
3. than you either
std::vector::push_back() your data (which uses placement new just like your code and updates the size).
memcpy your data (but this would not update the size of the array but the capacity would be updated from the reserve call).

Both ways are very similar to your code except obviously using the std library and both avoid the zero initialization which from your post above seems to be the biggest performance booster, obviously however with std::vector their are other things to take into consideration as I mentioned about the extra code for bounds checking exception safety etc which in a tightly controlled code base you can avoid as you know what the code is always going to do.

Thomas Young18 December 2013 10:37

> You would simply create the std::vector, call reserve to reserve however
much data you need, than either memcpy() the data into the underlying
array (using std::vector::data()) this however would not update the size
:( but obviously :)

This is what I meant by 'nasty hack'. :)

Officially, accessing the buffer outside of current size is most likely undefined behaviour.
And in practice, although this might work for you, you won't be able to use the vector as a vector after that point, e.g. doing stuff like querying size or pushing back additional elements.

Nathan18 December 2013 10:55

Ahh yes!!

Fair enough :)

You could also use the following method which I mentioned aswell (also without having to call std::vector::resize() so you still avoid zero initialization).

using std::vector::push_back() will update the size (and if it grows to large cause their wasn't enough space reserved etc it will grow) and will either move construct or copy construct the element or alternatively you can construct in place using std::vector::emplace_back().

I would NOT consider using this option as being a hack (the memcpy one fair enough definitely agree with you on that one).

The added bonus of this method (apart from also avoiding zero initialization) is that access to the array is controlled by the overloaded subscript operator std::vector::operator []() or the std::vector::at() method and their is no undefined behaviour as using the methods of the std::vector after std::vector::reserve() cannot result in undefined behaviour using this method as they have checking on them much like your code does aswell and you can still use the vector as a vector :)

Their is however the overheads of the amount of checking that occurs but still that;s their to ensure exception safety and bounds checking (which your code does as well minus the exception stuff). Which is what lead me to also implement my own vector class very much similar to your one! :)

Either way hacks, are we not game programmers who have to optimise isn't that what we do for a living sometimes to get games out the door that don't run at 0.5fps lol.

Its sad sometimes that the highest performing code is usually the most hackish and worse looking code.

Also looking forward to your future posts :).

Thomas Young18 December 2013 11:09

Right.

Doing reserve() and push_back() with std::vector(), instead of resize() and writing to elements, avoids unnecessary construction or zero initialisation, and it's good to be aware of this.

But in some situations this may be either not possible or not convenient, for example in the described use case. And there can be other use cases such as where we want to random access write into a buffer without any specific initial value required.

Herb Sutter19 December 2013 22:18

Thomas, I think the point is that "reserve + push_back/emplace_back" is the recommended style and the one people do use. Really resize() is not used often in the code I've seen; you use resize() only when you want that many extra default-constructed elements, since that's what it's for.

You're basically saying you want to use the vector to manage an "out-only" array (one that you only write to, no initial values). The right way to do that with std::vector is not to .resize() to the target size and then set values, but to start with an empty vector and then either use the range constructor/assignment if you have the data already in another buffer, or loop and emplace_back the data one at a time.

Your example doesn't do any of the writes which are the interesting part of your use case and that will dominate even when using your cVector. To be apples-to-apples you should at minimum (a) show a complete benchmark example that actually stores data in the vector, and (b) for std::vector use reserve + emplace_back. I would be curious to see if you still see any benefits.

Would you be willing to update the article to show that more reasonable benchmark case? Here is the code for std::vector:

template <class tVector>
int RealResizeBenchmark()
{
    int sum = 0;
    for(int i = 0; i != 400; ++i)
    {
        tVector v;
        v.reserve(100000);

        for( int j = 0; j <100000; ++j )
            v.emplace_back( GetNextInt() );
        // or replace above loop with "v.assign( src, src+srclen );"
        // if your cVector code would do a bulk memcpy

        v.back() += 1;
        sum += v.back();
    }
    return sum;
}

Thomas Young20 December 2013 15:49

Hi Herb,

I've just made another blog post to discuss this issue in more depth, including an updated benchmark.

Short story: yes the above code is effective in removing initialisation overhead for the 'construction as buffer copy' case, but there are nevertheless other cases where this can be an issue, and we still want to use resize() in PathEngine code, in some places, in particular when interfacing with lower level code.

Thanks for the feedback! I've definitely got a more rounded understanding of std::vector coming out of this..

BruceDawson12 January 2014 01:30

I think that the use case that Thomas is suggesting is basically this:
vector<char> buffer;
buffer.resize(n);
file.read(&buffer[0], n);
If you use std::vector for this then you end up with each element being initialized twice. That is the crux of the issue.
For file.read() you can also insert anything that produces a batch of data. There are definitely cases where reserve() plus push_back() or emplace_back() works perfectly, but I don't think that it covers all cases with perfect efficiency.

Thomas Young13 January 2014 12:16

Hi Bruce,

Yes, that's exactly it.
And the next post in the series includes some more detail discussion about this.

It was interesting, for example, to see how you can populate a vector from file without resize, i.e. by assign from iterators (something I hadn't seen previously), but this works out a lot slower, and it seems a bit of a stretch to have to convert everything to an iterators paradigm just to avoid this initialisation issue. (Sometimes good old fashioned buffers are just simpler!)

Tom Forsyth20 December 2013 16:59

> But this will now fail ... with something like v.push_back(v.front()).

Rather than add rare features like this, adding bloat & unreadability and perf problems, I prefer to add asserts that say "this isn't a good idea". It's often a bug rather than a deliberate action. If it isn't a bug and is wanted, the options are (a) change the calling code to make a temp copy of the item - most times it's not a perf-critical case (b) switch the declaration to a real std::vector or (c) it you really need this functionality and it's perf-critical, add an explicitly "special" function like push_back_aliased() or push_back_generic() that does handle all the odd cases.

Thomas Young21 December 2013 09:44

I don't think it's unreasonable to make this an assert condition, first of all. (Although we should definitely_either_ assert or support this properly!)

In the stack overflow question I quoted in the post I think the final consensus is that this is _required_ for std::vector, but there was also some argument on the other side, and this didn't seem to be considered self-evident, obvious, or a well known fact about std::vector!

Personally I prefer to support this, though, for the following reasons:

1) I wan't to minimise friction for switching between std::vector and the custom cVector class, as far as possible. It looks like standard vector implementations all support this kind of calling code and asserting means that switching some code that runs fine with std::vector to cVector can introduce assertion failure.

2) It's not actually so difficult to support this kind of calling code.

In the code posted there isn't actually any extra work involved, we just have to take care of the order of our operations. In the future I'll discuss actually using realloc() in the reallocate method and in this case there _is_ a potential performance implication but it's still not very significant in practice, given that this is something that only occurs when pushing back past current capacity and we do a bunch of other things (e.g. reusing vectors, initial push back tweaks, etc.) to reduce how often this happens..

and

3) I actually quite like the fact that it's possible to depend on this in calling code.

The v.push_back(v.front()) example is maybe a bad choice, as this looks a bit wacky, but the actual real examples I came across in PathEngine didn't look so strange and actually seeming like pretty reasonable vector based code.
In general these are cases involving referencing vector elements by index, e.g. v.push_back(v[i - 1]), with the element references looking just like element references in other surrounding code. We're not referencing the elements by pointer or iterator, and so this is something that is normally ok to do for a vector even across push_back operations.
So it's easy to end up doing something like this, I think, in certain situations, without realising that you're doing anything that could potentially cause a problem.

Also, the fact that I can depend on this enables me to write more concise and clearer vector based code, which I also think is a good thing..

ssylvan21 December 2013 02:38

A few more things you could try:
1) Have space for one extra element in the buffer. That way the push_back call will *always* construct an element at the end (because there's *always* space for one element). After the construction check if there's enough space for at least one more element and if not reseize. This makes the common case unconditional. Unclear how much benefit this gives on a modern superscalar architecture, but worth tryhing.
2) Move the book-keeping data to the actual buffer. It's pretty likely that anyone using the array will access the data anyway, so it could be beneficial to put all the book keeping data in the buffer itself to avoid cluttering up the owner's cache line with things like _size and _capacity (a nullptr would imply that both are zero, btw). So the size of your cVector would be exactly sizeof(void*). For the relatively common case of iteration you'll be touching the first cache line of the buffer immediately anyway.
3) If you do 2), you could try putting the _capacity field at the *end* of the currently used data, since this value is mostly read when you do push_back, and would therefore coincide with the cache line you're about to write into anyway. This does mean you have to keep copying the _capacity value to the end (referenced by _size) each time you add an element. To make the most of 1) you may want to put _capacity at sizeof(T) past the end (this way you don't have to wait for the read of _capacity to finish before you start writing into that same place during element construction).

EDIT: Actually, I just realized another thing to try which may work out pretty nicely. Make the vector just a pointer to a buffer like suggested in 2) with all the book-keeping in the buffer itself, but make the pointer always point to the end of the current user data and put *both* the size and capacity immediately following the current data. This way push_back will touch only one cache line for small elems, without taking up more than a pointer's worth of space in the owner (which will make *other* stuff faster). Iteration will probably be similar cost (it's sort of symmetric with what currently happens). Random indexing may potentially be slightly slower because you now have to subtract _size*sizeof(T) from the buffer pointer before you index into it.

Thomas Young21 December 2013 10:00

I think it's quite important to make sure empty vectors don't allocate, first of all (unless this is a known semantic difference and the calling code is specifically written to take this into account).

And that then either rules out (1) or adds a bit more complication to this that makes it less attractive (such as extra conditionals).

For (2) I would have thought that the vector class data memory would already need to be cached to get the buffer pointer, and it also seems like this buffer pointer is also something that _always_ needs to be accessed, whereas we won't _always_ access either the start or the end of the buffer.

But if the vector is just a buffer pointer I guess we _could_ effectively start passing that buffer pointer around by value (instead of pointers or references to vector objects) and lose a level of indirection that way.
Is this what you're intending?

Using a null pointer to represent an empty vector would stop empty vectors allocating, but would also add an extra conditional to all your push_back operations.
A better way to handle this could then perhaps be to point to some common static data with zero size and capacity values..

ssylvan21 December 2013 11:44

Yes, the vector would become just a pointer, and any owner would just store that pointer directly (by just storing the object itself by-value). It would be hard to benchmark because the value comes from the fact that all the owners get smaller and therefore experience fewer cache misses while accessing their *other* members (which are completely unrelated to the vector operations). The key thing to move out of the way is probably the capacity value since it's so infrequently used and would therefore be pretty damn cold for the vast majority of objects that have vectors. Even the size is relatively infrequently used (e.g. unchecked indexing doesn't need it) so it makes sense to pull that one out too.

EDIT: To expand on the cache saving point: The idea isn't to make vectors themselves faster (you're right that if you access the buffer pointer in a normal vector, both the size and capacity are likely to be in the cache), the idea is to not pollute the cache with cold data, which introduces cache misses for the rest of the members of any object containing a vector. If you only ever read those values in order to fetch from the data buffer (which is mostly true), why keep them hot in the cache the rest of the time when they're just wasting space and making everything else slower?

Thomas Young21 December 2013 13:40

> It would be hard to benchmark because the value comes from the fact that
all the owners get smaller and therefore experience fewer cache misses
while accessing their *other* members (which are completely unrelated to
the vector operations).

It would be hard to benchmark _in isolation_, but I think that's actually true for most things these days!
If you have some high level benchmarks set up, though, you should be able to 'just' plug it in and check the difference in overall timings before and after this change.

One thing that can be a bit tricky, I guess, is the issue of putting different data types in the same contiguous buffer, with alignment issues to watch out for, and because this kind of thing doesn't fit very naturally into 'proper' C++ code.
But it shouldn't be too hard to just hack something in for one specific target platform (and build) initially, just for benchmarking purposes.

Try it! I'd be interested to hear how this goes..

ssylvan21 December 2013 17:32

For the alignment stuff putting just the capacity near the end of the current user data would be relatively easy (reading an uint32 or whatever can be done relatively efficiently even unaligned on most platforms), that way you're not messing with the alignment of actual data elements.