Nov. 18, 2013, 10:39 a.m.

Using STL Vectors

(First in a series of posts about Vectors and Vector based containers.)

STL style vectors are a pretty useful construct.

I remember coming to C++ (and the STL) from a background in lower level programming, and finding STL vectors a bit surprising, with the idea of amortized constant time complexity and so on.

And then it was actually something of a minor revolution just to be able to set up a contiguous data buffer, without any significant programming effort, and without knowing the maximum required buffer size in advance.

Before we had C++ and the STL I remember a common source of errors being dynamically generated data overflowing fixed buffers, which is something we just don't see any more, thanks in no small part to the humble vector (and brethren).

But convenience can come at a cost, and there's some stuff you need to be careful about when applying STL vectors in performance critical situations.

In this post I'll discuss the main ways in which STL style vectors are applied in PathEngine, and look at some of the key performance and memory points coming out of this.

Prefer vector over list

Generally speaking, contiguous buffers are good, and memory allocations and scattered access patterns are to be avoided, and we're advised to prefer std::vector over std::list in most situations.

See this presentation by Stroustrup, for example.

And this has certainly been our experience with PathEngine. In fact, the SDK now uses vector-like data structures almost completely to the exclusion of pointer based data structures (even for things like dynamic meshes).

In many ways I think vectors can offer us the best of both low level and higher level coding paradigms, with the flexibility and automatic memory management of dynamically sized containers, but with the the performance benefits of low-level direct access to a contiguous data buffer and the simplicity of integer element indexing.

Vector use cases

The stuff we do with vectors can be broadly categorised into two main use cases:

the construction of (potentially complex) preprocess objects, and
run-time buffers for queries, where buffer size requirements are fundamentally dynamic in nature

Preprocess objects include things like the pathfinding visibility graph. These objects can take time to build, but this is something that can be done by the game content pipeline, with preprocess data saved and loaded back from persistence by the game runtime.

Key issues

Over-allocation in preprocess build

It's a great simplification in preprocess build code to be able to just build objects in one go, with buffers created and written to without the need to calculate buffer sizes in advance. Vectors let us do this, and then also double up as buffer memory management (and buffer size tracking) once the preprocess objects have been built.

But if we're not careful, due to the way vectors allocate additional capacity, we can easily end up with a lot of memory wasted in over-allocation.

For a vector built with push_back() calls, if we assume that the vector doubles its capacity each time capacity is exceeded by a push_back() operation (a pretty standard implementation), then we'll end up on average with capacity being 1.5 times the actual final required buffer size.

We can avoid this overallocation with the shrink to fit idiom, e.g. with something like the following:

template <class T> void
ShrinkToFit(T& container)
{
    if(container.capacity() != container.size())
    {
        T(container).swap(container);
    }
}

This has the effect of replacing the vector buffer with a newly allocated buffer of exactly the required size. So we avoid overallocation at the cost of a buffer copy operation (usually not significant at preprocess generation time), whilst also retaining all the convenience of the vector object buffer management (automatic buffer deletion and buffer size query).

With C++11, you can replace the helper function with a call to the new vector::shrink_to_fit() method, which does the same thing.

Iterator validity

It's great to be able to work with vector buffer contents while the vectors are being built, but it's quite a common gotcha to forget that vector resize operations will sometimes invalidate existing iterators, or raw pointers into the buffer.

A great way to sidestep this issue is to use simple integer indices instead of iterators or pointers, and this also has the advantage that indices into a vector remain valid also across load and save operations.

Hidden costs for non simple elements

Vectors are good for managing simple data buffers, but things can get costly with elements with significant set up or teardown costs, or with their own internal memory management.

One specific example is the 'vector of vectors' construct.

I'll talk about this in more detail in a future post, but the short story is that building vectors of vectors can potentially be a big performance hit, and the resulting data structure can also be the source of quite significant memory inefficiencies, and so this construct should be avoided as far as possible.

Run-time dynamic buffers

Memory allocation is expensive (in general, but particularly so if you're sharing a heap with other threads!), and run-time allocations can lead to memory fragmentation, so the ideal situation, if possible, would be to avoid any run-time memory allocations!

In some cases we can avoid run-time allocations, by preallocating suitably sized buffers, but this isn't always possible.

We can't always predict query buffer requirements, or if there is a theoretical maximum buffer requirement this may be too large to be practical.

For example, a lot of queries in PathEngine do some work local to a specific part of a ground mesh, and will usually only need to work with a small number of elements (e.g. mesh faces, or whatever), but with the actual number of elements relating a given query depending on stuff like the complexity of local geometry or the query distance or range.

Ground meshes can potentially be very large, and if we preallocate buffers based on total ground mesh size, but then only ever make locally restricted queries then we're potentially wasting a lot of memory.

We could choose some arbitrary size for preallocated buffers, but this is inflexible and prone to failure (in a similar way to 'old style' MAX_PATH_LENGTH and fixed buffer path formatting), with queries that fail unexpectedly when query complexity takes us just over the preallocated size requirement, and it's also much nicer if we can support arbitrary query sizes. Maybe the client does want to query over a large area, and can spare the memory for this.

So often the only choice is to then go ahead and embrace dynamically sized run-time buffers.

Vectors in run-time query code

The simplest thing to do is then just to create vectors directly in the code, as and when needed in the query code.

So this could look something like the following:

void
cLocalisedShape::getFacesOverlapped(cVector<cFace>& result)
{
    std::vector<cFace> openQueueBuffer;
    FacesOverlappedByConvexPoly_SI(_mesh, _points, _centreF, _poly, openQueueBuffer, results);
}

This is a made up example, but the idea is that we're going to perform a kind of fill operation through local ground mesh, and we need to keep track of 'open' faces during the process.

We don't know how many faces will actually need to be explored, but the fill operation will be restricted by mesh locality, and in general the number of faces explored is expected to be much smaller than the total number of faces in the mesh.

But the problem is that sequentially pushing faces on to the buffer is going to lead to multiple memory allocations, and if the query code is optimised then these memory allocations can easily become a significant bottleneck.

Preallocating at query time (be careful about clear() semantics!)

If we have an idea about common query situations then we could do something like the following:

void
cLocalisedShape::getFacesOverlapped(cVector<cFace>& result)
{
    std::vector<cFace> openQueueBuffer(20);
    openQueueBuffer.clear();
    FacesOverlappedByConvexPoly_SI(_mesh, _points, _centreF, _poly, openQueueBuffer, results);
}

(Although it turns out this is bad general purpose vector useage, see the discussion in this later post.)

The idea is to reserve some capacity in the vector to ensure that the buffer doesn't subsequently need to be expanded in common query situations.

The first thing to note is that the exact semantics of the clear() call here are important.

What we want this to do is to resize the vector to zero, whilst leaving the buffer allocated. If the vector buffer is freed by the clear call than that leaves us worse off than before.

According to the current standard it should be guaranteed behaviour that clear() does not free the underlying buffer, but this wasn't always the case. In some versions of Microsoft Visual Studio calling clear() does frees the buffer, but this can be worked around by replacing clear() with resize(0) (which doesn't).

In our case we actually use our own STL style class implementation, so we know we can depend on this, but if you use std::vector and your code will run on different platforms and compilation environments then it's worth checking the actual semantics being applied (e.g. by asserting that the vector capacity is unchanged after the clear).

Assuming the buffer is not cleared however, this is still not an ideal solution:

there's still at least one buffer allocation made for every query (we'd rather see none!), and
the starting buffer size is arbitrary, and suffers from the same kind of issues as other similar arbitrary constants - maybe for some clients the common query situations require exploration of 25 to 30 faces..

Holding on to vectors across queries

What we do in PathEngine is to hold on to a standard set of vectors and reuse these across queries (and frames).

The code for this looks something like the following:

void
cLocalisedShape::getFacesOverlapped(cQueryContext& qc, cVector<cFace>& result)
{
    std::vector<cFace>& openQueueBuffer = qc.lockFaceBuffer();
    openQueueBuffer.clear();
    FacesOverlappedByConvexPoly_SI(_mesh, _points, _centreF, _poly, openQueueBuffer, results);
    qc.releaseFaceBuffer();
}

Query context objects get passed down through the various query code paths as required.

In single threaded use there's just one query context passed in to each query, and in multi-threaded use a pool of query contexts is created.

This still is not an ideal solution, for various reasons, (for example with the intrusive requirement to pass in a query context object), and I've got a feeling there's probably a much nicer general way to handle this run-time buffer requirement out there somewhere, but it does the job for us!

What we get at runtime is then a bunch of allocations triggered when the game starts to call queries, but with the number of buffer reallocations allocations trailing off as the game 'warms up'. Which is generally fine, from a performance point of view, and also minimises fragmentation issues.

Note that this leaves us with the same over-allocation issue as with preprocess construction, but in practice runtime buffer sizes are smaller than preprocess buffers and so this has turned out not to be significant, with both allocation time overheads and fragmentation being higher priority issues. (If runtime buffer overallocation is an issue for you, just periodically shrinking to fit all run-time buffers could potentially be good enough.)

Conclusion

The bottom line is that dynamically sized containers make coding easier and, if you know how to use them, STL style vectors can give you this convenience without sacrifices in performance or memory efficiency.

Note that there are issues with the actual stl::vector specification, and standard library stl::vector implementations. In some cases you may want to replace stl::vector with your own custom vector class, and this is something I'll cover in more detail in a future post.

The points in this post are intended to apply equally whether you are using stl::vector, or your own custom STL style vector class.

To iterate the main points:

Look out for over-allocation issues. It's easy to overlook, and in extreme cases 'shrinking to fit' can be give you significant memory gains!
Indexed element accessing is a good way to go with STL style vectors.
Be careful about using vectors with non-simple element types.
Making sure you know what's going on with your vector buffer allocations. Vectors do a good job of hiding the underlying buffer allocation details, which is a good thing, but sometimes we need to know these details. It can be worth checking the actual allocation semantics being applied explicitly. (Check that clear() isn't freeing the buffer, for example.)
Avoiding runtime allocations can be difficult, but can be the only way to get the best performance, and if you need to then there are ways to do this.