Avoid resize()?
(Uninitialised buffers and std::vector)
(Number 4 in a series of posts about Vectors and Vector based containers.)
This post is essentially a response to feedback to this previous post.
In that post I talked about a change we made to the initialisation semantics for PathEngine's custom vector class, and described a specific use case where this can make a difference, with that use case involving calls to the vector resize() method.
In the comments for that post, Herb Sutter says:
Thomas, I think the point is that "reserve + push_back/emplace_back" is the recommended style and the one people do use. Really resize() is not used often in the code I've seen; you use resize() only when you want that many extra default-constructed elements, since that's what it's for.
In this post I'll be looking into our use case in a bit more detail, and in particular whether or not a resize() call is actually required.
Avoiding unnecessary initialisation
As a result of writing reader feedback to previous posts (including Herb's comments) I've realised that I've got a bit of a bad habit of sizing vectors directly, either in the vector constructor, or with calls to vector resize(), where in many cases it's actually more efficient (or better coding style) to use reserve().
We can see an example of this in the first post in this series, where I posted some code that constructs a vector with a fixed size and then immediately clears the vector, as a way of avoiding memory allocations:
std::vector<cFace> openQueueBuffer(20); openQueueBuffer.clear();
Memory allocations are bad news for performance, and avoiding unecessary allocations should be the first priority in many situations, but there's also a secondary semantic issue here related to element initialisation and it's better general purpose vector code to use the resize() method for this, as follows:
std::vector<cFace> openQueueBuffer; openQueueBuffer.reserve(20);
The difference is that the first version asks for initialisation of 20 cFace objects (with what actually happens depending on the way cFace constructors are set up), whereas the second version doesn't require any initialisation (independantly of how cFace is defined).
Note that this issue doesn't just apply to vector construction - exactly the same issue applies to vector resize() operations where vector size is increasing.
In PathEngine, because cFace is a POD class, but also because of the way initialisation is implemented in PathEngine's vector class, both versions actually work out as doing the same thing, i.e. no initialisation is performed in either case. But could we change the PathEngine source code to avoid calling resize() and so avoid the need for non-standard vector initialisation semantics?
Resize benchmark
In my previous post I posted the following minimal benchmark to shows how initialisation semantics can make a difference to timings.
template <class tVector> static int ResizeBenchmark() { int sum = 0; for(int i = 0; i != 400; ++i) { tVector v; v.resize(100000); v.back() += 1; sum += v.back(); } return sum; }
In the comments Herb points out that we shouldn't actually use code like this in many situations, but rather something like the following (avoiding the resize call and resulting element initialisation):
template <class tVector> int RealResizeBenchmark() { int sum = 0; for(int i = 0; i != 400; ++i) { tVector v; v.reserve(100000); for( int j = 0; j <100000; ++j ) v.emplace_back( GetNextInt() ); // or replace above loop with "v.assign( src, src+srclen );" // if your cVector code would do a bulk memcpy v.back() += 1; sum += v.back(); } return sum; }
Lets look at the buffer copy version, specifically (i.e. with the call to v.assign()), and set up an updated version of our benchmark to use this construction method.
We'll use std::vector to set up some source data, initially, for the buffer copy:
int sourceDataLength = 100000; std::vector<int> source; source.reserve(sourceDataLength); for(int i = 0; i != sourceDataLength; ++i) { source.push_back(i); }
And then the benchmark can be rewritten as follows:
template <class tVector> static int ResizeBenchmark(const int* sourceData, int sourceDataLength) { int sum = 0; for(int i = 0; i != 1000; ++i) { tVector v; v.reserve(sourceDataLength); v.assign(sourceData, sourceData + sourceDataLength); sum += v.back(); } return sum; }
This could be written more concisely with direct construction from iterators but reserve and assign are closer to what we would do in our original use case where we're actually reusing an existing vector.
There's a problem, however, when we try and apply this benchmark to PathEngine's cVector. cVector only provides a subset of the std::vector interface, and doesn't provide an assign() method, or construction from iterators, (illustrating one of the potential disadvantages of rolling your own vector, btw!), so we end up putting resize() back in for the cVector version of the benchmark:
template <class tVector> static int ResizeBenchmark_NoAssign(const int* sourceData, int sourceDataLength) { int sum = 0; for(int i = 0; i != 1000; ++i) { tVector v; v.resize(sourceDataLength); memcpy(&v[0], sourceData, sourceDataLength * sizeof(v[0])); sum += v.back(); } return sum; }
I ran this quickly on my main desktop machine (Linux, Clang 3.0, libstdc++ 6) and got the following results:
| container type | build | time | sum | 
|---|---|---|---|
| std::vector | release | 0.0246 seconds | 99999000 | 
| cVector (default initialisation) | release | 0.0237 seconds | 99999000 | 
I'm not going to go into more benchmarking specifics because (as with the original benchmark) the point is just to show whether or not there is an issue, and the exact timings values obtained aren't really very meaningful.
But, yes, with this version of the benchmark (construction as buffer copy), and with std::vector used in this way it's fairly clear that there's basically nothing in it, and therefore no longer any advantage to modified initialisation semantics.
But what if..
But what if we want to load this data directly from a file?
With resize(), we can do something like the following (with good old low level file IO):
void LoadFromFile(const char* fileName, cVector<char>& buffer) { FILE* fp = fopen(fileName, "rb"); // error checking ommitted for simplicity fseek(fp, 0, SEEK_END); buffer.resize(ftell(fp)); if(!buffer.empty()) { fseek(fp, 0, SEEK_SET); fread(&buffer.front(), 1, buffer.size(), fp); } fclose(fp); }
At first glance this seems a bit harder to rewrite 'the right way' (i.e. with reserve() instead of resize()), without doing something like loading each individual element separately, and without explicitly creating a separate buffer to load into. After a short search, however, it turns out we can do this by using stream iterators, as described in this answer on stackoverflow.
Let's implement another version of LoadFromFile() then, without resize(), as follows:
void LoadFromFile(const char* fileName, std::vector<char>& buffer) { std::ifstream source(fileName, std::ios::binary); std::vector<char> toSwap((std::istreambuf_iterator<char>(source)), std::istreambuf_iterator<char>()); buffer.swap(toSwap); }
I used the following code to knock up a quick benchmark for this:
template <class tVector> int FileSum(const char* fileName) { tVector buffer; LoadFromFile(fileName, buffer); int sum = 0; for(int i = 0; i != buffer.size(); ++i) { sum += buffer[i]; } return sum; }
Calling this for a largish file (on my main desktop with Linux, Clang 3.0, libstdc++ 6, release build) gave me the following:
| container type | Method | time | 
|---|---|---|
| std::vector | iterators | 0.1231 seconds | 
| std::vector | resize+fread | 0.0273 seconds | 
| cVector | resize+fread | 0.0241 seconds | 
So, on this machine (and build etc.) stream iterators work out to be a lot more expensive than 'simple' direct file IO to a buffer.
Note that I threw in a call to the old style file IO version for std::vector as well, for completeness, and it looks like the overhead for extra buffer initialisation in resize() is actually insignificant in comparison to the cost of using these stream iterators.
I guess this could depend on a lot of things, and I haven't checked this across different machines, different standard library implementations, and so on, and I don't have any practical experience to draw on with optimising C++ stream iterator use, but this does show quite clearly that it's nice to have an option for direct low level buffer access for some situations where performance is critical.
Full disclosure
In fact, for portability reasons, the PathEngine runtime doesn't actually include any direct filesystem accesses like those shown in the code above. Loading from persistence is handled by client side code loading data into a buffer in memory (using the relevant operating system file loading methods) and then passing this buffer into PathEngine load calls.
So we could theoretically switch to using std::vector assign() and get rid of this specific use case for changing vector initialisation semantics.
But there are some other reasons why I prefer not to do this.
Dependencies and low level interfaces
The use case I showed in my last post involved the following methods on a custom PathEngine stream object:
void openNextElementAsArray_GetSize(uint32_t& arraySize); void finishReadElementAsArray(char* destinationBuffer);
This is low level, works with raw pointers, and is not really modern c++ coding style, but I actually prefer this in many ways to passing in a std::vector reference or templated iterator arguments. It's nice to avoid dependencies on a vector class, for example, or the additional complexity of templating, and it's nice to be able to use these methods with buffers that are not managed as vectors.
Some of PathEngine is written as quite low level code, and I think that one of the great things about the STL is how easy it is to interface STL components with lower level code. Being able to taking pointers into the buffer of an STL vector for low level buffer access is one example of this, but sometimes this kind of low level buffer access implies direct vector resizing, and resizing vectors without element initialisation can then often be desirable.
Working with partially initialised buffers
Another use case we come across sometimes in PathEngine is when we want a buffer to store some kind of data corresponding with a given set of index values, but don't need to set values for all elements, and we have information available somewhere else which tells us which elements have actually got meaningful values set.
One example could be where we want to form a sequence of elements from a large set of potential elements, and so we set 'next' values for each element currently included in our sequence, but we don't need to initialise next values for elements not included in the sequence at all.
This is perhaps a less common use case, however, and I'm not sure how much difference this actually makes to timings in practice.
Wrapping up
It's good to know the right way to do things in modern c++. With vectors the right way usually means preferring reserve() over resize(), and this will avoid unnecessary initialisation in many cases (and the need to worry about details of vector initialisation semantics).
In the case of PathEngine however, I quite like being able to mix low and high level code. We still find it useful in some cases to resize() vectors without explicit element initialisation and changing the initialisation semantics in our custom vector class helps to ensure this is as efficient as possible!
Comments (discussion closed)
Thanks Thomas, this is a great step forward.
Note that in the code you don't even need to default-construct + .reserve + .assign -- vector's range constructor does all that for you. That is, instead of:
for std::vector you could write just:
I should probably have mentioned that when offering the sample code but it only applied to the .assign alternative and I didn't want to confuse the issue.
And for your LoadFromFile code, you could write it more simply and idiomatically as follows -- note that we replace the 'out' parameter with just returning the potentially-huge vector by value, which is efficient in C++11 onward because it is guaranteed that the vector will be moved, not copied (i.e., the storage and contents of the vector are untouched, not copied or reallocated), and I used 'auto' and range-for just because it was natural.
To sum up a bit, it looks like the use case you are most interested in is an "out-only" parameter style buffer -- one whose original contents will be ignored and will be overwritten with other data. That's a perfectly valid use case, and the usual modern C++ way to express it is to let the called function create a vector, populate it, and return it by value -- "return" directly expresses out-only semantics, and move guarantees the handoff of the buffer back to the caller will be efficient (on the order of copying a pointer to the underlying buffer and two integers for its size and capacity -- zero reallocation or data copying will occur).
Does that make sense?
Thanks Herb.
The C++11 coding style is very cool, I think, but we can't use that yet for PathEngine..
Firstly, I'd like to say that it's nice to see a bit higher-quality discussion here. Recently I've been super-saddened by "Used array instead of std::list, now so much faster! STL suxxors".
But your post could use a lot of work.
There's no consideration of using something like template<typename t=""> struct uninitialized { T member; uninitialized() {} }; or other similar means to meet your uninitialization needs through std::vector.
You should also look into "swaptimization" which MSVC2008 did with their vector. Your vector is going to get utterly trashed against that implementation for, say, vector<vector<t>> or vector<string>.
Move semantics aren't just about calling the move constructors of the contained elements, it's also about moving the vector itself. This is a very important optimization. There's no discussion here of how much performance you've lost by ignoring it. Obviously not everyone is compiling on a C++11 compiler, but you're taking a massive hit on those that are.
Virtually all existing implementations of std::vector use a constant factor of 1.5 for resizing, and this is for a fundamental mathematical reason you might want to investigate- namely, that having a constant below the Golden Ratio permits re-use of memory blocks and improved allocator caching.
When considering std::vector against cVector with zero initialization, cVector is more than three times slower! This sounds like a bit of a crippling problem, but there's no discussion of why this occurs or whether you resolved it.
I also didn't see any discussion of some of the more nuanced issues, like exposing allocator details or adding a reallocate method to the allocator interface.
Unfortunately, the filestream iterators *are* hideously slow on basically every implementation. Herb has the right idea here but <fstream> isn't really ideal. It's possible to code up your own iterators on top of C-style I/O or OS API that should give quite competitive performance. I am not denying that this is not a problem, but vector isn't at fault here- it's iostreams.
I might also mention that libstdc++ actually cheats by using a COW implementation, which I believe they have not fixed yet due to ABI concerns. This is not only illegal in C++11 but offers performance benefits only in some scenarios in C++03 and can have absolutely abhorrent performance implications in others- it's banned in C++11 for good reason. Any benchmarks you make against libstdc++ should document whether or not you use it in a way that is beneficial or detrimental to COW.
As for the whole "If the argument to push_back is inside the vector" issue, don't feel bad. Microsoft got this one wrong *twice*. The solution (which, incidentally, is also the only way to implement the method in an exception-safe way) is to allocate the new buffer, insert the new element, and THEN move/copy the old elements over and deallocate the old buffer. This protects you from a number of nasty situations, like resize(size, v[0]) or insert(v.end(), v.begin(), v.end());. You don't show any insert implementation but your resize() has the same bug too. It also protects you from needing an additional branch just to check if the argument is inside, which is what MSVC has done for a few versions (and is still broken). Infact, speaking of which, I think that the bug report I submitted about this to Microsoft probably didn't check resize() and my memory of their source suggests that it is probably broken too.
Oh yeah, and Clang 3.0 is hideously old *and* they have known performance bugs even in more recent versions which manifest with the more common std::vector implementations and push_back in particular. I'm not using this as a case against rolling your own (although there's a much lower probability that you'll know about and counteract specific optimizer bugs), merely saying that as a general picture, it's not very representative. As a minimum, you should be looking at the latest public release of Clang, MSVC and G++, and to be "good", you'll also want to include ICC, and mention which Standard library you're using and what version it was.
So basically, the existing implementations of std::vector are done that way for a reason, and I'm just not seeing that you've considered why they're done that way. I'm not saying that rolling your own std::vector isn't the right thing to do in any case or even in your case- it's not supposed to be all things to all people. I'm just saying that you don't seem to have explored much of the problem or solution here.
Hi, thanks for the feedback.
The bug in resize() is a bit embarrassing. Thanks for pointing that out. I added a note in the relevant post.
> There's no consideration of using something like template<typename t=""> struct uninitialized { T member; uninitialized() {} }; or other
similar means to meet your uninitialization needs through std::vector.
Not sure what you mean here! Note that the problem is with undesired zero fill operations for either built-in types or classes that are already declared explicitly with no default constructor or empty default constructor. I've just posted another followup post about this, since this seems to be a not very well understood issue..
> You should also look into "swaptimization" which MSVC2008 did with their vector. Your vector is going to get utterly trashed against that
implementation for, say, vector<vector<t>> or vector<string>.
I prefer using a custom class for vector of vectors, in fact, for our use cases, since this also works out much better for memory allocation. (See my previous post about this exact issue!)
And for performance critical stuff we only actually use vectors with simple types, so move optimisations don't really make much difference to us.
I guess we've kind of taken a two pronged approach to this, optimising a custom vector class for our specific use cases, but also optimising our vector use cases to be fast with our custom vector class..
> Virtually all existing implementations of std::vector use a constant factor of 1.5 for resizing, and this is for a fundamental mathematical
reason you might want to investigate- namely, that having a constant below the Golden Ratio permits re-use of memory blocks and improved allocator caching.
Ok, yes that sounds like something I should look into (although setting this to 3 _does_ give us noticeable better performance than 2).
Do you have any links to more discussion about this?
> I also didn't see any discussion of some of the more nuanced issues, like exposing allocator details or adding a reallocate method to the allocator interface.
This is planned for my next post, in fact.
(Being able to use realloc is another main reason for a custom vector implementation, in fact, and can also make a significant difference to performance.)
> Oh yeah, and Clang 3.0 is hideously old *and* they have known performance bugs ... to be "good", you'll also want to include ICC, and mention which Standard library you're using and what version it was.
Maybe this would make sense, if we could chose the platform and compiler setup for each of our clients. ;)
As it is I don't think it makes sense to get hung up on these kinds of benchmarking details.
(The benchmarks I show are only really to demonstrate _that an issue exists_, and you could even really change that to 'can exist'.)
> I'm just saying that you don't seem to have explored much of the problem or solution here.
For sure!
But it's potentially all a bit of a rabbit hole, and if I can find a simple solution that solves our concrete performance issues then I'm actually happier with that.
(And it does definitely seem to me from all this that making our own vector class _has_ been a simple solution, and _has_ solved our concrete performance issues.)
Ah, I must apologize about vectors of vectors. I did not see your previous entry on this issue. Swaptimization does indeed only address one half of your point, it essentially gives you move semantics in C++03. But if you need something like boost::multi_array anyway, then that's not going to be as helpful.
Try http://crntaylor.wordpress.... as a brief introduction on why a factor lower than Golden Ratio is important. If you resize often then having a more aggressive factor will help, but as far as I am aware, because of this block re-use, in the general case you can't do better than ~1.5. Obviously your exact milage may vary.
>Maybe this would make sense, if we could chose the platform and compiler setup for each of our clients. ;)
That's fair enough. I'm not here to tell you what is or isn't important to you or your customers. I'm just saying that, if you're not interested in the general case for modern compilers/libraries and your customer is only interested in one particular combination, then you might want to make an explicit note of that.
> Try http://crntaylor.wordpress.com...
as a brief introduction on why a factor lower than Golden Ratio is
important. If you resize often then having a more aggressive factor will
help, but as far as I am aware, because of this block re-use, in the
general case you can't do better than ~1.5. Obviously your exact milage
may vary.
Thanks, that's quite interesting.
For the benchmarking I did in our case, I should note that this was actually with a 'reallocing' allocator, which can obviously affect how this all works out..