Sept. 23, 2017, 3:12 p.m.

    Static Libs Do Not Modular Make

    A cautionary tale about statically-linked libraries, as generated by C/C++ build tools.

    As a project accumulates features, and complexity, it gets harder to understand exactly what's going on, and to find your way around the source code. You need to find some way to organise the code and try and keep things manageable.

    A common idea, in this situation, is to group some source files together to split out as a static library.

    I'm going to argue that this actually does very little, in itself, to increase modularity, can have the effect of significantly increasing dependencies, and is maybe not such a good idea, after all.

    Break it apart?

    So yeah, when something gets too big to work with, it makes sense to try to break it into pieces.

    Given a whole bunch of source files to work with, we probably want 'pieces' bigger than individual source files, and that means grouping source files together.

    Perhaps there are files that can be grouped together by theme (e.g. a bunch of source files related to 'geometry'). Or perhaps some kind of layered decomposition is possible (e.g. we can identify a bunch of 'core' source files). And then we can separate this group of files from the rest of our source code by putting them in a library.

    Making things modular

    Wiktionary defines 'modular' as follows:

    Consisting of separate modules; especially where each module performs or fulfills some specified function and could be replaced by a similar module for the same function, independently of the other modules.

    Libraries are a classic archetype for a software module, and splitting our code into libraries already kind of nails the first part of that, (the bit before the semicolon), right?

    The bit after the semicolon is probably also worth consideration, but we can tweak the code to better address this bit, incrementally, later on, by firming up the interface, hiding implementation details, and so on. Having our source code 'consisting of modules' already feels like a good start, and a step in the right direction.

    Static or dynamic

    Ok, so maybe I'm being a bit sarcastic about the definition of 'modular', but the idea of splitting related files into a library to improve project structure does seem fairly convincing, nevertheless, and we decide to go ahead with this.

    One implementation detail is then: static, or dynamic linkage?

    There are a bunch of considerations to take account when making this decision, in the general case.

    Dynamic linkage works differently on different platforms, but for the sake of argument, let's say we're building for Windows, and have the choice between classic statically-linked library and Windows DLL.

    In this case, and given that we're not designing a library from scratch, (but want to make a library out of a bunch of existing source files), the decision is probably clear.

    We may be concerned about DLL linkage changing the performance characteristics of our code. Tricky issues with passing data structures across DLL boundaries could mean a lot of rework around calls in to (or out of) the library, and generally, wherever we look into the technical details of splitting our code off, splitting the code off as a static library stands out as the path of least resistance.

    Taking the plunge

    So we decide on the 'static library' option, and take the plunge.

    We make some changes in our project setup. We move the source files off into their own directory and add an extra static library build step to compile these source files and build them into a suitably named static library. Object files for split off source code are no longer passed into the final project link, and the linker is supplied, instead, with the newly built static library.

    This is kind of C/C++ build process 101, how you're taught to set things up at college.

    It's surprising easy. We didn't need to change compiler settings for individual source files, and soon enough everything builds again, with some spanking new project structure, but without any real struggle.

    It's nice when a plan comes together, and it feels like we made some kind of improvement here, but how does this change actually work out? Is project structure after the change actually better than what we had before?

    What really is a static library?

    The key to answering this question, for me, lies in understanding how static libraries actually work.

    I have to admit that I spent many years using Visual Studio on Windows and not really understanding this (although in my defense it's often not so clear, with Visual Studio, exactly what's going on under the hood).

    At some point I came to set up a build on Linux, however. I remember researching the commands to use for the various build steps, coming across the linux man page for 'ar' (another version here), and realising with a kind of mild shock that static libraries are actually just archives of object files.

    From Wikipedia:

    The archiver, also known simply as ar, is a Unix utility that maintains groups of files as a single archive file. Today, ar is generally used only to create and update static library files that the link editor or linker uses and for generating .deb packages for the Debian family; it can be used to create archives for any purpose, but has been largely replaced by tar for purposes other than static libraries.

    There's some extra information bolted on to this file format to enable a symbol index to be stored with the archive, but that's about it.

    So all we've really done, with our build process changes, is change the way the split off object files are supplied to the linker.

    Before the change, these object files were supplied individually. After the change, the object files are combined together into a simple archive, and passed as one large file. Nothing changed with regards to symbol visibility. All symbols visible in the original object files remain visible in the static library. No calling convention changes were required, function calls work exactly as they did before, and we may even end up with exactly the same machine code for the final binary.

    In retrospect, this explains why the project change was so easy. The project change was easy because nothing really changed, but on the flip side, you don't get something for nothing, there's no magic bullet, and, without any actual code changes, can you really expect to end up with better, more maintainable code?

    Information hiding

    I'm not saying that static libraries are entirely irrelevent to project structure.

    Consider the following endorsement of static library decomposition, from 'Organising Source Code', on accu.org:

    One of our usual objectives in defining modules is that we wish to practice information hiding. It seems to me that the correct level to define our modules is the level at which we can actively hide some information, that is, hide some implementation.

    Once our C or C++ code is compiled to object code we can hide the implementation since we only need distribute the header files and the object file. Still, the object file has a one-to-one relationship with the source file so we're not hiding much.

    We need a bigger unit to hide in. When we bundle many object files together we get a static library. This is more promising. Our code can interface to the library by including one or more header files and we shouldn't need to care whether the library is made up of one file, two files, or 25.

    Static libraries are simple to create and use. Once compilation and linking are complete then static libraries present no additional overheads or requirements, we can have as many of them as we want at no additional run-time cost. Hence, they are well suited to be building blocks when decomposing a system into discrete chunks.

    Yes, information hiding is important for good code structure, and bundling up a bunch of objects in an archive does achieve some information hiding.

    If we restrict ourselves strictly to the step where object files are bundled together, however, no information is hidden by this step other than the list of names of bundled object files.

    (The bulk of the implementation detail information in our original source files should already be hidden by the compilation step. Some information remains accessible in the object files, as linker symbols, but the static library build step has no effect on the accessibility of this information.)

    So there's not actually a huge amount of information hiding here. And the usefulness of this information hiding varies, depending on how the static library is organised.

    We can imagine an ideal scenario, on one end of the spectrum, where external code calls into a static library through one single (hopefully small) 'API header'. Perhaps one of the objects being bundled into the library provides a bunch of externally accessible functions described by this API header, and calls on to other objects as needed.

    Importantly, in this ideal scenario, the external code doesn't need to access any individual 'internal' library object headers directly. We can potentially change a bunch of details about how the library works internally, rename a bunch of objects, and end up with a completely different internal layout, without any effect on external code, and that's kind of cool.

    But that's not the situation I'm talking about here.

    In the situation I describe, the object files being split off are not going to be some perfectly designed software module, just groups of object files that seem like they should go together, two examples being groups of file that share a theme, and groups of files that correspond to some application layer.

    In this kind of static library decomposition external code tends to end up needing direct access to headers for the majority of the bundled objects, and you end up with very little benefit, in practice, from static library information hiding.

    Not a straw man

    This is the point where I admit that this film is 'based on a true story'.

    There came a time during the development of PathEngine where the code started to get unwieldy, with just too many object files.

    I went through pretty much the same thought process I have described here (as far as I can remember), used similar object file groupings, and ended up with a bunch of 'internal' static libraries, as you can see on this archived version of a page from the PathEngine docs.

    And I've seen other people follow exactly the same chain of reasoning. It seems this can be quite compelling, and it's compounded by the fact that, the worse the state of your source code (and the more you actually need to organise), the easier it is to end up going down this path.

    In the case of PathEngine the expected benefits of the internal decomposition never really materialised, even after source code changes and refactoring around the new structure.

    The static library decomposition turned out to be more of a hindrance than a help, and is long since gone, with PathEngine code structure now expressed through source code organisation into directories, and direct dependency relationships between object files.

    Dependency structure and a significant disadvantage of static library decomposition

    So I've talked about static library decomposition failing to improve project structure in particular for certain kinds of object file groupings, but it gets worse, and static library decomposition can also actively hurt your project.

    What we need to consider here is project dependency structure, as this is where that static library decomposition can turn around and bite you on the arse!

    The point is, with a bunch of object files in a directory, you can use some object files, and ignore others, but with object files bundled into a static library it's now 'all or nothing'.

    Let's consider the following (made up) example.

    Maybe I have some functions for testing collision with rectangles, in 'RectangleCollision.cpp' (and associated header), and some other functions for testing collision with circles, in 'CircleCollision.cpp' (and associated header).

    I have an application that works with both rectangles and circles, and uses both sets of functions, and because both of these source files share a 'geometry' theme I decided to bundle both generated objects (together with a load of other geometry themed objects) into a 'Geometry' static library.

    But then I find I need to make another application, which only ever works with rectangles, perhaps with the source code for this application distributed separately, to another client.

    I no longer use any of the functions in CircleCollision.cpp, and it shouldn't matter to me (when building the new application) if the CircleCollision.cpp source file has been modified, or if this source file fails to build for some reason.

    But my code has a dependency on the Geometry static library as a whole, and the Geometry static library cannot be built without compiling CircleCollision.cpp.

    That's kind of annoying already, but not the worst part. Now I have to distribute the circle related source code to my second client, and maintain this (even though they're not actually using this code). And dependencies can break -- imagine I need to update my build tools, but 'CircleCollision.cpp' brings in a dependency on some third party 'TrigonometryStuff' library, and 'TrigonometryStuff' has not yet been updated to build with the latest version of the build tools.

    Grouping object files into static libraries (often without any significant benefit), has the effect of clumping those object files together and creating a bunch of artificial dependencies. This prevents you from separating things which should come apart, if a situation arises where that would otherwise make sense, and tends to lock you in to monolithic development practices.

    Other benefits of static libraries

    There were good reasons for the introduction of static libraries, of course, and it's good to be aware of some other concrete reasons for this setup, e.g. as discussed in the answers to this Stack Overflow question:

    • the fact that it can be easier or more efficient to process a single large file in place of lots of small files, and
    • the possibility for a symbol index to be cached with the library

    This is more about the concrete details of your build process, however, and less directly related to code structure, and I think these kinds of technical limitations have also been obviated to some extent by the evolution of operating systems and build tools.

    In my experience, for example, it's not so much of a problem to pass large numbers of object files directly into your project link step, these days, because command buffer size is less of an issue for modern hardware (and command buffers tend to be much much larger than when static libraries were first implemented), and with the possibility for workarounds through linker features such as 'response files'.

    Other dynamic linkage mechanisms

    To simplify things above I limited the choice of linkage options to a choice between classic static library and Windows DLL linkage.

    But maybe you're developing on another platform, and there are other options available to you. Perhaps you're developing on Linux, for example, and can consider splitting the code out as a linux shared object.

    In this case there's a bit more going on in the shared library linkage phase, and I can't lean so much on the technical details of static linkage, but I think the general point stands.

    With popular build tools such as gcc, symbols in shared objects default to being externally visible, and this means (as with the static library case) you can often pretty much split code out into a shared library 'as is'.

    But well, no pain, no gain, you get what you pay for, and all that. Without code changes (and some careful thought about interfaces and information hiding) the result is likely to be the same, i.e. a bunch of object files clumped together artificially, without any information hiding benefit.

    One detail to note, in this case of shared object linkage, is that you can change build settings to make symbols hidden by default, (with -fvisibility=hidden), at the cost of some additional work to define and mark the symbols and functions constituting the library external interface.

    Wrapping up

    There's a fine line between breaking things apart and lumping them together!

    Key points:

    • Static libraries are really just object file archives, with a bit of additional symbol indexing.

    • Bundling object files into static libraries doesn't affect code linkage, which is why it's really easy to decompose a project into 'internal' static libraries.

    • The actual information hiding from the static library build step is minimal (just a list of object files), and, unless your code is already well structured, your project is unlikely to benefit from this information hiding anyway.

    • But grouping objects together in this way adds artificial dependencies between the objects in the library (all or nothing linkage), and tends to lead to monolithic development practices.

    There's more to be said about alternatives to this kind of static library decomposition, and how we ended up doing things at PathEngine, but that's a story for another day..


    blog comments powered by Disqus