You are currently browsing the category archive for the ‘GCC’ category.

I’ve seen a few people assert the precompiled headers are a pain in the butt, or not workable for large scale projects.  In fact, it’s incredibly easy to add precompiled headers to a GCC-based project, and it can be quite beneficial.

Since I’m mostly familiar with Makefiles, I’ll present an example here that uses Make. It trivially extends to other similar build systems such as SCons, NMake, or Ant, but I’m not so sure about Visual studio projects. This example builds a single static library and several test applications. I’ve stripped out most of my compiler flags for brevity.

# boilerplate settings...
SHELL = /bin/bash
CXX = g++ -c
CXXFLAGS += -std=c++98 -pedantic -MMD -g -Wall -Wextra
LD = g++
LDFLAGS += -rdynamic -fno-stack-protector
AR = ar

# generic build rules
# $@ is the target, $< is the first source, $^ is all sources
define compile
$(CXX) -o $@ $< $(CXXFLAGS)

define link
$(LD) -o $@ $(filter %.o,$^) $(filter %.a,$^) $(LDFLAGS)

define ar
$(AR) qsc $@ $(filter %.o,$^)

# all library code is in src/
# all test applications are single-source
# e.g. testfoo is produced from from testfoo.cxx and libsseray.a
TEST_SRC = $(wildcard test*.cxx)
LIB_SRC = $(wildcard src/*.cxx)
TESTS = $(basename $(TEST_SRC))
LIB = libsseray.a
all : $(TESTS)

$(TESTS) : $(LIB)
$(TESTS) : % : %.o

%.o : %.cxx

$(LIB) : $(LIB_SRC:cxx=o)

# gcc-provided #include dependencies
-include $(TEST_SRC:cxx=d) $(LIB_SRC:cxx=d)

clean :
	rm -f $(LIB) $(TESTS) $$(find . -name '*.o' -o -name '*.d')

In order to use a precompiled header, this is what needs to be added to the Makefile. There are no source code modifications at all. I created a file pre.h that includes all of the system C and C++ headers that I use (in particular <iostream> is a big expense for the compiler).

# PCH is built just like all other source files
# CXXFLAGS must match everything else
pre.h.gch : pre.h

# all object files depend on the PCH for build ordering
$(TEST_SRC:cxx=o) $(LIB_SRC:cxx=o) : pre.h.gch
# this is equivalent to adding '#include <pre.h>' to the top of every source file
$(TEST_SRC:cxx=o) $(LIB_SRC:cxx=o) : CXXFLAGS += -include pre.h

# pre.h.gch should be cleaned up along with everything else
clean :
	rm -f (...) pre.h.gch

The project itself is fairly small—16 source files totaling 1800 SLOC—but this small change just decreased my total build time from 12 to 8 seconds. This is entirely non-intrusive to the code, so adding it is a one-time cost (as opposed to the Microsoft stdafx.h approach, which is O(N) cost in the number of source files). It is also easy to disable for production builds, if you only want to use the PCH machinery for your day-to-day edit-compile-test cycle.

I like to sneak bitshifts into interviews—not because they’re used commonly in modern C++ code, but because they used to be common, as a way of getting good performance out of poor compilers. It’s very useful to know the tricks of the past, if you ever find yourself maintaining code written by an earlier generation of programmer. Take this example:

unsigned imul(unsigned x)
    return x * 10;

unsigned bitshift(unsigned x)
    return (x << 3) + (x << 1);

Below is the assembler produced by gcc -c -O3 -march=core2, using objdump -d --no-show-raw-insn to get the assembler from the compiled output. There are two interesting things to note:

  1. The compiler uses address calculation hardware for simple arithmetic: lea is basically a strided array access, base + stride * i.
  2. The compiler doesn’t use any shifts at all.
00000000 <imul>:
   0:	push   %ebp
   1:	mov    %esp,%ebp
   3:	mov    0x8(%ebp),%eax
   6:	pop    %ebp
   7:	lea    (%eax,%eax,4),%eax    # n + 4n
   a:	add    %eax,%eax             # (n + 4n) + (n + 4n)
   c:	ret    

00000010 <bitshift>:
  10:	push   %ebp
  11:	mov    %esp,%ebp
  13:	mov    0x8(%ebp),%eax
  16:	pop    %ebp
  17:	lea    0x0(,%eax,8),%edx     # 8n
  1e:	lea    (%edx,%eax,2),%eax    # 8n + 2n
  21:	ret    

For comparison, here is the same code compiled with gcc -c -O0 -march=i386. Note that shifts are used in both cases. If you try a few other values of -O and -march, you’ll see some other interesting results, but I’m not going to bother to paste them all here.

00000000 <imul>:
   0:	push   %ebp
   1:	mov    %esp,%ebp
   3:	mov    0x8(%ebp),%edx
   6:	mov    %edx,%eax
   8:	shl    $0x2,%eax    # n << 2
   b:	add    %edx,%eax    # (n << 2) + n
   d:	shl    %eax         # ((n << 2) + n) << 1
   f:	leave  
  10:	ret    

00000011 <bitshift>:
  11:	push   %ebp
  12:	mov    %esp,%ebp
  14:	mov    0x8(%ebp),%eax
  17:	lea    0x0(,%eax,8),%edx      # 8n
  1e:	mov    0x8(%ebp),%eax
  21:	shl    %eax                   # n << 1
  23:	lea    (%edx,%eax,1),%eax     # 8n + (n << 1)
  26:	leave  
  27:	ret    

If you go through some of the major Intel processor models, you will see that the actual assembler output varies quite a bit. What does this mean? Mostly that micro-optimizations designed to produce ever-so-slightly better assembler are usually the wrong approach for long-lived software. Yet, it’s a fact of software that this code will be seen on any sufficiently large system, and it must be understood and fixed if possible.

Developers working on embedded systems with a restricted set of compilers… YMMV. Sorry.

I recently stumbled across an article referencing macros defined by gcc. That list is pretty daunting! If you compare it to GCC Common Predefined Macros (the official source), you realize quite fast that a lot of those macros exist for the benefit of libc and libstdc++ library authors—not compiler end users such as myself.

For whatever reason, it takes quite a bit of Google-Fu (or luck) to get GCC’s official page to show up on the first page of search results. Whenever I search for it, if I don’t remember the page title exactly, I have to dig through about 20 other sites before finding it. In any case, both of those lists are frigging huge, so here’s a bit of a smaller list that I usually have up my sleeves:

Use this instead of __CHAR_BIT__. Required by the C and C++ language standards (IIRC) and defined in <limits.h> or <climits>. Number of bits in a byte. I’m not old enough to have worked on any machines that didn’t have 8-bit bytes, but they certainly existed at one point (and it’s the reason why my university networking class used the term octets instead of bytes).
GCC-specific versioning info, considered as a tuple. For example, if I’m using GCC 4.3.1, those values will respectively be (4, 3, 1). Simple, and useful for code that needs to be compatible across multiple versions of the compiler, or requires language features that are newer (such as using GCC’s new builtin atomic intrinsics).

If you want to simply identify any GCC-family compiler (including the Intel C/C++ compilers), simply look for __GNUC__‘s existence.

size_t, ptrdiff_t
Use these instead of __SIZE_TYPE__, __PTRDIFF_TYPE__. Required by the C++ language standard and defined in <stddef.h> or <cstddef>. Occasionally, I’ll write code that should follow STL semantics, but it doesn’t bring in any other parts of the STL. I always forget that <cstddef> is the header to bring in for the machine-agnostic size types.

Still not quite as good as nullptr, GCC does have some magic that provides better type-checking than just using the literal 0 in C++. However, I’ve recently found that in GCC 4.3.0 that it introduces ambiguouity: std::vector<T*>::push_back(NULL) confuses the compiler, and I have to instead write T *nullptr = NULL; std::vector<T*>::push_back(nullptr). Once that breaks, I’ll just delete the first line.
Defined by GCC if the -ansi flag is set. You probably don’t need to depend upon it, but if you really want to be anal (or a jerk, depending on your point of view), you could error out if __STRICT_ANSI__ was not set, thereby forcing users to crank up the standardization in their compilers. In my experience, this in combination with Visual Studio’s DisableLanguageExtensions setting makes it much easier to write code that works across both GCC/Linux and MSVC/Win32.
Ever need to get automated build metrics out of a compilation phase? __TIMESTAMP__ records the system time when the compile took place, and __BASE_FILE__ records the name of the source file being compiled (as opposed to __FILE__, which lists the current file). You could craft up a common project header file that somehome embedded these values into the object file (say by using a static global in an anonymous namespace that writes those strings to an empty file descriptor). Not useful in 99% of the cases, but it’s a good way to add forensic traceability data to a build.
__FILE__, __LINE__, __func__, __PRETTY_FUNCTION__
Diagnostics for “where am I?” Note that the first three are standard and provided by any compiler, __PRETTY_FUNCTION__ is a GCC-ism and contains the “decorated name”, which includes all of the arguments and template instantiations. The only good reason I’ve heard to use __PRETTY_FUNCTION__ is the template instantiation inclusion (e.g. “this problem happens with std::vector<MyCrappyType>, and not with just any vector type).
This macro is also provided by the Visual Studio compiler. It simply expands to an increasing integer value, and the only use I’ve found for it is to provide unique ids in auto-generated code (say, for the TUT C++ testing framework).
A necessity for writing headers compatible with both C and C++. ‘Nuff said.
linux, unix, i386
Since I don’t really write code that’s cross-UNIXish-platform, I’ve never really had to mess with GCC on other platforms (BSD/Solaris/AIX/Win32), nor have I had to use other non-GCC-family compilers on Linux. I imagine these would come in handy for those cases, but I can’t speak to their exact uses.

Since I just noticed the original article I referred to has a snide comment about development environments, I’ll include this just for kicks: Predefined Macros in VC++ 2005.

First, a word of warning: This is not portable. Secondly… being able to produce stack traces (outside of the debugger) is something that’s usually reserved for languages like Python or Java… but it’s quite nice to have them in C++. There’s several hurdles to overcome, however.

Acquire Stack

This part is pretty easy, but unless you’re nosy with the header files in /usr/include, it’s not likely to stumble upon this by chance.

#include <execinfo.h>
void print_trace(FILE *out, const char *file, int line)
    const size_t max_depth = 100;
    size_t stack_depth;
    void *stack_addrs[max_depth];
    char **stack_strings;

    stack_depth = backtrace(stack_addrs, max_depth);
    stack_strings = backtrace_symbols(stack_addrs, stack_depth);

    fprintf(out, "Call stack from %s:%d:\n", file, line);

    for (size_t i = 1; i < stack_depth; i++) {
        fprintf(out, "    %s\n", stack_strings[i]);
    free(stack_strings); // malloc()ed by backtrace_symbols

Demangle C++ Names

GCC also provides access to the C++ name (de)mangler. There are some pretty hairy details to learn about memory ownership, and interfacing with the stack trace output requires a bit of string parsing, but it boils down to replacing the above inner loop with this:

#include <cxxabi.h>
for (size_t i = 1; i < stack.depth; i++) {
    size_t sz = 200; // just a guess, template names will go much wider
    char *function = static_cast(malloc(sz));
    char *begin = 0, *end = 0;
    // find the parentheses and address offset surrounding the mangled name
    for (char *j = stack.strings[i]; *j; ++j) {
        if (*j == '(') {
            begin = j;
        else if (*j == '+') {
            end = j;
    if (begin && end) {
        *begin++ = '';
        *end = '';
        // found our mangled name, now in [begin, end)

        int status;
        char *ret = abi::__cxa_demangle(begin, function, &sz, &status);
        if (ret) {
            // return value may be a realloc() of the input
            function = ret;
        else {
            // demangling failed, just pretend it's a C function with no args
            std::strncpy(function, begin, sz);
            std::strncat(function, "()", sz);
            function[sz-1] = '';
        fprintf(out, "    %s:%s\n", stack.strings[i], function);
        // didn't find the mangled name, just print the whole line
        fprintf(out, "    %s\n", stack.strings[i]);

There. You could do a bit more optimization, but I’ll leave that as an exercise to the reader. The important thing is to obey exactly what the ABI requires regarding dynamic memory:

  • you pass me a buffer created by malloc, along with the current size of the buffer.
  • i might realloc your buffer to make space for the whole name, and I’ll return the result, which may be different. Or I’ll fail and return NULL, because you didn’t pass in a mangled name I could understand.
  • you free my return value when you’re done (unless I returned NULL).

Otherwise, you’ll get a segmentation fault somewhere along the line, and a stack trace that blows up the program isn’t useful except for post-mortem analysis in gdb.

Export Symbols

Even if you go through all of the other steps, if you don’t account for this in your build phase, you will get a pretty useless stack trace. Here’s what I got from my experiment initially:

Call stack from backtrace.cxx:105:

After poking around, I realized I had to add -rdynamic to my linker flags so that all symbols would be exported into the executable. I haven’t experimented, but I would guess this applies to shared-object building as well.

With -rdynamic, my stack trace looks a lot nicer:

Call stack from backtrace.cxx:105:
    debug/backtrace:hot_potato::pass(double, double)

Bingo! I only get a source file and line number from the call site, but it’s not too difficult to trace back through callers from this point. In this case, my executable is named debug/backtrace and hot_potato is just an example class that calls its own functions to give me a pretty chain to look at.


Being able to create a stack trace is only half of it. Now you actually have to use it to get any value out of it. Here’s the rub: This is most useful when an exception gets caught, but the data is already gone by that time, because the exception’s been thrown. Logically then, it seems to make sense to encapsulate this functionality into an exception class (e.g. class app_exception : public std::exception) that gets thrown. Simply replace all of the fprintf calls with something that generates a string. execute it in the exception class’s constructor, and print it out in the catch block.

Another option would be to allocate thread-local storage for stack traces (similar to the current thread-local errno), and then have all of your important functions call set_thread_stacktrace, which populates that thread-local storage. Then the exception handlers can just pull that data regardless of the type of exception thrown. I think this is better from the flexibility aspect, but I haven’t actually investigated the feasability of it nor the performance impact of recalculating this all the time.

I spent about two hours today trying to debug a race condition in a multi-threaded C++ app today… definitely not a fun thing to do. The worst part? The runtime diagnostics weren’t giving me anything useful to work with! Sometimes things just worked, sometimes I got segmentation faults inside old, well-tested parts of the application. At one point, I saw this error pop up:

pure virtual method called
terminate called without an active exception

What? I know I can’t instantiate a class that has any pure-virtual methods, so how did this error show up? To debug it, I decided to replace all of the potentially-erroneous pure virtuals with stub functions that printed warnings to stderr. Lo and behold, I confirmed that polymorphism wasn’t working in my application. I had a bunch of Deriveds sitting in memory, and yet, the Base methods were being called.

Why was this happening? Because I was deleting objects while they were still in use. I don’t know if this is GCC-specific or not, but something very curious happens inside of destructors. Because the object hierarchy’s destructors get called from most-derived to least-derived, the object’s vtable switches up through parent classes. As a result, at some point in time (nondeterministic from a separate thread), my Derived objects were all really Bases. Calling a virtual member function on them in this mid-destruction state is what caused this situation.

Here’s about the simplest example I can think of that reproduces this situation:

#include <pthread.h>
#include <unistd.h>
struct base
    virtual ~base() { sleep(1); }
    virtual void func() = 0;
struct derived : public base
    virtual ~derived() { }
    virtual void func() { return; }
static void *thread_func(void* v)
    base *b = reinterpret_cast<base*>(v);
    while (true) b->func();
    return 0;
int main()
    pthread_t t;
    base *b = new derived();
    pthread_create(&t, 0, thread_func, b);
    delete b;
    return 0;

So what’s the moral of the story? If you ever see the error message pure virtual method called / terminate called without an active exception, check your object lifetimes! You may be trying to call members on a destructing (and thus incomplete) object. Don’t waste as much time as I did.

Anyone who has used GNU Make on a nontrivial project surely has at some point wanted to separate inputs from outputs and intermediates. For example, take a C++ shared library using GCC (using -MMD to autogenerate GNU make dependency information):

  • baz.cxx
  • bar.cxx
  • baz.o
  • baz.dep
  • bar.o
  • bar.dep
  • (symlink to
  • (symlink to

First, anyone who hasn’t already should read How Not to Use VPATH. Now, suppose that I want to put intermediates into $(CURDIR)/tmp and outputs into $(CURDIR)/lib, and those directories must be created as a part of the build. That means that at some point in time, the build script needs to execute mkdir tmp and mkdir lib, because GCC will not auto-create them. These are the solutions I’ve come upon (aside from the obvious but useless “don’t use make”):

  1. Order the rules so that the mkdirs occur before the g++s. Of course, this breaks parallel builds, so it’s not really an option. But it’s the most obvious solution, and it doesn’t break in serial builds.
  2. Modify the rule to depend upon the directory (tmp/%.o : %.cxx tmp). There’s only one call to mkdir, but now we get a new sort of error. Whenever the directory gets modified, it triggers recompilation of all of the objects!
  3. Make the compilation rule (tmp/%.o : %.cxx) execute the mkdir before executing the g++. There’ll be a lot of spurious calls to mkdir, but it should never break.
  4. make a local g++ wrapper script that creates the directory before compiling. Invoke the wrapper script from the Makefile instead of invoking the raw compiler.

Personally, I don’t think that any of those solutions is that great. The problem, as I see it, is the combination of how Make handles dependencies (by modified-timestamp) and how directory timestamps work on Linux (if you add or remove files, it “modifies” the directory). I’ve dealt with it using solutions #1-3, but I haven’t tried #4. I suppose for a large project that #4 isn’t such a bad idea… you can wrap up all of your system-wide rules into it, and then the make output gets shorter, as well. Take this as an example:


while read arg ; do
    if [ "$arg" = "-o" ] ; then
        read outdir
        mkdir -p $(dirname "$outdir")
done < <(echo "$@")

g++ -c -ansi -pedantic -std=c++98 \
    -O3 -m{arch=core2,sse2,fpmath=sse,inline-all-stringops} \
    -W{all,extra,format=2,write-strings,init-self,error} \
    -W{cast-align,cast-qual,pointer-arith,old-style-cast,overloaded-virtual} \
    -f{omit-frame-pointer,strict-aliasing,fast-math,tracer} \
    -I{/usr/local/include,/usr/java/jdk1.5/include,/usr/java/jdk1.5/include/linux} \

Ok, now I’m convinced. I do like #4… It seems like a lot of extra work for small one-off projects, but it’s a marginal cost for a larger environment (where you end up doing more complex tasks like code generation, combining disparate projects into a single build location, automated source-control integration).

Here’s a few tidbits that I’ve run into when building latency-sensitive applications in Linux…

gcc -c -fno-guess-branch-probabilities

If GCC is not provided with branch probability data (either by using -fprofile-generate or __builtin_expect), it will run its own estimate. Unfortunately, very minor changes in source code can have very major changes in these estimates, which can foil someone attempting to measure the effect of source-level optimizations. Personally, my opinion is to use -fprofile-generate and -fprofile-use for maximum performance, but the next best is to have consistent results while developing.

Also, see this brief gcc mailing list discussion about the topic.

ld -z now
Equivalent to gcc -Wl,-z,now (which is probably preferrable, since I write C++ and always use the g++ frontend), this tells the dynamic linker to resolve symbols on program start instead of on first use. I’ve noticed that sometimes, the first run through a particular code path results in a latency of 800 milliseconds. That’s huge; that’s big enough for a human to notice with the naked eye. For server processes, that generally also means that the first request serviced on startup blows chunks, latency-wise. While it may not make a huge difference in the grand scheme of things, it’s nice to be able to pay the cost of symbol lookup up front instead of at first use.

Get every new post delivered to your Inbox.