You are currently browsing the category archive for the ‘ASM’ category.
I like to sneak bitshifts into interviews—not because they’re used commonly in modern C++ code, but because they used to be common, as a way of getting good performance out of poor compilers. It’s very useful to know the tricks of the past, if you ever find yourself maintaining code written by an earlier generation of programmer. Take this example:
unsigned imul(unsigned x)
{
return x * 10;
}
unsigned bitshift(unsigned x)
{
return (x << 3) + (x << 1);
}
Below is the assembler produced by gcc -c -O3 -march=core2, using objdump -d --no-show-raw-insn to get the assembler from the compiled output. There are two interesting things to note:
- The compiler uses address calculation hardware for simple arithmetic: lea is basically a strided array access, base + stride * i.
- The compiler doesn’t use any shifts at all.
00000000 <imul>: 0: push %ebp 1: mov %esp,%ebp 3: mov 0x8(%ebp),%eax 6: pop %ebp 7: lea (%eax,%eax,4),%eax # n + 4n a: add %eax,%eax # (n + 4n) + (n + 4n) c: ret 00000010 <bitshift>: 10: push %ebp 11: mov %esp,%ebp 13: mov 0x8(%ebp),%eax 16: pop %ebp 17: lea 0x0(,%eax,8),%edx # 8n 1e: lea (%edx,%eax,2),%eax # 8n + 2n 21: ret
For comparison, here is the same code compiled with gcc -c -O0 -march=i386. Note that shifts are used in both cases. If you try a few other values of -O and -march, you’ll see some other interesting results, but I’m not going to bother to paste them all here.
00000000 <imul>: 0: push %ebp 1: mov %esp,%ebp 3: mov 0x8(%ebp),%edx 6: mov %edx,%eax 8: shl $0x2,%eax # n << 2 b: add %edx,%eax # (n << 2) + n d: shl %eax # ((n << 2) + n) << 1 f: leave 10: ret 00000011 <bitshift>: 11: push %ebp 12: mov %esp,%ebp 14: mov 0x8(%ebp),%eax 17: lea 0x0(,%eax,8),%edx # 8n 1e: mov 0x8(%ebp),%eax 21: shl %eax # n << 1 23: lea (%edx,%eax,1),%eax # 8n + (n << 1) 26: leave 27: ret
If you go through some of the major Intel processor models, you will see that the actual assembler output varies quite a bit. What does this mean? Mostly that micro-optimizations designed to produce ever-so-slightly better assembler are usually the wrong approach for long-lived software. Yet, it’s a fact of software that this code will be seen on any sufficiently large system, and it must be understood and fixed if possible.
Developers working on embedded systems with a restricted set of compilers… YMMV. Sorry.
