Most software developers have stumbled across the humble C atoi() function. And for most simple jobs, it is sufficient. However, I’ve actually run into cases where it becomes the performance bottleneck, so I’ve spent a little bit of time playing around with various optimizations.

So, what are the problems with the standard atoi() that would cause me to even think about reimplementing it myself for a custom case?

  • It doesn’t provide useful error indicators. It just returns 0 on failure. It also doesn’t consider trailing non-digits to be errors, so its return code can’t really be trusted for total conversion, anyways.
  • It’s at least an order of magnitude slower than rolling your own.

Overall, I’d probably stick with boost::lexical_cast or std::istringstream for the average case, where performance is negligable. since I always use C++, I don’t see any reason to drop down to strtod, but I suppose I would use that if I were writing C code.

A Simple Replacement

For your edification, here’s a simple replacement function that handles most of the stock C version:

int atoi(const char *str) {
    while (*str == ' ') ++str; // skip leading whitespace
    int sign = 1;
    switch (*str) { // handle leading +/-
        case '-': ++str; sign = -1; break;
        case '+': ++str; break;
    }
    int value = 0;
    while (char c = *str) {
        if (c  '9' || value < 0) return 0; // non-digit, or overflow
        value *= 10;
        value += c - '0';
    }
    if (value < 0) return 0; // integer overflow
    return value * sign;
}

Optimizing my Special Case

In the cases I care about, there are a few additional constraints I can typically take advantage of:

  • Negative numbers can show up, but positive numbers never have a leading + character. Leading 0s never show up.
  • Size of the string is known a priori, so I can work from the back to the front.
  • Maximum size of the string is known (a 32-bit integer isn’t ever going to be more than 10 characters), so I can manually unroll the loop.
  • The caller has validated that there are no non-digits (or doesn’t care if an input error cascades to an output error silently), so I don’t need to check if all values are between 0 and 9 inclusive.

So here’s the optimized version. For some reason, wrapping this up in a function object (called many times per instantiation) performed significantly better than using a plain old global function.

class atoi_func
{
public:
    atoi_func(): value_() {}

    inline int value() const { return value_; }

    inline bool operator() (const char *str, size_t len)
    {
        value_ = 0;
        int sign = 1;
        if (str[0] == '-') { // handle negative
            sign = -1;
            ++str;
            --len;
        }

        switch (len) { // handle up to 10 digits, assume we're 32-bit
            case 10:    value_ += (str[len-10] - '0') * 1000000000;
            case  9:    value_ += (str[len- 9] - '0') * 100000000;
            case  8:    value_ += (str[len- 8] - '0') * 10000000;
            case  7:    value_ += (str[len- 7] - '0') * 1000000;
            case  6:    value_ += (str[len- 6] - '0') * 100000;
            case  5:    value_ += (str[len- 5] - '0') * 10000;
            case  4:    value_ += (str[len- 4] - '0') * 1000;
            case  3:    value_ += (str[len- 3] - '0') * 100;
            case  2:    value_ += (str[len- 2] - '0') * 10;
            case  1:    value_ += (str[len- 1] - '0');
                value_ *= sign;
                return value_ > 0;
            default:
                return false;
        }
    }
private:
    int value_;
};

Results

I compared this optimized version with the standard library function by running each 1 million times in a row. I know for a fact that it’s bad, particularly because it reduces the impact of instruction cache misses. However, I expect that this only pessimizes my specialization, which I would expect to exhibit much better locality due to aggressive inlining. Here are the times:

standard atoi()
79142 milliseconds, or approximately 79 nanoseconds per iteration amortized.
class atoi_func
131 milliseconds. On my machine, that works out to about 4 clock cycles, which can’t be right. My best guess is that the numbers are too small for my clock resolution. I’ll take this to work and try on our machines there, they have much higher resolution for clock_gettime(CLOCK_REALTIME).
Advertisements