Sunday, 12 April 2015

GCC Bug 323 - a journey to the heart of floating-point darkness

What does the -O option for GCC do? Does it ever cause problems?

Here is a "spot the difference" puzzle:
These are screenshots from a Quake 2 demo. They come from exactly the same point in the demo. They are nearly the same. The one on the left comes from Quake 2 compiled in debug mode (no optimisation). The one on the right is from release mode (optimisations on). Where is the difference?

It is very hard to see it, even if I zoom in:
It's still hard to spot, unless I apply some image filters and highlight the affected area:
This difference occurs because of the way that the compiler, GCC, has optimised the code.

Most people who use GCC will have used settings such as -O2, a typical setting for GCC's code optimiser. In almost all cases, it increases the speed of compiled code. It often reduces code size too. -O2 turns on many things. Optimisation is mostly about removing redundant operations, e.g. keeping temporary data in registers to avoid the need for extra load/store instructions. Usually it is possible to assume that optimisations don't change the meaning of code - it just gets faster. But there are special cases.

During the Easter break, I worked on a benchmark program based on Quake 2. Some time ago, I made such a program based on Doom (and I'm pleased to say I have now got around to putting it on Github). I used that for testing FPGA hardware, a code generator for LLVM, at least two CPU simulators, and most recently the test coverage and timing tools made by my company Rapita Systems.

However, a problem with Doom is that it runs very quickly! There's a demo called "Doom Done Quick" which takes slightly under 20 minutes to play in real-time, completing every level in the game, but on a modern PC the whole demo can be played in under 10 seconds. Doom is too simple. Therefore, I'd like to also use Quake 2, which (like Doom) has a software 3D engine (though OpenGL support was also included).

For my "headless" version of Doom, I incorporated a test to prove that the benchmark really was rendering the graphics correctly, rather than taking shortcuts. The test computed a CRC-32 of each frame at the point where it would normally be drawn on the screen. This could be checked against an existing reference file. Differences would indicate a hardware or software problem.

This proved to be a good way to find bugs. I found at least three bugs in Doom just by compiling the program on different platforms and with different optimisation settings. When I saw differences in the output, I traced backwards through the code to see where the difference was coming from.

I wanted to do the same thing for Quake 2: draw each frame of the game while playing through some demo file, compute a CRC-32, and check it. It's the same sort of test, but using a larger, more complex program. I wrote a new "headless" graphics driver for Quake 2 which would do the CRC-32 calculations, storing them and/or checking them against a reference. And, fairly soon after doing so, I found the difference shown above. I traced the data backwards, to see where it was coming from.

The journey begins at the point where the pixel is drawn: D_DrawSpans16. The important parts of the drawing code are:

   t = (int)(tdivz * z) + tadjust;
   ...

   do {
      *pdest++ = *(pbase + (s >> 16) +
 (t >> 16) * cachewidth);
      s += sstep;
      t += tstep;
   } while (--spancount > 0);


pdest is a pointer to the current pixel location, and (s, t) is a location in a cached copy of the texture. The pixel is copied from the texture to the screen. In the optimised version, the pixel is copied from (68, 67). In the unoptimised version, the pixel is copied from (68, 68). This happens because the value of tadjust is greater (by 1):
  • optimised tadjust = 7449641
  • unoptimised tadjust = 7449642
Why is tadjust different? tadjust comes from D_CalcGradients in r_edge.c:

    tadjust = ((fixed16_t)(DotProduct (p_temp1, p_taxis) * 0x10000 + 0.5)) -
            ((pface->texturemins[1] << 16) >> miplevel)
            + pface->texinfo->vecs[1][3]*t;

All the inputs are the same in both cases, except for p_temp1[0]:
  • optimised p_temp1[0] = -147.310593
  • unoptimised p_temp1[0] = -147.310608
Notice that these are not integer values - we are now in floating-point territory. Quake 2 uses floating-point numbers in its game engine, mostly the "float" data type, which stores each number in 32 bits.

Why is p_temp1[0] different? It came from D_DrawSurfaces in r_edge.c, where a procedure called TransformVector is called to calculate it from another variable named modelorg. modelorg is different:
  • optimised modelorg[0] = -18.1849995
  • unoptimised modelorg[0] = -18.1850014
Why is modelorg[0] different? Here's where it came from:
In this case, cl.vieworg[0] is equal to modelorg[0], so:
  • optimised cl.vieworg[0] = -18.1849995
  • unoptimised cl.vieworg[0] = -18.1850014
And cl.vieworg[0]? That comes from cl_ents.c in the Quake 2 source:

    int         i;
    float       lerp, backlerp;
    ps_t*       ps;
    ps_t*       ops;
...
    ps = &cl.playerstate;
    ops = &cl.oldplayerstate[cl.n];
    lerp = cl.lerpfrac;
...
    // calculate the origin
    if ((cl_predict->value) && ...)
    {   // use predicted values
        unsigned    delta;

        backlerp = 1.0 - lerp;
        for (i=0 ; i<3 ; i++)
        {
            cl.vieworg[i] =
   cl.predicted_origin[i] + ops->viewoffset[i]
   + cl.lerpfrac * (ps->viewoffset[i] - ops->viewoffset[i])
   - backlerp * cl.prediction_error[i];
        }
    }


I think cl.vieworg represents the player's viewpoint in the virtual world. cl is a global variable containing all sorts of client-side state information; ops and ps are pointers to substructures within cl.

I found that all of the inputs to the calculation of cl.vieworg[0] are identical, but somehow, the output value of cl.vieworg[0] is different. What has the optimiser done? Has it introduced a bug?

Notice the local variables "lerp" and "backlerp", calculated as follows:

    lerp = cl.lerpfrac;
    backlerp = 1.0 - lerp;


The x86 floating-point unit (FPU) knows about three types of floating point number. One occupies 32 bits: this is known as "float" in C, and it's widely used in Quake 2. lerp and backlerp are both floats. Another type occupies 64 bits ("double" in C) and a third occupies 80 bits ("long double" in C). Larger sizes mean greater precision.

However, the FPU only distinguishes between the three types when copying floating-point data between the FPU and RAM. Internally, it always uses the 80 bit type. All of the registers are 80 bits wide. More precision is good, right?

Well.. not entirely.

The code to calculate backlerp is just before the "for" loop:

    backlerp = 1.0 - lerp;

The optimised x86 code is as follows:

    flds 0x18(%esp)   # load lerp (local variable)
    fsubrs 0x8084104  # compute 1.0 - lerp (address 0x8084104 contains 1.0)

This stores the value of backlerp within the FPU register %st0, where the precision is 80 bits. The unoptimised x86 code is different:

   fld1              # load 1.0
   fsubs -0x20(%ebp) # compute 1.0 - lerp
   fstps -0x24(%ebp) # store backlerp

This stores the value of backlerp in RAM, where the precision is 32 bits. Because 48 bits of precision are lost, the value is rounded.

To illustrate the difference in precision, here are two values for backlerp computed from the above code. Start with lerp = 0.479999989:
  • optimised backlerp = 0.5200000107288360595703125
    (long double, 80 bits, more precise)
  • unoptimised backlerp = 0.519999981
    (float, 32 bits, less precise)
At this point, the two numbers are equivalent when converted to the "float" type. But that equivalence is not maintained when backlerp is used in a calculation. The small difference is propagated, and becomes significant enough to produce a different result:
  • optimised cl.vieworg[0] = -18.1849995
  • unoptimised cl.vieworg[0] = -18.1850014
That different result is why a different pixel is drawn on the screen.

Is this a compiler bug? At first glance, it seems like it: a different optimisation setting produces a different output. That impression is confirmed by trying out other versions of GCC on other architectures. You get the correct answer (-18.1850014) with GCC for AMD64 and with GCC for ARMv6. You also get the correct answer with old versions of GCC for x86 (e.g. version 3.3 or version 2.95). It looks like this is a bug, introduced in GCC 4.x, affecting x86 platforms only.

However, it's not a compiler bug. In fact, the dark heart of this problem is the FPU. The x86 FPU does not provide any way to insist that calculations should take place at a particular precision. Everything happens at 80-bit precision. So the results of a calculation may be different depending upon whether the temporary variables are stored in RAM (which is typical of unoptimised code) or stored in registers (which is typical of optimised code).

The x86 FPU design dates back to 1980. The design has certain disadvantages which do not exist in more recent designs. For instance, on AMD64, GCC uses the SSE vector unit for floating-point arithmetic. Every SSE floating-point operation has an associated precision. However, SSE was only introduced with Pentium III. Older x86 CPUs don't support the special SSE instructions, and so the original FPU is the only option.

The "bug" we are looking at here has been reported many, many times to the GCC developers. The bug number is 323, and here is the Bugzilla record.

Bug 323 is "the most often reported non-bug." It is legend. It is rediscovered at least every few months and reopened under a new name. And that only counts the people who bothered to report it on GCC's bug tracker. Many others may have noticed it but never reported it.

It doesn't help that the bug can appear in various forms, and unless you understand something of the low-level details of the x86 FPU, you may not realise that you are looking at a duplicate of bug 323.

Why can't GCC just work around it? Well - a hardware design flaw is one of the worst things that a compiler designer may encounter. The only thing that may be worse is a problem with the language itself! The x86 FPU design forces the compiler designer to make a choice:
  • Sacrifice speed by storing temporary values in RAM.
  • Sacrifice compatibility by using a different FPU, e.g. the SSE vector unit.
  • Sacrifice accuracy, and accept that programs using floating point may behave strangely as a result of optimisation.
None of these are great choices. By default, GCC 4.x sacrifices some accuracy, and allows the user to select the other behaviours with command-line options:
  • Store temporary values in RAM using the GCC option "-ffloat-store", and force a specific level of precision using "-fexcess-precision=standard".
    Disadvantage: the speed of the program is reduced.
  • Enable SSE using the GCC options "-mfpmath=sse -msse".
    Disadvantage: the program won't run on older x86 CPUs.
I found that the Clang compiler (version 3.5) uses SSE by default, so it gets the correct answer, but at the cost of compatibility. Perhaps this behaviour will also be the default in future versions of GCC. These days, x86 CPUs without SSE are quite rare.

When I use either of the options, I find that I get the same output from x86 Quake 2 regardless of the optimisation setting, and I get the same results when I compile in debug mode for AMD64 and ARM, though work is ongoing. Differences still exist when I compile in optimised mode (AMD64/ARM), but I don't yet know if they are due to inaccuracies in the floating-point implementation, or due to other bugs. I already found and fixed two 64-bit-specific bugs in the software renderer; there are probably more.

Summary:

Due to the design of the FPU that is part of the x86 architecture, compiler optimisations can cause floating-point operations to behave differently in subtle ways. These differences show up in tests that rely on perfect floating-point accuracy, such as the test I have been building from the Quake 2 source code. The problem is well known to compiler designers and is documented in the legendary GCC bug 323, which is frequently duplicated as GCC users "discover" the "bug". There is no easy solution - all possible workarounds reduce compatibility, speed or accuracy.

3 comments:

  1. Great article! Thanks for sharing your experience!

    My solution: Use 80 bits for storing floating-point data in RAM for unoptimised x86 code (non SSE).
    Disadvantages: larger memory consumption, so, performance penalty, but...
    ...Who does it matter? In fact, it is unoptimized code.

    Regards,

    ReplyDelete
    Replies
    1. Thanks.

      I suppose you might "#undef float" and then "typedef long double float;" (!) Though it might be quite a big change - every "float" would suddenly be a different size, which would affect data structures throughout the 3D engine, and perhaps break it, if there is code that assumes something about the size (which is quite likely to exist).

      Turns out there is another solution that I didn't think of. Contrary to what I already wrote, the x87 FPU *can* be forced to work entirely in "float" or "double" precision by setting flags in the control word. But the setting applies to the whole program: http://www.reddit.com/r/programming/comments/32o069/gcc_bug_323_a_journey_to_the_heart_of/cqd0br8 - and so it is still preferable to use SSE/SSE2 if possible.

      Delete
  2. > The x86 FPU does not provide any way to insist that calculations
    > should take place at a particular precision.

    That is not completely true. The x86 FPU certainly doesn't have a convenient or perfect way of doing this. However you can set the FPU's precision such that the mantissa will be rounded to 24 bits (float), 53 bits (double) or 64 bits (long double) after each operation.

    It's inconvenient because changing the rounding mode is expensive -- it is impractical to change it on a per-instruction basis. It is imperfect because the exponent is not clamped so the results are not identical.

    It turns out that SSE is not actually a fix. If you calculate "d = a + b + c" where they are all floats then the results may vary because the intermediate precision is not specified by C, C++, or IEEE. Expanding to double may be more accurate (or, maybe not) and staying at float precision may be less accurate and will definitely be faster. Different compilers and compiler versions make different choices:

    https://randomascii.wordpress.com/2012/03/21/intermediate-floating-point-precision/

    ReplyDelete