DragonFly commits List (threaded) for 2011-12
git: kernel - Add workaround support for a probable AMD cpu bug related to cc1
Author: Matthew Dillon <firstname.lastname@example.org>
Date: Sun Dec 25 13:47:39 2011 -0800
kernel - Add workaround support for a probable AMD cpu bug related to cc1
* Add supporting inlines and a #define. See the followup commit to
the gcc-4.4 code in the DFly codebase.
* This bit of code is used to add a single NOP instruction just prior to
the pop/ret sequence in cc1's fill_sons_in_loop() which works around
what we believe to be a very difficult to reproduce AMD cpu bug. The
bug appears to be present on contemporary AMD cpus and was replicated
on a Phenom(tm) II X4 820 Processor (Origin = "AuthenticAMD" Id = 0x100f42
Stepping = 2) and on an opteron 12-core cpu AMD Opteron(tm) Processor 6168
(Origin = "AuthenticAMD" Id = 0x100f91 Stepping = 1).
* The bug is extremely sensitive to %rip and %rsp values as well as
stack memory use patterns and appears to cause either the %rip or the
%rsp to become corrupt during the multi-register-pop/ret sequence at
the end of fill_sons_in_loop() in the GCC 4.4.7 codebase. This
procedure is called as part of a deep tree recursion which exercises both
the AMD RAS (Return Address Stack) hardware circuitry and probably also
the write combining circuitry.
* I have so far only been able to reproduce the bug on DragonFly but have
to the best of my ability eliminated the OS as a possible source of the
problem over the last few months. I am currently attempting to reproduce
the bug running FreeBSD on the same hardware but it's virtually impossible
to replicate the exact environment without adding DragonFly binary emulation
to FreeBSD (which I just might have to do to truly verify that the bug is
not a DragonFly OS bug).
* Bug reproducability: DragonFly utilizes a 0-1023 (~16 byte aligned)
random stack gap. Under normal buildworld -j 25 or similar conditions
it can take anywhere up to 2 days to cause a failure. Using a fixed
stack gap of 904 (sysctl kern.stackgap_random=-904) on a particular cc1
line during the compilation of gcc-4.4 using gcc-4.4, compiling gcc/mcf.c,
with a carefully constructed environment and command path (to replicate
a precise starting stack %rsp of for main() of 0x7fffffffe818), I was
able to replicate the bug in around a 60-second time frame with
approximately one out of every 16 compiles hitting the the bug and failing.
* Changing the stackgap and/or modifying the code in any way (e.g. causing a
shift in the %rpc values) changes the characteristics of the bug, sometimes
causing it to stop appearing entirely.
It was found that an adjustment of the stackgap in 32768 byte increments
starting at the gap known to fail also reproduces the bug with the same
consistency as the original stackgap value.
* Only the fill_sons_in_loop() function in cc1 in a few particular cases
appears to be able to trigger the bug, across all the compiles we've
done over a year.
Summary of changes:
sys/cpu/i386/include/cpufunc.h | 32 ++++++++++++++++++++++++++++++++
sys/cpu/x86_64/include/cpufunc.h | 32 ++++++++++++++++++++++++++++++++
2 files changed, 64 insertions(+), 0 deletions(-)
DragonFly BSD source repository