
                       FPU Handling in the Fiasco Microkernel

                    Udo A. Steinberg <us15@os.inf.tu-dresden.de>

                                  31 March 2003


1. FPU State Handling
---------------------

In a multi-threaded environment like the Fiasco microkernel, concurrent
access to the FPU must be handled by the operating system in such a way that
each thread has the notion of using the FPU exclusively. In a kernel like
Fiasco, where threads can be preempted at virtually any time, the saving and
restoration of the FPU state must be done by the kernel during context
switches. We can distinguish between two approaches: eager and lazy FPU
handling. When using eager FPU saving, the kernel saves the FPU state of the
outgoing thread and restores the context of the incoming thread during a
context switch. Intel x86 processors and compatibles have support for
marking the FPU busy by means of flipping the TS-Bit in the CR0 register.
When marked as busy any access to the FPU will cause the processor to
generate a fault #7 (Device Not Available). This bit is used by operating
systems to implement various forms of lazy FPU saving. Whereas older Fiasco
kernels implemented eager FPU saving and lazy FPU restoration the current
FPU code uses a lazy approach for both operations.

At bootup the FPU is marked busy (TS-Bit=1) and the FPU owner thread is set
to none. A thread which accesses the FPU will cause the generation of
exception #7. The kernel's exception handler saves the FPU state of the
previous FPU owner thread and then restores the FPU state of the current
thread which caused the fault to be generated. There are two exceptions:
First, if no previous FPU owner existed, the saving part is omitted.
Secondly, if the current thread did not yet have an FPU state, an FPU state
is allocated for that thread and the FPU is being reinitialized to default
values, thus giving the thread a clean FPU state to work with. The current
thread then becomes the new FPU owner and the FPU is no longer marked busy
(TS-Bit=0) until the next context switch. When a context switch occurs, the
TS-Bit is set or cleared depending on the switch target. When switching away
from the FPU owner, the FPU is marked busy, when switching to the FPU owner
the FPU is no longer marked busy.

The data structures used by the FPU handling code are as follows:

- A context pointer "Fpu::owner" which points to the thread which most
  recently used the FPU, i.e. whose FPU state is currently loaded in the FPU.

- An FPU slab for each thread which is allocated on demand the first time
  the thread accesses the FPU and which is used to save/restore the FPU state.

- A bit "Thread_fpu_owner" in each thread's state which is set if that
  thread is currently the FPU owner. It is used to speed up the test if a
  thread is FPU owner, because the thread state is almost always hot in the
  cache whereas Fpu::owner is not.

2. FPU Exception Handling
-------------------------

The FPU reports errors like "division by zero" by means of exceptions. If
such an exception occurs, the FPU sets the respective exception bit in the
FPU status word and proceeds with the execution of the next FPU instruction.
"Waiting" FPU instructions check the FPU status for pending exceptions
beforehand and in case of a pending exception, a fault #16 (Math Fault) is
signalled to the CPU. "Non-waiting" FPU instructions do not check for pending
exceptions and thus any pending FPU exception will not be delivered until
the execution of the next "waiting" instruction. Because CPU and FPU execute
simultaneously, the operating system must ensure that deferred exceptions are
delivered to the correct thread which caused them to be generated. Imagine
scenarios where a thread generated an FPU exception and then another thread
is being scheduled. When the second thread executes a waiting FPU
instruction, the exception will be delivered "out of context" of the
originating thread. The eager exception handling approach ensures that any
pending FPU exceptions are delivered before threads are suspended during a
context switch by calling "fwait" to immediately deliver any pending FPU
exceptions. However, this is not necessary as the lazy approach shows.
Because the exception bits are contained in the FPU state, they can be saved
and restored the same way the rest of the FPU state is saved and restored.
After restoring an FPU state with exception bits set, the next "waiting" FPU
instruction will recognize the exception state and signal the CPU a math
fault. For this to work, the kernel must not consume and thus lose pending
user FPU exceptions and ideally not generate any FPU exceptions in kernel
mode at all.

Fiasco does not use the FPU itself. However, it executes FPU instructions
to save and restore the FPU state. These are:

fnsave/frstor        on CPUs without Fast FPU-Save/Restore (FXSR)
fxsave/fxrstor       on CPUs with Fast FPU-Save/Restore (FXSR)

fnsave, fxsave and fxrstor are non-waiting FPU instructions and thus will
not cause the delivery of pending user FPU exceptions in kernel mode. The
frstor instruction is a "waiting" FPU instruction. Therefore special care
must be taken to ensure that no FPU exceptions are pending when frstor is
called. There are two possible scenarios:

- If a previous FPU owner existed, fnsave had been called to save the FPU
  owner's state before the restoration of the current FPU state via frstor.
  fnsave completely reinitialized the FPU and thus ensured that no exception
  bits are set.

- If no previous FPU owner existed, fnsave had not been called and there is
  a possibility for pending exceptions in the FPU, namely if the thread
  which most recently used the FPU left the FPU in an exceptional state
  and was then shut down. In such a case the kernel calls "fnclex", which
  itself is non-waiting, to clear any stale pending exception bits.

For all the other gory details about how the FPU works, see the "Intel IA-32
Architecture Software Developer's Manual" or RTFS.
