DragonFly kernel List (threaded) for 2005-06
Re: SPL vs. Critical Section vs. Mutexing for device synchronisation
:Hi Matt, hi all,
:as I said before, we should not start to fall into an adhoc-change mode,
:but think carefully about what we want and need first. I want to
:describe the advantages and problems of the mechanisms used here first
:and what I want to have afterwards.
There are certain things we HAVE to do, regardless of where we
want to end up.
Both SPLs and critical sections work only on the local cpu, which
means that as a general interrupt protection mechanism they only
really work when the BGL (Big Giant Lock) is being held. This is
the case with the current system.
We obviously want to get rid of the BGL. At a minimum this means that
SPLs will no longer function. Therefore we have to get rid of SPLs.
Without the BGL critical sections still serve a purpose... they
interlock against interrupts on the local cpu. More specifically,
they interlock against *IPI* interrupts on the local cpu and thus
still represent an excellent mechanism for interlocking in subsystems
which use cpu-local threading (such as our networking subsystem),
or need to protect cpu-local variables (such as our LWKT threading
No matter what we do we have to get rid of SPLs. We have a
chicken-and-egg problem here. SPLs have to go in order to progress
towards our MP goals, but we have not yet rewritten the hundreds of
drivers that depend on SPLs. Yet they have to go... chicken and
egg. We can't do it all in one go so we have to take a baby step and
the first baby step is to replace SPLs with critical sections.
This isn't an adhoc change... SPLs have to go, period. But we can't
achieve a lockless goal all in one step. It simply isn't possible.
The system is too complex. This change allows us to consolidate the
code and remove the SPL support from the interrupt paths entirely.
Since SPLs won't work in an SMP environment without the Big Giant Lock,
this represents considerable forward progress. Once SPLs are gone I
will be able to remove easily a hundred lines of code or more of
hybrid C and assembly. The only assembly left will be the assembly that
handles critical sections, and since critical sections are necessary
even once the BGL is removed the removal of the SPL and CPL checks
will repreesnt a major cleanup of our low level interrupt and preemption
code and serious forward progress towards our goals.
:(A) Defered interrupt processing via critical sections
:The primary advantage of critical sections is the simplicity. They
:are very easy to use and as long as there is no actual interrupt
:they are very cheap too. As a result they work best when used over
:short periods of time. Interrupt processing can be implemented either
:via a FIFO model via interrupt masks.
This only works because we hold the BGL. Without the BGL critical
sections cannot be used to protect against interrupts. Therefore,
while the interrupt subsystem is able to depend on critical sections
now, IT WON'T BE ABLE TO IN THE FUTURE. A new MP-SAFE API is needed.
At the moment I have created a mutex-like (locked bus cycle)
serialization API that is MP safe. The abstraction is general enough
that we should be able to replace the internals with something
better (aka lockless) in the future. But right now it's the only
thing we have which is inter-cpu safe.
:The down-side of critical sections is the coarse granularity making
:it unsuitable for any thing not taking a short period of term. Similiar
:issues to the mutex concept below apply. It also means we have to be
:on the same CPU as the interrupt.
But this isn't necessarily true. What really matters here is
* What percentage of the time is a cpu holding a critical section, and
* What percentage of interrupts are being delayed due to code being in
a critical section.
Critical sections are not often held for long periods of time. Sure
there are a few exceptions, but even 'long' procedural paths such as
malloc() or lwkt_switch() typically only hold a critical section for
a microsecond or two. The paths we really care about are the ones
that either hold a critical section for a very long period of time,
such as through a DELAY() call, or processing loops (such as in CAM
or in device drivers) which can potentially process hundreds or thousands
of events and thus take an unbounded amount of time with a critical
section held. Processing loops, at least, can be dealt with by adding
an splz() call in the loop, but still cause additional interrupt
overhead to be taken to delay the interrupt.
:(B) Per-device mutex
:This is the model choosen by FreeBSD and Linux. Ignoring dead-locks,
:this is actually very simple to use too. When ever the device-specific
:code is entered, the mutex is acquired and released when it is left.
:The down-side of this are two-fold. First of all it does require *two*
:bus-locked instruction, which is quite expensive especially under SMP.
:This holds true independent of whether the mutex is contested or not.
:The second big problem is that it can dramatically increase the interrupt
:latency. (Just like long-term critical section). The results has been
:measured for the Linux and FreeBSD implementation and are the one reason
:for the preemption mess they have.
Yes, locked bus cycle instructions can potentially be very expensive.
But there are only a limited number of ways to get around it. In
the DragonFly model the only way we can get around a locked bus cycle
is to use cpu localization to turn the lock into a critical section.
This means that all operations have to execute on the same cpu.
When a device driver is interacting between its upper and lower layers
you have to remember that the upper layers can be called from any
process and thus any cpu. If that layer must interact with a lower
layer the only way to do it is to either use a locked bus cycle or
to use an IPI message to forward the operation to the same cpu that
the interrupt is bound to.
:(C) Defered interrupt processing via SPL masks
:This is the mutual exclusion mechanism tradionally used by the BSDs.
:It allows certain device classes to be serialised at once, e.g. to
:protect the network stack from infering with the network drivers.
:Currently in use are splvm (anything but timer), splbio (for block devices)
:and splnet (for network drivers). The nice part of this approach is that
:it has a similiar performance as critical sections on UP, but is finer
:The down-side is the big complexity for managing the masks. It is also
:more course-grained than it often has to be.
The down side is that it doesn't work in an SMP environment unless
you are holding the Big Giant Lock.
:Conclusion: I'd like to have two basic mechanisms in the tree:
:(a) critical sections for *short* sections
:(b) *per-device* interrupt deferal
I like the idea of per-device interrupt deferal but you have to
realize that in an SMP environment this almost certainly requires
the use of locked bus cycle instructions.
In the case where the entity wishing to defer a particular device
interrupt is on a different cpu from the interrupt handler we have
two race conditions we have to deal with:
(1) cache coherency race condition with entity A attempting to mark
the interrupt for deferal simultaniously with the interrupt
handler beginning execution and marking the interrupt is being
(2) threading race where the interrupt handler is ALREADY RUNNING on
another cpu and the entity wishes to defer execution of the
handler. In this case the entity must spin or block until the
handler has finished running.
Frankly I don't see how we can possibly avoid the use of a
locked bus cycle instruction for either case. The only way to truely
avoid locked bus cycles is through cpu localization (aka using IPIs
to execute all related operations on the same cpu).
At the moment, per-device interrupt deferal is achieved through the
serialization API that I committed a few days ago, which the IF_EM
driver is now using.
The use of cpu localization is one of DragonFly's major goals. I
*WANT* to use a cpu localization mechanism whenever possible, and indeed
we are using such a mechanism in our networking code with incredible
results. But that doesn't mean that cpu localization works in every
case. FreeBSD is using mutexes for just about everything. We may
still need to use mutexes but when we do it won't be for everything, it
will only be for those things that cannot be efficiently implemented
with cpu localization. Interrupt interlocks could very well be one of
those things for which cpu localization is not the best solution.
In anycase, these are all very complex issues. Even if we come up with
a clean solution, we can't go from step A to step Z in a single step.
We still have to take baby steps to achieve our goals. And that is
what we are doing right now by removing SPLs.