DragonFly kernel List (threaded) for 2005-12
Re: Software Transactional Memory?
:Someone used a "transactions system" that is found in modern databases to
:handle SMP. Quoting:
I did a quick browse of the paper. I believe the concept is quite valid.
In fact, I am a big fan of opportunistic locking (which they are using, or
at least partially using). What they are describing would not be useful
in all situations considering that there is still a lock involved. The
pure DragonFly model is to avoid having to use the lock in the first
place, but even so I recognize the fact that there are still planty of
cases where we don't have a choice.
I would argue that a pure opportunistic model might be an even better
solution then the one described in the paper. With a pure opportunistic
model the structures are basically just guarenteed not to go away from
the point of view of SMP access and a transaction involves creating a
new copy of the structure containing the changes. Synchronization
occurs at the very end of the transaction and either gets a thumbs up
or a thumbs down. If it succeeds, the new structures replace the old
by changing the originating pointer to the structure. The old structure
remains around only as long as is necessary to satisfy other accessors
(who will then fail when they try to commit their transaction).
Clearly this model is not useful in all circumstances. To be efficient
the structures involved have to be fairly small. So, for example,
the model would work for a route table entry or a VM page, but would
not work for TCP state.
Another model that works very well is a replication model, which is
what Jeff's route table code does (that is going in after the release).
With a replication model the data is physically replicated on each
cpu involved in accessing or modifying it, giving lock-free read-only
access. Modifications to the data are then chained through the cpu's
involved (that is, modifications are made locally in the context of
each cpu). A simple 9 nS critical section is all that is needed.
But the best model, in my view, is one that is inherently contention-free.
Our TCP implementation is a good example of this. Only one cpu manages
any particular connection so there is no contention and no locking
required at all. The trade-off is that one winds up doing more context
switches but this also has the effect of concentrating the bad news in
one place where it can be optimized well.
This is what the focus of my BayLisa talk was... getting actual hard
numbers from some of our algorithms:
What I found, basically, was that overheads for certain MP operations
were far lower then I originally thought they would be, and that it is
possible for algorithms to naturally aggregate operations into very
efficient batches to not only reduce or remove the context switching
overhead from the performance equation, but also to take maximum advantage
of the cpu's code and data caches.
If you look at the end of the slide show there are some numbers from
naturally aggregated work initiated by received packets (multiple packets
per interrupt) and flowing through to the TCP stack running lock-free
on another cpu.
Earlier in the slide show there are slides showing the positive cache
effects of work batching, such as when batching multiple cross-cpu
free() operations. There are actually two slides. The 'take 1' slide
had bad numbers for the target cpu due to KTR logging overhead. The
'take 2' slide has correct numbers (after I removed the extra arguments
from the KTR calls I was making).