DragonFly kernel List (threaded) for 2003-07
Re: token passing & IPIs
Thanks for the answers.
On Thu, Jul 17, 2003 at 12:46:59PM -0700, Matthew Dillon wrote:
> The key to using an IPI properly is to use it asynchronously whenever
> possible... that is, to not have to wait for a reply to your IPI message.
You will eat a lot of overhead from the interrupt processing
when receiving one on the target CPU. You can probably make it a fast path
(handler in pure assembler etc.), but it still will be somewhat costly.
Just take a look at the pseudo code of "INT" in the Intel manual.
It's several pages. Processing that just must be slow. And it even
includes several locked cycles on the bus (IDT/GDT accesses are normally
> Well, I disagree somewhat on locks not being avoidable. Consider the
> case where you have two entirely independant machines operating in a
> cluster. These machines obviously do not have to get low level locks
> to protect each from the other, mainly because each machine has its
> own entirely independant set of caches.
But they do not present a single image neither.
If they have the same / every time you have a path name that starts
with / you have to read lock the root inode and the cache where it is
stored somehow. The same applies to other often accessed file objects.
Unix unfortunately has some shared data by design that cannot be easily split.
You could avoid the problem of shared root by using plan9 like name spaces,
but that would be quite a radical design change and may not be very nice
for the user.
As long as you have a single name space you need some amount of
synchronization because there is single shared data.
> This implies that it is possible to implement caching algorithms which
> are far more aware of an MP environment and make better use of per-cpu
> caches (which do not require locks), possibly even to the point of
> duplicating meta-data in each cpu's cache.
You are probably refering to McVoy's cc/cluster concept here. But even
that requires some amount of synchronization, e.g. for the single
file name space.
> Take a route table for example. Traditionally you lock a route table up
> by obtaining a global mutex. But what if you duplicated the entire
> route table on each cpu in the system? If the route table is a bottleneck
> in an MP system and only eats, say, 1MB of ram, then eating 8MB of ram
> on an 8cpu system with 8G of ram is not a big deal, and you suddenly have
> an entirely new situation where now route table lookups can be performed
> without any locking whatsoever at the cost of making route table updates
> (which have to be sent to all cpus) more expensive. But if you are
Most modern TCPs allocate new routes on the fly to save path MTU discovery
data. You don't want to do that duplication for each new connection.
In theory you could save that information per CPU, but it may not
be a very good trade off because sending packets with the wrong MTU
can very costly in performance.
Read-Copy-Update might be a better solution for routing tables than
duplication, but it is probably still too slow.
Another example would be the local ports hash table of your UDP and TCP.
You could try to split it per CPU, but it would limit the user
unduly (The 16bit TCP/UDP port space is a very limited resource, limiting
it more than 16bits would be not very nice).
Or the IPID is also a common issue - normally single global variable
[I did implement a "cookie lair" for one stack where it allocates
the IDs in batches and saves them per CPU, but it has the nasty side effect
that it makes the 16bit IPID space deplete faster, making it more
likely that you get data corruption over fragmented NFS when the ipid
wraps at Gigabit speeds]
Or a queue for a single listen socket. Often big boxes only listen
on one port (like port 80). It is often updated, duplication would
be rather slow.
I fear even with radical system redesign (one nice example for that
is the K42 research OS) you will still have plenty of "hot"
data structures to manage and protect. And that requires efficient
> VM Objects and other infrastructure can be worked the same way, or worked
> in a hybrid fashion (e.g. only duplicate VM objects for threaded processes,
> for example).
Just hope that you don't have an multithreaded application that uses
all CPUs with a single address space then.
Another related issue would be swapping. Sometimes it's a good idea
to have an asynchronous swapper on another CPU clean up/age your
address space, which you're working in it (it already costs IPIs
for the remote TLB flushes, but that's a different issue). For that
you also need a reasonable cheap way to access the data structures
of the object that may be also used on the other CPU.
> Well it isn't written yet, but yes, it will just do something like an
> INT 0x80. You can't really fast-path user<->kernel transitions, you
SYSENTER would be much faster on modern x86.