DragonFly BSD
DragonFly kernel List (threaded) for 2005-02
[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]

User/Kernel MPI [LONG, Newbie, Questions]


From: "Nathaniel W. Filardo" <nwf@xxxxxxxxxxxxxx>
Date: Mon, 7 Feb 2005 05:11:20 -0500 (EST)

Hello all.

I'm writing mostly because I'm curious about the topic; I'm probably not sufficiently an ubercoder to actually write the implementation but I'd still like to understand the design and implementation. If I'm jumping the gun and wondering about something that's not yet on the radar, let me know. Also, apologies that this email is damn long.

First off, is there any documentation on the messaging and threading model beyond those on www.dragonflybsd.org/goals ?

Anyway, my friend and I spent a lot of time thinking about the proposed threading/messaging model and developed it more fully than the docs we'd read (he's looking for a project for a class; seems to have moved on to L4/HURD but I'm still curious about DFly). So really, can somebody tell me if the below is correct or at least on the right track?

=== Scheduler changes

First off, we introduce a userspace thread scheduler to replace the current, entirely in-kernel userland scheduler (similar in concept to NetBSD's scheduler activation sa_* API). This exposes to the kernel an entrypoint which may be called with arguments (such as which signals are pending, possibly with messages pending - if scatter/gather queues are used the head of each new chain must be passed in; if the buffer is static, it can be created once, registered with the kernel, and never passed back from the kernel - this is really an implementation detail, though I think the former is preferable as an initial opinion). Details of when this is called are entangled below.

Secondly, we have to introduce some kind of shared state between the kernel and userspace. If somebody has a way of avoiding this, do let me know [the NetBSD people seem to use lots of syscalls which seems worse, to me]. Firstly, the userland scheduler must inform the kernel of how many threads are currently runnable (RUNNABLE_COUNTER). This implies that the userland scheduler needs to provide, as one would expect, thread_yield() and similar functions which manage this counter in a MP-safe way. The kernel merely reads this number and the worst case of misreading is a delayed schedule() or a spurious schedule(). If the kernel additionally knows the locking mechanism employed it can delay reading until the lock is released. Note that while this number is redundant against the data structures below, its presence allows the kernel not to keep persistent pointers to thread data structures.

In addition to the userland scheduler's per thread data structures, there must be a per-thread shared data structure as well - primarily it must contain the thread's runnable state and its context (stored registers). We introduced several values for this thread state to take on: ACTIVE, RUNNABLE_YIELDED, RUNNABLE_PREEMPTED, BLOCKED_USERLAND, and BLOCKED_KERNELSPACE. Note well that despite the presence of these data structures, they are *NOT* persistently pointed to by kernelspace entities (excepting for BLOCKING_KERNELSPACE threads - see below), which means that userland is free to create or destroy threads at its whim without notifying the kernel (though failure to update the RUNNABLE_COUNTER may have adverse effects on system performance - both cases [spurious schedule() and missed schedule()] are probably traceable in the kernel and the process can be reprimanded appropriately).

+ ACTIVE threads are those that are currently running. The userland scheduler NEVER jumps into an active thread; doing so would lead to undefined results, as one expects. The kernel will only see threads of this type, excepting when it has to annotate one to BLOCKED_KERNELSPACE (see below).

+ RUNNABLE_YIELDED threads are those that voluntarily gave up their timeslice via yield(), while RUNNABLE_PREEMPTED are those that the kernel preempted at the end of their timeslice. The distinction is drawn merely to enable different thread scheduling algorithms, in some sense. Note that _YIELDED state is assertable by userland in thread_yield() (which must also save %eip and the other registers...) but that the kernel must WRITE TO the shared data-structure to transition the thread to _PREEMPTED (in addition to updating the state the kernel must write its register dump there). The kernel will never see a thread in this state [though it will put one in _PREEMPTED... see below]. Preemptable threads requires kernel-side support as far as we can tell, but cooperative multitasking could do away with that little bit of instrumentation.

+ BLOCKED_USERLAND is the result of a mutex_lock() or msg_waitfor() on a purely userland construct. In all probability the userland-only per-thread data-structure will contain additional data for the thread scheduler's decision process. These processes will never be running, so the kernel will never see this state.

+ BLOCKED_KERNELSPACE is a very unfortunate state. It means that the thread is in the process of page faulting or similar but that the kernel was unable to service the request immediately. In the event of a fast-fault (e.g., mmap'd region with the backing data in page cache, some syscalls), this state is avoided as the ACTIVE state prevents the scheduler from running the thread and we'll soon be able to return (detrampoline) to it. However, in slow-faults (e.g., swapstorm and we just asked for a page way off on disk, most syscalls), to avoid truly blocking, the essentially synchronous message that was the page fault should be downgraded to an ASYNC message and the kernel's userland scheduler reentered. The exact details of this SYNC->ASYNC conversion are not exactly clear but the end result must be that a (set of) LWKT(s) is spawned to wait on the result with ultimate goal of reversing the thread's annotation to RUNNABLE_PREEMPTED state [in order to facilitate this, the kernel obviously needs to keep a pointer out to userland. It will produce undefined results (probably crash) to delete the data structure while the kernel is in this state because then it will blindly annotate something else. This (and RUNNABLE_PREEMPTED, really) smells like a rich area for bugs and exploits... if somebody has something cleaner, I'm all ears, of course.]

The reason that BLOCKED_KERNELSPACE must be introduced is to decouple the kernelspace from userland so that the kernel does not have to have contexts stored per thread, merely per blocking thread, as well as so the userland thread scheduler can, on the next reentry, note that a previously active thread has been blocked an can reduce the RUNNABLE_COUNTER appropriately. [This update cannot be done from kernel due to MP-safeness].

So the biggest change, all in all, is that the kernel's userland scheduler no longer resumes a context - it merely jumps in to the process' thread scheduler and leaves the decision of which thread runs to that process. This allows for different threading models to match workloads rather than trying to tune the global scheduler to deal with it.

=== Message Passing

Now, given all of the above we can start using message passing system calls in a real way. It's possible to use the above without MPI syscalls as well as to use MPI syscalls without the above, but neither of those seem to be desirable.

In the MPI syscall model, instead of faulting, we will call the targetPort->send_msg() function, which may fault or SYSENTER or as appropriate in the SYNC case, but we won't see that code or any of the instrumentation (debugging, IPC, userland drivers / VFS, etc) that the MPI allows - it's been nicely encapsulated away from us. [Incidentally, the MPI can be used safely for IPC as long as the ports are used essentially as file descriptors and the ->send_msg() functions are provided by a trusted library, but that's another discussion.]

If we send an ASYNC message, our behavior is quite clear. We go on our business and at some point inform the userland thread scheduler that we're willing to be blocked on the arrival of a message for us on targetPort (timeout optional).

In the event of a SYNC message, there are several cases:
+ The message is handled in full, synchronously, right now.
+ The message is immediately converted to ASYNC and somebody else blocks for us [immediate return in this case would be a bug].
+ The message is synchronously trampolined into the kernel and later would block. In this case the kernel must downgrade the remainder of the message to ASYNC, spawn off a context (or several, again, depending on details of implementation) and sleep the context on the reception of an answer to the ASYNC message in flight. Here the thread is annotated by the detrampoline code to BLOCKING_KERNELSPACE and the kernel userland scheduler is invoked.


Example of the latter case (possibly incorrect in terminology / details, but should give the idea). This case is overly long but is robust against subsystems taking locks, which merely swinging the entire stack to the side (spawning LWKT at the innermost layer only) would not be:
user calls read() in a way that triggers SYNC messages
read() generates a SYNC message and sends it off
->send_msg() trampolines into the kernel synchronously
the message is routed to the VFS and parsed
the VFS calls (C style) into the filesystem
the FS calls (C style) into the block device layer
the block device layer realized that is has to block. At this point
there is a kernel context (stack) associated with everything up to
this point. Here the message will be downgraded to ASYNC and
dispatched to the appropriate block device after the block layer
creates or reserves a message port. The block layer spawns a thread
to wait on the message it sent.
the block device layer returns a handle to that port back to the FS
along with an error indicating that it had to block.
the FS spawns a thread to wait on this port as well as generate a port
and error return code to the VFS.
the VFS does the same, returning to
the detrampoline code recognizes the error and spawns off a thread to
wait for reception of the message from the VFS to enqueue the result
for the process' userland thread scheduler. It annotates the
currently running thread to BLOCKING_KERNELSPACE. It then jumps over
to the kernel process scheduler.


At the end, the kernel stack is fully unwound, having been transferred to LWKTs waiting for an answer. This is probably not ideal, but it works and is robust against locking of any type inside the subsystems; I'm sure somebody has a better idea.

Why SYNC messages at all? The short answer is performance. If every inter-subsystem call is translated into an asynchronous message, the results can be disastrous for performance, as the kernel can end up spending more cycles in message queue management than in actually doing the task requested.

=== Implementation details & misc.

+ The mechanisms of segmentation that can be used to give rise to thread local storage can be used for all of this with the exception of RUNNABLE_COUNTER. By merely reserving a chunk of memory at the top of the TLS area (alternatively, the thread's stack and use a segment for that), the kernel can always know the current thread's data structure location (It'd be in the segment register) without having to have any notion of thread IDs or similar. This may slightly impact the ABI but I do not believe it must as TLS already exists. This may also have impacts against such VMMs as Xen.

+ RUNNABLE_COUNTER can instead be implemented as a syscall-updated in-kernel quantity (rather than shared state in the memory area). This has the advantage that the kernel can decrement it on transition to BLOCKED_KERNELSPACE as well as always ensure its validity, but the disadvantage is that the userland must trampoline up for every thread spawned or killed, which may be bad for performance of certain, thread-heavy workloads (MozartOZ programs leap to mind, ignoring that Mozart currently does its own threading - in an ideal world it could spawn real threads for it).

+ ps & friends now have no process-agnostic way of showing threads if RUNNABLE_COUNTER is managed in userland. Every process must be asked for how many threads it has. Perhaps this is another advantage of it being a syscall. Note that it does not otherwise impact viewing process information.

+ Register store and restore must be done in userspace on occasion. I assume this isn't much of a problem but I am unfamiliar with the exact details of all the permission bits present on x86.

OK, well, that's enough for now. I eagerly await responses.
--nwf;



[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]