DragonFly kernel List (threaded) for 2006-02
Re: namecache coherency, 2nd turn
On 2006-02-08, Matthew Dillon <dillon@xxxxxxxxxxxxxxxxxxxx> wrote:
> The problem is not what happens after you lock the group, the problem
> is the locking operation itself.
> Lets say you have A->B(L).
> Process #2 locks B. The lock succeeds.
> Process #1 tries to lock A, and blocks on the lock residing in B.
> Process #2 detaches A from B, then destroys B. From its point of view,
> it has clear access to the entire group when, in fact, it doesn't.
The point in the "stable groups" approach is that the creator of the
A->B link should be able to count on others respect of this linkage.
That is, process #2 simply should not do such things.
By this POV, the shadow list (of shadow kids, ie. overlays) is private
data and API users shouldn't tinker with it. Or at least read-only
Or... Technically, we could go as far as readjusting/deleting shadow
links, just changing lock related info or the root of the group is what
should be avoided. You can here point out the asymmetry regarding the
group root. Yet my point is that group operation rights shouldn't be
pushed as far as this asymmetry shows up (even if it's technically
cache_setunresolved() is a nice medium to send notifications upwards.
Just a notification, not intervention. Doesn't it scale good enough?
> Process #1 blows up because B(L) no longer exists.
If the lower layer dropped off its overlays, that would be problematic
for some reasons (I describe these below, see * and **). But I don't think
corruption of locking related data would be among these (see next
> This case cannot occur pre-patch because any given namecache record
> will be ref'd (nc_refs) and this prevents the namecache record from
> being destroyed while a contended lock is pending on it.
> But post-patch B(L)'s only refs will be from process #2 and from the
> shadow association. B(L) will not have a ref from process #1 as far
> as I can tell. Both refs go away when process #2 detaches A
> from B and then unlocks, sees that the ref count on B is zero, and
> then destroys B. Process #1 then breaks.
Per patch, as part of the detachment routine, A's lock is migrated back
to A. A's lock contender re-reads A->nc_lockby in each turn, and it will
always refer to a coherent entity, be it A itself or something else.
In fact, I just don't see how could it work out differently. As we
agreed, group membership means lock sharing, first and foremost. If we
expel A from the group, then after that A should have nothing to do with
the group, in particular, it shouldn't get locked via it.
If you feel that a "half-expelled" status is desirable, please provide
a proper group semantics for it and some motivation.
> Embedding the lock in B in this case greatly reduces our coding
> flexibility during splits and merges. The code needs to be more
> flexible. For example, having a separately allocated shadow
> information structure allows us to implement a symmetric algorithm
> with respect to a process attempting to lock A and a process attempting
> to lock B. They would bump a ref count in the shared shadow info
> structure and then simply check that the association still exists
> after the lock has been successfully obtained.
This check would happen either within the namecache code or from the
- Having to check the association from the user side is tedious and error
It's a general argument against unstable groups and intervening
overlays, too: it wouldn't make me happy if the nullfs vop routines
should start with checking the shadow association, and if it's not
found, then either bail out or call internally for the resolver, etc...
- ... and I don't see how would we check the association within the
backend code: what expectations would we have, how to react if
the lock routine finds the expected link missing when the control
> embedded in B, aka B(L), then A would have to bump B's ref count
> prior to attempting to obtain the lock, to prevent B from going away.
> This is highly assymetric and could introduce hard to find bugs.
"Lock is at B" is equivalent with "B is ref'd". These are moved off from
B in one atomic step.
Would de-Gianting break this? If yes: uncureably?
> I understand what you are trying to do by defering tree cleanups until
> re-resolution, but I think its a mistake. It's always a bad idea to
> allow inconsistent data topologies. Implementation details can come
> out and bite you and tracking down the bugs becomes almost impossible
> because the code that caused the bug might have run seconds or minutes
> before the code that, during re-resolution, hits the bug.
Who is in the position of deeming a data topology "inconsistent"? Is it
an intrinsic property of the group?
A possible answer is: no; the one who can cry "inconsistent" is the one
who made and/or makes use of it.
> Algorithmic simplicity is a lot more important then avoiding malloc()'s.
> It's too easy to program yourself in a corner.
The need of making ad hoc choices is a sign of looking for solutions in
the wrong problem domain. And in particular, they don't contribute to
And I do have questions for which I don't have unambigous answers when
it comes to unstable groups. Say we have the following chain:
A -> B -> C -> D -> E
An fs syscall handler bumps into the chain at C. It wraps down to E,
which in turn gets passed off to cache_setunresolved(). As we are into
unstable groups and immediate cleanups, the group is blown up.
Q1. Should the group be totally disintegrated, should it be just the
part over E, or should it be just the single D -> E link which gets
cut? Why this or that way?
Q2. After the blowup, how to set the lock state of the certain