DragonFly BSD
DragonFly kernel List (threaded) for 2004-04
[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]

Re: pipe testing and kernel copyin/copyout/bcopy performance


From: Matthew Dillon <dillon@xxxxxxxxxxxxxxxxxxxx>
Date: Thu, 29 Apr 2004 03:40:43 -0700 (PDT)

:Matthew Dillon wrote:
:
:>     Just to let people know, in case anyone is wondering why I have been so
:>     quiet lately :-)
:> 
:>     I've been running some major pipe benchmarks to compare various pipe 
:>     optimizations as part of a paper (FreeBSD's) Alan Cox and I are writing.
:> 
:>     At the same time I've delved deeply into the AMD64 and have been working
:>     on optimizing the kernel bcopy, memcpy, copyin, and copyout to use
:>     XMM instructions when possible.
:
:	Didn't you mention to me something about FPU context switch
:	overhead?  Secondly, wouldn't the XMM based copyin, bcopy etc
:	make small transfers slow?
:
:		-Hiten
:		hmp@xxxxxxxxxxxxx

    Yah, the code has a check for small copies and just runs an integer
    loop.  xmm/mxx is only beneficial for larger buffers.

    That said, I came up with a neat solution that allows the kernel to
    avoid the fxsave/fxrstore.  Since the kernel is likely to make multiple
    copyout() calls to break down larger buffers, and (when dealing with
    larger buffers) the userland code is not likely to execute any FP ops
    in the core of the read/write loop, I have the kernel's optimized 
    bcopy/copyin/copyout code save off the FP state from userland and *not*
    attempt to restore it.  That is, userland will take a fault to restore
    its fpstate in that particular situation.  This means that multiple
    entries into the kernel can be made and/or the kernel can make multiple
    bcopy/copyin/copyout calls (w/ buffers > 2K) and use the FP registers
    at the cost of only a single fxsave.

    I should be able to commit that tomorrow.  I basically rewrote nearly
    all of i386/i386/support.s, and broke-out the zeroing and copying
    routines into their own .s files.

    This allows the kernel to use the FP registers with basically only 
    an 'fninit' call, and it would even be possible to avoid that with some
    additional logic.  Unfortunately, there are a lot of other overheads
    involved that, while small, do add up.  The minimum buffer size 
    where kernel use of FP registers begins to make sense is around 
    2-4K.

					-Matt
					Matthew Dillon 
					<dillon@xxxxxxxxxxxxx>



[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]