DragonFly kernel List (threaded) for 2010-02
[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index][Thread Index]
kernel work week of 3-Feb-2010 HEADS UP - TESTING & WARNINGS

From:	Matthew Dillon <dillon@xxxxxxxxxxxxxxxxxxxx>
Date:	Thu, 4 Feb 2010 17:14:00 -0800 (PST)
    WARNING WARNING WARNING

    On the work that has been and continues to be committed to the
    development branch.  With regards to the vm.swapcache sysctls.

    These features are HIGHLY EXPERIMENTAL.  If you turn on the swapcache
    by setting vm.swapcache.{read,meta,data}_enable to 1 (read_enable being
    the most dangerous since that actually turns on the intercept), you
    risk losing EVERY SINGLE FILESYSTEM mounted RW to corruption.

    I want to be very clear here.  The swap cache overrides vn_strategy()
    reads from the filesystem and gets the data from the swap cache instead.
    It will do this for regular file data AND (if enabled) for meta-data.
    Needless to say if it gets it wrong and the filesystem modifies and
    writes back some of that meta-data the filesystem will blow and the
    media will be corrupted.  It has been tested for, well, less than a day
    now.  Anyone using these options needs to be careful for the next few
    weeks.

    If you do not enable any of these sysctls you should be safe.

    People who wish to test the swap cache should do so on machines they
    are willing to lose the ENTIRE machine's storage to corruption.  The
    swap cache operates system-wide.  Any direct-storage filesystem (aka
    UFS or HAMMER) is vulnerable.  NFS is safer but data corruption is
    still possible.

	vm.swapcache.read_enable	Very dangerous, enables intercept.
					Moderately dangerous if meta_enable
					is turned off.

	vm.swapcache.meta_enable	Very dangerous.

	vm.swapcache.data_enable	Moderately dangerous.

    VMs and vkernels are recommended if you do not have a dedicated machine
    and dedicated HD and SSD to test with, if you want to play with it.

    But, of course, these features are designed for SSD swap and you will
    not be able to do any comparitive tests unless you have SSD swap and
    a normal HD (or NFS) for your main filesystem(s).  And, as it currently
    stands, a 15K rpm HD hasn't been worked in yet.  Currently writes are
    very fragmented.  So it is SSD swap or nothing, basically.

    You are not going to see any improvement unless you actually have a SSD.

    WARNING WARNING WARNING

					---

    Ok, that said, there is still a ton of work I have to do.  I am not
    doing any write-clustering yet and I am not doing any proactive disposal
    of stale swap cache data to make room for new data yet.  Vnode recycling
    can cause swap cache data to be thrown away early, as well.  I expect
    to add improvements in the next week and a half or so.  So keep that in
    mind.

    Also note that the write rate is limited... the initial write rate
    is also limited by vm.swapcache.maxlaunder so do not expect the data
    you are reading from your HD at 60MB+/sec to all get cached to SSD
    swap on the first pass.  The current algorithms are very primitive.

    With all these caveats, the basic functionality is now operational.
    The commit message details how the sysctls work:

    http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/c504e38ecd4536447026bf29e61a391cd2340ec3

    You have fine control over write bursting and the long-term average
    write bandwidth via sysctl.  The defaults are fairly generous.

    Currently since there is no proactive recycling of ultra stale swap
    (short of vnode recycling), the only testing that can really be done
    is with data sets smaller than 2/3 swap space and larger than main
    memory.

    Default maximum swap on i386 is 32G.
    Default maximum swap on x86_64 is 512G.

    The default maximum swap on i386 can be changed with the kern.maxswzone
    loader tunable.  This is a KVM allocation, approximate one megabyte to
    one gigabyte.  So e.g.  kern.maxswzone=64m would allow you to configure
    up to ~64G of swap.  The problem on i386 is the limited KVM.

    On x86_64 you can configure up to 512G of swap by default.

					---

    Sample test:

    * md5 6.6G test file on machine w/ 3G of ram, in a loop.  This is on
      my test box, AHCI driver, Intel 40G SSD (SATA INTEL SSDSA2M040 2CV1).
      16G of swap configured.

	-rw-r--r--  1 root  wheel  6605504512 Feb  4 15:42 /usr/obj/test4

	MD5 (test4) = aed3d9e3e1fe34620f40e4f9cb0dbcda
	15.344u 5.272s 2:19.28 14.7%    83+93k 8+0io 4pf+0w
	15.194u 5.788s 2:05.37 16.7%    79+88k 6+0io 2pf+0w
		(1G initial swap burst exhausted)
		(write rate now limited to 1MB/s)
	15.459u 5.861s 2:04.82 17.0%    76+85k 6+0io 2pf+0w
	15.318u 6.194s 2:03.70 17.3%    82+92k 6+0io 6pf+0w
	15.286u 5.960s 2:01.09 17.5%    95+106k 4+0io 2pf+0w
	15.321u 6.179s 1:59.48 17.9%    80+90k 4+0io 4pf+0w
	15.391u 5.687s 1:58.71 17.7%    81+91k 6+0io 4pf+0w
		(set curburst and maxburst to 10G (10000000000))
		(write rate limited by vm.swapcache.maxlaunder, set to 1024)
		(write rate to SSD is approximately 4-8MB/sec)
	15.181u 6.437s 1:53.42 19.0%    82+92k 6+0io 2pf+0w
	15.276u 5.891s 1:42.72 20.5%    82+92k 6+0io 2pf+0w
	15.581u 5.774s 1:31.11 23.4%    81+91k 4+0io 0pf+0w
	15.643u 6.062s 1:27.76 24.7%    81+90k 4+0io 0pf+0w
		(SSD now doing about 50-100MB/sec, mostly reading)
		(HD now doing about 15-30MB/sec reading)
		(5G now cached in the SSD)
	14.910u 5.477s 1:15.48 27.0%    86+97k 6+0io 6pf+0w
	15.182u 5.633s 1:21.64 25.4%    82+92k 4+0io 0pf+0w (glitch)
	14.762u 5.712s 1:12.13 28.3%    87+97k 6+0io 2pf+0w
	14.932u 5.804s 1:16.70 27.0%    84+94k 4+0io 0pf+0w
		(HD activity now sporatic, but has bursts of 20-30MB
		 occasionally)
	15.183u 5.625s 1:09.28 30.0%    85+95k 6+0io 4pf+0w
	15.245u 5.648s 1:12.79 28.6%    83+93k 4+0io 0pf+0w
	15.332u 5.852s 1:08.02 31.1%    80+90k 4+0io 0pf+0w
	15.505u 5.712s 0:59.95 35.3%    85+96k 6+0io 4pf+0w
		(HD activity mostly 0, but still has some activity)
	15.521u 5.485s 0:59.20 35.4%    81+91k 4+0io 2pf+0w
	15.381u 5.334s 0:54.01 38.3%    84+94k 6+0io 2pf+0w
	16.022u 5.455s 0:50.13 42.8%    78+88k 4+0io 0pf+0w
	15.702u 5.345s 0:50.16 41.9%    77+86k 6+0io 2pf+0w
		(HD actiivty now mostly 0, SSD reading 120-140MB/sec,
		 no SSD writing)
	15.850u 5.243s 0:50.12 42.0%    82+92k 6+0io 2pf+0w
	15.397u 5.337s 0:50.21 41.2%    82+92k 4+0io 0pf+0w

	That appears to be steady state.  The SSD is doing around
	3000 tps, 90-100% busy, 120-140MB/sec reading continuously.

	Average data rate 6.6G over 50 seconds = 132MB/sec.

	test28:/archive# pstat -s
	Device          1K-blocks     Used    Avail Capacity  Type
	/dev/da1s1b      16777088  6546900 10230188    39%    Interleaved

    So that essentially proves that it's doing something real.

    Next up I am going to work on some write clustering and ultra-stale
    data recycling.

						-Matt
Follow-Ups:
- Re: kernel work week of 3-Feb-2010 HEADS UP - TESTING & WARNINGS
  - From: Matthew Dillon <dillon@apollo.backplane.com>
[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index][Thread Index]