DragonFly users List (threaded) for 2008-01
DragonFly BSD
DragonFly users List (threaded) for 2008-01
[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]

Re: cvsup


From: Garance A Drosihn <drosih@xxxxxxx>
Date: Fri, 18 Jan 2008 15:33:57 -0500

At 9:16 AM +0000 1/18/08, Vincent Stemen wrote:

I realize that everything I read comparing cvsup to rsync indicates that cvsup is faster with mirroring cvs repositories. So I decided to run my own tests this evening. I thought everybody might be interested in the results.

My results are not even close to what others are claiming.  Rsync was
vastly faster.  Granted, so far as I know, this was not right after
a large number of files have been tagged, but as you mentioned, that
does not happen very often.  If anybody wants to email me after that
does happen, I will try to make time to re-run the tests.

This is a very inadequate benchmark. Certainly rsync works very well, and the dragonfly repository's have enough capacity that they can handle whatever the load is. So, I realize that it is perfectly fine to use rsync if that's what works for you. And I realize that there is the (unfortunate) headache due to needing modula-3 when it comes to CVSUP. So, I'm not saying anyone has to use cvsup, and I am sure that rsync will get the job done. I'm just saying that this specific benchmark is so limited that it is meaningless.

What was the load on the server?  How well does rsync scale when there
are thousands of people updating at the same time?  (in particular, how
well does the *server* handle that?).

How big of an update-interval were you testing with?  If I'm reading
your message right, the largest interval you tested was 2-days-worth
of updates.  For most larger open-source projects, many end-users are
going at least a week between sync's, and many of my friends update
their copy of the freebsd repository once every three weeks.  Some
update their copy only two or three times a year, or after some
significant security update is announced.  Note that this means the
server sees a huge spike right after security updates, because there
are connections from people who haven't sync'ed in months, and who
probably would not have sync'ed for a few more months if it wasn't
for the security update.

Tags occur rarely, but they do occur.  And in the case of dragonfly,
there are also the sliding tags that Matt likes to use.  So while he
doesn't create a new tag very often, he does move the tag in a group
of files.  (Admittedly, I have no clue as to how well cvsup does
with a moved tag, but it would be worthwhile to know when it comes
to benchmarking rsync-vs-cvsup for dragonfly.  It is quite possible
that cvsup will actually get confused by a moved-tag, and thus not
be able to optimize the transfer of those files)

The shorter the update-interval, the less likely that all the
CVS-specific optimizing code in cvsup will do any good.  Note, for
instance:

For a 1.5 hour old repository:
    rsync total time: 34.846
    cvsup total time: 3:40.77
    =========================================
    cvsup took 6.33 times as long as rsync

For a 2 day old repository:
    rsync total time: 2:03.07
    cvsup total time: 9:14.73
    =========================================
    cvsup took 4.5 times as long as rsync

Even with just two data points, we see that larger the window, the
less-well that rsync does compared to cvsup.

In that 1.5 hour old repository, how many files were changed?  10?
100?  If there are only 100 files to do *anything* with, then there
isn't much for cvsup to optimize on.  It's pretty likely that rsync
is going to be faster than cvsup at "sync"ing a file which has zero
changes which need to be sync'ed.

If you have users who are regularly sync-ing their repository
every 1.5 hours, 24 hours a day, 7 days a week, then there are some
cvsup servers which would block that user's IP address for being such
an annoying pest.  The only people who need to sync *that* often are
people who themselves are running mirrors of the repository.  For all
other users, syncing that often is an undesirable and unwanted load
on the server.  The people running a sync-server wouldn't want to
optimize for behavior patterns which they don't want to see in the
first place.

I would say the *smallest* window that you should bother testing is
an six-hour window (which would be four updates per day), and that
the most interesting window to test would be a 1-week window.

It took more than a year to write cvsup, by someone who was working
basically full-time at it.  (that's what he told me, at least!  :-)
He wouldn't have put in all that work if there was no point to it,
and he would have based his work on a wide range of usage patterns.

Unless I am overlooking something obvious, I think I am going to
stick with updating our repository via rsync :-).

As I said earlier, rsync is certainly a reasonable solution. I'm just commenting on the "benchmark". And I realize I haven't done *any* benchmarks, so I can't claim much of anything either. But you would need a much more elaborate and tedious set of benchmarks before you could draw any significant conclusions.

--
Garance Alistair Drosehn            =   gad@gilead.netel.rpi.edu
Senior Systems Programmer           or  gad@freebsd.org
Rensselaer Polytechnic Institute    or  drosih@rpi.edu



[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]