[svlug] rsync efficiency on trees with large numbers of files
svlug at ml.altern8.net
Fri Oct 29 12:04:38 PDT 2004
On Thu, Oct 28, 2004 at 07:17:58PM -0700, J C Lawrence wrote:
> My current thoughts are that there are two primary expense sources:
> 1) stat(2) on 10M files.
> 2) transfer for the stat(2) data to the other host so as to determine
> what needs to be exchanged.
> Whether the stat(2) is called by find(1) or by rsync(1) wouldn't seem to
> matter much. 10M stat(2) calls is just gonna hurt no matter how you
> colour it.
Are you using --size-only to skip the checksum?
> As the generating application can't/won't generate the
> changelist (multi-threaded multi-process database), that leaves
> me running find(1) over the tree which would seem to pretty
> well put me back where I started minus some of the state
> transfer overhead in #2. I don't know how big a chunk the
> transfer time is (I haven't profiled), but I could imagine it
> being non-trivial. Hurm. Time for some testing!
In the past, I resorted to rsyncing parts of the source directory
at different times. Kludgey, but effective.
If you are apt to experiment, I would look into clustering the
filesystem, say with gfs or nbd.
More information about the svlug