[svlug] rsync efficiency on trees with large numbers of files

Vince Hoang svlug at ml.altern8.net
Fri Oct 29 12:04:38 PDT 2004


On Thu, Oct 28, 2004 at 07:17:58PM -0700, J C Lawrence wrote:
> My current thoughts are that there are two primary expense sources:
> 
>   1) stat(2) on 10M files.
> 
>   2) transfer for the stat(2) data to the other host so as to determine
>   what needs to be exchanged.
> 
> Whether the stat(2) is called by find(1) or by rsync(1) wouldn't seem to
> matter much.  10M stat(2) calls is just gonna hurt no matter how you
> colour it.

Are you using --size-only to skip the checksum?

> As the generating application can't/won't generate the
> changelist (multi-threaded multi-process database), that leaves
> me running find(1) over the tree which would seem to pretty
> well put me back where I started minus some of the state
> transfer overhead in #2. I don't know how big a chunk the
> transfer time is (I haven't profiled), but I could imagine it
> being non-trivial. Hurm. Time for some testing!

In the past, I resorted to rsyncing parts of the source directory
at different times. Kludgey, but effective.

If you are apt to experiment, I would look into clustering the
filesystem, say with gfs or nbd.

-Vince




More information about the svlug mailing list