[svlug] rsync efficiency on trees with large numbers of files
charney at charneyday.com
Fri Oct 29 14:32:32 PDT 2004
I have two questions:
a) Is the time taken in stat()-ing the directories?; or
b) Is the time taken in rsyncing across the network?
This problem also suggests two questions with a variety of solutions:
1) Are you trying to rsync parts of the database?; or
2) Are you trying to rsync output from the application?
If you are dealing with 1), many RDBMS have either hooks or triggers you
can add to the database. From there you can instrument when additions or
changes are made. If the RDBMS modifies the database, can you use the
RDBMS to find the changes and additions, i.e., using a SQL command with
a WHERE clause? Often, RDBMS have utilities to operate on the database,
including dumping to text. That allows you to do a diff on the dumps.
Lastly, a lot of modern RDBMS allow to distributed synchronization.
If you are rsyncing the output of the application, independent of the
database, you problem then becomes one of identifying which files were
added or changed. Let me deal with additions and changes separately.
For additions, are the file names predictable? That is, given the last
file created, can you predict what the next file name will be? If that
is the case, keep a log of the last file you rsynced and rsync any files
created after that. If the file names are created randomly, you may try
this: keep a sorted list of the files when you last rsynced; then,
create a new sorted file list; next do a diff on the two files. This
will give you a list of the newer files. BTW, doing this in memory with
a purpose built program, say in Python because it is so easy, could
speed things up immensely.
Changes to existing files are the most expensive to detect, especially
if we count on examining the details of the directory entries. One
possible solution is the symlink suggested by Szii. At this point, the
issue of whether a) or b) is the gating issue. I can't see a way around
a) that others haven't recommended. If rsyncing across the network is
the problem, then minimizing that can be done by minimizing the data
sent across the nodes. To solve that problem, cat the contents of the
files, where each file occupies at least one '\n', Save this file. The
next time the application runs, do concatenation and then do a diff as
above. Only the diffs have to be shipped across the network.
Hope I have given some useful information.
J C Lawrence wrote:
>What can be done to make rsync more efficient for synchronising
>directories with large numbers of files?
>I have a directory hierarchy which contains a little over ten million
>files (~12 million and growing) where most of the files are very small
>at 8 - 16 bytes, with roughly 0.5% being variously larger (Meg+). The
>file contents change relatively rarely with the most common operation
>being the addition of new files. Currently I use rsync to synchronise
>that directory hierarchy with another system. However the IO time
>required to determine what needs to be copied (usually nothing) is a bit
>extreme (many hours even over fast links with fast machines). I'd like
>to make this faster.
> No, the data storage format cannot be changed. It is defined and
> controlled by an external application which is not subject to change.
>J C Lawrence
>---------(*) Satan, oscillate my metallic sonatas.
>claw at kanga.nu He lived as a devil, eh?
>http://www.kanga.nu/~claw/ Evil is a name of a foeman, as I live.
>svlug mailing list
>svlug at lists.svlug.org
More information about the svlug