[svlug] rsync efficiency on trees with large numbers of files

Reg. Charney charney at charneyday.com
Fri Oct 29 14:32:32 PDT 2004


I have two questions:

a) Is the time taken in stat()-ing the directories?; or
b) Is the time taken in rsyncing across the network?

This problem also suggests two questions with a variety of solutions:

1) Are you trying to rsync parts of the database?; or
2) Are you trying to rsync output from the application?

If you are dealing with 1), many RDBMS have either hooks or triggers you 
can add to the database. From there you can instrument when additions or 
changes are made. If the RDBMS modifies the database, can you use the 
RDBMS to find the changes and additions, i.e., using a SQL command with 
a WHERE clause? Often, RDBMS have utilities to operate on the database, 
including dumping to text. That allows you to do a diff on the dumps. 
Lastly, a lot of modern RDBMS allow to distributed synchronization.

If you are rsyncing the output of the application, independent of the 
database, you problem then becomes one of identifying which files were 
added or changed. Let me deal with additions and changes separately.

For additions, are the file names predictable? That is, given the last 
file created, can you predict what the next file name will be? If that 
is the case, keep a log of the last file you rsynced and rsync any files 
created after that. If the file names are created randomly, you may try 
this: keep a sorted list of the files when you last rsynced; then, 
create a new sorted file list; next do a diff on the two files. This 
will give you a list of the newer files. BTW, doing this in memory with 
a purpose built program, say in Python because it is so easy, could 
speed things up immensely.

Changes to existing files are the most expensive to detect, especially 
if we count on examining the details of the directory entries. One 
possible solution is the symlink suggested by Szii. At this point, the 
issue of whether a) or b) is the gating issue. I can't see a way around 
a) that others haven't recommended. If rsyncing across the network is 
the problem, then minimizing that can be done by minimizing the data 
sent across the nodes. To solve that problem, cat the contents of the 
files, where each file occupies at least one '\n',  Save this file. The 
next time the application runs, do concatenation and then do a diff as 
above. Only the diffs have to be shipped across the network.

Hope I have given some useful information.

Reg.

J C Lawrence wrote:

>What can be done to make rsync more efficient for synchronising
>directories with large numbers of files?
>
>I have a directory hierarchy which contains a little over ten million
>files (~12 million and growing) where most of the files are very small
>at 8 - 16 bytes, with roughly 0.5% being variously larger (Meg+).  The
>file contents change relatively rarely with the most common operation
>being the addition of new files.  Currently I use rsync to synchronise
>that directory hierarchy with another system.  However the IO time
>required to determine what needs to be copied (usually nothing) is a bit
>extreme (many hours even over fast links with fast machines).  I'd like
>to make this faster.
>
>Suggestions?
>
>  No, the data storage format cannot be changed.  It is defined and
>  controlled by an external application which is not subject to change.
>
>--
>J C Lawrence
>---------(*)                Satan, oscillate my metallic sonatas.
>claw at kanga.nu               He lived as a devil, eh?
>http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.
>
>_______________________________________________
>svlug mailing list
>svlug at lists.svlug.org
>http://lists.svlug.org/lists/listinfo/svlug
>
>
>
>  
>






More information about the svlug mailing list