Wiki‎ > ‎

Super fast backup. copy, move many files

posted Jul 6, 2016, 10:49 PM by Dong Xu   [ updated Jul 6, 2016, 11:09 PM ]
To copy a large amount of data, tar is generally faster than rsync. The idea is to start with tar and finish off with a final rsync,

tar -cpf - src/ | tar -xpf - -C dst/
rsync -avhW --no-compress --progress --exclude=.gvfs /src/ /dst/


Note: /src/
with /: do not create new folder in /dst
without /: create new folder in /dst 

-a is for archive, which preserves ownership, permissions etc. -v is for verbose, so I can see what's happening (optional) -h is for human-readable, so the transfer rate and file sizes are easier to read (optional) -W is for copying whole files only, without delta-xfer algorithm which should reduce CPU load --no-compress as there's no lack of bandwidth between local devices --progress so I can see the progress of large files (optional)

Because rsync needs to do fstats on each file, on both ends as it rolls along. Big files flew over the wire, but, most of what we were moving were tiny, tiny files, and as I said, millions and millions of them.

The beauty of this method is that you do not need double the space because you never actually create an intermediate tar file. The tar before the pipe packs the data and streams it to stdout, and the tar after the pipe grabs it from stdin and unpacks it.
===========================================


tar over ssh is the most reliable for the speed.

If you are doing file transfers on a local lan and are not too concerned about full encryption. You may want to add: -c arcfour

Host Key Checking causes considerable pauses in establishing connections. Once again if you are on a lan, you may want to add: -o StrictHostKeyChecking=no

And one other note, if you do publickey authentication adding: o PreferredAuthentications=publickey can also cut down on the time it takes to establish a connection.

For example:
tar -cpf – * | ssh -2 -4 -c arcfour -o StrictHostKeyChecking=no -o PreferredAuthentications=publickey -lroot ${REMOTE_HOST} tar -C ${REMOTE_DIRECTORY}/ -xpf -”

================================================

SSH really isn’t all that CPU-consuming. Sure, it’s encryption, but symmetric encryption (which is used after the initial authentication) can be quite fast even in software. For instance, try “dd if=/dev/zero bs=1048576 | openssl enc -blowfish -k somepass > /dev/null”. On my machine, it performs at 91,2 MB/s, almost enough to fully saturate a 1GB-line with payload, and faster than many disks even in sequential read.

Regarding rsync vs. tar, rsync is optimized for efficient network-usage, not efficient disk-usage. In your particular case, it sounds like many small files was the challenge. While rsync is usually the winner in partial updates, it has a strong disadvantage in the first run, since it scans the directory on both machines first to determine which files to transfer.

Also, if network is the bottleneck and not CPU, you can optimize your tar-example with compression, for instance bzip2 or xz/lzma. My /var/lib/dpkg went from 90MB to 13 using bzip2, with a throughput of 6.1mbit, so adding a -j to your tar-line it might improve your low-bandwidth-example even more than rsync if you want to prove a point.

Also, rsync uses compression by default, so disabling it /might/ improve performance in the high-bandwidth-case.

------------------------------

scp -qrp /var/lib/dpkg [server]:/tmp
rsync -ae ssh /var/lib/dpkg [server]:/tmp
tar -cf - /var/lib/dpkg |ssh [server] tar -C /tmp -xf - 

Comments