Force rsync to compare bytes of local files by byte instead of checksum

I wrote a Bash script to backup a folder. The script is based on the instructionrsync

rsync -abh --checksum /path/to/source /path/to/target

      

I use --checksum

because I don't want to rely on file size or modification time to determine if a file needs to be backed up in the original path. However, most - if not all - of the time I run this script locally, that is, with an external USB drive connected that contains the backup destination folder; no network backup. Thus, there is no need for a delta transfer as both files will be completely read and processed by the same computer. Calculating checksums even leads to a decrease in speed in this case. It would be better if it were rsync

just diff

files, if they were both saved locally.

After reading the manpage, I came across an option --whole-file

that seems to avoid an expensive checksum calculation. The manpage also states that this is the default if the source and destination are local paths.

So I'm going to change the instruction rsync

to

rsync -abh /path/to/source /path/to/target

      

Will now rsync

check the bytes of the local source and target file by byte, or use the modification time and / or size to determine if the source file needs to be backed up? I definitely can't help but rely on file size or modification time to decide if a backup should be performed.

UPDATE

Pay attention to the parameter -b

in the instructions rsync

. This means that the destination files will be copied before they are replaced. Therefore, blindly rsync'ing all files in the source folder, for example by shipping --ignore-times

, as suggested in the comments, is not an option. This would create too many duplicate files and waste storage space. Keep in mind that I am trying to reduce backup times and workload on my local machine. Simply reinforcing everyone will defeat this goal.

So my question can be rephrased as: rsync

able to perform file byte comparisons on a byte basis?

+3


source to share


2 answers


There is no way to do a byte file comparison instead of the checksum as you would expect it to be.

The method rsync

is to create two processes, sender and receiver, that create a list of files and their metadata in order to decide with each other which files need to be updated. This is done even in the case of local files, but in this case processes can communicate over a pipe rather than over a network socket. After selecting the list of changed files, the changes are sent as delta or whole files.



In theory, it would be possible to send entire files to a file list to someone else to do a diff, but in practice this would be inefficient in many cases. The recipient needs to keep these files in memory if he detects the need to update the file, or, otherwise, the changes in the files need to be sent. Any of the possible solutions are not very effective here.

There is a good overview of the (theoretical) mechanics rsync

: https://rsync.samba.org/how-rsync-works.html

+1


source


Question: is rsync able to do file byte-by-byte comparison?

Strictly speaking, yes:

  • This is a block-by-block comparison , but you can change the block size.
  • You can use --block-size=1

    (but it would be unreasonably inefficient and unacceptable for everyone)

Checksum based on a block basis is the default behavior over the network.

Use a parameter --no-whole-file

to force this behavior locally. (see below)

Statement 1. Calculating checksums even leads to a decrease in speed in this case.

This is why it is disabled by default for local transfers.

Using the parameter --checksum

forces the entire file read, as opposed to the default delta transfer checksum check by default

Statement 2 .. Now rsync will check the local byte of the source and target file by byte or
& emsp; & emsp; & emsp; & emsp; & emsp; & emsp; will use the modification time and / or size to determine if the source file is & emsp; & emsp; & emsp; & emsp; & emsp; & emsp; need to reserve?

By default it will use the size and modification time.

You can use a combination of --size-only

, --(no-)ignore-times

, --ignore-existing

and
--checksum

to change this behavior.

Statement 3. I definitely do n't want to rely on file size or modification time to decide if & emsp; & emsp; & emsp; & emsp; & emsp; & emsp; , the backup should be done.

Then you need to use --ignore-times

and / or--checksum

Operator 4. providing --ignore-times , as pointed out in the comments, is not .

Perhaps using --no-whole-file

and --ignore-times

what do you want? This forces a delta transfer algorithm to be used, but for each file regardless of time stamp or size.

You would (in my opinion) ever use this combination of options if it was important to avoid meaningless entries (although it was critical that these are specifically meaningless entries that you are trying to avoid, not system efficiency, since in fact there would not be more efficiently perform delta transfers for local files), and had reason to believe that files with the same revision mark and byte size might indeed be different.

I don't see how the change stamp and size in bytes is nothing but a logical first step in identifying the changed files.

If you compared the following two files:

  • File 1 (local) & emsp ;: File.bin - 79776451 bytes

    and changed to15 May 07:51


  • File 2 (deleted): File.bin - 79776451 bytes

    and changed to15 May 07:51

The default behavior is to skip these files. If you are not sure if files should be skipped and want to compare them, you can force side comparison and differential update of these files with --no-whole-file

and --ignore-times



So, a summary on the matter:

  • Use the default method for the most efficient backup and archive
  • Use --ignore-times

    and --no-whole-file

    to force delta change (block checksum, transmitting only differential data) if necessary for any reason
  • Use --checksum

    and --ignore-times

    to be totally paranoid and wasteful.

Statement 5. Notice the parameter in the rsync statement. This means that the destination files will be copied before they are replaced. -b

Yes, but it may work the way you want it, it doesn't necessarily mean a full backup every time the file is updated, and it certainly doesn't mean that a full transfer will take place at all.

You can configure rsync to:

  • Save 1 or more versions of a file
  • Customize it with --backup-dir

    for a complete incremental backup system.

This method leaves no unnecessary space beyond what is required for storing differential data. I can make sure that in practice, as my backup disks don't have enough space for all of my previous versions to be complete.


Additional Information


Why isn't Delta transfer more efficient than copying the entire file locally?

Since you are not tracking changes in each of your files. If you have a delta file, you can merge just the changed bytes, but you need to know that those changed bytes are the first. The only way to find out is to read the whole file

For example:

  • I am changing the first byte of a 10MB file.
  • I am using rsync

    delta carry to sync this file.
  • rsync

    immediately sees that the first byte (or byte in the first block) has changed and continues (by default --inplace

    ) to change only that block
  • However , he rsync

    does not know that only the first byte has changed. It will keep checksums until the entire file is read.

For all purposes and tasks:

  • Consider a rsync

    tool that conditionally executes --checksum

    depending on whether the timestamp or file size has changed. Overriding this value to --checksum

    is essentially equivalent to --no-whole-file

    and --ignore-times

    , since both would be:
    • Use every file regardless of time and size
    • Read each block of the file to determine which blocks will be synchronized.

What's the use?

It's all about the tradeoff between transmission bandwidth and speed / overhead .

  • --checksum

    is a good way to only send mailings over the network.
  • --checksum

    , while ignoring files with the same timestamp and size is a good way to both send differences over the network and to maximize the speed of the entire backup operation.

Interestingly, it is perhaps significantly more useable --checksum

as a "blanket" option than to force a delta transfer for each file .

+2


source







All Articles