Rsync for incremental backups
Introduction
Archiving files to tape is still considered one of the cheapest way of making backups. However with the prices of disk storage and solid-state storage decreasing rapidly, it won't be long before users make the switch to the faster disk storage for all thier backup needs. The problem, however, is that if you want to do anything more than mirroring data on a remote storage, there aren't too many good freeware tools to do it. This writeup explains one of the interesting ways to do incremental backups with snapshot cabability using a popular tool called rsync.
Traditional Backup
Traditional backup applications not only support backing up and restoring of files, directories, partitions and drives, but also allow incrementally backups to reduce time taken to backup large file systems. Since its not practical to restore the complete data from tape just to apply minor changes to it, most tape backup software store the differences in a seperate file or location on tape which can be used to patch the last full backup image on the tape. This feature of storing incremental updates also allows administrators to maintain multiple versions of data without keeping as many physical copies of data. Incremental updates to backup repository is extremly valuable feature in highly dynamic environments where maintaining multiple snapshot of data taken very frequently is important.
Rsync
Rsync was one of the first tools I used, which allows one to update copies of data by sending incremental updates. This dramatically cuts down the time to update a copy of data. The problem is that though rsync allows you to make a copy of data, and allows you to incrementally update it, it was not designed with tape in mind. It specifically doesn't allow you to keep the incremental updates in a different directory or file the way backup applications do. This limits the number of snapshots one can maintain using rsync.
The cp command
And that brings us to the last part of the puzzle which we need to know to make do incremental backups. The "cp" command on most unix operating systems allows copying of hard links (instead of the actual data). This feature allows you to maintain two physically different directories on the same partition (with different names) pointing to the same physical set of files. In linux this can be accomplised by the following command "cp -al $sourcedir $targetdir".
Rsync + cp
To demonstrate how these two can work together to provide us snapshot capability I did some tests on my linux box. The first step was to create a small directory structure which we would be using for this exercise. "ls -ila" on the directory shows the actual "inodes" (column 1) assigned to each of these files and directories within the test directory I created.
Original Directory Structure
List of files
==============================
/tmp/test/primary
/tmp/test/primary/file1.txt
/tmp/test/primary/file2.txt
/tmp/test/primary/subdir1/file3.txt
==============================
la:/tmp/test # ls -ila primary/*
358315 -rw-r--r-- 1 root root 5 Aug 21 23:06 primary/file1.txt
358316 -rw-r--r-- 1 root root 5 Aug 21 23:07 primary/file2.txt
primary/subdir1:
total 4
358313 drwxr-xr-x 2 root root 80 Aug 21 23:07 .
358312 drwxr-xr-x 3 root root 136 Aug 21 23:07 ..
358314 -rw-r--r-- 1 root root 5 Aug 21 23:07 file3.txt
Next step is to do a traditional recursive copy from "primary" to "directory1". You can accomplish this by either "cp" or "rsync". In this following example I used "cp" command. Notice that when I list the inodes after the cp commands, it creates a new set of inodes for each of the files and directory in the new directory structure. This means that the traditional "cp" command did an actual recursive copy of file contents to new locations, and that there exists two identical copies of each of the objects.
After Copy using "cp -rp src target"
la:/tmp/test # cp -rp primary directory1
la:/tmp/test # ls -ila directory1/*
358307 -rw-r--r-- 2 root root 5 Aug 21 23:06 directory1/file1.txt
358308 -rw-r--r-- 2 root root 5 Aug 21 23:07 directory1/file2.txt
directory1/subdir1:
total 4
358305 drwxr-xr-x 2 root root 80 Aug 21 23:07 .
115666 drwxr-xr-x 3 root root 136 Aug 21 23:07 ..
358306 -rw-r--r-- 2 root root 5 Aug 21 23:07 file3.txt
Now lets see how "cp" behaves when we ask it to preserve hardlinks. In this example we are copying "directory1" into a new directory "directory2". Notice how the inodes in the new directory are same as the ones from "directory1". This means that though there are two logical directories which look alike, the actual file and directories listed within each one of them are identical. Any modification done to one file within one directory (without modifying the inode) will affect the file in the other directory. This is almost same as symbolic linking, except that unlike symbolic links the file wont dissapear from "directory2" if I delete it from "directory1". In other words there is actually multiple owners of these inodes at this moment, which seems a little hard to digest.
"cp -la src target"
la:/tmp/test # cp -al directory1 directory2
la:/tmp/test # ls -ila directory2/*
358307 -rw-r--r-- 2 root root 5 Aug 21 23:06 directory2/file1.txt
358308 -rw-r--r-- 2 root root 5 Aug 21 23:07 directory2/file2.txt
directory2/subdir1:
total 4
358310 drwxr-xr-x 2 root root 80 Aug 21 23:07 .
358309 drwxr-xr-x 3 root root 136 Aug 21 23:07 ..
358306 -rw-r--r-- 2 root root 5 Aug 21 23:07 file3.txt
So we know how interesting hard links are and we know how to create multiple directories look exactly the same without creating as many copies of the actual data. A little more research on your part would reviel that if you had modified "subdir1/file3.txt" the only two inodes which would change are "subdir1" and "subdir1/file3.txt". I didn't show inodes of "primary" directory in the dumps below, but what I did do is show you how the inodes look like after I rsync the changes from "primary" to "directory1".Notice that after rsync to "directory1" the inodes for "subdir1" and "subdir1/file3.txt" has changed (as expected). This is because rsync usually doesn't overwrite existing inodes. Instead it creates fresh copies of updated files and directories and deletes the old ones. Interestingly inodes of "directory2" still shows the old inodes for the files/directories which were modified.
"directory2" has now become a "snapshot" of "directory1" without actually having a duplicate copy of all the data in "directory1".
"directory2" has now become a "snapshot" of "directory1" without actually having a duplicate copy of all the data in "directory1".
Modified file3.txt in primary copy"
la:/tmp/test # rsync -rvgoutl primary/* directory1/
building file list ... done
subdir1/
subdir1/file3.txt
wrote 161 bytes read 40 bytes 402.00 bytes/sec
total size is 21 speedup is 0.10
la:/tmp/test # ls -ila directory1/*
358307 -rw-r--r-- 2 root root 5 Aug 21 23:06 directory1/file1.txt
358308 -rw-r--r-- 2 root root 5 Aug 21 23:07 directory1/file2.txt
directory1/subdir1:
total 4
358305 drwxr-xr-x 2 root root 80 Aug 21 23:14 .
115666 drwxr-xr-x 3 root root 136 Aug 21 23:07 ..
358322 -rw-r--r-- 1 root root 11 Aug 21 23:14 file3.txt
la:/tmp/test # ls -ila directory2/*
358307 -rw-r--r-- 2 root root 5 Aug 21 23:06 directory2/file1.txt
358308 -rw-r--r-- 2 root root 5 Aug 21 23:07 directory2/file2.txt
directory2/subdir1:
total 4
358310 drwxr-xr-x 2 root root 80 Aug 21 23:07 .
358309 drwxr-xr-x 3 root root 136 Aug 21 23:07 ..
358306 -rw-r--r-- 1 root root 5 Aug 21 23:07 file3.txt
Comments