How to Copy a Filesystem and Preserve Hard Links in Linux (by Jeremy Zawodny)

As part of my Linux backup scheme (which I need to write up someday) I've recently been swapping and upgrading/replacing some USB hard disks at home. There's a Linux box at home (a Thinkpad T43p running Ubuntu if you must know) that has a 320GB disk attached and mounted as /mnt/backup and was running fairly low on space.

jzawodn@wasp:/mnt$ df -h /mnt/backup
Filesystem            Size  Used Avail Use% Mounted on
/dev/sdb1             276G  211G   51G  81% /mnt/backup

That was after I moved about 50GB of stuff off it last night.

I want to replace it with a newly attached 750GB disk and need to move all the data over to the new disk. But since much of the data consists of remote filesystem snapshots produced using rsnapshot, which makes copious use of hard links, it's rather important that I do this correctly. If I don't, the data won't even fit on the 750GB disk!

(If that seems impossible, you don't quite grok hard links on a filesystem yet.)

Digging deep into my Unix past, I remember needing to do this once before. The trick was not to use any of the usual suspects: cp, tar, rsync, or mv. Instead, you use either dump (yuck) or a combination of find and cpio.

It looks something like this:

mkdir /mnt/backup2/snaps
cd /mnt/backup/snaps
find . -print | cpio -Bpdumv /mnt/backup2/snaps

Then you just wait a long time while stuff scrolls by and you wish you were using disks in eSATA enclosures rather than in USB 2.0 enclosures.

The trouble is that cpio didn't properly preserve timestamps on directories (not sure why--I expected it to), so I had to dig even deeper to remember pairing up dump and restore.

cd /mnt/backup2
mkdir snaps
( dump -0 -f - /mnt/backup/snaps | restore -v -x -y -f - ) >& ~jzawodn/dump.log

And then I waited about half a day for the copy to complete.

root@wasp:~# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sdb1             276G  212G   50G  82% /mnt/backup
/dev/sdc1             688G  284G  370G  44% /mnt/backup2

Not bad. A quick edit to /etc/rsnapshot.conf to change my snapshot_root from /mnt/backup to /mnt/backup2 and that's all it took.

Next time I have to go through this, it won't take me nearly as long to devise a scheme to get it done.

Now, does anyone have alternative methods? Or do you know why cpio didn't preserve timestamps correctly?

Thanks to the folks at TechCzar for translating my tech blog posts and including them in their blog network.

Posted by jzawodn at February 28, 2008 08:40 AM | edit

Reader Comments

# Sam said:

On the Mac I use 'ditto' for this. I think that its available on other unix systems as well.

on February 28, 2008 09:15 AM

# Rob Steele said:

I'm playing with BackupPC just now (http://backuppc.sourceforge.net/) and its docs say use dd if you can:

If the pool disk requirements grow you might need to copy the entire data directory to a new (bigger) file system. Hopefully you are lucky enough to avoid this by having the data directory on a RAID file system or LVM that allows the capacity to be grown in place by adding disks.

The backup data directories contain large numbers of hardlinks. If you try to copy the pool the target directory will occupy a lot more space if the hardlinks aren't re-established.

The best way to copy a pool file system, if possible, is by copying the raw device at the block level (eg: using dd). Application level programs that understand hardlinks include the GNU cp program with the -a option and rsync -H. However, the large number of hardlinks in the pool will make the memory usage large and the copy very slow. Don't forget to stop BackupPC while the copy runs.

Starting in 3.0.0 a new script bin/BackupPC_tarPCCopy can be used to assist the copy process. Given one or more pc paths (eg: TOPDIR/pc/HOST or TOPDIR/pc/HOST/nnn), BackupPC_tarPCCopy creates a tar archive with all the hardlinks pointing to ../cpool/.... Any files not hardlinked (eg: backups, LOG etc) are included verbatim.

You will need to specify the -P option to tar when you extract the archive generated by BackupPC_tarPCCopy since the hardlink targets are outside of the directory being extracted.

To copy a complete store (ie: /mnt/data/BackupPC) using BackupPC_tarPCCopy you should:

stop BackupPC so that the store is static.
*

copy the cpool, conf and log directory trees using any technique (like cp, rsync or tar) without the need to preserve hardlinks.
*

copy the pc directory using BackupPC_tarPCCopy:

su backuppc
cd NEW_TOPDIR
mkdir pc
cd pc
/usr/local/BackupPC/bin/BackupPC_tarPCCopy /mnt/data/BackupPC/pc | tar xvPf -

on February 28, 2008 09:23 AM

# Joe Beda said:

rsync has a flag (-H) to copy hard links. This is what I used when I had to do something similar. I think it ends up keeping a big map if inodes in memory so you can't stop/restart it, but that doesn't work with the other methods you mentioned. Rsync does go through great lengths to make sure the target is an exact copy of the original.

on February 28, 2008 09:56 AM

# Stuart Langridge said:

What's wrong with rsync --hard-links, which preserves hardlinks?

on February 28, 2008 10:44 AM

# Chris Adams said:

I generally use dd:

mount -o remount,ro /
dd if=/dev/hda of=/dev/newdrive bs=1024k

or even over the network
dd if=/dev/hda of=- bs=1024k | ssh remote_host dd if=- of=/dev/newdrive bs=1024k

with a large blocksize it'll be *massively* faster than filesystem-level copies unless the volume is almost empty and tools like parted have made it fairly easy to expand most common filesystems once you have it copied.

Of course, the mere fact that this is so much work is insane - I'm really looking forward to the day when the process is something like "add new to pool, remove old from pool, wait for the filesystem to copy before removing the old device" process. Sadly even ZFS doesn't do that yet.

on February 28, 2008 10:48 AM

# Jeremy Zawodny said:

I swear that I read the rsync man page three times and never saw the hard links option.

DOH!

on February 28, 2008 11:10 AM

# Cliff Stanford said:

What's wrong with cp -rpv ?

Am I missing something?

Cliff.

on February 28, 2008 12:22 PM

# Jeremy Zawodny said:

Cliff:

My reading of the cp man page didn't say anything about preserving hard links. Symlinks, yes. But not hard links.

on February 28, 2008 12:25 PM

# Roger Binns said:

You need to be very careful when using rsync with the hard links option. I make daily backups using hard links for unchanged files.

On needing to copy disk contents to a new one, I found that rsync uses a humongous amount of memory. In fact I had to drop using a 32 bit Ubuntu for the 64 bit version because it ran out of address space! Then after 3 days with the 64 bit version, I had to change my machine from 1GB of physical RAM to 3GB because rsync was totally thrashing memory and would have taken forever to complete. The final working set size was around 2.5GB.

on February 28, 2008 12:39 PM

# Wayne Scott said:

Using LVM2 is very nice. You add the 750G drive to your pool and then remove your smaller drive.

on February 28, 2008 02:38 PM

# Martin Levy said:

You should check the man pages for find and cpio. Using a simple "find . -depth -print | cpio -pdm $destination" command will both preserve links AND preserve dates on directories. If you use "cp -rp" then you DON'T preserve the directory times because it's not a depth first traversal.

By removing the -v flag from cpio, you will only be presented with the errors. No need for excess crud on the screen. The -B flag is not needed with -p option (it's only used with -i or -o). BTW: Use -C if you are using -i or -o on a modern system.

The -depth option to find and the cpio command showed up externally in AT&T System V Unix (and internally to AT&T much earlier than that!).

The find/cpio combination is still the cleanest way of copying files from one directory to another. The modern usage of ssh, as a way to run a "cpio -i" remotely, hence enabling machine to machine clean copying of a hierarchy works like a charm! Before ssh, we used rsh (I'm glad we aren't doing that anymore).

Enjoy,

Martin

on February 28, 2008 09:20 PM

# Ask Bjørn Hansen said:

As someone else pointed out rsync can do it, but it uses a good deal of memory with a lot of files. (Not impossibly much for a one-off job, but a lot ... I estimated about 1GB memory for ~10M files/links when I was moving my 1TB rsnapshot "disk" from one volume group to another a few days ago).

Anyway - "cp" can do it too, look for the "preserve" option; something like --preserve=all or --preserve=link.

cd /mnt/old
cp -av --preserve=all . /mnt/new

(I sometimes use -v for big jobs like this, sure it might be slower but at least I can easily see what's going on ...)

- ask

on February 28, 2008 10:14 PM

# Asgeir S. Nilsen said:

My laptop Linux installation is based on LVM, and is currently at its third hard drive.

Procedure is fairly simple:

1. Pop new hard drive in drive bay frame and insert where DVD player normally is.

2. If swap is on the LVM, remove it, as the virtual memory subsystem seems to get confused if you migrate underneath it.

3. Add new disk to volume group.

4. Remove old disk from volume group.

5. Carry on working as normal. In case of kernel panics or crashes, LVM will resume the migration where it left off on next reboot.

6. rsync the boot partition to the new disk and do the regular GRUB magic to make the new disk bootable.

7. Insert new drive in hard drive bay and reboot.

As I said, I've done this twice and not lost any data. The same procedure can also be applied when migrating to a new computer, as long as the new computer can accept the hard drive from the old one and manage to boot from it in some way.

Asgeir

on February 29, 2008 12:42 AM

# Jeremy Johnstone said:

I've found the following options to rsync to work well when doing backups of this nature:

-a = handles almost everything
-H = hard links, the option you missed
-S = handle sparse files better (not always needed, but doesn't hurt)
-v = because I like to see what's going on

I really don't know why -a doesn't include -H. It includes virtually everything else even remotely relevant to an "archive" session, so why it doesn't include that almost seems like a mistake to me (mistake in judgment).

on February 29, 2008 09:37 AM

# Totologie said:

Hi (I'm french so my english is not very good ;-p)
I use Backuppc since long time, but I don't know perl language.
I'm using 2.12 version
I have a space problem... I have to copy my Backuppc/data to an other HD (a bigger one)
With this version the BackupPC_tarPCCopy script doesn't work :(
What is the easiest way to do this ?

Thx for your help

on April 16, 2008 01:14 AM

# Zhenlei Cai said:

Stuart and Jeremy Johnstone:

rsync's -H ( --hard-links) option uses a lot of memory because a hard link is basically a link to the i-node number of the original file, i-node numbers are not portable across different disks, so rsync must note the i-nodes of every file in the source disk and keep them in memory. Say if A1 is a hard link to file A, rsync must then tell the remote machine this and replace A1's content on remote machine with i-node number of the remote machine's file A's i-node. Worse if A1 is copied before A, this i-node must be back-filled after A is copied.

on July 6, 2008 04:36 PM

# Steve Mor said:

re: bad timestamp on directories using cpio; I think you can fix that by adding -depth as a find option. Without -depth, the directory gets restored first, then its timestamp is updated as you restore files into it. With -depth, the files are presented by find before the directory itself, so cpio will create the directory w/ default settings so it can restore the file. Once all restores are done into that directory, find will present the directory itself so the original dir owner, group, perms, and timestamps will be restored.

on September 19, 2008 06:02 PM

# Lorens Kockum said:

Simple `cp -a` using cp (GNU coreutils) 5.97 on my debian does the job quite nicely, I just checked. No need for the --preserve=all option, -a implies --preserve=link.

It didn't seem to take too long either, but I would be surprised if it was very much better than rsync. Much easier to remember though :-)

on November 12, 2008 09:44 AM

# piyo said:

I would add to Jeremy Johnstone's rsync command the --numeric-ids option, since I mostly boot up a Linux Live OS like System Rescue CD when consolidating HDDs and VM disk images. So in summary:

rsync -avP -H --numeric-ids /mnt/src /mnt/dst

on May 15, 2009 06:31 PM

# Jonathan Matthews said:

Jeremy Johnstone said:
>> [snip]
>> -a = handles almost everything
>> -H = hard links, the option you missed
>>
>> I really don't know why -a doesn't include -H. It
>> includes virtually everything else even remotely
>> relevant to an "archive" session, so why it doesn't
>> include that almost seems like a mistake to me
>> (mistake in judgment).

IMHO the reasoning is this: if the aim of -a is to create an "archival" (i.e. point-in-time) copy of the data, then maintaining any hardlinks would mean that changes to the /live/ data would be propagated to the archival copy - thus rendering the backup-ish nature of the archive broken. Hence -H is not a good candidate for automatic inclusion via -a.

Jonathan

on July 13, 2009 09:53 AM

# Jonathan Matthews said:

/me checks the -H behaviour and notices that my last comment's totally wrong: it's only maintaining hardlinks between files *inside* the destination - not between source and destination. Coffee. Need coffee.

on July 13, 2009 09:57 AM

# ben said:

Hello
I do not have much time to test, I can find such description on rsnapshot site :
>
jzawodn may test this solution for preserving hard links ?
cheers
Ben

on July 18, 2009 02:48 AM

# Who Cares said:

Here is a warning to other users who believed the comment above saying cpio was still the best. I just wasted several hours using cpio in single-user mode as it seemed plausible it was the best and the find command gave me an easy way to control which files not to backup. Well, it doesn't even copy files over ~4GB properly. cpio belongs to history (and the boot process) leave it to those places, I wish I'd used rsync.

on September 13, 2009 04:09 AM

# Imran Chaudhry said:

Zhenlei Cai and others are correct about the cause of rsync with a large dataset eating memory. This is what happened to me when rsyncing two datasets with hardlinks, one of ~30G the other 130G. I left it running overnight rsyncing from HDD to USB2 external HDD and the machine gradually ate into swap until it became unresponsive. It looks like a good chunk of the data got transferred though (I deduced this by du -chs of the original data vs that on the USB HDD). The rsync settings I use are: time rsync -avh --stats --hard-links

Looks like I have to gradually do a bunch of broken rsyncs until the data is transferred. If this failed, I think cp -av --preserve=all would be the next best thing.

One other thing is that in my case the data is going onto an encrypted partition by using cryptmount.

on November 20, 2009 09:55 AM

# Gary said:

Regarding your question about cpio not preserving file times... Did you run the command as root? I believe that is required to preserve file access times even if you have full permissions to the directory as a non-root user.

on August 5, 2010 10:34 AM

Disclaimer: The opinions expressed here are mine and mine alone. My current, past, or previous employers are not responsible for what I write here, the comments left by others, or the photos I may share. If you have questions, please contact me. Also, I am not a journalist or reporter. Don't "pitch" me.

Privacy: I do not share or publish the email addresses or IP addresses of anyone posting a comment here without consent. However, I do reserve the right to remove comments that are spammy, off-topic, or otherwise unsuitable based on my comment policy. In a few cases, I may leave spammy comments but remove any URLs they contain.