gzip and hard links. I don't get it. (by Jeremy Zawodny)

I recently was looking to make compressed backups of some files that exist in a tree that's actually a set of hard links (rsnapshot or rsnap style) to a canonical set of files.

In other words, I have a data directory and a data.previous directory. I would like to make a backup of the stuff in data.previous, most of the files being unchanged from data. And I'd like to do this without using lots of disk space.

The funny thing is that gzip is weird about hard links. If you try to gzip a file whose link count is greater than one, it complains.

I was puzzled by this and started to wonder if it actually over-writes the original input file instead of simply unlinking it when it is done reading it and generating the compressed version.

So I did a little experiment.

First I create a file with two links to it.

/tmp/gz$ touch a
/tmp/gz$ ln a b

Then I check to ensure they have the same inode.

/tmp/gz$ ls -li a b
5152839 -rw-r--r-- 2 jzawodn jzawodn 0 2008-12-03 15:38 a
5152839 -rw-r--r-- 2 jzawodn jzawodn 0 2008-12-03 15:38 b

They do. So I compress one of them.

/tmp/gz$ gzip a
gzip: a has 1 other link  -- unchanged

And witness the complaint. The gzip man page says I can force it with the "-f" argument, so I do.

/tmp/gz$ gzip -f a

And, as I'd expect, the new file doesn't replaced the old file. It gets a new inode instead.

/tmp/gz$ ls -li a.gz b
5152840 -rw-r--r-- 1 jzawodn jzawodn 22 2008-12-03 15:38 a.gz
5152839 -rw-r--r-- 1 jzawodn jzawodn  0 2008-12-03 15:38 b

This leads me to believe that the gzip error/warning message is really trying to say something like:

gzip: a has 1 other link and compressing it will save no space

But I still don't see the danger. What can't that simply be an informational message? After all, you still need enough space to store the original and compressed versions since the original (in the normal case) exists until it is done writing the compressed version anyway. (I checked the source code later.)

So what's the rationale here? I don't get it.

Posted by jzawodn at December 03, 2008 03:51 PM | edit

Reader Comments

# dean said:

Forcing the gzip has another side effect besides compressing the file. It also breaks the link. The rationale seems perfectly reasonable - those files were linked for a reason and you should be warned by gzip because unlinking the files is a side effect, possibly an unwanted one. Granted, the warning message could be a little clearer on this point.

on December 3, 2008 08:01 PM

# Mike Moody said:

The message is telling you that not only are you not saving space but you are using more space. If the size of the file is X bytes then a file and a hard link use X bytes. If the file can be compressed by 30% you are now using X + .7X bytes. The message is mainly a warning, not an error.

on December 3, 2008 08:08 PM

# Simon said:

Interesting. My understanding is - as you say - that gzipping a file doesn't modify that file: it creates a new file (which happens to be gzipped) and deletes the original.

So if you have /foo/bar.txt hardlinked to /home/fred/notes.txt, and you gzip notes.txt...apps and users relying on /foo/bar.txt (and perhaps unaware of notes.txt's very existence) would be pretty irked that it had been deleted. Fortunately gzip doesn't delete it, and in fact leaves it untouched.

I don't think disk space comes into it, there's many reasons why you'd compress a bunch of files beyond saving disk space.

on December 3, 2008 08:23 PM

# Jeffrey Friedl said:

It makes sense to me... you have to explicitly tell it that you want to disconnect one reference to a group of files that had been previously connected. And it's not that it won't save space... it actually *increases* disk use (now two versions of the file, instead of one), but it's the worry about inadvertent disconnects that drives this behavior, I'd think.

If it could warn about symbolic links that would suddenly become broken by compressing a file, I'm sure it would...

on December 3, 2008 08:48 PM

# said:

Just a guess:

An older use of hardlinks
was for applications which could perform numerous related functions,
and the behavior would be based on the command used to invoke the
program by scanning argv[0]. This would save the space of having
multiple copies of the same application along with preseving inodes,
as once upon a time both were considered expensive. You could also
get a performance gain if the application was used often and could
be kept cached in memory, as you wouldnt have to keep loading and
unloading what was effectively the same program.

on December 3, 2008 08:49 PM

# Eric said:

I think we're spoiled by our abundance here. wikipedia says gzip was written in 1992, when I thought that 500Mb drives were plenty big. and it seems that gzip would probably ape the cli syntax of compress, written in 1984, whose behavior is to silently ignore files that it can't compress any further. in its time, this error message was a usability improvement!

on December 3, 2008 10:35 PM

# Nathan Neulinger said:

The issue is that gzip tries to NOT change the inode number of the file (good behavior for archiving/backup/etc. systems).

In order to do this with a hardlinked file, you'd wind up changing the content of the file - resulting in one reference (notes.txt) being named .txt but gzipped content, and the other side (bar.txt.gz) renamed. That would be very confusing.

There is no way for gzip (or any other tool for that matter) to even know where that other reference is located without an exhaustive search of the filesystem.

Take it from someone who revoked all permissions on one buried chroot reference to libc.so in the days of hardlinked shared libraries (without realizing it was hardlinked). It makes for an nice time recovering from that mess.

on December 4, 2008 05:45 AM

# Oskar Pearson said:

Nathan - when you a gzip a file, the inode number does change, so the rationale doesn't make sense, unfortunately.

(I hope the example below displays correctly - apparently HTML tags aren't allowed)

oskar@zen:~$ touch x
oskar@zen:~$ ls -li x
836734 -rw-r--r-- 1 oskar users 0 Dec 4 21:32 x
oskar@zen:~$ gzip x
oskar@zen:~$ ls -li x.gz
838121 -rw-r--r-- 1 oskar users 22 Dec 4 21:32 x.gz
oskar@zen:~$

And strace proves that it's not re-writing the file as it goes - the source is opened read-only:
oskar@zen:~$ strace -o /tmp/me gzip x
...
open("x", O_RDONLY|O_LARGEFILE) = 3
open("x.gz", O_WRONLY|O_CREAT|O_EXCL|O_LARGEFILE, 0600) = 4
close(3) = 0
close(4) = 0
unlink("x") = 0
oskar@zen:~$

As a side effect, compressing a file will always use more space than the original file took, until the source file can be removed.

I'm also interested in this behaviour. If I know one thing, though, it'll be some crazy history case, defined in posix or convention. Eric is on the right track.

Oskar

on December 4, 2008 01:42 PM

# Timo said:

The space issues are a possibility, but I'm guessing it's because unzipping it will not reverse it properly. Usually zipping and unzipping a file puts you back to the same situation you were in before. With hard links you start with one file with two names. You then zip it and get two files, one zipped and one normal. When you unzip it in this case, you end up with two files, not back to one file with two names.

on December 4, 2008 04:35 PM

# Nathan Neulinger said:

Interesting, you're right, it does change the inode number... I must have been confusing with something else. Well, beats me then... that's the only rationale I can think that would make any sense for why it would treat them specially.

on December 4, 2008 06:47 PM

# Sachin said:

thanks learned a lot "extra" about gzip

on December 5, 2008 02:30 AM

# Dominik said:

To me, it seems to be clear that the compressed file must have a new inode number and that during the compression process, more space is needed.
Reusing the old inode number and avoiding the extra space would require compressing/rewriting the source file in-place. This, however is a very bad idea, because if something goes wrong in between, the result is a complete mess.

(An inode-preserving alternative to in-place operation would be to create a temporary copy of the source and then writing the compressed result to the old file, but to me this seems not to be worth the extra effort at all.)

Consequently, the warning message seems perfectly reasonable to me, because the effect of compression (less disk space needed *after* compression) will turn into the opposite (more spaced needed) with hard-linked files.

on December 5, 2008 03:29 AM

# Martin Tsachev said:

If you use gzip -c a > a.gz it won't complain and you get to decide what to do with the unarchived copy afterwards.

Not even sure why you would use gzip on a directory directly and not tar it first.

on December 5, 2008 08:42 AM

# said:

on December 5, 2008 10:51 AM

# chad said:

I guess my question in all of this, is why are you gzip'ing your archive? Doesn't that defeat the purpose of the rsync --link-dest=data.previous --compare-dest=data.previous ?

on December 5, 2008 11:02 AM

# A.T. said:

not that I insist, but... reading this (presumably old) text might help you with peculiarities in gzip handling of hard links http://www.hoobie.net/security/exploits/hacking/gzip.txt

on January 8, 2009 04:45 AM

# منتديات الابداع و التميز said:

same oskar person ..

Nathan - when you a gzip a file, the inode number does change, so the rationale doesn't make sense, unfortunately.

(I hope the example below displays correctly - apparently HTML tags aren't allowed)

As a side effect, compressing a file will always use more space than the original file took, until the source file can be removed.

I'm also interested in this behaviour. If I know one thing, though, it'll be some crazy history case, defined in posix or convention. Eric is on the right track.

thanks ..

..............................................
منتديات الابداع و التميز
http://ebdaa.yoo7.com

on January 11, 2009 03:13 AM

# John Swindells said:

Thanks for your post, and for the other comments. Having forced gzip to ignore the warning, it's a good idea to go and find the ex-hardlinks and deal with them!

on June 25, 2009 01:20 AM

Disclaimer: The opinions expressed here are mine and mine alone. My current, past, or previous employers are not responsible for what I write here, the comments left by others, or the photos I may share. If you have questions, please contact me. Also, I am not a journalist or reporter. Don't "pitch" me.

Privacy: I do not share or publish the email addresses or IP addresses of anyone posting a comment here without consent. However, I do reserve the right to remove comments that are spammy, off-topic, or otherwise unsuitable based on my comment policy. In a few cases, I may leave spammy comments but remove any URLs they contain.