Tue, 13 Oct 2009

Tarballs explained

This entry was originally posted in slightly different form to Server Fault

If you're coming from a Windows world, you're used to using tools like zip or rar, which compress collections of files. In the typical Unix tradition of doing one thing and doing one thing well, you tend to have two different utilities; a compression tool and a archive format. People then use these two tools together to give the same functionality that zip or rar provide.

There are numerous different compression formats; the common ones used on Linux these days are gzip (sometimes known as zlib) and the newer, higher performing bzip2. Unfortunately bzip2 uses more CPU and memory to provide the higher rates of compression. You can use these tools to compress any file and by convention files compressed by either of these formats is .gz and .bz2. You can use gzip and bzip2 to compress and gunzip and bunzip2 to decompress these formats.

There are also several different types of archive formats available, including cpio, ar and tar, but people tend to only use tar. These allow you to take a number of files and pack them into a single file. They can also include path and permission information. You can create and unpack a tar file using the tar command. You might hear these operations referred to as "tarring" and "untarring". (The name of the command comes from a shortening of Tape ARchive. Tar was an improvement on the ar format in that you could use it to span multiple physical tapes for backups).

# tar -cf archive.tar list of files to include

This will create (-c) and archive into a file -f called archive.tar. (.tar is the convention extention for tar archives). You should now have a single file that contains five files ("list", "of", "files", "to" and "include"). If you give tar a directory, it will recurse into that directory and store everything inside it.

# tar -xf archive.tar
# tar -xf archive.tar list of files

This will extract (-x) the previously created archive.tar. You can extract just the files you want from the archive by listing them on the end of the command line. In our example, the second line would extract "list", "of", "file", but not "to" and "include". You can also use

# tar -tf archive.tar

to get a list of the contents before you extract them.

So now you can combine these two tools to replication the functionality of zip:

# tar -cf archive.tar directory
# gzip archive.tar

You'll now have an archive.tar.gz file. You can extract it using:

# gunzip archive.tar.gz
# tar -xf archive.tar

We can use pipes to save us having an intermediate archive.tar:

# tar -cf - directory | gzip > archive.tar.gz
# gunzip < archive.tar.gz | tar -xf -

You can use - with the -f option to specify stdin or stdout (tar knows which one based on context).

We can do slightly better, because, in a slight apparent breaking of the "one job well" idea, tar has the ability to compress its output and decompress its input by using the -z argument (I say apparent, because it still uses the gzip and gunzip commandline behind the scenes)

# tar -czf archive.tar.gz directory
# tar -xzf archive.tar.gz

To use bzip2 instead of gzip, use bzip2, bunzip2 and -j instead of gzip, gunzip and -z respectively (tar -cjf archive.tar.bz2). Some versions of tar can detect a bzip2 file archive with you use -z and do the right thing, but it is probably worth getting in the habit of being explicit.

More info:

[serverfault,tar,gzip,bzip2] | # Read Comments (8) |

Comments

Good primer :-)

Might be worth mentioning the -v switch as it adds a little visual interactivity. This (IMHO) is good for confidence building in people new to the CLI. It lets them see the process and files (and structure) they selected being tar'd up :-)
Posted by tek at Tue Oct 13 11:09:33 2009
Of course, the - is not necessary in all of these commands. tar xzf archive.tar.gz works just as well.
Posted by Jean-Christophe Dubacq at Tue Oct 13 11:21:48 2009
Perhaps you should mention the newer lzma/xz which is supported directly with newer tar versions.

I have read somewhere that lzma (with compression ratio/parameter "-2" ) is in most cases faster than bzip2 and compresses better. So bzip2 -9 can be replaced by lzma -2
Posted by anonymous at Tue Oct 13 12:35:25 2009
Perhaps you should mention the newer lzma/xz which is supported directly with newer tar versions.

I have read somewhere that lzma (with compression ratio/parameter "-2" ) is in most cases faster than bzip2 and compresses better. So bzip2 -9 can be replaced by lzma -2
Posted by anonymous at Tue Oct 13 13:58:38 2009
On Debian, there is a great tool to abstract away the differences between archive and compression tools: the "atool" package.

A simple tar command like this:

  tar jcf file.tar.bz2 directory/

becomes:

  apack file.tar.bz2 directory/

Similarly:

  tar jxf file.tar.bz2

becomes:

  aunpack file.tar.bz2

It transparently supports pbzip2 (the multi-core version of bzip2) when using bz2 archives.

But the killer feature in my opinion is that it never unpacks more than one file in the current directory. If a tarball is going to spit out a thousand files without using a directory, aunpack will create one first. Very handy to avoid having to manually cleanup your home directory after an accidental "spill".
Posted by Francois Marier at Tue Oct 13 20:37:41 2009
You do not need the "j" option to extract

tar xf file.tar.bz2

will work just fine
Posted by anon at Tue Oct 13 22:00:45 2009
all this could be explained in 4 commands:

man bash

man tar

man gzip

man bzip2

BAM!!
Posted by anon at Wed Oct 14 03:24:14 2009
The stupid part about tar xf is that it infers the compression format from the file name. But that should be more difficult to understand for the Linux crowd than the people coming from windows.
Posted by . at Wed Oct 14 08:42:14 2009

Name:


E-mail:


URL:


Comment:


Please enter "fudge" to prove you are a human