This entry was originally posted in slightly different form to Server Fault

If you’re coming from a Windows world, you’re used to using tools like zip or
rar, which compress collections of files. In the typical Unix tradition of
doing one thing and doing one thing well, you tend to have two different
utilities; a compression tool and a archive format. People then use these two
tools together to give the same functionality that zip or rar provide.

There are numerous different compression formats; the common ones used on Linux
these days are gzip (sometimes known as zlib) and the newer, higher performing
bzip2. Unfortunately bzip2 uses more CPU and memory to provide the higher rates
of compression. You can use these tools to compress any file and by convention
files compressed by either of these formats is .gz and .bz2. You can use gzip
and bzip2 to compress and gunzip and bunzip2 to decompress these formats.

There are also several different types of archive formats available, including
cpio, ar and tar, but people tend to only use tar. These allow you to take a
number of files and pack them into a single file. They can also include path
and permission information. You can create and unpack a tar file using the tar
command. You might hear these operations referred to as “tarring” and
“untarring”. (The name of the command comes from a shortening of Tape ARchive.
Tar was an improvement on the ar format in that you could use it to span
multiple physical tapes for backups).

# tar -cf archive.tar list of files to include

This will create (-c) and archive into a file -f called archive.tar. (.tar
is the convention extention for tar archives). You should now have a single
file that contains five files (“list”, “of”, “files”, “to” and “include”). If
you give tar a directory, it will recurse into that directory and store
everything inside it.

# tar -xf archive.tar
# tar -xf archive.tar list of files

This will extract (-x) the previously created archive.tar. You can extract just
the files you want from the archive by listing them on the end of the command
line. In our example, the second line would extract “list”, “of”, “file”, but
not “to” and “include”. You can also use

# tar -tf archive.tar

to get a list of the contents before you extract them.

So now you can combine these two tools to replication the functionality of zip:

# tar -cf archive.tar directory
# gzip archive.tar

You’ll now have an archive.tar.gz file. You can extract it using:

# gunzip archive.tar.gz
# tar -xf archive.tar

We can use pipes to save us having an intermediate archive.tar:

# tar -cf - directory | gzip > archive.tar.gz
# gunzip < archive.tar.gz | tar -xf -

You can use – with the -f option to specify stdin or stdout (tar knows which one based on context).

We can do slightly better, because, in a slight apparent breaking of the
“one job well” idea, tar has the ability to compress its output and decompress
its input by using the -z argument (I say apparent, because it still uses the
gzip and gunzip commandline behind the scenes)

# tar -czf archive.tar.gz directory
# tar -xzf archive.tar.gz

To use bzip2 instead of gzip, use bzip2, bunzip2 and -j instead of gzip, gunzip
and -z respectively (tar -cjf archive.tar.bz2). Some versions of tar can detect
a bzip2 file archive with you use -z and do the right thing, but it is probably
worth getting in the habit of being explicit.

More info:

This entry was originally posted in slightly different form to Server Fault

There are several ways to run Tomcat applications. You can either run
tomcat direcly on port 80, or you can put a webserver in front of tomcat and
proxy connections to it. I would highly recommend using Apache as a
front end. The main reason for this suggestion is that Apache is more
flexible than tomcat. Apache has many modules that would require you to
code support yourself in Tomcat. For example, while Tomcat can do gzip
compression, it’s a single switch; enabled or disabled. Sadly you can
not compress CSS or javascript for Internet Explorer 6. This is easy to
support in Apache, but impossible to do in Tomcat. Things like caching
are also easier to do in Apache.

Having decided to use Apache to front Tomcat, you need to decide how
to connect them. There are several choices: mod_proxy ( more accurately, mod_proxy_http in
Apache 2.2, but I’ll refer to this as mod_proxy), mod_jk and mod_jk2.
Mod_jk2 is not under active development and should not be used. This
leaves us with mod_proxy or mod_jk.

Both methods forward requests from apache to tomcat. mod_proxy uses the HTTP
that we all know an love. mod_jk uses a binary protocol AJP. The main
advantages of mod_jk are:

  • AJP is a binary protocol, so is slightly quicker for both ends to deal with and
    uses slightly less overhead compared to HTTP, but this is minimal.
  • AJP
    includes information like original host name, the remote host and the SSL
    connection. This means that ServletRequest.isSecure() works as expected, and
    that you know who is connecting to you and allows you to do some sort of
    virtualhosting in your code.

A slight disadvantage is that AJP is based on
fixed sized chunks, and can break with long headers, particularly request URLs
with long list of parameters, but you should rarely be in a position of having
8K of URL parameters. (It would suggest you were doing it wrong. 🙂 )

It used to be the case that mod_jk provided basic load balancing
between two tomcats, which mod_proxy couldn’t do, but with the new
mod_proxy_balancer in Apache 2.2, this is no longer a reason to choose between them.

The position is slightly complicated by the existence of mod_proxy_ajp. Between
them, mod_jk is the more mature of the two, but mod_proxy_ajp works in the same
framework as the other mod_proxy modules. I have not yet used mod_proxy_ajp,
but would consider doing so in the future, as mod_proxy_ajp is part of
Apche and mod_jk involves additional configuration outside of Apache.

Given a choice, I would prefer a AJP based connector, mostly due to my second
stated advantage, more than the performance aspect. Of course, if your
application vendor doesn’t support anything other than mod_proxy_http, that
does tie your hands somewhat.

You could use an alternative webserver like lighttpd, which does have
an AJP module. Sadly, my prefered lightweight HTTP server, nginx, does
not support AJP and is unlike ever to do so, due to the design of its
proxying system.

To make things explicitly clear, my blog is copyrighted and licensed
as “All rights reserved”. It even says that at the footer of every page.
That means you may not redistribute any content without my permission.
Yes, this means you, Ross Beazley. I may allow aggregation sites to
redistribute my content, but the only sites where I have given explicit
permission are Planet Debian and Planet BNM. I am unlikely to be upset
if your aggregation site links back to the original entry and does not
carry advertising, and will probably give you permission. If both these
conditions are not met, you do not have permission and will not be
granted permission.

This entry was originally posted in slightly different form to Server Fault

There are two methods of using virtual hosting with HTTP: Name based and IP based.
IP based is the simplest as each virtual host is served from a different
IP address configured on the server, but this requires an IP address for
every host, and we’re meant to be running out. The better solution is to
use the Host: header introduced in HTTP 1.1, which allows the server to
serve the right host to the client from a single IP address.

HTTPS throws a spanner in the works, as the server does not know
which certificate to present to the client during the SSL connection set
up, because the client can’t send the Host: header until the connection
is set up. As a result, if you want to host more than one HTTPS site,
you need to use IP-based virtual hosting.

However, you can run multiple SSL sites from a single IP address using a couple of
methods, each with their own drawbacks.

The first method is to have a SSL certificate that covers both sites. The idea
here is to have a single SSL certificate that covers all the domains you want
to host from a single IP address. You can either do this using a wildcard
certificate that covers both domains or use Subject Alternative Name.

Wildcard certificates would be something *, which would cover, and There are a number
of problems with wildcard certificates. Firstly, every hostname needs to have a
common domain, e.g. with * you can have, but not Secondly, you can’t reliably have more than one subdomain,
i.e. you can have, but not This might work
in earlier versions of Firefox (<= 3.0), but it doesn’t work in 3.5 or any
version of Internet Explorer. Thirdly, wildcard certificates are significantly
more expensive than normal certificates if you want it signed by a root CA.

Subject Alternative Name is a method of using an extension to X509 certificates
that lists alternative hostnames that are valid for that certificate. It
involves adding a “subjectAltName” field to the certificate that lists each
additional host you want covered by the certificate. This should work in most
browsers; certainly every modern mainstream browser. The downside of this
method is that you have to list every domain on the server that will use SSL.
You may not want this information publicly available. You probably don’t want
unrelated domains to be listed on the same certificate. It may also be
difficult to add additional domains at a later date to your certificate.

The second approach is to use something called SNI (Server Name Indication)
which is an extension in TLS that solves the chicken and egg problem of not
knowing which certificate to send to the client because the client hasn’t sent
the Host: header yet. As part of the TLS negotiation, the client sends the
required hostname as one of the options. The only downside to this is client
and server support. The support in browsers tends to be better than in servers.
Firefox has supported it since 2.0. Internet Explorer supports it from 7
onwards, but only on Vista or later. Chrome only supports it on Vista or later
too. Opera 8 and Safari 8.2.1 have support. Other browsers may not support it.

The biggest problem preventing adoption is the server support. Until very
recently neither of the two main webservers supported it. Apache gained SNI
support as of 2.2.12, which was released July 2009. As of writing, IIS does not
support SNI in any version. nginx, lighttpd and Cherokee all support SNI.

Going forward, SNI is the best method for solving the name-based virtual
hosting of HTTPS, but support might be patchy for a year or two yet. If you
must do HTTPS virtual hosting without problems in the near future, IP based
virtual hosting is the only option.

While git is a completely distributed revision control system, sometimes
the lack of a central canonical repository can be annoying. For example,
you might want to make your repository published publically, so other
people can fork your code, or you might want all your developers to push
into (or have code pulled into) a central “golden” tree, that you then
use for automated building and continuous integration. This entry should
explain how to get this all working on Ubuntu 9.04 (Jaunty).

Gitosis is a very useful git repository manager, which adds support like
ACLs in pre-commits and gitweb and git-daemon management. While it’s
possible to set all these things up by hand, gitosis does everything for
you. It is nicely configured via git; to make configuration changes,
you push the config file changes into gitosis repository on the server.

Gitosis is available in Jaunty, but unfortunately there is a bug
in the version in Jaunty, which means it doesn’t work out of the box.
Fortunately there is a fixed version in jaunty-proposed that
fixes the main problem. This does mean that you need to add the
following to your sources.list:

deb jaunty-proposed universe

Run apt-get update && apt-get install gitosis. You should
install 0.2+20080825-2ubuntu0.1 or later. There is another
small bug in the current version too, as a result of git removing the
git-$command scripts out of /usr/bin. Edit
and replace



git update-server-info

With these changes in place, we can now set up our gitosis
repository. On the server you are going to use to host your central
repositories, run:

sudo -H -u gitosis gitosis-init <

The file is a public ssh key. As I mentioned,
gitosis is managed over git, so you need an initial user to clone and
then push changes back into the gitosis repo, so make sure this key
belongs to a keypair you have available to the remote user you’re going
to configure gitosis.

Now, on your local computer, you can clone the gitosis-admin repo

git clone

If you look inside the gitosis-admin directory, you should
find a file called gitosis.conf and a directory called
keydir. The directory is where you can add ssh public keys for
your users. The file is the configuration file for gitosis.

loglevel = INFO

[group gitosis-admin]
writable = gitosis-admin
members = david@david

[group developers]
members = david@david
writable = publicproject privateproject

[group contributors]
members = george@wilber
writable = publicproject

[repo publicproject]
daemon = yes
gitweb = yes

[repo privateproject]
daemon = no
gitweb = no

This sets up two repositories, called publicproject and
privateproject. It enables the public project to be available via the
git protocol and in gitweb if you have that installed. We also create
two groups, developers and contributors. David has access to both
projects, but George only has access to change the publicproject. David
can also modify the gitosis configuration. The users are the names of
ssh keys (the last part of the line in or

Once you’ve changed this file, you can run git add
to add it to the commit, git commit -m "update
gitosis configuration
to commit it to your local repository, and
finally git push to push your commits back up into the central
repository. Gitosis should now update the configuration on the server to
match the config file.

One last thing to do is to enable git-daemon, so people can
anonymously clone your projects. Create /etc/event.d/git-daemon with the
following contents:

start on startup
stop on shutdown

exec /usr/bin/git daemon 
   --user=gitosis --group=gitosis 

You can now start this using start git-daemon

So now, you need to start using your repository. You can either start
with an existing project or an empty directory. Start by running
git init and then git add $file to add each of the
files you want in your project, and finally git commit to
commit them to your local repository. The final task is to add a remote
repository and push your code into it.

git remote add origin
git push origin master:refs/heads/master

In future, you should be able to do git push to push your
changes back into the central repository. You can also clone a project
using git or ssh, providing you have access, using the following
commands. The first is for read-write access over ssh and the second
uses the git protocol for read-only access. The git protocol uses TCP
port 9418, so make sure that’s available externally, if you want the
world to be able to clone your repos.

git clone
git clone git://

Setting up GitWeb is left as an exercise for the reader (and myself
because I am yet to attempt to set that up).

If you’ve not heard of Puppet, it is a
configuration management tool. You write descriptions of how you want
your systems to look and it checks the current setup and works out what
it needs to do to move your system so it matches your description.
The idea is to write how it should look, not how to change the system.

Puppet uses a client (puppetd) that talks to the central server
(puppetmaster) over HTTPS.The default puppetmaster HTTP server is
webbrick, which is a
lightweight Ruby HTTP server. While it’s simple and allows Puppetmaster
to work straight out the box, due to it’s pure Ruby structure and Ruby’s
green thread architecture, it doesn’t scale beyond a simple puppet
setup. After a while, every medium to large Puppet installation needs to
move to the other HTTP server that puppet supports: Mongrel. Mongrel is
a faster HTTP library, but supports a lot less features. In particular
it doesn’t support SSL, which is important with Puppet, as Puppet relies
heavily on client certificate verification for authentication. As a
result, we need to put another webserver in front that can handle the
SSL aspect. As a nice side effect of having to proxy to Puppetmaster is
that we can run several puppetmaster processes and improve on the green
threads problem that Ruby has. In this blog post, I’m going to describe
setting up nginx and mongrel.

The first thing to do is to install the mongrel and
nginx packages.

apt-get install mongrel nginx

We need to run nginx on port 8140 and proxy to
our mongrel servers on different ports, so lets move puppetmaster off
8140 and configure it to use mongrel while we’re at it. Edit
/etc/default/puppetmaster and set the following variables:


This tells the init.d script to use the mongrel server type and to
run four of them. The init.d script is clever enough to start up the
right number of processes and will set them up to use a sequence of
ports for each one, starting at 18140 for the first process, up to 18143
for the last one. The DAEMON_OPTS option tells Puppetmaster how
we’re going to pass the SSL certificate information from nginx so it can
grant or refuse permission.

Now to set up nginx. Put the following in

ssl                     on;
ssl_certificate /var/lib/puppet/ssl/certs/;
ssl_certificate_key /var/lib/puppet/ssl/private_keys/;
ssl_client_certificate  /var/lib/puppet/ssl/certs/ca.pem;
ssl_ciphers             SSLv2:-LOW:-EXPORT:RC4+RSA;
ssl_session_cache       shared:SSL:8m;
ssl_session_timeout     5m;

upstream puppet-production {

In this file we tell nginx where to find the server certificates for
your puppetmaster, so your clients can authenticate your server. We also
tell nginx the CA certificate to authenticate clients with and set up
some SSL details required for Puppet. Finally we create a group of
remote servers for our pack of mongrel puppetmasters, so we can refer to
them later. If you added more or less servers earlier don’t forget to
add or remove them here. You also need to replace with your FQDN. If at a later stage, you find
you need ever more performance, you can easily move some of your
puppetmaster processes to a separate box and update the upstream list to
refer to servers on the remote server.

Finally, we need to set up a couple of HTTP servers. Create
/etc/nginx/sites-enabled/puppetmaster with the following

server {
    listen                  8140;
    ssl_verify_client       on;
    root                    /var/empty;
    access_log              /var/log/nginx/access-8140.log;

    # Variables
    # $ssl_cipher returns the line of those utilized it is cipher for established SSL-connection
    # $ssl_client_serial returns the series number of client certificate for established SSL-connection
    # $ssl_client_s_dn returns line subject DN of client certificate for established SSL-connection
    # $ssl_client_i_dn returns line issuer DN of client certificate for established SSL-connection
    # $ssl_protocol returns the protocol of established SSL-connection

    location / {
        proxy_pass          http://puppet-production;
        proxy_redirect      off;
        proxy_set_header    Host             $host;
        proxy_set_header    X-Real-IP        $remote_addr;
        proxy_set_header    X-Forwarded-For  $proxy_add_x_forwarded_for;
        proxy_set_header    X-Client-Verify  SUCCESS;
        proxy_set_header    X-SSL-Subject    $ssl_client_s_dn;
        proxy_set_header    X-SSL-Issuer     $ssl_client_i_dn;
        proxy_read_timeout  65;

server {
    listen                  8141;
    ssl_verify_client       off;
    root                    /var/empty;
    access_log              /var/log/nginx/access-8141.log;

    location / {
        proxy_pass  http://puppet-production;
        proxy_redirect     off;
        proxy_set_header   Host             $host;
        proxy_set_header   X-Real-IP        $remote_addr;
        proxy_set_header   X-Forwarded-For  $proxy_add_x_forwarded_for;
        proxy_set_header   X-Client-Verify  FAILURE;
        proxy_set_header   X-SSL-Subject    $ssl_client_s_dn;
        proxy_set_header   X-SSL-Issuer     $ssl_client_i_dn;
        proxy_read_timeout  65;

This creates two servers on port 8140 and 8141 which both proxy all
requests to our group of mongrel servers, adding suitable headers to
pass on the SSL information. The only difference between them is the
X-Client-Verify header. This shows the one problem with using nginx with
puppet. Because the client verification success or failure is not
available as a variable before nginx 0.8.7, we can’t have a single port
for both the usual client connection and the initial unauthenticated
connection where the client requests a certificate to be signed. As a
result, with this setup, you are required to run puppet with --ca-port
the first time you run puppet until the certificate has been
signed with puppetca.

Foruntately with versions of nginx later than 0.8.7, you can use a
simpler setup shown below. This replaces both files shown above with the single
server. Unfortunately, 0.8.7 is not available in any
version of Ubuntu yet, not even Karmic.

server {
  listen 8140;

  ssl                     on;
  ssl_session_timeout     5m;
  ssl_certificate         /var/lib/puppet/ssl/certs/puppetmaster.pem;
  ssl_certificate_key     /var/lib/puppet/ssl/private_keys/puppetmaster.pem;
  ssl_client_certificate  /var/lib/puppet/ssl/ca/ca_crt.pem;

  # choose any ciphers
  ssl_ciphers             SSLv2:-LOW:-EXPORT:RC4+RSA;

  # allow authenticated and client without certs
  ssl_verify_client       optional;

  # obey to the Puppet CRL
  ssl_crl /var/lib/puppet/ssl/ca/ca_crl.pem;

  root                    /var/tmp;

  location / {
    proxy_pass              http://puppet-production;
    proxy_redirect         off;
    proxy_set_header    Host             $host;
    proxy_set_header    X-Real-IP        $remote_addr;
    proxy_set_header    X-Forwarded-For  $proxy_add_x_forwarded_for;
    proxy_set_header    X-Client-Verify  $ssl_client_verify;
    proxy_set_header    X-SSL-Subject    $ssl_client_s_dn;
    proxy_set_header    X-SSL-Issuer     $ssl_client_i_dn;
    proxy_read_timeout  65;

If you are running another webserver on the server, you may want to
delete /etc/nginx/sites-enabled/default which attempts to
create a server listening on port 80, which will conflict with your
existing HTTP server.

If you follow these instructions, you should find yourself with a
better performing puppetmaster and significantly few “connection reset
by peer” and other related error messages.

Sometimes you need to review what exactly doing an svn up
will do. Fortunately, you can do a couple of things to find out. The
first is use svn status -u to find out what files have

/etc/puppet/modules# svn stat -u
       *     1338   exim4_mailserver/files/
       *     1338   dbplc/files/
       *     1338   dbplc/files
       *     1338   dbplc/manifests/portal.pp
M            1386   tomcat/files/server.xml
Status against revision:   1386

Here we can see that four files were changed since their current
revision of 1338. We can also see that tomcat/files/server.xml is up to
date against the repository, but has local modifications.

This is all well and good, but how do we know what the changes are?
Well, svn diff is our friend here. By comparing the checkout
against the repository, we can see what will be updated.

/etc/puppet/modules# svn diff dbplc/manifests/portal.pp -rBASE:HEAD
Index: dbplc/manifests/portal.pp
--- dbplc/manifests/portal.pp (working copy)
+++ dbplc/manifests/portal.pp (revision 1386)
@@ -22,6 +22,7 @@
       owner => "tomcat55",
       group => "adm",
       mode => 644,
+      require => Package["tomcat5.5"],

    apache::config { "portal":

When you’re happy, you can run svn up as normal. I’ve just
used this process to sanity check our Puppet config before updating, as
it hasn’t been updated for a few days.

Recently, I’ve seen a lot of suggestions along the lines of:

find . -name foo -exec ls -l {} ;

This is incredibly inefficient, because for each and every matching
file, ls will be executed. Fortunately, we can do better by
using xargs. xargs takes a list of lines from stdin
and uses them to build up a command line. It takes special care to not
exceed the maximum command line length by splitting the input up into
multiple commands if it is needed. So, with that knowledge, we can
replace our original command with:

find . -name foo | xargs ls -l

There is one slight problem with this command; it isn’t space-safe.
xargs splits arguments on whitespace, so “file name” will be
incorrectly passed to the command as “file” “name”. Fortunately,
xargs has an option to delimit parameters by the null
character, and as our luck would have it, find has a suitable
option to produce output in this format. This means our command is

find . -name foo -print0 | xargs -0 ls -l

mlocate can do something similar:

locate foo -0 | xargs -0 ls -l

GNU find has one more trick up its sleeve. It has a modified version
of -exec that will do the same thing as xargs, so we could have
written our original command as:

find . -name foo -exec ls -l {} +

Every process is sacred; every process is great. If a process is wasted, God gets quite
irate. Please make sure you try to use one of the latter forms and not the
first form, and make a happy deity. 🙂

When you want to copy files from one machine to another, you might
think about using scp to copy them. You might think about using
rsync. If, however, you’re
trying to copy a large amount of data between two machines, here’s a
better, quicker, way to do it is using netcat.

On the receiving machine, run:

# cd /dest/dir && nc -l -p 12345 | tar -xf -

On the sending machine you can now run:

# cd /src/dr && tar -xf - | nc -q 0 remote-server 12345

You should find that everything works nicely, and a lot quicker. If
bandwidth is more constrained than CPU, then you can add “z” or “j” to
the tar options (“tar -xzf -” etc) to compress the data before it sends
it over the network. If you’re on gigabit, I wouldn’t bother with the
compression. If it dies, you’ll have to start from the beginning, but
then you might find you can get away with using rsync if you’ve copied
enough. It’s also
worth pointing out that the recieving netcat will die as soon as the
connection closes, so you’ll need to restart it if you want to copy the
data again using this method.

It’s worth pointing out that this does not have the security that scp or
rsync-over-ssh has, so make sure you trust the end points and everything
in between if you don’t want anyone else to see the data.

Why not use scp? because it’s incredibly slow in comparison. God knows
what scp is doing, but it doesn’t copy data at wire speed. It aint the
encyption and decryption, because that’d just use CPU and when I’ve done
it it’s hasn’t been CPU bound. I can only assume that the scp process
has a lot of handshaking and ssh protocol overhead.

Why not rsync? Rsync doesn’t really buy you that much on the first copy.
It’s only the subsequent runs where rsync really shines. However, rsync
requires the source to send a complete file list to the destination
before it starts copying any data. If you’ve got a filesystem with a
large number of files, that’s an awfully large overhead, especially as
the destination host has to hold it in memory.

Ever wanted to find out how much diskspace each table was taking in a
database? Here’s how:

database=# SELECT
   pg_size_pretty(pg_relation_size(tablename)) AS table_size,
   pg_size_pretty(pg_total_relation_size(tablename)) AS total_table_size
   schemaname = 'public';
 tablename  | table_size | total_table_size
 deferrals  | 205 MB     | 486 MB
 errors     | 58 MB      | 137 MB
 deliveries | 2646 MB    | 10096 MB
 queue      | 7464 kB    | 22 MB
 unknown    | 797 MB     | 2644 MB
 messages   | 1933 MB    | 6100 MB
 rejects    | 25 GB      | 75 GB
(7 rows)

Table size is the size for the current data.
Total table size includes indexes and data that is too large to fix in
the main table store (things like large BLOB fields). You can find more
information in the PostgreSQL

Edit: changed to use pg_size_pretty(), which I thought existed, but couldn’t find in the docs. Brett Parker reminded me it did exist after all and I wasn’t just imagining it.