Tue, 13 Oct 2009

Tarballs explained

This entry was originally posted in slightly different form to Server Fault

If you're coming from a Windows world, you're used to using tools like zip or rar, which compress collections of files. In the typical Unix tradition of doing one thing and doing one thing well, you tend to have two different utilities; a compression tool and a archive format. People then use these two tools together to give the same functionality that zip or rar provide.

There are numerous different compression formats; the common ones used on Linux these days are gzip (sometimes known as zlib) and the newer, higher performing bzip2. Unfortunately bzip2 uses more CPU and memory to provide the higher rates of compression. You can use these tools to compress any file and by convention files compressed by either of these formats is .gz and .bz2. You can use gzip and bzip2 to compress and gunzip and bunzip2 to decompress these formats.

There are also several different types of archive formats available, including cpio, ar and tar, but people tend to only use tar. These allow you to take a number of files and pack them into a single file. They can also include path and permission information. You can create and unpack a tar file using the tar command. You might hear these operations referred to as "tarring" and "untarring". (The name of the command comes from a shortening of Tape ARchive. Tar was an improvement on the ar format in that you could use it to span multiple physical tapes for backups).

# tar -cf archive.tar list of files to include

This will create (-c) and archive into a file -f called archive.tar. (.tar is the convention extention for tar archives). You should now have a single file that contains five files ("list", "of", "files", "to" and "include"). If you give tar a directory, it will recurse into that directory and store everything inside it.

# tar -xf archive.tar
# tar -xf archive.tar list of files

This will extract (-x) the previously created archive.tar. You can extract just the files you want from the archive by listing them on the end of the command line. In our example, the second line would extract "list", "of", "file", but not "to" and "include". You can also use

# tar -tf archive.tar

to get a list of the contents before you extract them.

So now you can combine these two tools to replication the functionality of zip:

# tar -cf archive.tar directory
# gzip archive.tar

You'll now have an archive.tar.gz file. You can extract it using:

# gunzip archive.tar.gz
# tar -xf archive.tar

We can use pipes to save us having an intermediate archive.tar:

# tar -cf - directory | gzip > archive.tar.gz
# gunzip < archive.tar.gz | tar -xf -

You can use - with the -f option to specify stdin or stdout (tar knows which one based on context).

We can do slightly better, because, in a slight apparent breaking of the "one job well" idea, tar has the ability to compress its output and decompress its input by using the -z argument (I say apparent, because it still uses the gzip and gunzip commandline behind the scenes)

# tar -czf archive.tar.gz directory
# tar -xzf archive.tar.gz

To use bzip2 instead of gzip, use bzip2, bunzip2 and -j instead of gzip, gunzip and -z respectively (tar -cjf archive.tar.bz2). Some versions of tar can detect a bzip2 file archive with you use -z and do the right thing, but it is probably worth getting in the habit of being explicit.

More info:

[serverfault,tar,gzip,bzip2] | # Read Comments (8) |

Comments

Sun, 11 Oct 2009

mod_proxy or mod_jk

This entry was originally posted in slightly different form to Server Fault

There are several ways to run Tomcat applications. You can either run tomcat direcly on port 80, or you can put a webserver in front of tomcat and proxy connections to it. I would highly recommend using Apache as a front end. The main reason for this suggestion is that Apache is more flexible than tomcat. Apache has many modules that would require you to code support yourself in Tomcat. For example, while Tomcat can do gzip compression, it's a single switch; enabled or disabled. Sadly you can not compress CSS or javascript for Internet Explorer 6. This is easy to support in Apache, but impossible to do in Tomcat. Things like caching are also easier to do in Apache.

Having decided to use Apache to front Tomcat, you need to decide how to connect them. There are several choices: mod_proxy ( more accurately, mod_proxy_http in Apache 2.2, but I'll refer to this as mod_proxy), mod_jk and mod_jk2. Mod_jk2 is not under active development and should not be used. This leaves us with mod_proxy or mod_jk.

Both methods forward requests from apache to tomcat. mod_proxy uses the HTTP that we all know an love. mod_jk uses a binary protocol AJP. The main advantages of mod_jk are:

  • AJP is a binary protocol, so is slightly quicker for both ends to deal with and uses slightly less overhead compared to HTTP, but this is minimal.
  • AJP includes information like original host name, the remote host and the SSL connection. This means that ServletRequest.isSecure() works as expected, and that you know who is connecting to you and allows you to do some sort of virtualhosting in your code.

A slight disadvantage is that AJP is based on fixed sized chunks, and can break with long headers, particularly request URLs with long list of parameters, but you should rarely be in a position of having 8K of URL parameters. (It would suggest you were doing it wrong. :) )

It used to be the case that mod_jk provided basic load balancing between two tomcats, which mod_proxy couldn't do, but with the new mod_proxy_balancer in Apache 2.2, this is no longer a reason to choose between them.

The position is slightly complicated by the existence of mod_proxy_ajp. Between them, mod_jk is the more mature of the two, but mod_proxy_ajp works in the same framework as the other mod_proxy modules. I have not yet used mod_proxy_ajp, but would consider doing so in the future, as mod_proxy_ajp is part of Apche and mod_jk involves additional configuration outside of Apache.

Given a choice, I would prefer a AJP based connector, mostly due to my second stated advantage, more than the performance aspect. Of course, if your application vendor doesn't support anything other than mod_proxy_http, that does tie your hands somewhat.

You could use an alternative webserver like lighttpd, which does have an AJP module. Sadly, my prefered lightweight HTTP server, nginx, does not support AJP and is unlike ever to do so, due to the design of its proxying system.

[serverfault,tomcat] | # Read Comments (3) |

Comments

Sat, 10 Oct 2009

Blog Copyright

To make things explicitly clear, my blog is copyrighted and licensed as "All rights reserved". It even says that at the footer of every page. That means you may not redistribute any content without my permission. Yes, this means you, Ross Beazley. I may allow aggregation sites to redistribute my content, but the only sites where I have given explicit permission are Planet Debian and Planet BNM. I am unlikely to be upset if your aggregation site links back to the original entry and does not carry advertising, and will probably give you permission. If both these conditions are not met, you do not have permission and will not be granted permission.

[] | # Read Comments (7) |

Comments

Name-Based HTTPS

This entry was originally posted in slightly different form to Server Fault

There are two methods of using virtual hosting with HTTP: Name based and IP based. IP based is the simplest as each virtual host is served from a different IP address configured on the server, but this requires an IP address for every host, and we're meant to be running out. The better solution is to use the Host: header introduced in HTTP 1.1, which allows the server to serve the right host to the client from a single IP address.

HTTPS throws a spanner in the works, as the server does not know which certificate to present to the client during the SSL connection set up, because the client can't send the Host: header until the connection is set up. As a result, if you want to host more than one HTTPS site, you need to use IP-based virtual hosting.

However, you can run multiple SSL sites from a single IP address using a couple of methods, each with their own drawbacks.

The first method is to have a SSL certificate that covers both sites. The idea here is to have a single SSL certificate that covers all the domains you want to host from a single IP address. You can either do this using a wildcard certificate that covers both domains or use Subject Alternative Name.

Wildcard certificates would be something *.example.com, which would cover www.example.com, mail.example.com and support.example.com. There are a number of problems with wildcard certificates. Firstly, every hostname needs to have a common domain, e.g. with *.example.com you can have www.example.com, but not www.example.org. Secondly, you can't reliably have more than one subdomain, i.e. you can have www.example.com, but not www.eu.example.com. This might work in earlier versions of Firefox (<= 3.0), but it doesn't work in 3.5 or any version of Internet Explorer. Thirdly, wildcard certificates are significantly more expensive than normal certificates if you want it signed by a root CA.

Subject Alternative Name is a method of using an extension to X509 certificates that lists alternative hostnames that are valid for that certificate. It involves adding a "subjectAltName" field to the certificate that lists each additional host you want covered by the certificate. This should work in most browsers; certainly every modern mainstream browser. The downside of this method is that you have to list every domain on the server that will use SSL. You may not want this information publicly available. You probably don't want unrelated domains to be listed on the same certificate. It may also be difficult to add additional domains at a later date to your certificate.

The second approach is to use something called SNI (Server Name Indication) which is an extension in TLS that solves the chicken and egg problem of not knowing which certificate to send to the client because the client hasn't sent the Host: header yet. As part of the TLS negotiation, the client sends the required hostname as one of the options. The only downside to this is client and server support. The support in browsers tends to be better than in servers. Firefox has supported it since 2.0. Internet Explorer supports it from 7 onwards, but only on Vista or later. Chrome only supports it on Vista or later too. Opera 8 and Safari 8.2.1 have support. Other browsers may not support it.

The biggest problem preventing adoption is the server support. Until very recently neither of the two main webservers supported it. Apache gained SNI support as of 2.2.12, which was released July 2009. As of writing, IIS does not support SNI in any version. nginx, lighttpd and Cherokee all support SNI.

Going forward, SNI is the best method for solving the name-based virtual hosting of HTTPS, but support might be patchy for a year or two yet. If you must do HTTPS virtual hosting without problems in the near future, IP based virtual hosting is the only option.

[] | # Read Comments (14) |

Comments