Writing Robust Bash Shell Scripts

Many people hack together shell scripts quickly to do simple tasks, but these soon take on a life of their own. Unfortunately shell scripts are full of subtle effects which result in scripts failing in unusual ways. It’s possible to write scripts which minimise these problems. In this article, I explain several techniques for writing robust bash scripts.

Use set -u

How often have you written a script that broke because a variable wasn’t set? I know I have, many times.

rm -rf $chroot/usr/share/doc

If you ran the script above and accidentally forgot to give a parameter, you would have just deleted all of your system documentation rather than making a smaller chroot. So what can you do about it? Fortunately bash provides you with set -u, which will exit your script if you try to use an uninitialised variable. You can also use the slightly more readable set -o nounset.

david% bash /tmp/shrink-chroot.sh
david% bash -u /tmp/shrink-chroot.sh
/tmp/shrink-chroot.sh: line 3: $1: unbound variable

Use set -e

Every script you write should include set -e at the top. This tells bash that it should exit the script if any statement returns a non-true return value. The benefit of using -e is that it prevents errors snowballing into serious issues when they could have been caught earlier. Again, for readability you may want to use set -o errexit.

Using -e gives you error checking for free. If you forget to check something, bash will do it or you. Unfortunately it means you can’t check $? as bash will never get to the checking code if it isn’t zero. There are other constructs you could use:

if [ "$?"-ne 0]; then echo "command failed"; exit 1; fi

could be replaced with

command || { echo "command failed"; exit 1; }


if ! command; then echo "command failed"; exit 1; fi

What if you have a command that returns non-zero or you are not interested in its return value? You can use command || true, or if you have a longer section of code, you can turn off the error checking, but I recommend you use this sparingly.

set +e
set -e

On a slightly related note, by default bash takes the error status of the last item in a pipeline, which may not be what you want. For example, false | true will be considered to have succeeded. If you would like this to fail, then you can use set -o pipefail to make it fail.

Program defensively – expect the unexpected

Your script should take into account of the unexpected, like files missing or directories not being created. There are several things you can do to prevent errors in these situations. For example, when you create a directory, if the parent directory doesn’t exist, mkdir will return an error. If you add a -p option then mkdir will create all the parent directories before creating the requested directory. Another example is rm. If you ask rm to delete a non-existent file, it will complain and your script will terminate. (You are using -e, right?) You can fix this by using -f, which will silently continue if the file didn’t exist.

Be prepared for spaces in filenames

Someone will always use spaces in filenames or command line arguments and you should keep this in mind when writing shell scripts. In particular you should use quotes around variables.

if [ $filename = "foo" ];

will fail if $filename contains a space. This can be fixed by using:

if [ "$filename" = "foo" ];

When using $@ variable, you should always quote it or any arguments containing a space will be expanded in to separate words.

david% foo() { for i in $@; do printf "%s\n" "$i"; done }; foo bar "baz quux"
david% foo() { for i in "$@"; do printf "%s\n" "$i"; done }; foo bar "baz quux"
baz quux

I can not think of a single place where you shouldn’t use “$@” over $@, so when in doubt, use quotes.

If you use find and xargs together, you should use -print0 to separate filenames with a null character rather than new lines. You then need to use -0 with xargs.

david% touch "foo bar"
david% find | xargs ls
ls: ./foo: No such file or directory
ls: bar: No such file or directory
david% find -print0 | xargs -0 ls
./foo bar

Setting traps

Often you write scripts which fail and leave the filesystem in an inconsistent state; things like lock files, temporary files or you’ve updated one file and there is an error updating the next file. It would be nice if you could fix these problems, either by deleting the lock files or by rolling back to a known good state when your script suffers a problem. Fortunately bash provides a way to run a command or function when it receives a unix signal using the trap command.

trap command signal [signal ...]

There are many signals you can trap (you can get a list of them by running kill -l), but for cleaning up after problems there are only 3 we are interested in: INTTERM and EXIT. You can also reset traps back to their default by using - as the command.

Signal Description
INT Interrupt – This signal is sent when someone kills the script by pressing ctrl-c.
TERM Terminate – this signal is sent when someone sends the TERM signal using the kill command.
EXIT Exit – this is a pseudo-signal and is triggered when your script exits, either through reaching the end of the script, an exit command or by a command failing when usingset -e.

Usually, when you write something using a lock file you would use something like:

if [ ! -e $lockfile ]; then
   touch $lockfile
   rm $lockfile
   echo "critical-section is already running"

What happens if someone kills your script while critical-section is running? The lockfile will be left there and your script won’t run again until it’s been deleted. The fix is to use:

if [ ! -e $lockfile ]; then
   trap "rm -f $lockfile; exit" INT TERM EXIT
   touch $lockfile
   rm $lockfile
   trap - INT TERM EXIT
   echo "critical-section is already running"

Now when you kill the script it will delete the lock file too. Notice that we explicitly exit from the script at the end of trap command, otherwise the script will resume from the point that the signal was received.

Race conditions

It’s worth pointing out that there is a slight race condition in the above lock example between the time we test for the lockfile and the time we create it. A possible solution to this is to use IO redirection and bash’s noclobber mode, which won’t redirect to an existing file. We can use something similar to:

if ( set -o noclobber; echo "$$" > "$lockfile") 2> /dev/null; 
   trap 'rm -f "$lockfile"; exit $?' INT TERM EXIT


   rm -f "$lockfile"
   trap - INT TERM EXIT
   echo "Failed to acquire lockfile: $lockfile." 
   echo "Held by $(cat $lockfile)"

A slightly more complicated problem is where you need to update a bunch of files and need the script to fail gracefully if there is a problem in the middle of the update. You want to be certain that something either happened correctly or that it appears as though it didn’t happen at all.Say you had a script to add users.

add_to_passwd $user
cp -a /etc/skel /home/$user
chown $user /home/$user -R

There could be problems if you ran out of diskspace or someone killed the process. In this case you’d want the user to not exist and all their files to be removed.

rollback() {
   del_from_passwd $user
   if [ -e /home/$user ]; then
      rm -rf /home/$user

trap rollback INT TERM EXIT
add_to_passwd $user
cp -a /etc/skel /home/$user
chown $user /home/$user -R

We needed to remove the trap at the end or the rollback function would have been called as we exited, undoing all the script’s hard work.

Be atomic

Sometimes you need to update a bunch of files in a directory at once, say you need to rewrite urls form one host to another on your website. You might write:

for file in $(find /var/www -type f -name "*.html"); do
   perl -pi -e 's/www.example.net/www.example.com/' $file

Now if there is a problem with the script you could have half the site referring to www.example.com and the rest referring to www.example.net. You could fix this using a backup and a trap, but you also have the problem that the site will be inconsistent during the upgrade too.

The solution to this is to make the changes an (almost) atomic operation. To do this make a copy of the data, make the changes in the copy, move the original out of the way and then move the copy back into place. You need to make sure that both the old and the new directories are moved to locations that are on the same partition so you can take advantage of the property of most unix filesystems that moving directories is very fast, as they only have to update the inode for that directory.

cp -a /var/www /var/www-tmp
for file in $(find /var/www-tmp -type f -name "*.html"); do
   perl -pi -e 's/www.example.net/www.example.com/' $file
mv /var/www /var/www-old
mv /var/www-tmp /var/www

This means that if there is a problem with the update, the live system is not affected. Also the time where it is affected is reduced to the time between the two mvs, which should be very minimal, as the filesystem just has to change two entries in the inodes rather than copying all the data around.

The disadvantage of this technique is that you need to use twice as much disk space and that any process that keeps files open for a long time will still have the old files open and not the new ones, so you would have to restart those processes if this is the case. In our example this isn’t a problem as apache opens the files every request. You can check for files with files open by using lsof. An advantage is that you now have a backup before you made your changes in case you need to revert.

46 thoughts on “Writing Robust Bash Shell Scripts

  1. Got here from “http://fvue.nl/wiki/Bash:_Error_handling”.

    Regarding “chroot=$1″, you could also do some parameter expansion like:

    chroot=”${1:?Missing Input}”

    Great article on file names and spaces and other characters here:

  2. Very intressting, I’ve lerned a lot by reading this. Thanks for the many fine code examples as well. Keep up the good work.

  3. “need to use twice as much disk space”? No more the case, with btrfs + either cp –reflink or btrfs subvolume snapshot.

  4. Hi,

    I just wanted to drop a quick note of thanks here.
    I have this post bookmarked, and refer back to it every time I write a shell script that’s not ‘single use’.

    Thanks for making these tips available!

  5. All of these things are not specific to GNU bash, by the way.

    Absolutely do not use set -u (in production, feel free to use it in testing). Also, dogmatically applying set -e (as in Debian) just leads to people writing worse scripts; IME it’s better to use explicit error handling like in C (and not let newbies write scripts).

    You forgot to mention things like: never use [ or test, always use [[ for secure string handling (but still quote the RHS of a comparison because it’s taken as glob otherwise).

  6. You should use xargs -0r. This prevents running the program with no arguments when there’s no input, which is almost always not what you want. For instance

    find . -name foo\*bar -print0 | xargs -0 ls -l

    will do something quite unexpected when there happens to be no matching file.

  7. Your find/xargs tip is heaps better than my old habit (xargs -i … “{}”).
    In this example (ls): the print0 version has the output sorted so I don’t have to pipe in to sort with special key spec. Also, all the columns line up:

    Sun Feb 16-10:40:57 mcb@ken007:~/bin/gdata 2154$ find . -type f -print0 | xargs -0 ls -la
    -rw------- 1 mcb mcb  127095 May 16  2012 ./gdata-client-1.0.jar
    -rw------- 1 mcb mcb    1421 May 16  2012 ./gdata-client-meta-1.0.jar
    -rw------- 1 mcb mcb 1039559 May 16  2012 ./gdata-core-1.0.jar
    -rw------- 1 mcb mcb  121296 May 16  2012 ./gdata-docs-3.0.jar
    -rw------- 1 mcb mcb    4554 May 16  2012 ./gdata-docs-meta-3.0.jar
    -rw------- 1 mcb mcb   68596 May 16  2012 ./gdata-media-1.0.jar
    -rw------- 1 mcb mcb   51629 May 16  2012 ./gdata-spreadsheet-3.0.jar
    -rw------- 1 mcb mcb    2427 May 16  2012 ./gdata-spreadsheet-meta-3.0.jar
    -rw-r--r-- 1 mcb mcb  548821 Apr  6  2009 ./google-collect-1.0-rc1.jar
    -rwxr-xr-x 1 mcb mcb 2288505 Feb 16 00:06 ./google-docs-upload-1.4.7.jar
    -rw------- 1 mcb mcb 1648200 Feb 16 09:51 ./guava-11.0.2.jar
    -rw-r--r-- 1 mcb mcb  494975 Oct 19  2011 ./java-mail-1.4.4.jar
    -rw------- 1 mcb mcb   33017 Apr 16  2012 ./jsr305.jar
    Sun Feb 16-10:41:08 mcb@ken007:~/bin/gdata 2155$ find . -type f | xargs -i ls -la "{}"
    -rw------- 1 mcb mcb 51629 May 16  2012 ./gdata-spreadsheet-3.0.jar
    -rw------- 1 mcb mcb 121296 May 16  2012 ./gdata-docs-3.0.jar
    -rw------- 1 mcb mcb 1039559 May 16  2012 ./gdata-core-1.0.jar
    -rw------- 1 mcb mcb 127095 May 16  2012 ./gdata-client-1.0.jar
    -rw------- 1 mcb mcb 33017 Apr 16  2012 ./jsr305.jar
    -rw------- 1 mcb mcb 1648200 Feb 16 09:51 ./guava-11.0.2.jar
    -rw------- 1 mcb mcb 4554 May 16  2012 ./gdata-docs-meta-3.0.jar
    -rw-r--r-- 1 mcb mcb 494975 Oct 19  2011 ./java-mail-1.4.4.jar
    -rw------- 1 mcb mcb 68596 May 16  2012 ./gdata-media-1.0.jar
    -rw-r--r-- 1 mcb mcb 548821 Apr  6  2009 ./google-collect-1.0-rc1.jar
    -rwxr-xr-x 1 mcb mcb 2288505 Feb 16 00:06 ./google-docs-upload-1.4.7.jar
    -rw------- 1 mcb mcb 1421 May 16  2012 ./gdata-client-meta-1.0.jar
    -rw------- 1 mcb mcb 2427 May 16  2012 ./gdata-spreadsheet-meta-3.0.jar
    1. If you’re using GNU find, you might also find the following useful:

      find . -type f -exec ls -la {} +

      This effectively does the same thing as the -print0 | xargs -0

  8. echo is bad. Don’t use it. Instead of echo “$value” use printf “%s” “$value”. The example with the for loop over “$@” should be fixed.

  9. No, printf(1) is bad because it is an external utility and thus slow.

    Use the print built-in if you can make use of
    basic Korn shell features, i.e. if you know you
    are running under ksh88, pdksh, mksh, ksh93, MKS ksh.
    Avoid scripting for just POSIX then…

    That being said, if you must write for POSIX sh,
    or GNU bash (slower and less free than Korn Shell),
    you should indeed prefer printf over echo.

    1. printf is not an external utility, it is one of bash’s built-in command. I don’t know how to check whether it’s built-in for POSIX sh as well, but then again on most systems bash is linked to sh. Also, for an article like this, I would personally avoid ksh. It’s too specific.

    2. Printf is a Bash builtin.

      $ type -a printf
      printf is a shell builtin
      printf is /usr/bin/printf
      printf is /bin/printf

  10. “If you ask rm to delete a non-existent file, it will complain and your script will terminate. (You are using -e, right?) You can fix this by using -f, which will silently continue if the file didn’t exist.”


    #! /usr/bin/env bash
    set -e
    rm -rf $FOO/$BAR
    # This will even succeed even if $FOO/$BAR even doesn’t even exist

    1. Exactly this made steam delete all files starting at /
      It even has a #scary comment associated with the -rf but visibly the programmer did not realize why it was scary.

      So please remove the tip of using “-f” to avoid errors.
      Either unset -e for the rm operation or test first.

      I don’t use -e but test return codes instead. My usage of $? can be called as “intense”.

  11. One problem I have run into using this technique for critical sections occurs when the file system is full. In this case, the lock file is created, but the echo command fails to write anything into the lock file, and produces an error. Now the lock file exists, with 0 length, and there is no mechanism to clean it up. When the file system is cleaned up and the script runs again, it will never be able to obtain the lock.

  12. I know I am commenting an older post, but I think it is still relevant.

    First, thank for useful tips.

    Seconds, it is a pitty that you do not keep some of your own recommendations in this article, e.g. «trap “rm -f $lockfile; exit” INT TERM EXIT», «if [ -e /home/$user ]; then» and «for file in $(find /var/www -type f -name “*.html”); do». In all these cases, some spaces can make the script to behave incorrectly.

    In some specific cases, this might have some security implications. For example, if the attacker can control end of the name of the $lockfile, he can locate it somewhere in “…/foo /etc” (space is intentional). Well, this is somehow artifical example, some more realistic scenarios will come with passing find results to some command.

  13. Good article!

    One thing to note is that you are not safe to race conditions using any option that you describe. You need to create some kind of semaphore. Tat can be made easily with mkdir, that returns -1 if a dir cannot be created. You can use it like this

    if [ ! mkdir “/tmp/semaphoredir.lock” ] …

    It checks if it can create the dir or creates it atomically.

    Another way is to use /usr/bin/flock, but it is not guaranteed to be installed in all systems.

  14. Hi,

    my issue is :
    #/etc/init.d/tomcat start ; ls && cat /etc/passwd
    here in the above command i want that script should not accept ; && or | so that i can prevent command line injection.

    can you please help in this , this is very important for me. Thanks.

  15. The easiest way to handle the critical section is using mkdir, which both checks for the existence of the directory, and creates one if it doesn’t exist.

    if mkdir boo; then
    echo 1
    echo 2

Leave a Reply

Your email address will not be published. Required fields are marked *