I noticed today that Mark Pilgrim
linked to Eddie, my liberal RSS and Atom parsing library for Java, so I
figured I should make a new release. It’s been a few months since I did
any serious work on the parser, but in the last few days I’ve reduced
the number of test case failures to less than 100 out of 3502 test cases
which come as part of Mark’s Feedparser parser for python. The
majority of the failures are in the date parsing routines and due to
bugs in the Jython library which cause literal dictionaries not to match
with classes inherited fro PyDictionary.

Improvements in this version include:

  • Massively improved support for different character encodings. With
    Java 6, it also has support for UTF32 feeds.
  • CDF Support.
  • Optional support of TagSoup for sanitizing of HTML in entries.
  • Improved support for different input sources including String,
    InputStream and byte[].
  • Numerous bug fixes, with 97% of test cases passing, up from 90%

If you use Eddie, drop me an email. I’d like to thank Mark Pilgrim
again for providing the community with a fantastic and comprehensive
suite of test cases, extensive documentation and a first class Python
library.

I was quietly minding my own business, fixing some encoding bugs in
Eddie, my liberal RSS and Atom parser, when I noticed that Java 6
included support for UTF-32, which is one of the encoding tests that was
failing. I downloaded and installed the Ubuntu packages and installed
it, and decided to run a quick benchmark using my unit tests.

First up was the Sun Java 5 JVM. I’d been running the unit tests all
night, but timed it this time,and got these results:

Ran 3502 tests
Passed 3322 tests
Failed 180 tests

real    1m10.293s
user    0m40.375s
sys     0m3.632s

Next I tried the Sun Java 6 JVM, using the same jar files and
got;

Ran 3502 tests
Passed 3326 tests
Failed 176 tests

real    0m56.059s
user    0m39.198s
sys     0m4.212s

One thing to note was that it spend a couple of seconds noticing new
jars to read, so I decided to run it again and got:

Ran 3502 tests
Passed 3326 tests
Failed 176 tests

real    0m45.317s
user    0m34.770s
sys     0m3.516s

Wow, I’d gone from 70 seconds to 45 seconds using the new runtime,
and interestingly enough, past 4 more tests in the process. I’m assuming
they are the UTF-32 tests, although I have’t checked yet. The other thing
for me to try is recompiling the code to see if that has any additional
benefits.

Update: Got around to checking what Java 6 fixed and
it turned out it was the additional support for koi-u and
cspc862latinhebrew encodings. After I fixed the UTF32 support in Eddie,
it passed an additional 16 tests. Down to just 160 out of 3502. I just
wish they would add support for some of the stranger encodings. Maybe
this will happen when it’s open
source.

Java has a nice IO subsystem. In particular, it has been designed
such that input streams can optionally support a feature where a
programmer can mark a position in the stream and at a later
stage return to that point to read the data again. Programmers can check
for this support by calling InputStream.markSupported().
Unfortunately I’ve had the need for this support, but I haven’t managed
to find a stream which supports this. Not even
ByteArrayInputStream sees to support it. Fortunately it’s
fairly trivial to wrap an InputStream in another class which
will add this support. Here is my quick adaptor, which seems to work
for most cases.

import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;

public class MarkableInputStream extends InputStream {
    private InputStream inputstream;
    private int maxpos = 0;
    private int curpos = 0;
    private int mark = 0;
    private ArrayList<Integer> buffer = new ArrayList<Integer>();
    public MarkableInputStream(InputStream is) {
        inputstream = is;
    }

    @Override
    public int read() throws IOException {
        int data;
        if(curpos == maxpos) {
            data = inputstream.read();
            buffer.add(data); maxpos++;curpos++;
        } else {
            data = buffer.get(curpos++);
        }
        return data;
    }

    @Override
    public synchronized void mark(int readlimit) {
        mark = curpos;
    }

    @Override
    public boolean markSupported() {
        return true;
    }

    @Override
    public synchronized void reset() throws IOException {
        curpos = mark;
    }
}

You can use it like:

if (!istream.markSupported()) {
   istream = new MarkableInputStream(istream);
}

This could probably be improved on, most notably by not using an
ArrayList. I’m not sure what performance penalty that adds. It
should be possible to use a normal array as the readlimit
parameter to mark() says how many bytes the stream should
record before throwing old data away in favour of new input. The class
above will record all data from the start of the stream, so could result
in a significant amount of memory usage. Hope you find it useful.

Imagine you’ve got some text you’ve been told is ASCII and you’ve
told java that it’s ASCII using:

Reader reader = new InputStreamReader(inputstream, "ASCII");

Imagine your surprise when it happily reads in non-ascii values, say
UTF-8 or ISO8859-1, and converts them to a random character.

import java.io.*;

public class Example1 {

   public static void main(String[] args) {
      try{
         FileInputStream is = new FileInputStream(args[0]);
         BufferedReader reader
            = new BufferedReader(new InputStreamReader(is, args[1]));
         String line;
         while ((line = reader.readLine()) != null) {
            System.out.println(line);
         }
      } catch (Exception e) {
         System.out.println(e);
      }
   }
}
beebo david% java Example1 utf8file.txt ascii
I��t��rn��ti��n��liz��ti��n
beebo david% java Example1 utf8file.txt utf8
Iñtërnâtiônàlizætiøn

So, I hear
you ask, how do you get Java to be strict about the conversion. Well, answer
is to lookup a Charset object, ask it for a CharsetDecoder object and
then set the onMalformedInput option to
CodingErrorAction.REPORT. The resulting code is:

import java.io.*;
import java.nio.charset.*;

public class Example2 {

   public static void main(String[] args) {
      try{
         FileInputStream is = new FileInputStream(args[0]);
         Charset charset = Charset.forName(args[1]);
         CharsetDecoder csd = charset.newDecoder();
         csd.onMalformedInput(CodingErrorAction.REPORT);
         BufferedReader reader
            = new BufferedReader(new InputStreamReader(is, csd));
         String line;
         while ((line = reader.readLine()) != null) {
            System.out.println(line);
         }
      } catch (Exception e) {
         System.out.println(e);
      }
   }
}

This time when we run it,we get:

beebo david% java Example2 utf8file.txt ascii
java.nio.charset.MalformedInputException: Input length = 1
beebo david% java Example2 utf8file.txt utf8
Iñtërnâtiônàlizætiøn

On a slightly related note, if anyone knows how to get Java to decode
UTF32, VISCII, TCVN-5712, KOI8-U or KOI8-T, I would love to know.

Update: (2007-01-26) Java 6 has support for UTF32
and KOI8-U.

Class::DBI is
a very nice database abstraction layer for perl. It allows you to define
your tables and columns and it magically provides you with classes with
accessors/mutators for those columns. With something like
Class::DBI::Pg, you don’t even need to tell it your columns; it
asls the database on startup. It’s all very cool mojo and massively
decreases the development time on anything database related in perl.

Unfortunately, as far as I can tell, it has a massive performance
problem in its design. One of the features of Class::DBI is lazy
population of data. It won’t fetch data from the database until you try
to use one of the accessors. This isn’t normally a problem, except with
retrieve_all(). Basically this function returns a list of
objects for every row in your table. Unfortunately, due to the lazy
loading of data, retrieve_all() calls SELECT id FROM
table;
and then every time you use an object it calls SELECT *
FROM table WHERE id = n;
. For a small table, this isn’t too bad, but
for a large table, it’s a killer.

I did a little benchmark today to see just how much slower it is over
plain DBI. I wrote two functions which iterate over a table, assigning
one value to a function (forcing Class::DBI to fetch the data). The
table in question contains 635 rows. The code I used was:

use strict;
use warnings;

use Benchmark qw(:all) ;

use Foo;

use DBI;

sub class_dbi {
   for my $foo (Foo->retrieve_all()) {
      my $bar = $foo->bar;
   }
}

sub dbi {
   my $dbh = DBI->connect("dbi:Pg:dbname=$db;host=$host",$user, $passwd);
   my $sth = $dbh->prepare("SELECT * FROM foos;");
   $sth->execute();
   while(my $row = $sth->fetchrow_hashref()) {
      my $bar = $row->{bar};
   }
}
cmpthese(100, {
      'Class::DBI' => 'class_dbi();',
      'DBI' => 'dbi();',
   });

The results:

brick david% perl benchmark.pl
           s/iter Class::DBI        DBI
Class::DBI   10.3         --       -97%
DBI         0.351      2845%         --

Class::DBI is more than 28 times slower than using DBI directly. I’m
hoping that someone will now tell me “Oh you just do blah”, otherwise
I’m going to have to rewrite some of my code. One thing to learn from
this is that reduction in development time can often cost you more in
other areas, and it’s often runtime performance.

Update: It appears that the bug is that
Class::DBI::Pg does’t set the Essential list of columns, so
Class::DBI uses the primary column. you can fix this by adding the
following to your database modules:

__PACKAGE__->columns(Essential => __PACKAGE__->columns);

Remember you’ll need to do that for each of your modules; you won’t
be able to do it in your superclass, as you won’t have discovered your
columns yet. This has increased performance, but not massively. New
timings (with the addition of using Class::DBI through an iterator):

              s/iter Class::DBI it    Class::DBI           DBI
Class::DBI it   6.35            --           -2%          -94%
Class::DBI      6.23            2%            --          -94%
DBI            0.350         1714%         1680%            --

Update 2: It appears that further speedgains can be
made by not using Class::DBI::Plugin::DateTime::Pg to convert the three
timestamp columns in my table into DateTime objects.:

              s/iter Class::DBI it    Class::DBI           DBI
Class::DBI it   1.26            --          -11%          -72%
Class::DBI      1.12           12%            --          -69%
DBI            0.350          260%          220%            --

A few hours ago I got stressed about the lack of leg room under my
desk and ended up spending the next few tidying and moving all of my
computers to under the next desk. I also made the mistake of starting to
remove keys from my keyboard to clean something sticky and found myself
surrounded by keys and a keyless keyboard. It’s now nice and shiny,
which is more than can be said for the rest of the flat, which is now
overrun with all the crap that was around my desk.

Another thing that could do with a tidy up is Eddie, my
Java liberal feed parsing library. After the initial coding sprint, I’ve had time
to sit back and look at the design of the library and clean up any thing
that sticks out. As mentioned in a previous
entry
, one of the things that has bothered me is that when ever you
need to call an object method, you need to be certain that the object is
not null. The means you end up with code like:

if (string != null && strong.equals("string")) {

This quickly becomes tiresome and the test for null distracts from
the meaning of the code. Fortunately I was reminded of an improvement
for string objects. Ideally, we should all be writing comparison
conditionals like rvalue == lvalue. (an rvalue mostly is an expresion
you can’t assign to). The most common rvalue is a literal value like a
string constant. The
advantage of getting into the habit of writing code like this is that
you’ll discover at compile time when you accidentally write =
rather than ==. Because you can’t assign to an rvalue, the
compiler will complain. What makes this interesting from a java string
point of view is that you can call methods on string literals. Comparing
a variable to a string literal, rather than calling .equals()
on a variable is that the string literal is not going to be null, so you
can remove the test for null and simplify the code:

if("string".equals(string)) {

I know it’s not everyone’s cup of tea, but I prefer it to testing for
null every time I look at a string. The other thing is that I’ve been
reading Hardcore Java by Robert Simmons at work. Considering I’ve only
got a few pages in so far. I’ve received a surprisingly large number of
ideas to improve my code.

The one that sticks in my head is using assert for doing
post and pre conditions on your functions. Using asserts have number of
advantages over throwing exceptions, including the fact they get
optimised away when you do a production release. In Eddie, during a
<feed> element I determine the version of Atom that we are
parsing. This had a number of nested if/else if/else blocks. At
the end of the function, I wanted to make sure I had set the version
string to something, so had the following code:

if (!this.feed.has("format")) {
   throw new SAXParseException("Failed to detect Atom format", this.locator);
}

However, using assertions I can write this as

assert(this.feed.has("format")) : "Failed to detect Atom format";

I highly recommend the Hardcore java book if you want to improve your java
programming. It includes sections on the new features of Java 1.5 and
using collections. I’ve made a couple of other cleanups including going through member
variable access specifiers to make sure they are right and making
several public methods and variables and making them
priavte. I also have a couple of ideas about refactoring some
of the code to clean it up. Redesigning and refactoring code is almost
more fun than writing it in the first place. You get to be in
competition with yourself, challenging yourself to write better code
and end up with cleaner code in the process.

A couple of things I want to do in the near future is use a profiler
and code coverage tools. If anyone has recommendations for either of
these tools that integrates nicely with eclipse, I’d love to know.

Just when you thought Perl couldn’t get more unreadable, someone[0] comes up with something like this:

print join ", ", map ord, split //, $foo;

This mess of perl might be easier to understand if I put the brackets
in:

print join (", ", map( ord, split( //, $foo)));

What this does is split $foo into a list of characters. It then uses
map to run ord() on each item in the list to return a new list
containing the numeric character values. We then join these again with
“, ” to make the output easier to read.

david% perl -e 'print join ", ", map ord, split //, "word";'
119, 111, 114, 100

The map function is familiar to functional programmers and is very
powerful, but beware it can reduce the clarity of your code.

[0] Me

I’d like to announce the initial release of Eddie, a feed parser
library written in Java. It’s taken me over 100 hours, but it now correctly
parses 90% of the FeedParser unit tests, including all the rss and atom
tests. It’s GPLed, with an exception allowing you to use it in any open
sourced program. Get it at my website.
Need to add documentation and character set and encoding support. Also
need to separate the testing infrastructure from the rest of the code.

This is the first time I’ve done any java programming in anger, and I
have to say I’m surprised to discover I quite like it. In many ways it
seems a very quick language to program in. It seems almost like
programming in a scripting language, but stronger typed. This is
probably due to not having to worry about memory management. Certainly I
don’t think I could have written this quite so quickly in C++.

Having said that, there are a couple things that I don’t like about
Java. Everything is a pointer. This is useful at times, but it means
that every time you want to call a method on an object you have to test
whether it is null or you run the risk of getting the dreaded
NullPointerException. Java also doesn’t have keywords for
and, or and not. I know not everyone likes
these, but I keep finding myself trying to use them.

I’m sure there are other things I hated, but I can’t remember them
now. I think I’ll end up doing more java programming in the future.

I’ve recently had cause to parse some date values in Java. As a
result I’ve produced a class which can manage to parse an awful lot of
date formats. I thought I’d better document it in case someone found it
useful. Certainly there doesn’t appear to be anything elsewhere which
shows you how to parse lots of formats. I have found the order of
date_formats to be very brittle, so I don’t recommend you
change it without an awful lot of test cases.

Anyway, without further to do, I present to you, the Pathological
Date Parser for Java

// Copyright 2006 David Pashley <david@davidpashley.com>
// Licensed under the GPL version 2
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.TimeZone;

public class Date {
    private Calendar date;

    static String[] date_formats = {
            "yyyy-MM-dd'T'kk:mm:ss'Z'",        // ISO
            "yyyy-MM-dd'T'kk:mm:ssz",          // ISO
            "yyyy-MM-dd'T'kk:mm:ss",           // ISO
            "EEE, d MMM yy kk:mm:ss z",        // RFC822
            "EEE, d MMM yyyy kk:mm:ss z",      // RFC2882
            "EEE MMM  d kk:mm:ss zzz yyyy",    // ASC
            "EEE, dd MMMM yyyy kk:mm:ss",   //Disney Mon, 26 January 2004 16:31:00 ET
            "-yy-MM",
            "-yyMM",
            "yy-MM-dd",
            "yyyy-MM-dd",
            "yyyy-MM",
            "yyyy-D",
            "-yyMM",
            "yyyyMMdd",
            "yyMMdd",
            "yyyy",
            "yyD"

    };
    public Date(String d) {
        SimpleDateFormat formatter = new SimpleDateFormat();
        d = d.replaceAll("([-+]\d\d:\d\d)", "GMT$1"); // Correct W3C times
        d = d.replaceAll(" ([ACEMP])T$", " $1ST"); // Correct Disney timezones
        for (int i = 0; i < date_formats.length; i++) {
           try {
              formatter.applyPattern(date_formats[i]);
              formatter.parse(d);
              date = formatter.getCalendar();
              break;
           } catch(Exception e) {
              // Oh well. We tried
           }
        }

    }
}

The only date formats I can’t get it to parse are <4-digit
year>-<day of year>
and <2digit year><day of
year>
(e.g. 2003-335 and 03335 for
2003-12-01). If you can add support for those and other date formats
I’ll gladly take patches.

Have you ever wanted to call a member function in your class, but not
known what it will be at compile time? I’m writing a SAX parser and
would like a function for every element name. I could write a massive
switch statement in the startElement function, but this will
quite quickly become unmanagable for a large schema. The alternative is
to look to see if a particular member function exists and call it.

To do this little bit of magic we need to use Java’s introspection API. The
first thing to do is to get a Class object for our class. We
can do that by calling:

Class klass = this.getClass();

We can then look up the method we are looking for using
Class.getMethod, but this function requires an array of types
that the method we are looking for takes as parameters, so we get the
right version of an overloaded method. We can do this with:

Class[] arguments = { Int.class, String.class, URL.class};
Method method = klass.getMethod("foo", arguments);

Now we have our method, we can call it using the
Method.invoke() call. This takes an object as the first
parameter, which we can use this, and an array of
Objects for the parameters.

Object[] values = {bar, baz, quux};
method.invoke(this, values);

But what happens if our class has no member method called
foo()? Well, Class.getMethod() will throw a
NoSuchMethodException, so we can just throw a
try/catch block around the code to deal with unhandled
functions. It’s worth pointing out that Class.getMethod() also
throws SecurityException and Method.invoke() throws
IllegalAccessException, IllegalArgumentException and
InvocationTargetException, so you’ll want to catch
Exception too.

We can chain some of these calls together and the result for my SAX
parser is:

public void startElement(String uri, String localName, String qName, Attributes atts)
            throws SAXException {
   try {
       Class[] argTypes = { String.class, String.class, String.class,
               Attributes.class };
       Object[] values = { uri, localName, qName, atts };
       this.getClass().getMethod("startElement_" + localName, argTypes)
               .invoke(this, values);
   } catch (NoSuchMethodException e) {
       log.debug("unhandled element " + localName);
   } catch (Exception e) {
       e.printStackTrace();
   }
}

With this arrangement, when I want to handle a new element in my code I
can just make a function like:

public void startElement_foo(String uri, String localName, String qName, Attributes atts)
            throws SAXException {
   ...
}