Sun, 28 Jan 2007

Lazy Class Infrastructure

Do you ever feel you should implement equals(), hashCode() and toString, but just can't be bothered to do it for every class? Well, if you aren't bothered by speed, you can use Jakarta Commons Lang to do it for you. Just add this to your class:

import org.apache.commons.lang.builder.ToStringBuilder;
import org.apache.commons.lang.builder.EqualsBuilder;
import org.apache.commons.lang.builder.HashCodeBuilder;

class Foo {
   public int hashCode() {
      return HashCodeBuilder.reflectionHashCode(this);
   }
   public boolean equals(Object other) {
      return EqualsBuilder.reflectionEquals(this,other);
   }
   public String toString() {
      return ToStringBuilder.reflectionToString(this);
   }
}

And that's it. Your class will just do the right thing. As you can probably guess from the function names, it uses reflection, so may be suboptimal. If you need performance, you can use tell it to use particular members, but I think I'll leave that up to a future article. I also recommend you don't use this technique if you are using something like Hibernate, which does things behind the scenes on member access; you may find it does undesirable things. :)

[] | # Read Comments (0) |

Comments

Fri, 26 Jan 2007

Eddie 0.2 RSS and Atom Parser

I noticed today that Mark Pilgrim linked to Eddie, my liberal RSS and Atom parsing library for Java, so I figured I should make a new release. It's been a few months since I did any serious work on the parser, but in the last few days I've reduced the number of test case failures to less than 100 out of 3502 test cases which come as part of Mark's Feedparser parser for python. The majority of the failures are in the date parsing routines and due to bugs in the Jython library which cause literal dictionaries not to match with classes inherited fro PyDictionary.

Improvements in this version include:

  • Massively improved support for different character encodings. With Java 6, it also has support for UTF32 feeds.
  • CDF Support.
  • Optional support of TagSoup for sanitizing of HTML in entries.
  • Improved support for different input sources including String, InputStream and byte[].
  • Numerous bug fixes, with 97% of test cases passing, up from 90%

If you use Eddie, drop me an email. I'd like to thank Mark Pilgrim again for providing the community with a fantastic and comprehensive suite of test cases, extensive documentation and a first class Python library.

[] | # Read Comments (0) |

Comments

Thu, 25 Jan 2007

Speedy Java 6

I was quietly minding my own business, fixing some encoding bugs in Eddie, my liberal RSS and Atom parser, when I noticed that Java 6 included support for UTF-32, which is one of the encoding tests that was failing. I downloaded and installed the Ubuntu packages and installed it, and decided to run a quick benchmark using my unit tests.

First up was the Sun Java 5 JVM. I'd been running the unit tests all night, but timed it this time,and got these results:

Ran 3502 tests
Passed 3322 tests
Failed 180 tests

real    1m10.293s
user    0m40.375s
sys     0m3.632s

Next I tried the Sun Java 6 JVM, using the same jar files and got;

Ran 3502 tests
Passed 3326 tests
Failed 176 tests

real    0m56.059s
user    0m39.198s
sys     0m4.212s

One thing to note was that it spend a couple of seconds noticing new jars to read, so I decided to run it again and got:

Ran 3502 tests
Passed 3326 tests
Failed 176 tests

real    0m45.317s
user    0m34.770s
sys     0m3.516s

Wow, I'd gone from 70 seconds to 45 seconds using the new runtime, and interestingly enough, past 4 more tests in the process. I'm assuming they are the UTF-32 tests, although I have't checked yet. The other thing for me to try is recompiling the code to see if that has any additional benefits.

Update: Got around to checking what Java 6 fixed and it turned out it was the additional support for koi-u and cspc862latinhebrew encodings. After I fixed the UTF32 support in Eddie, it passed an additional 16 tests. Down to just 160 out of 3502. I just wish they would add support for some of the stranger encodings. Maybe this will happen when it's open source.

[] | # Read Comments (1) |

Comments

Sun, 18 Jun 2006

Markable InputStreams

Java has a nice IO subsystem. In particular, it has been designed such that input streams can optionally support a feature where a programmer can mark a position in the stream and at a later stage return to that point to read the data again. Programmers can check for this support by calling InputStream.markSupported(). Unfortunately I've had the need for this support, but I haven't managed to find a stream which supports this. Not even ByteArrayInputStream sees to support it. Fortunately it's fairly trivial to wrap an InputStream in another class which will add this support. Here is my quick adaptor, which seems to work for most cases.

import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;

public class MarkableInputStream extends InputStream {
    private InputStream inputstream;
    private int maxpos = 0;
    private int curpos = 0;
    private int mark = 0;
    private ArrayList<Integer> buffer = new ArrayList<Integer>();
    public MarkableInputStream(InputStream is) {
        inputstream = is;
    }

    @Override
    public int read() throws IOException {
        int data;
        if(curpos == maxpos) {
            data = inputstream.read();
            buffer.add(data); maxpos++;curpos++;
        } else {
            data = buffer.get(curpos++);
        }
        return data;
    }

    @Override
    public synchronized void mark(int readlimit) {
        mark = curpos;
    }

    @Override
    public boolean markSupported() {
        return true;
    }

    @Override
    public synchronized void reset() throws IOException {
        curpos = mark;
    }
}

You can use it like:

if (!istream.markSupported()) {
   istream = new MarkableInputStream(istream);
}

This could probably be improved on, most notably by not using an ArrayList. I'm not sure what performance penalty that adds. It should be possible to use a normal array as the readlimit parameter to mark() says how many bytes the stream should record before throwing old data away in favour of new input. The class above will record all data from the start of the stream, so could result in a significant amount of memory usage. Hope you find it useful.

[] | # Read Comments (1) |

Comments

Sat, 17 Jun 2006

Invalid Characters in Encoding with JAVA

Imagine you've got some text you've been told is ASCII and you've told java that it's ASCII using:

Reader reader = new InputStreamReader(inputstream, "ASCII");

Imagine your surprise when it happily reads in non-ascii values, say UTF-8 or ISO8859-1, and converts them to a random character.

import java.io.*;

public class Example1 {

   public static void main(String[] args) {
      try{
         FileInputStream is = new FileInputStream(args[0]);
         BufferedReader reader 
            = new BufferedReader(new InputStreamReader(is, args[1]));
         String line;
         while ((line = reader.readLine()) != null) {
            System.out.println(line);
         }
      } catch (Exception e) {
         System.out.println(e);
      }
   }
}
beebo david% java Example1 utf8file.txt ascii
I��t��rn��ti��n��liz��ti��n
beebo david% java Example1 utf8file.txt utf8
Iñtërnâtiônàlizætiøn

So, I hear you ask, how do you get Java to be strict about the conversion. Well, answer is to lookup a Charset object, ask it for a CharsetDecoder object and then set the onMalformedInput option to CodingErrorAction.REPORT. The resulting code is:

import java.io.*;
import java.nio.charset.*;

public class Example2 {

   public static void main(String[] args) {
      try{
         FileInputStream is = new FileInputStream(args[0]);
         Charset charset = Charset.forName(args[1]);
         CharsetDecoder csd = charset.newDecoder();
         csd.onMalformedInput(CodingErrorAction.REPORT);
         BufferedReader reader 
            = new BufferedReader(new InputStreamReader(is, csd));
         String line;
         while ((line = reader.readLine()) != null) {
            System.out.println(line);
         }
      } catch (Exception e) {
         System.out.println(e);
      }
   }
}

This time when we run it,we get:

beebo david% java Example2 utf8file.txt ascii
java.nio.charset.MalformedInputException: Input length = 1
beebo david% java Example2 utf8file.txt utf8
Iñtërnâtiônàlizætiøn

On a slightly related note, if anyone knows how to get Java to decode UTF32, VISCII, TCVN-5712, KOI8-U or KOI8-T, I would love to know.

Update: (2007-01-26) Java 6 has support for UTF32 and KOI8-U.

[] | # Read Comments (0) |

Comments

Thu, 01 Jun 2006

Tidying Up

A few hours ago I got stressed about the lack of leg room under my desk and ended up spending the next few tidying and moving all of my computers to under the next desk. I also made the mistake of starting to remove keys from my keyboard to clean something sticky and found myself surrounded by keys and a keyless keyboard. It's now nice and shiny, which is more than can be said for the rest of the flat, which is now overrun with all the crap that was around my desk.

Another thing that could do with a tidy up is Eddie, my Java liberal feed parsing library. After the initial coding sprint, I've had time to sit back and look at the design of the library and clean up any thing that sticks out. As mentioned in a previous entry, one of the things that has bothered me is that when ever you need to call an object method, you need to be certain that the object is not null. The means you end up with code like:

if (string != null && strong.equals("string")) {

This quickly becomes tiresome and the test for null distracts from the meaning of the code. Fortunately I was reminded of an improvement for string objects. Ideally, we should all be writing comparison conditionals like rvalue == lvalue. (an rvalue mostly is an expresion you can't assign to). The most common rvalue is a literal value like a string constant. The advantage of getting into the habit of writing code like this is that you'll discover at compile time when you accidentally write = rather than ==. Because you can't assign to an rvalue, the compiler will complain. What makes this interesting from a java string point of view is that you can call methods on string literals. Comparing a variable to a string literal, rather than calling .equals() on a variable is that the string literal is not going to be null, so you can remove the test for null and simplify the code:

if("string".equals(string)) {

I know it's not everyone's cup of tea, but I prefer it to testing for null every time I look at a string. The other thing is that I've been reading Hardcore Java by Robert Simmons at work. Considering I've only got a few pages in so far. I've received a surprisingly large number of ideas to improve my code.

The one that sticks in my head is using assert for doing post and pre conditions on your functions. Using asserts have number of advantages over throwing exceptions, including the fact they get optimised away when you do a production release. In Eddie, during a <feed> element I determine the version of Atom that we are parsing. This had a number of nested if/else if/else blocks. At the end of the function, I wanted to make sure I had set the version string to something, so had the following code:

if (!this.feed.has("format")) {
   throw new SAXParseException("Failed to detect Atom format", this.locator);
}

However, using assertions I can write this as

assert(this.feed.has("format")) : "Failed to detect Atom format";

I highly recommend the Hardcore java book if you want to improve your java programming. It includes sections on the new features of Java 1.5 and using collections. I've made a couple of other cleanups including going through member variable access specifiers to make sure they are right and making several public methods and variables and making them priavte. I also have a couple of ideas about refactoring some of the code to clean it up. Redesigning and refactoring code is almost more fun than writing it in the first place. You get to be in competition with yourself, challenging yourself to write better code and end up with cleaner code in the process.

A couple of things I want to do in the near future is use a profiler and code coverage tools. If anyone has recommendations for either of these tools that integrates nicely with eclipse, I'd love to know.

[] | # Read Comments (5) |

Comments

Wed, 31 May 2006

Eddie RSS and Atom Feed Parser

I'd like to announce the initial release of Eddie, a feed parser library written in Java. It's taken me over 100 hours, but it now correctly parses 90% of the FeedParser unit tests, including all the rss and atom tests. It's GPLed, with an exception allowing you to use it in any open sourced program. Get it at my website. Need to add documentation and character set and encoding support. Also need to separate the testing infrastructure from the rest of the code.

This is the first time I've done any java programming in anger, and I have to say I'm surprised to discover I quite like it. In many ways it seems a very quick language to program in. It seems almost like programming in a scripting language, but stronger typed. This is probably due to not having to worry about memory management. Certainly I don't think I could have written this quite so quickly in C++.

Having said that, there are a couple things that I don't like about Java. Everything is a pointer. This is useful at times, but it means that every time you want to call a method on an object you have to test whether it is null or you run the risk of getting the dreaded NullPointerException. Java also doesn't have keywords for and, or and not. I know not everyone likes these, but I keep finding myself trying to use them.

I'm sure there are other things I hated, but I can't remember them now. I think I'll end up doing more java programming in the future.

[] | # Read Comments (1) |

Comments

Thu, 25 May 2006

Pathological Date Parser in Java

I've recently had cause to parse some date values in Java. As a result I've produced a class which can manage to parse an awful lot of date formats. I thought I'd better document it in case someone found it useful. Certainly there doesn't appear to be anything elsewhere which shows you how to parse lots of formats. I have found the order of date_formats to be very brittle, so I don't recommend you change it without an awful lot of test cases.

Anyway, without further to do, I present to you, the Pathological Date Parser for Java

// Copyright 2006 David Pashley <david@davidpashley.com>
// Licensed under the GPL version 2
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.TimeZone;

public class Date {
    private Calendar date;

    static String[] date_formats = {
            "yyyy-MM-dd'T'kk:mm:ss'Z'",        // ISO
            "yyyy-MM-dd'T'kk:mm:ssz",          // ISO
            "yyyy-MM-dd'T'kk:mm:ss",           // ISO
            "EEE, d MMM yy kk:mm:ss z",        // RFC822
            "EEE, d MMM yyyy kk:mm:ss z",      // RFC2882
            "EEE MMM  d kk:mm:ss zzz yyyy",    // ASC
            "EEE, dd MMMM yyyy kk:mm:ss",   //Disney Mon, 26 January 2004 16:31:00 ET
            "-yy-MM",
            "-yyMM",
            "yy-MM-dd",
            "yyyy-MM-dd",
            "yyyy-MM",
            "yyyy-D",
            "-yyMM",
            "yyyyMMdd",
            "yyMMdd",
            "yyyy", 
            "yyD"
    
    };
    public Date(String d) {
        SimpleDateFormat formatter = new SimpleDateFormat();
        d = d.replaceAll("([-+]\\d\\d:\\d\\d)", "GMT$1"); // Correct W3C times
        d = d.replaceAll(" ([ACEMP])T$", " $1ST"); // Correct Disney timezones
        for (int i = 0; i < date_formats.length; i++) {
           try {
              formatter.applyPattern(date_formats[i]);
              formatter.parse(d);
              date = formatter.getCalendar();
              break;
           } catch(Exception e) {
              // Oh well. We tried
           }
        }
        
    }
}

The only date formats I can't get it to parse are <4-digit year>-<day of year> and <2digit year><day of year> (e.g. 2003-335 and 03335 for 2003-12-01). If you can add support for those and other date formats I'll gladly take patches.

[] | # Read Comments (4) |

Comments

Sat, 20 May 2006

Choosing Member Functions at Runtime in Java

Have you ever wanted to call a member function in your class, but not known what it will be at compile time? I'm writing a SAX parser and would like a function for every element name. I could write a massive switch statement in the startElement function, but this will quite quickly become unmanagable for a large schema. The alternative is to look to see if a particular member function exists and call it.

To do this little bit of magic we need to use Java's introspection API. The first thing to do is to get a Class object for our class. We can do that by calling:

Class klass = this.getClass();

We can then look up the method we are looking for using Class.getMethod, but this function requires an array of types that the method we are looking for takes as parameters, so we get the right version of an overloaded method. We can do this with:

Class[] arguments = { Int.class, String.class, URL.class};
Method method = klass.getMethod("foo", arguments);

Now we have our method, we can call it using the Method.invoke() call. This takes an object as the first parameter, which we can use this, and an array of Objects for the parameters.

Object[] values = {bar, baz, quux};
method.invoke(this, values);

But what happens if our class has no member method called foo()? Well, Class.getMethod() will throw a NoSuchMethodException, so we can just throw a try/catch block around the code to deal with unhandled functions. It's worth pointing out that Class.getMethod() also throws SecurityException and Method.invoke() throws IllegalAccessException, IllegalArgumentException and InvocationTargetException, so you'll want to catch Exception too.

We can chain some of these calls together and the result for my SAX parser is:

public void startElement(String uri, String localName, String qName, Attributes atts) 
            throws SAXException {
   try {
       Class[] argTypes = { String.class, String.class, String.class,
               Attributes.class };
       Object[] values = { uri, localName, qName, atts };
       this.getClass().getMethod("startElement_" + localName, argTypes)
               .invoke(this, values);
   } catch (NoSuchMethodException e) {
       log.debug("unhandled element " + localName);
   } catch (Exception e) {
       e.printStackTrace();
   }
}

With this arrangement, when I want to handle a new element in my code I can just make a function like:

public void startElement_foo(String uri, String localName, String qName, Attributes atts)
            throws SAXException {
   ... 
}
[] | # Read Comments (4) |

Comments