Xerces is an XML library for several languages, but if a very common library in Java. 

I recently came across a problem with code intermittently throwing a NullPointerException inside the library:

[sourcecode lang=”text”]java.lang.NullPointerException
at org.apache.xerces.dom.ParentNode.nodeListItem(Unknown Source)
at org.apache.xerces.dom.ParentNode.item(Unknown Source)
at com.example.xml.Element.getChildren(Element.java:377)
at com.example.xml.Element.newChildElementHelper(Element.java:229)
at com.example.xml.Element.newChildElement(Element.java:180)

[/sourcecode]You may also find the NullPointerException in ParentNode.nodeListGetLength() and other locations in ParentNode.

Debugging this was not helped by the fact that the xercesImpl.jar is stripped of line numbers, so I couldn’t find the exact issue. After some searching, it appeared that the issue was down to the fact that Xerces is not thread-safe. ParentNode caches iterations through the NodeList of children to speed up performance and stores them in the Node’s Document object. In multi-threaded applications, this can lead to race conditions and NullPointerExceptions.  And because it’s a threading issue, the problem is intermittent and hard to track down.

The solution is to synchronise your code on the DOM, and this means the Document object, everywhere you access the nodes. I’m not certain exactly which methods need to be protected, but I believe it needs to be at least any function that will iterate a NodeList. I would start by protecting every access and testing performance, and removing some if needed.

[sourcecode lang=”java”]/**
* Returns the concatenation of all the text in all child nodes
* of the current element.
*/
public String getText() {
StringBuilder result = new StringBuilder();

synchronized ( m_element.getOwnerDocument()) {
NodeList nl = m_element.getChildNodes();
for (int i = 0; i < nl.getLength(); i++) {
Node n = nl.item(i);

if (n != null && n.getNodeType() == org.w3c.dom.Node.TEXT_NODE) {
result.append(((CharacterData) n).getData());
}
}
}

return result.toString();
}[/sourcecode]Notice the “synchronized ( m_element.getOwnerDocument()) {}” block around the section that deals with the DOM. The NPE would normally be thrown on the nl.getLength() or nl.item() calls.

Since putting in the synchronized blocks, we’ve gone from having 78 NPEs between 2:30am and 3:00am, to having zero in the last 12 hours, so I think it’s safe to say, this has drastically reduced the problem. 

Because I couldn’t find the information anywhere else, if you want to
use maven with Grails 1.2 snapshot, use:

mvn org.apache.maven.plugins:maven-archetype-plugin:2.0-alpha-4:generate
-DarchetypeGroupId=org.grails
-DarchetypeArtifactId=grails-maven-archetype
-DarchetypeVersion=1.2-SNAPSHOT     -DgroupId=uk.org.catnip
-DartifactId=armstrong
-DarchetypeRepository=http://snapshots.maven.codehaus.org/maven2

One really nice feature of maven is the dependency resolution stuff
that it does. The dependency plugin also has an analyse goal that can
detect a number of problems with your dependencies. It can detect
libraries you use but haven’t declared in your POM, but work through
transitive dependencies. This can cause build problems when you remove
the library that was dragging in the undeclared dependency. It can also
work out which dependencies you are no longer using, but have a declared
dependency.

mojo-jojo david% mvn dependency:analyze
[INFO] Scanning for projects...
...
[INFO] [dependency:analyze]
[WARNING] Used undeclared dependencies found:
[WARNING]    commons-collections:commons-collections:jar:3.2:compile
[WARNING]    commons-validator:commons-validator:jar:1.3.1:compile
[WARNING]    org.apache.myfaces.core:myfaces-api:jar:1.2.6:compile
[WARNING] Unused declared dependencies found:
[WARNING]    javax.faces:jsf-api:jar:1.2_02:compile
...

On off the biggest problems with developing servlets under a
container like Tomcat is the amount of time taken to build your code,
deploy it to the container and restart it to pick up any changes. Maven
and the Jetty plugin allow you to cut down on this cycle considerably.
The first step is to allow you to start your application in maven by
running:

mvn jetty:run

We do this by configuring the jetty plugin inside our
pom.xml:

<plugin>
   <groupId>org.mortbay.jetty</groupId>
   <artifactId>maven-jetty-plugin</artifactId>
   <version>6.1.10</version>
</plugin>

Now when you run mvn jetty:run your application will start
up. But we can improve on this. The Jetty plugin can be configured to
scan your project every so often and rebuild it and reload it if
anything changes. We do this by changing our pom.xml to read:

<plugin>
   <groupId>org.mortbay.jetty</groupId>
   <artifactId>maven-jetty-plugin</artifactId>
   <version>6.1.10</version>
   <configuration>
      <scanIntervalSeconds>10</scanIntervalSeconds>
   </configuration>
</plugin>

Now when you save a file in your IDE, by the time you’ve switched to
your web browser, Jetty is already running your updated code. Your
development cycle is almost up to the same speed as Perl or PHP.

You can find more information at the plugin page.

Do you ever feel you should implement equals(),
hashCode() and toString, but just can’t be bothered to
do it for every class? Well, if you aren’t bothered by speed, you can
use Jakarta Commons Lang to do it for you. Just add this to your class:

import org.apache.commons.lang.builder.ToStringBuilder;
import org.apache.commons.lang.builder.EqualsBuilder;
import org.apache.commons.lang.builder.HashCodeBuilder;

class Foo {
   public int hashCode() {
      return HashCodeBuilder.reflectionHashCode(this);
   }
   public boolean equals(Object other) {
      return EqualsBuilder.reflectionEquals(this,other);
   }
   public String toString() {
      return ToStringBuilder.reflectionToString(this);
   }
}

And that’s it. Your class will just do the right thing. As you can
probably guess from the function names, it uses reflection, so may be
suboptimal. If you need performance, you can use tell it to use
particular members, but I think I’ll leave that up to a future article.
I also recommend you don’t use this technique if you are using something
like Hibernate, which does
things behind the scenes on member access; you may find it does
undesirable things. 🙂

I noticed today that Mark Pilgrim
linked to Eddie, my liberal RSS and Atom parsing library for Java, so I
figured I should make a new release. It’s been a few months since I did
any serious work on the parser, but in the last few days I’ve reduced
the number of test case failures to less than 100 out of 3502 test cases
which come as part of Mark’s Feedparser parser for python. The
majority of the failures are in the date parsing routines and due to
bugs in the Jython library which cause literal dictionaries not to match
with classes inherited fro PyDictionary.

Improvements in this version include:

  • Massively improved support for different character encodings. With
    Java 6, it also has support for UTF32 feeds.
  • CDF Support.
  • Optional support of TagSoup for sanitizing of HTML in entries.
  • Improved support for different input sources including String,
    InputStream and byte[].
  • Numerous bug fixes, with 97% of test cases passing, up from 90%

If you use Eddie, drop me an email. I’d like to thank Mark Pilgrim
again for providing the community with a fantastic and comprehensive
suite of test cases, extensive documentation and a first class Python
library.

I was quietly minding my own business, fixing some encoding bugs in
Eddie, my liberal RSS and Atom parser, when I noticed that Java 6
included support for UTF-32, which is one of the encoding tests that was
failing. I downloaded and installed the Ubuntu packages and installed
it, and decided to run a quick benchmark using my unit tests.

First up was the Sun Java 5 JVM. I’d been running the unit tests all
night, but timed it this time,and got these results:

Ran 3502 tests
Passed 3322 tests
Failed 180 tests

real    1m10.293s
user    0m40.375s
sys     0m3.632s

Next I tried the Sun Java 6 JVM, using the same jar files and
got;

Ran 3502 tests
Passed 3326 tests
Failed 176 tests

real    0m56.059s
user    0m39.198s
sys     0m4.212s

One thing to note was that it spend a couple of seconds noticing new
jars to read, so I decided to run it again and got:

Ran 3502 tests
Passed 3326 tests
Failed 176 tests

real    0m45.317s
user    0m34.770s
sys     0m3.516s

Wow, I’d gone from 70 seconds to 45 seconds using the new runtime,
and interestingly enough, past 4 more tests in the process. I’m assuming
they are the UTF-32 tests, although I have’t checked yet. The other thing
for me to try is recompiling the code to see if that has any additional
benefits.

Update: Got around to checking what Java 6 fixed and
it turned out it was the additional support for koi-u and
cspc862latinhebrew encodings. After I fixed the UTF32 support in Eddie,
it passed an additional 16 tests. Down to just 160 out of 3502. I just
wish they would add support for some of the stranger encodings. Maybe
this will happen when it’s open
source.

Java has a nice IO subsystem. In particular, it has been designed
such that input streams can optionally support a feature where a
programmer can mark a position in the stream and at a later
stage return to that point to read the data again. Programmers can check
for this support by calling InputStream.markSupported().
Unfortunately I’ve had the need for this support, but I haven’t managed
to find a stream which supports this. Not even
ByteArrayInputStream sees to support it. Fortunately it’s
fairly trivial to wrap an InputStream in another class which
will add this support. Here is my quick adaptor, which seems to work
for most cases.

import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;

public class MarkableInputStream extends InputStream {
    private InputStream inputstream;
    private int maxpos = 0;
    private int curpos = 0;
    private int mark = 0;
    private ArrayList<Integer> buffer = new ArrayList<Integer>();
    public MarkableInputStream(InputStream is) {
        inputstream = is;
    }

    @Override
    public int read() throws IOException {
        int data;
        if(curpos == maxpos) {
            data = inputstream.read();
            buffer.add(data); maxpos++;curpos++;
        } else {
            data = buffer.get(curpos++);
        }
        return data;
    }

    @Override
    public synchronized void mark(int readlimit) {
        mark = curpos;
    }

    @Override
    public boolean markSupported() {
        return true;
    }

    @Override
    public synchronized void reset() throws IOException {
        curpos = mark;
    }
}

You can use it like:

if (!istream.markSupported()) {
   istream = new MarkableInputStream(istream);
}

This could probably be improved on, most notably by not using an
ArrayList. I’m not sure what performance penalty that adds. It
should be possible to use a normal array as the readlimit
parameter to mark() says how many bytes the stream should
record before throwing old data away in favour of new input. The class
above will record all data from the start of the stream, so could result
in a significant amount of memory usage. Hope you find it useful.

Imagine you’ve got some text you’ve been told is ASCII and you’ve
told java that it’s ASCII using:

Reader reader = new InputStreamReader(inputstream, "ASCII");

Imagine your surprise when it happily reads in non-ascii values, say
UTF-8 or ISO8859-1, and converts them to a random character.

import java.io.*;

public class Example1 {

   public static void main(String[] args) {
      try{
         FileInputStream is = new FileInputStream(args[0]);
         BufferedReader reader
            = new BufferedReader(new InputStreamReader(is, args[1]));
         String line;
         while ((line = reader.readLine()) != null) {
            System.out.println(line);
         }
      } catch (Exception e) {
         System.out.println(e);
      }
   }
}
beebo david% java Example1 utf8file.txt ascii
I��t��rn��ti��n��liz��ti��n
beebo david% java Example1 utf8file.txt utf8
Iñtërnâtiônàlizætiøn

So, I hear
you ask, how do you get Java to be strict about the conversion. Well, answer
is to lookup a Charset object, ask it for a CharsetDecoder object and
then set the onMalformedInput option to
CodingErrorAction.REPORT. The resulting code is:

import java.io.*;
import java.nio.charset.*;

public class Example2 {

   public static void main(String[] args) {
      try{
         FileInputStream is = new FileInputStream(args[0]);
         Charset charset = Charset.forName(args[1]);
         CharsetDecoder csd = charset.newDecoder();
         csd.onMalformedInput(CodingErrorAction.REPORT);
         BufferedReader reader
            = new BufferedReader(new InputStreamReader(is, csd));
         String line;
         while ((line = reader.readLine()) != null) {
            System.out.println(line);
         }
      } catch (Exception e) {
         System.out.println(e);
      }
   }
}

This time when we run it,we get:

beebo david% java Example2 utf8file.txt ascii
java.nio.charset.MalformedInputException: Input length = 1
beebo david% java Example2 utf8file.txt utf8
Iñtërnâtiônàlizætiøn

On a slightly related note, if anyone knows how to get Java to decode
UTF32, VISCII, TCVN-5712, KOI8-U or KOI8-T, I would love to know.

Update: (2007-01-26) Java 6 has support for UTF32
and KOI8-U.

A few hours ago I got stressed about the lack of leg room under my
desk and ended up spending the next few tidying and moving all of my
computers to under the next desk. I also made the mistake of starting to
remove keys from my keyboard to clean something sticky and found myself
surrounded by keys and a keyless keyboard. It’s now nice and shiny,
which is more than can be said for the rest of the flat, which is now
overrun with all the crap that was around my desk.

Another thing that could do with a tidy up is Eddie, my
Java liberal feed parsing library. After the initial coding sprint, I’ve had time
to sit back and look at the design of the library and clean up any thing
that sticks out. As mentioned in a previous
entry
, one of the things that has bothered me is that when ever you
need to call an object method, you need to be certain that the object is
not null. The means you end up with code like:

if (string != null && strong.equals("string")) {

This quickly becomes tiresome and the test for null distracts from
the meaning of the code. Fortunately I was reminded of an improvement
for string objects. Ideally, we should all be writing comparison
conditionals like rvalue == lvalue. (an rvalue mostly is an expresion
you can’t assign to). The most common rvalue is a literal value like a
string constant. The
advantage of getting into the habit of writing code like this is that
you’ll discover at compile time when you accidentally write =
rather than ==. Because you can’t assign to an rvalue, the
compiler will complain. What makes this interesting from a java string
point of view is that you can call methods on string literals. Comparing
a variable to a string literal, rather than calling .equals()
on a variable is that the string literal is not going to be null, so you
can remove the test for null and simplify the code:

if("string".equals(string)) {

I know it’s not everyone’s cup of tea, but I prefer it to testing for
null every time I look at a string. The other thing is that I’ve been
reading Hardcore Java by Robert Simmons at work. Considering I’ve only
got a few pages in so far. I’ve received a surprisingly large number of
ideas to improve my code.

The one that sticks in my head is using assert for doing
post and pre conditions on your functions. Using asserts have number of
advantages over throwing exceptions, including the fact they get
optimised away when you do a production release. In Eddie, during a
<feed> element I determine the version of Atom that we are
parsing. This had a number of nested if/else if/else blocks. At
the end of the function, I wanted to make sure I had set the version
string to something, so had the following code:

if (!this.feed.has("format")) {
   throw new SAXParseException("Failed to detect Atom format", this.locator);
}

However, using assertions I can write this as

assert(this.feed.has("format")) : "Failed to detect Atom format";

I highly recommend the Hardcore java book if you want to improve your java
programming. It includes sections on the new features of Java 1.5 and
using collections. I’ve made a couple of other cleanups including going through member
variable access specifiers to make sure they are right and making
several public methods and variables and making them
priavte. I also have a couple of ideas about refactoring some
of the code to clean it up. Redesigning and refactoring code is almost
more fun than writing it in the first place. You get to be in
competition with yourself, challenging yourself to write better code
and end up with cleaner code in the process.

A couple of things I want to do in the near future is use a profiler
and code coverage tools. If anyone has recommendations for either of
these tools that integrates nicely with eclipse, I’d love to know.