Imagine you’ve got some text you’ve been told is ASCII and you’ve
told java that it’s ASCII using:
Reader reader = new InputStreamReader(inputstream, "ASCII");
Imagine your surprise when it happily reads in non-ascii values, say
UTF-8 or ISO8859-1, and converts them to a random character.
import java.io.*;
public class Example1 {
public static void main(String[] args) {
try{
FileInputStream is = new FileInputStream(args[0]);
BufferedReader reader
= new BufferedReader(new InputStreamReader(is, args[1]));
String line;
while ((line = reader.readLine()) != null) {
System.out.println(line);
}
} catch (Exception e) {
System.out.println(e);
}
}
}
beebo david% java Example1 utf8file.txt ascii I��t��rn��ti��n��liz��ti��n beebo david% java Example1 utf8file.txt utf8 Iñtërnâtiônàlizætiøn
So, I hear
you ask, how do you get Java to be strict about the conversion. Well, answer
is to lookup a Charset object, ask it for a CharsetDecoder object and
then set the onMalformedInput option to
CodingErrorAction.REPORT. The resulting code is:
import java.io.*;
import java.nio.charset.*;
public class Example2 {
public static void main(String[] args) {
try{
FileInputStream is = new FileInputStream(args[0]);
Charset charset = Charset.forName(args[1]);
CharsetDecoder csd = charset.newDecoder();
csd.onMalformedInput(CodingErrorAction.REPORT);
BufferedReader reader
= new BufferedReader(new InputStreamReader(is, csd));
String line;
while ((line = reader.readLine()) != null) {
System.out.println(line);
}
} catch (Exception e) {
System.out.println(e);
}
}
}
This time when we run it,we get:
beebo david% java Example2 utf8file.txt ascii java.nio.charset.MalformedInputException: Input length = 1 beebo david% java Example2 utf8file.txt utf8 Iñtërnâtiônàlizætiøn
On a slightly related note, if anyone knows how to get Java to decode
UTF32, VISCII, TCVN-5712, KOI8-U or KOI8-T, I would love to know.
Update: (2007-01-26) Java 6 has support for UTF32
and KOI8-U.