Freitag, 24. April 2009

Java Charset encoding UTF-8

One of the most annoying stuff you will definitely come accross is character encoding.
When you initialize a String with the default constructor, the JVM uses uses the Charset.defaultCharset() for the encoding. Anothor constructor allows you to specify any Charset, that is available on your box.
/*
s1 and s2 deliver the same results
*/

String s1 = new String("hello");
String s2 = new String("hello".getBytes(), Charset.defaultCharset());

// specific encoding, and yes you this ü char there with intent
String s3 = new String("grün".getBytes(), "UTF-8");


Conversion


Charsets have a limited set of characters, that have to be used to encode a, probably much larger amount of charaters. Therefore in UTF-8 the Umlaut ü is encoded as ü. In Java there are serveral ways to convert from one encoding to an other. Let's assume you have an UTF-8 encoded String with the value 'grün' (See the last line of listing 1). To get rid of those escaped charaters like ü you have to encode it with p.e. ISO-8859-1.

System.out.println(new String ( s.getBytes("ISO-8859-1"), "UTF-8"));


As far as I know isn't there any method that can detect the encoding of a String. You can test if a String is UTF-8 encoded, which means, the String only contains characters, that are valid in UTF-8. So it is probably be a valid ISO-8859-* String too.

[unfinished]

Keine Kommentare:

Kommentar veröffentlichen