How to set the code number of encoded characters?
Given a stream of bytes (representing characters) and the encoding of the stream, how would one get the code points of the characters?
InputStreamReader r = new InputStreamReader(bla, Charset.forName("UTF-8"));
int whatIsThis = r.read();
What is returned by read () in the above snippet? Is this unicode code?
source to share
Reader.read()
returns a value that can be appended to, char
or -1 if no more data is available.
A is char
(implicitly) a 16-bit UTF-16BE encoded block of code. This encoding can represent basic multilingual plane characters with one char
. Additional range is represented using two- char
sequences.
Character
contains methods for converting UTF-16 code points to Unicode code points:
A codepoint that requires two char
will satisfy isHighSurrogate and isLowSurrogate when you pass two consecutive values ββfrom sequence . The codePointAt methods can be used to extract code points from sequences of code. There are similar methods of working from code points to UTF-16 code units.
An example implementation of a code stream reader:
import java.io.*;
public class CodePointReader implements Closeable {
private final Reader charSource;
private int codeUnit;
public CodePointReader(Reader charSource) throws IOException {
this.charSource = charSource;
codeUnit = charSource.read();
}
public boolean hasNext() { return codeUnit != -1; }
public int nextCodePoint() throws IOException {
try {
char high = (char) codeUnit;
if (Character.isHighSurrogate(high)) {
int next = charSource.read();
if (next == -1) { throw new IOException("malformed character"); }
char low = (char) next;
if(!Character.isLowSurrogate(low)) {
throw new IOException("malformed sequence");
}
return Character.toCodePoint(high, low);
} else {
return codeUnit;
}
} finally {
codeUnit = charSource.read();
}
}
public void close() throws IOException { charSource.close(); }
}
source to share
It does not read Unicode code points, but UTF-16 code units. There is no difference for code points below 0xFFFF, but code points above 0xFFFF are represented by two code units each. This is because you cannot have a value higher than 0xFFFF in 16-bit.
So in this case:
byte[] a = {-16, -96, -128, -128}; //UTF-8 for π U+20000
ByteArrayInputStream is = new ByteArrayInputStream(a);
InputStreamReader r = new InputStreamReader(is, Charset.forName("UTF-8"));
int whatIsThis = r.read();
int whatIsThis2 = r.read();
System.out.println(whatIsThis); //55360 not a valid stand alone code point
System.out.println(whatIsThis2); //56320 not a valid stand alone code point
With surrogate values, we add them together to get 0x20000
:
((55360 - 0xD800) * 0x400) + (56320 - 0xDC00) + 0x10000 == 0x20000
More on how UTF-16 works: http://en.wikipedia.org/wiki/UTF-16
source to share