What's a good heuristic to determine if a set of bytes is encoded as UTF-8 in Java?
I have a stream of bytes which can be UTF-8 data or can be a binary image. I should be able to make an educated guess as to what it is by checking the first 100 bytes or so.
However, I have not figured out how to do this in Java. I've tried doing something like:
new String (bytes, UTF-8). substring (0,100) .matches (". * [^ \ p {Print}]") to see if the first 100 characters contain non-printable characters, but that doesn't seem to work.
Is there a better way to do this?
source to share
final Charset charset = Charset.forName("UTF-8");
final CharsetDecoder decoder = charset.newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPORT);
try {
final String s = decoder.decode(ByteBuffer.wrap(bytes)).toString();
Log.d( s );
} catch( CharacterCodingException e ) {
// don't log binary data
}
source to share
In well-formed UTF-8, a byte with an upper bit set must either follow or precede another byte that has an upper bit set; the first of the runs should start with the top two bits, and the rest should have a transparent next-to-top bit (in fact, the first of the top N bit bytes should have the top N bits, and the next one clear).
These characteristics should be easy enough to find.
source to share