What's a good heuristic to determine if a set of bytes is encoded as UTF-8 in Java?

I have a stream of bytes which can be UTF-8 data or can be a binary image. I should be able to make an educated guess as to what it is by checking the first 100 bytes or so.

However, I have not figured out how to do this in Java. I've tried doing something like:

new String (bytes, UTF-8). substring (0,100) .matches (". * [^ \ p {Print}]") to see if the first 100 characters contain non-printable characters, but that doesn't seem to work.

Is there a better way to do this?

+2


source to share


3 answers


    final Charset charset = Charset.forName("UTF-8");
    final CharsetDecoder decoder = charset.newDecoder();
    decoder.onMalformedInput(CodingErrorAction.REPORT);

    try {
        final String s = decoder.decode(ByteBuffer.wrap(bytes)).toString();
        Log.d( s );
    } catch( CharacterCodingException e ) {
        // don't log binary data
    }

      



+3


source


In well-formed UTF-8, a byte with an upper bit set must either follow or precede another byte that has an upper bit set; the first of the runs should start with the top two bits, and the rest should have a transparent next-to-top bit (in fact, the first of the top N bit bytes should have the top N bits, and the next one clear).



These characteristics should be easy enough to find.

+4


source


I suggest using ICU4J

ICU is a mature, widely used set of C / C ++ and Java libraries providing Unicode and Globalization support for software applications. ICU is widely portable and gives applications the same results across all platforms and between C / C ++ and Java software.

0


source







All Articles