Parsing bytes in java unicode

I am just reading some data from a file as a stream of bytes and I just ran into some unicode strings that I am not sure how to best handle.

Each character uses two bytes, with only the first seemingly containing the actual data, so for example the "trust" string is stored in the file as:

0x74 0x00(t) 0x72 0x00(r) ...and so on

      

I usually just use a regex to replace the zeros with nothing and therefore remove the spaces. However, the spaces between words within the file are implemented with 0x00 0x00

, so trying to make a simple String "replaceAll" looks a bit like it.

I've tried playing with String encoding sets such as "ISO-8859-1" and "UTF-8/16" but every time I end up with a space.

I created a simple regex to remove double zero hex values ​​which:

new String(bytes).replaceAll("[\\00]{2,},"");

      

But this obviously only works for double zero and I would really like to replace single zeros with nothing and double zeros with an actual ASCII / Unicode space character.

I could swear that one of the Java string format parameters refers to similar things, but I could be wrong. So should I be working on making a regex to strip out zeros, or does Java really provide mechanisms for this?

thank

+3


source to share


2 answers


This is "UTF-16LE"

. 0x00 0x00

actually encodes a NUL character to UTF-16, so that's what you get.



This encoding can encode about a million different characters using 2 or 4 bytes per character. The first 256 characters are encoded as the second byte 0x00

, and if the text contains only those that can be considered useless, but this is necessary for the rest of the characters. For example, the euro currency symbol €

would be displayed as 0xAC 0x20

.

+6


source


I am just reading some data from a file as a stream of bytes and I just ran into some unicode strings that I am not sure how to best handle.

Convert them to strings using the appropriate encoding, in this case UTF-16LE (unlikely UTF-16 with low byte first and high byte then)



String str = new String(bytes, "UTF-16LE");

      

+3


source







All Articles