Problems understanding Java InputStream read method

let's say I have an input stream. Correct me if I'm wrong, but all the data in the InputStream is stored as bytes, for example the following text: "Why not with ♥?" Now I'm wondering how this text is converted to a byte array because I don't understand how (for example) to store ♥. If i call

InputStream myInputStream = os.getInputStream();
byte[] b = new byte[1];
while ((in.read(b)) != -1) {
            System.out.write(b, 0, 1);
}

      

then my byteArray (with length 1) will be filled with the next byte in each loop.

int.read(b)

      

returns an integer value that is later converted to a character. So, if I look at the Java documentation, you find something like this:

Reads the next byte of data from the input stream. The value byte is returned as an int in the range 0 to 255.

My mind says: maybe only 255 different characters? There must be a mistake in reasoning because it doesn't matter what characters are used in my source.

So - can anyone help me with this brain? thanks a lot.

+3


source to share


2 answers


The process of converting characters to bytes (and vice versa) is called "character encoding". And this can be done in different ways. The rules for these transformations are in what Java calls Charset . And Java supports many of them: ASCII, UTF_8, UTF_16, ISO_8859_1, etc. Standard standards can be found in StandardCharsets .

Some encodings consider the mapping between byte and character to be one-to-one. ISO_8859_1 (AKA latin-1) is one of them. But of course there is a drawback: only 256 characters are encoded in bytes using such an encoding (Western Latin characters for ISO_8859_1).



Some others, such as UTF_8, use one, two, or more bytes per character, depending on the character. ASCII characters (ab, AB, numbers, etc.) are encoded one byte at a time, while others (accented letters, Chinese, Cyrillic and other letters) use two or more bytes. The disadvantage is that it is more difficult to encode and decode, but the advantage is huge: every Unicode character is supported by such an encoding.

Just keep in mind that byte and character are two different things, and there is no one-to-one mapping between them. Use InputStreamReader to read characters and OutputStreamWriter to write characters. Always specify the encoding: this will not use the default encoding of your system (which may not match the anothe system).

+5


source


The character that requires 2 bytes contains a flag (in the first byte) that notifies those who want to know (including text editors) that this requires a different byte.



In your case, you read the first byte altogether, including the flag, then read the second. It is the text editor or console that adds them.

0


source







All Articles