How to read string bytes correctly?

A containing special characters such as ç

takes two bytes in size in each special char, but the string length method or getting its length with the byte array returned from the getBytes method does not return special characters read as two bytes.

How do I correctly count the number of bytes in a string?

Example:

The word endereço

should return me length 9 instead of 8.

+3


source to share


1 answer


The word endereço should return me length 9 instead of 8.

If you want to be 9 bytes in size for a string "endereço"

that is 8 characters long: 7 ASCII

characters and 1 character ASCII

, I suppose you want to use a UTF-8

charset that uses 1 byte for ASCII characters and more for others.

but the string length method or its length with the byte array returned by the getBytes method does not return special characters are considered two bytes.


String

length()

the method does not answer the question: how many bytes are used? But answer to: " how many" UTF-16 code blocks "or more simply char

are there in?
"

String

length()

Javadoc:

Returns the length of this string. The length is equal to the number of Unicode code units in the string.


The method byte[]

getBytes()

with no argument encodes a String into a byte array. You can use the property of the length

returned array to find out how many bytes are used by the encoded string, but the result will depend on the encoding used during encoding. But the method byte[]

getBytes()

does not allow you to specify the encoding: it uses the platform's default encoding .
Thus, using this may not give the expected result if the underlying OS uses a default encoding that is not the one you want to use to encode strings in bytes.
Also, according to the platform the application is deployed on, the way the string is encoded in bytes may change. This may not be desirable.
Finally, if the string cannot be encoded in the default encoding, the behavior is unspecified.
Therefore, this method should be used with great care or not at all.

byte[]

getBytes()

Javadoc:

Encodes this string into a sequence of bytes using the default framework, saving the result to a new byte array.

The behavior of this method when this string cannot be encoded into the default charset is not specified. The Java.nio.charset.CharsetEncoder class should be used when more control over the encoding process is required.

In the example with a string "endereço"

, if it getBytes()

returns an array of size 8, not 9, this means that your OS does not use the default UTF-8

encoding, but the encoding with 1 byte fixed width, such as ISO 8859-1

its derived encodings, such as windows-1252

for Windows.

To see the default encoding of the current Java virtual machine on which the application is running, you can use this method tools: Charset defaultCharset = Charset.defaultCharset()

.


Decision



byte[]

getBytes()

the method comes with two other very useful overloads:

  • byte[] java.lang.String.getBytes(String charsetName) throws UnsupportedEncodingException

  • byte[] java.lang.String.getBytes(Charset charset)

Unlike a method getBytes()

with no argument, these methods allow you to specify the encoding to use during byte encoding.

byte[] java.lang.String.getBytes(String charsetName) throws UnsupportedEncodingException

Javadoc:

Encodes this string into a sequence of bytes using the named encoding, storing the result into a new byte array.

The behavior of this method when this string cannot be encoded into the given encoding is undefined. The Java.nio.charset.CharsetEncoder class should be used when more control over the encoding process is required.

byte[] java.lang.String.getBytes(Charset charset)

Javadoc:

Encodes this string into a sequence of bytes using the given encoding, storing the result in a new byte array.

This method always replaces invalid input and non-transferable character sequences with this default byte set. The java.nio.charset.CharsetEncoder class should be used when you have more control over the encoding process.

You can use one or the other (although there are some complications in between) to encode your string in a byte array with UTF-8 or any other encoding and thus get its size for that specific encoding.

For example, to get an array of bytes UTF-8

using getBytes(String charsetName)

, you can do this:

String yourString = "endereço";
byte[] bytes = yourString.getBytes("UTF-8");
int sizeInBytes = bytes.length;

      

And you get the length of 9 bytes as you wish.

Here is a more detailed example showing the default encoding, byte encoding with the default character set platform, UTF-8

and UTF-16

:

public static void main(String[] args) throws UnsupportedEncodingException {

    // default charset
    Charset defaultCharset = Charset.defaultCharset();
    System.out.println("default charset = " + defaultCharset);

    // String sample
    String yourString = "endereço";

    //  getBytes() with default platform encoding
    System.out.println("getBytes() with default charset, size = " + yourString.getBytes().length + System.lineSeparator());

    // getBytes() with specific charset UTF-8
    System.out.println("getBytes(\"UTF-8\"), size = " + yourString.getBytes("UTF-8").length);       
    System.out.println("getBytes(StandardCharsets.UTF_8), size = " + yourString.getBytes(StandardCharsets.UTF_8).length + System.lineSeparator());

    // getBytes() with specific charset UTF-16      
    System.out.println("getBytes(\"UTF-16\"), size = " + yourString.getBytes("UTF-16").length);     
    System.out.println("getBytes(StandardCharsets.UTF_16), size = " + yourString.getBytes(StandardCharsets.UTF_16).length);
}

      

Output on my Windows based machine:

default charset = windows-1252

getBytes () with default encoding, size = 8

getBytes ("UTF-8"), size = 9

getBytes (StandardCharsets.UTF_8), size = 9

getBytes ("UTF-16"), size = 18

getBytes (StandardCharsets.UTF_16), size = 18

+10


source







All Articles