ReadLine and extended ascii table encoding

Good day.

I have an ASCII file with Spanish words. They only contain characters between A and Z, plus C, ASCII code 165 ( http://www.asciitable.com/ ). I get this file with this source code:

InputStream is = ctx.getAssets().open(filenames[lang_code][w]);
InputStreamReader reader1 = new InputStreamReader(is, "UTF-8");
BufferedReader reader = new BufferedReader(reader1, 8000);

try {
    while ((line = reader.readLine()) != null) {
                 workOn(line);
                 // do a lot of things with line
            }
    reader.close();
    is.close();
} catch (IOException e) { e.printStackTrace(); }

      

What I'm calling here workOn () is a function that should extract character codes from strings and something like this:

    private static void workOn(String s) {      
    byte b;
    for (int w = 0; w < s.length(); w++) {
        b = (byte)s.charAt(w);
                    // etc etc etc
            }
}   

      

Unfortunately, what happens here is that I cannot identify b as an ASCII code when it represents the letter Ñ. B is valid for any ascii letter and returns -3 when dealing with Ñ, which is converted to a signed value of 253 or the ASCII character ². Nothing like C ...

What's going on here? How do I get this simple ASCII code?

What pissed me off was that I couldn't find the correct encoding. Even if I go and look at the UTF-8 table ( http://www.utf8-chartable.de/ ), then with 209dec and 253dec ý, 165dec is ¥. Again, not relatives of the events that I need.

So ... help me please! :(

+3


source to share


1 answer


Are you sure your original file you are reading is UTF-8 encoded? In UTF-8 encoding, all values ​​above 127 are reserved for multibyte sequences and are never visible on their own.

I am assuming that the file you are reading is encoded using "code page 237" which is the original IBM PC character set. In this character set, C is represented by a decimal number.

Many modern systems use ISO-8859-1, which is equivalent to the first 256 characters of the Unicode character set. In these, C is the decimal number 209. In a comment, the author explained that 209 is actually in the file.

If the file was indeed UTF-8 encoded, then it will be represented as a two-byte sequence and will not be either 165 or 209.



Based on the assumption that the file is ISO-8859-1 encoded, you should be able to fix this problem using:

InputStreamReader reader1 = new InputStreamReader(is, "ISO-8859-1");

      

This will convert Unicode characters and you should find the C character represented by the decimal number 209.

+8


source







All Articles