Xcode 4.6 (4H127) clang warns "illegal character encoding in string literal" for ISO-8859-1 encoded by o-umlaut (0xF6)

This code was compiled in a previous release of Xcode. I updated Xcode and now compilation fails. I am guessing there is something wrong with my code. The question mark in the code below is o-umlaut (ö) encoded according to ISO-8859-1 (0xF6) - we used this upper (or extended) ASCII. I'm guessing the compilation error has something to do with the switch to UTF-8 input encoding for clang ??

$ xcrun -sdk macosx10.8 -run clang -v
Apple LLVM version 4.2 (clang-425.0.24) (based on LLVM 3.2svn)
Target: x86_64-apple-darwin12.2.0

$ cat test.c
#include <stdio.h>
int main( int argc, char** argv )
{
    fprintf( stderr, "?\n" );
    return 0;
}

$ xcrun -sdk macosx10.8 -run clang -o test test.c 
test.c:4:23: warning: illegal character encoding in string literal [-Winvalid-source-encoding]
    fprintf( stderr, "<F6>\n" );
                      ^~~~
1 warning generated.

      

+3


source to share


1 answer


So, it seems that clang from the latest Xcode (4.6) accepts UTF-8 encoding and complains about upper (or extended) ASCII because the upper ASCII for Universal Character Set (UCS) specifies according to ISO-8859-1 , mixed with your source does not result in the correct UTF-8 encoding. I haven't checked the release notes to make sure the new clang requires UTF-8, but I changed my source to have a suitably UT-8 encoded little o-umlaut and compiled.

0xF6 or 246 is the UCS code point for a small o-umlaut. However, in order to properly encode it in UTF-8, you cannot simply place 0xF6 in a single byte in your file. The native UTF-8 encoding is two bytes: 0xC3 0xB6. See details below. So open up your favorite hex editor and replace one character 0xF6 with two characters: 0xC3 0xB6.

Here is a great hex editor: Hex Fiend

So, what if your problem character isn't o-umlaut? I've included a list of a few common characters, but you can follow these steps to find any other UTF-8 encoding to solve your specific problem:

| Char | ISO-8859-1 |   UTF-8   |
| ---- | ---------- | --------- |
|  ©   |    0xA9    | 0xC2 0xA9 |
|  ®   |    0xAE    | 0xC2 0xAE |
|  Ä   |    0xC4    | 0xC3 0x84 |
|  Å   |    0xC5    | 0xC3 0x85 |
|  Æ   |    0xC6    | 0xC3 0x86 |
|  Ç   |    0xC7    | 0xC3 0x87 |
|  É   |    0xC9    | 0xC3 0x89 |
|  Ñ   |    0xD1    | 0xC3 0x91 |
|  Ö   |    0xD6    | 0xC3 0x96 |
|  Ü   |    0xDC    | 0xC3 0x9C |
|  ß   |    0xDF    | 0xC3 0x9F |
|  à   |    0xE0    | 0xC3 0xA0 |
|  á   |    0xE1    | 0xC3 0xA1 |
|  â   |    0xE2    | 0xC3 0xA2 |
|  ä   |    0xE4    | 0xC3 0xA4 |
|  å   |    0xE5    | 0xC3 0xA5 |
|  æ   |    0xE6    | 0xC3 0xA6 |
|  ç   |    0xE7    | 0xC3 0xA7 |
|  è   |    0xE8    | 0xC3 0xA8 |
|  é   |    0xE9    | 0xC3 0xA9 |
|  ê   |    0xEA    | 0xC3 0xAA |
|  ë   |    0xEB    | 0xC3 0xAB |
|  ì   |    0xEC    | 0xC3 0xAC |
|  í   |    0xED    | 0xC3 0xAD |
|  î   |    0xEE    | 0xC3 0xAE |
|  ï   |    0xEF    | 0xC3 0xAF |
|  ñ   |    0xF1    | 0xC3 0xB1 |
|  ò   |    0xF2    | 0xC3 0xB2 |
|  ó   |    0xF3    | 0xC3 0xB3 |
|  ô   |    0xF4    | 0xC3 0xB4 |
|  ö   |    0xF6    | 0xC3 0xB6 |
|  ù   |    0xF9    | 0xC3 0xB9 |
|  ú   |    0xFA    | 0xC3 0xBA |
|  û   |    0xFB    | 0xC3 0xBB |
|  ü   |    0xFC    | 0xC3 0xBC |
|  ÿ   |    0xFF    | 0xC3 0xBF |

      

Only lower ASCII (7-bit character) can be encoded as one character in UTF-8. See http://en.wikipedia.org/wiki/UTF-8 .

8-11 bit code points are encoded in UTF-8 as:

110xxxxx  10xxxxxx

      

In this case, 0xF6 followed by anything that does not start with the highest two bits set to 1 and 0, respectively, is not encoded correctly.



The correct encoding of this UCS code point (246 or 0xF6) in UTF-8 is 0xC3 0xB6, which looks like this:

11000011  10110110

      

Because encoding 0xF6 means taking the lower 6 bits and including them in the second byte, and the higher 2 bits are added to the first byte. Example:

0xF6
11110110
   11    <-SPLIT->  110110
     \                 \
110xxxxx           10xxxxxx

      

Since 0xF6 is only 8 bits, the first 3 x in the first byte can be set to 0. So you get:

11000011  10110110

      

Or:

0xC3 0xB6

      

Hope this helps you to encode correctly whatever file you have. I seem to run into this open source issue. Many times the offending character is found in the comment (author name), in which case you can simply change it however you want. Sometimes you do not have permission to modify the source code, in which case you must correct the encoding and submit a fix to the maintainer.

+5


source







All Articles