How do I search for a non-ASCII character in a C ++ string?

string s="x1→(y1⊕y2)∧z3";

for(auto i=s.begin(); i!=s.end();i++){
    if(*i=='→'){
       ...
    }
} 

      

The char comparison is definitely wrong, what is the correct way to do this? I am using vs2013.

+3


source to share


2 answers


First, you need some basic understanding of how programs handle Unicode. Otherwise, you should read, I really like this post on Joel on Software .

You actually have 2 problems:

Problem # 1: getting a string in your program

Your first problem is getting this actual line in string s

. Depending on the encoding of the source file, MSVC may corrupt any non-ASCII characters on that line.

  • either save the C ++ file as UTF-16 (which Windows confuses Unicode), and use whcar_t

    and wstring

    (effectively encoding the expression as UTF-16). It will also preserve UTF-8 persistence with BOM. Any other encoding and your character literals L"..."

    will contain the wrong characters.

    Note that other platforms may define wchar_t

    as 4 bytes instead of 2. Thus, handling characters above U + FFFF will not be portable.

  • In all other cases, you cannot just write these characters in the source file. The most portable way is to encode your string literals as UTF-8 using \x

    escape codes for all non-ASCII characters. For example: "x1\xe2\x86\x92a\xe2\x8a\x95" "b)"

    rather than "x1→(a⊕b)"

    .

    And yes, it's as impenetrable and unwieldy as it gets. The root problem is that MSVC doesn't support UTF-8. You can consider this question here for an overview: How to Create a UTF-8 String Literal in Visual C ++ 2008 .

    But also consider how often these lines will appear in your source code.

Problem # 2: Finding a Symbol



(If you're using UTF-16, you can just search for the character L'→'

, since that character is represented as one whcar_t

. For characters above U + FFFF, you'll have to use the wider version of the workaround below.)

There is no way to define a char

representing an arrow symbol. However, you can specify a string: "\xe2\x86\x92"

. (that's a 3-character string for an arrow and a terminator \0

.

Now you can search for this line in your expression:

s.find("\xe2\x86\x92");

      

The UTF-8 encoding scheme ensures this always finds the correct character, but keep in mind that this is the byte offset.

+2


source


My comment is too long, so I present it as an answer.

The problem is that everyone is concentrating on the problem of the different encodings that Unicode can use (UTF-8, UTF-16, UCS2, etc.). But your problems are just starting here.

There is also a problem with compound characters that will really mess up any search you try.



Let's say you are looking for the character "é", you find it in Unicode as U + 00E9 and search, but this is not guaranteed to be the only way to represent this character. The document may also contain the combination U + 0065 U + 0301. It is actually exactly the same character.

Yes, not just a "symbol that looks the same", but it is exactly the same, so any software and even some programming libraries will freely convert from one to another without even telling you.

So, if you want to do a search, it will be reliable, you will need something that is not just different Unicode encodings, but Unicode characters with equality between composite and prepared characters.

+1


source







All Articles