How to check if UTF-8 string starts with character 'a'
I have a UTF-8 string listed as const char*
null terminated. I would like to know if the first letter of this line is itself a
. The following code
bool f(const char* s) {
return s[0] == 'a';
}
incorrect, since the first letter (grapheme cluster) of the string can be à
- made from 2 scanned unicode values: a
and `
. Therefore, this very simple question seems extremely difficult to answer unless you know how grapheme clusters are made.
However, many libraries parse UTF-8 files (like YAML files) and therefore should be able to answer this question. But these libraries do not seem to depend on the Unicode library.
So my question is:
-
How do I write code that checks if a string starts with a letter
a
? -
Assuming there is no simple answer to the first question, how do parsers (such as YAML parsers) manage to parse files without being able to answer that question?
source to share
It just doesn't matter.
Consider: Is this string valid JSON?
"̀"
(This is a sequence of bytes 22 cc 80 22
.)
You seem to be arguing that this is not the case: as the JSON string must start with "
(QUOTATION MARK), but instead starts with "̀
(QUOTATION + MAN).
The only reasonable answer is that you are thinking at the wrong level: text serialization is defined in terms of code points. Grapheme clusters are considered for natural language processing and text editing only.
And this is certainly considered valid JSON.
>>> json.loads(bytes.fromhex('22cc8022'))
'̀'
source to share
How do I write code that checks if a string starts with the letter a?
There is no simple answer to this question. To answer this question, you will need to check the Unicode CCC property for the code point. If it is nonzero, then it is a union character.
Of course, C doesn't have an API for this.
How do parsers (such as YAML parsers) manage to parse files without being able to answer this question.
This is not a question they have to answer. What for? Because they never ask about it.
If YAML reads a key, it reads it up to the end character (for example :
). A Unicode combining character cannot be combined with such a character, and the YAML spec doesn't care if there is a combining character on the other side :
. If he sees :
, then he knows that he has reached the end of the name, and everything before that is the key.
If it reads a text string, it similarly continues reading until it reads the terminating character or sequence of characters.
Parsing text with most text formats is based on regular expression matching (or something similar) against some termination condition. That is, a string will be any of a certain set of characters (alternative, all characters except a certain set), up to the final character (s).
source to share
s[0] == 'a'
is the correct test to see if the first character is a
. If the string contains a decomposed version à
, it will be two characters, a
and a combined grave. Until Apple decided to implement NFD all over the place, this was mostly not a problem, because people who wanted to à
be considered a character / letter on their own would enter it as one, and people who wanted it as a
with an attached enter it as two. Yes, this is contrary to Unicode's intent for canonical equivalence, but Unicode's intent for canonical equivalence is largely contrary to the expectations and intentions of users (not to mention existing text and text processing models).
If you really want to check that the first character is this a
and there are no combination labels following it, this should work:
wchar_t tmp = WEOF;
mbrtowc(&tmp, s+1, MB_LEN_MAX, &(mbstate_t){0});
if (tmp && wcwidth(tmp)==0) {
/* character following 'a' is a combining mark */
}
It depends on the POSIX function wcwidth
, but you can find portable versions of it, or write your own based on Unicode tables (indeed, you could write a simpler function that only checks the alignment status as well as the East Asian Width).
To answer your second question about parsers, they have no reason to know or care about the issue you are concerned about. File formats like yaml, json, etc. are not subject to canonical equivalence (at least not at the parsing level, the content stored in the file which applications will interpret) may be affected by it). A string that is a different sequence of Unicode characters, even if it would be canonically equivalent, is another string that compares non-uniform.
source to share
Here is the code that checks if the utf8 string starts with the letter "a"?
bool f(const char* s) {
if (s[0] == 'a') return true;
if (strlen(s) >= 2 && s[0] == '\xc3') {
char s1 = s[1];
if (s1 == '\x80') return true; // LATIN CAPITAL LETTER A WITH GRAVE
if (s1 == '\x81') return true; // LATIN CAPITAL LETTER A WITH ACUTE
if (s1 == '\x82') return true; // LATIN CAPITAL LETTER A WITH CIRCUMFLEX
if (s1 == '\x83') return true; // LATIN CAPITAL LETTER A WITH TILDE
if (s1 == '\x84') return true; // LATIN CAPITAL LETTER A WITH DIAERESIS
if (s1 == '\x85') return true; // LATIN CAPITAL LETTER A WITH RING ABOVE
if (s1 == '\xa0') return true; // LATIN SMALL LETTER A WITH GRAVE
if (s1 == '\xa1') return true; // LATIN SMALL LETTER A WITH ACUTE
if (s1 == '\xa2') return true; // LATIN SMALL LETTER A WITH CIRCUMFLEX
if (s1 == '\xa3') return true; // LATIN SMALL LETTER A WITH TILDE
if (s1 == '\xa4') return true; // LATIN SMALL LETTER A WITH DIAERESIS
if (s1 == '\xa5') return true; // LATIN SMALL LETTER A WITH RING ABOVE
}
return false;
}
source to share