Problems with Cyrillic, search by string

I need to write a character counter. If I search s

in string, count

equals 3, but if I search for Cyrillic (

), something is wrong. I tried to find the code 237

. I found this code in the ASCII table http://ascii.org.ru/ascii.pdf .

How can I fix this?

#include <stdio.h>
#include <string.h>

int main () {
  char str[] = "This is a string. ";
  char * pch;
  int count = 0;

  pch = strchr(str, 's');

  while (pch != NULL) {
    count++;
    pch = strchr(pch + 1, 's');
  }
  printf("%i", count);
  return 0;
}

      

+3


source to share


2 answers


I would suggest switching to functions wchar_t

and wide-char ( wcschr()

etc.).

Thus, the character data in the program will be stored in 32-bit (Linux) or 16-bit (Windows) instead of 8 bits. This will allow all locales to be handled correctly.

Also, if you need to work with UTF-8 (multibyte strings), you mbstowcs()

must convert the data to wchar_t

.



Complete example:

#include <stdio.h>
#include <wchar.h>

int main () {
  wchar_t str[] = L"This is a string. ";
  wchar_t * pch;
  int count = 0;

  pch = wcschr(str, L'');

  while (pch != NULL) {
    count++;
    pch = wcschr(pch + 1, L'');
  }
  wprintf(L"%i", count);
  return 0;
}

      

+3


source


You must save the C file with Cyrillic encoding.

If the file is saved in Unicode eg. UTF-8

will be a double byte character.

0x04 0x3d    (4 61)

      

not

0xed         (237)

      

In fact, your compiler reads when it parses the source file and encounters a line:

pch = strchr(str, '');

      

is an

pch = strchr(str, 0x0461);

      



not

pch = strchr(str, 0xed);

      

Depending on the editor, you can usually change the encoding of the file, eg. in Vim

set fenc=cyrillic
set fenc=iso-8859-5
etc.

      

Then

pch = strchr(pch + 1, '');

      

should work properly. Also, you can search for the byte value at 237, but then the file must be in Cyrillic, all the same, since your input line will have the same encoding as the original file.

Besides; looking at wchar_t

probably the best approach. But again, it all depends on the context.

+4


source







All Articles