Saving int Unicode code point to UTF-8 file

Context

Debian 64bit Attempting to write an int like 233 to a file and make it text print "é".

Question

I can't figure out how I could write the equivalent of a utf8 char like "é" or any UTF-8 char that's significantly wider than the char type. The file must be human-readable in order to send it over the network.

My goal is to write an int to a file and get its utf8 equivalent.

I don't know what I'm doing.

code

FILE * dd = fopen("/myfile.txt","w");
fprintf(dd, "%s", 233); /* The file should print "é" */
fclose(dd);

      

thank

UPDATE:

According to Biffen's comment, here's another code for the code that writes "E9" (hex value "é");

int p = 233;
char r[5];
sprintf(r,"%x",p);
printf("%s\n",r);
fwrite(r,1,strlen(r),dd);
fclose(dd);

      

How do I convert it to "é"?

Update final working code:

UFILE * dd = u_fopen("/myfile.txt","wb", NULL, NULL);
UChar32 c = 233;
u_fputc(c,dd);
u_fclose(dd);

      

+3


source to share


4 answers


In the standard library codecvt

for encoding conversions, but as far as I remember GCC for example t).
Edit: Omitted c . codecvt

is C ++.

"The algorithm for converting a Unicode code point to a sequence of UTF-8 blocks is not overly complicated, so you could implement it quite easily. Here's a page describing the procedure and here's another good resource.



But if you know you will be doing a lot of Unicode stuff, I would recommend using the library. ICU is a popular choice.

+1


source


You seem to be expecting to printf()

learn about UTF-8, which is not.

You can implement UTF-8 encoding yourself, it's a very simple encoding after all.

The solution might look like this:

void put_utf8(FILE *f, uint32_t codepoint)
{
    if (codepoint <= 0x7f) {
       fprintf(f, "%c", (char) codepoint & 0x7f);
    }
    else if (codepoint <= 0x7ff) {
       fprintf(f, "%c%c", (char) (0xc0 | (codepoint >> 6)),
                          (char) (0x80 | (codepoint & 0x3f));
    }
    else if (codepoint <= 0xffff) {
       fprintf(f, "%c%c%c", (char) (0xe0 | (codepoint >> 12)),
                            (char) (0x80 | ((codepoint >> 6) & 0x3f),
                            (char) (0x80 | (codepoint & 0x3f));
    }
    else if (codepoint <= 0x1fffff) {
       fprintf(f, "%c%c%c%c", (char) (0xf0 | (codepoint >> 18)),
                              (char) (0x80 | ((codepoint >> 12) & 0x3f),
                              (char) (0x80 | ((codepoint >> 6) & 0x3f),
                              (char) (0x80 | (codepoint & 0x3f));
    }
    else {
        // invalid codepoint
    }
}

      



You would use it like this:

FILE *f = fopen("mytext.txt", "wb");
put_utf8(f, 233);
fclose(f);

      

and then will output the two characters 0xC3 and 0xA9 to f

.

For details on UTF-8 see Wikipedia .

+5


source


One way to do it:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main(void){
    wchar_t utfchar = 233;
    setlocale(LC_CTYPE, "");
    wprintf(L"%lc\n", utfchar);
}

      

You need to find the appropriate one fprintf

to print to file.

+3


source


You can install a package libunistring-dev

for GNU libunistring , then include <unistr.h>

and use eg. u32_to_u8

to convert a UCS-4 string to a UTF-8 string. See libunistring documentation . Maybe use<unistdio.h>

+1


source







All Articles