Workaround for glibc printf truncation error in multibyte locales?

Certain GNU (Debian) based OS distributions still encounter a bug in the GNU libc that causes the family of functions to printf

return a bogus -1

when a specified precision level truncates a multibyte character.This bug was fixed in 2.17 and was released before 2.16. Debian has a zipped bug for this, but the developers don't seem to intend to revert the 2.13 fix used by Wheezy.

Below is the text from https://sourceware.org/bugzilla/show_bug.cgi?id=6530 . (Please do not edit the block quoting inline again.)

Here's a simpler test version for this error, provided by Jonathan Nieder:

#include <stdio.h>
#include <locale.h>

int main(void)
{
    int n;

    setlocale(LC_CTYPE, "");
    n = printf("%.11s\n", "Author: \277");
    perror("printf");
    fprintf(stderr, "return value: %d\n", n);
    return 0;
}

      

In the C locale, which will do the right thing:

$ LANG=C ./test
Author: &#65533;
printf: Success
return value: 10

      

But not in the UTF-8 locale, as \277

it is not a valid UTF-8 sequence:

$ LANG=en_US.utf8 ./test
printf: Invalid or incomplete multibyte or wide character

      

It's worth noting that it will printf

also overwrite the first character of the output array \0

in this context.

I am currently trying to modify the MUD codebase to support UTF-8, and unfortunately the code is riddled with cases where arbitrary precision is used to limit the amount of text sent to the output buffers sprintf

. This problem has gotten worse due to the fact that most programmers don't expect to return -1

in this context, which can lead to uninitialized reads and bad memory cascading from this. (already caught a few cases in valgrind)

Has anyone come up with a concise workaround for this error in their code that does not involve rewriting every single call to an arbitrary length precision formatting string? I'm fine with the truncated UTF-8 characters that get written to my output buffer as it's pretty trivial to clear this up in my output processing before writing the socket and it seems like overkill to put this effort into a problem that will eventually go to for several years.

+3


source to share


1 answer


I am assuming, and it seems to be confirmed by the comments on the question, that you are not using all this specific locale functionality of the C library. In this case, you are probably better off not changing the locale to UTF-8, and leaving it in the single-byte locale, which is assumes your code.

When you need to process UTF-8 strings as UTF-8 strings, you can use specialized code. It's not hard to write your own UTF-8 processing routines. You can even download the Unicode Character Database and do some pretty complex character classification. If you prefer to use a third party library for handling UTF-8 strings, ICU , as you mentioned in your comments. This is a fairly heavy library, although the previous question recommends several lighter weight alternatives .



It may also be possible to switch the C locale back and forth as needed, so that you can use the functionality of the C library, however, you want to test the performance impact of this since switching locales can be an expensive operation.

+1


source







All Articles