IMultiLanguage2 :: ConvertStringFromUnicode - how to avoid complicated prefix?

Question

IMultiLanguage2 :: ConvertStringFromUnicode - how to avoid complicated prefix?

I am using IMultilanguage2 :: ConvertStringFromUnicode to convert from UTF-16. For some languages (Japanese, Chinese, Korean), I get the escape sequence (for example, 0x1B, 0x24, 0x29, 0x43

for code page 50225 (ISO-2022 Korean)). WideCharToMultiByte exhibits the same behavior.

I am creating a MIME message so the encoding is in the header itself and the exit prefix is shown as-is.

Is there a way to convert without a prefix?

Thank!

+3

winapi unicode character-encoding utf-16

Dmitry Streblechenko May 28 '15 @ 4:20 am

source to share

2 answers

Remy Lebeau · Answer 1 · 2015-05-29T03:56:24+0000

I really don't see a problem here. This is a valid byte sequence in ISO 2022 :

Escape sequences for character sets take the form ESC i [I ...] F , where there is one or more intermediate bytes i from the range 0x20-0x2F and a final F byte from the range 0x40-0x7F. (Range 0x30-0x3F is reserved for private F bytes). i bytes define the type of character set and the working set to which it should be assigned, while the F byte identifies the character set itself.
...
Code: ESC $) F
Hex: 1B 24 29 F
Abbr: G1DM4
Name: G1-denoted multibyte 94-bit F
Effect: selects the 94n character set to be used for G1.

Since F is 0x43 (C), this byte sequence tells the decoder to switch to ISO-2022-KR:

Character encodings using the ISO / IEC 2022 engine include:
...
ISO-2022-KR. The encoding for the Korean language.
ESC $) C to go to KS X 1001-1992 , previously named KS C 5601-1987 (2 bytes per character) [for G1]

In this case, you must specify iso-2022-kr

as encoding in the MIME header Content-Type

or RFC2047 -encoded. But the ISO 2022 decoder still needs to be able to switch encodings dynamically while decoding, so the data needs to include an intuitive Korean encoding switching sequence.

Is there a way to convert without a prefix?

Not with IMultiLanguage2

and WideCharToMultiByte()

, no. They have no idea how you are going to use their output, so it makes sense why they include the initial switching sequence in the Korean encoding - so a decoder not having access to encoding information from MIME (or other source) will still know which encoding for first use.

When you put data in a MIME message, you will have to manually disable the character set switch sequence when you set the MIME encoding to iso-2022-kr

. If you don't want to break it up by hand, you'll have to find (or write) a Unicode encoder that doesn't output this initial toggle sequence.

Dmitry Streblechenko · Answer 2 · 2015-05-29T07:12:09+0000

It was a red herring - it turned out that an escape sequence was needed. The problem was that my code was truncating names and addresses using Delphi's Trim () function, which trims all characters less than or equal to a space (0x20); which includes the escape character (0x1B).

Switching to my own trim function, which only removes spaces, fixes the problem.

IMultiLanguage2 :: ConvertStringFromUnicode - how to avoid complicated prefix?

More articles: