Delphi - converting string from UTF-8
I'm having a problem converting a UTF-8 encoded string to something used by delphi. The app is written in XE8 and is deployable on windows and OSX. The application uses the dll dll and dylib LimeLM API for windows and OSX, respectively. Everything works fine on Windows, the problem is converting strings returned from dylib library on OSX. I appreciate that all strings in and from dylib must be UTF-8 encoded. The limeLM function returns a PWideChar which I believe is UTF encoded. But it doesn't matter which function I use to try and convert the value to something useful in Delphi, all I get is garbage.
Here's the function:
class function TurboActivate.GetFeatureValue(featureName: String): String;
var
value : PWideChar;
FieldName : PWideChar;
tmpStr : String;
begin
{$IFDEF MSWINDOWS}
FieldName := PwideChar(featureName);
{$ENDIF}
{$IFDEF MACOS}
FieldName := PWideChar(UTF8Encode(featureName));
{$ENDIF}
value := GetFeatureValue(FieldName, nil);
if (value = '') then
begin
raise ETurboActivateException.Create('Failed to get feature value. the feature doesn''t exist.');
end;
{$IFDEF MSWINDOWS}
Result := value;
{$ENDIF}
{$IFDEF MACOS}
tmpStr := UTF8ToString(value);
ShowMessage(tmpStr);
tmpStr := UTF8ToWideString(value);
ShowMessage(tmpStr);
tmpStr := UTF8ToUnicodeString(value);
ShowMessage(tmpStr);
tmpStr := UTF8ToAnsi(value);
ShowMessage(tmpStr);
Result := TmpStr;
{$ENDIF}
end;
There is definitely a value to decode, value = '散 汤 湡 獤 杀 潯 汧 浥 楡 潣 潣 潣 潣 呖 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎呎
but tmpStr always contains' ?????????? c ?????? / '
Any help would be greatly appreciated.
source to share
Meaning = '散 汤 湡 獤 杀 潯 汧 浥 楡 潣 潣 潣 潣 呖 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎 呎
This indicates that you are interpreting 8-bit text, presumably UTF-8 encoded, as if it were UTF-16 encoded. Typically, when you see a UTF-16 string with Chinese characters, it is either correctly interpreted Chinese text, or it is misinterpreted 8-bit text.
When you correctly interpret this text as UTF-8, it is:
cedlands@googlemail.com 4CSA-7GFJ-YMW4-2VTF-II5Q-BNTA♥♦
I got it with this code:
Writeln(TEncoding.UTF8.GetString(
TEncoding.Unicode.GetBytes('散汤湡獤杀潯汧浥楡潣m䌴䅓㜭䙇ⵊ䵙㑗㈭呖ⵆ䥉儵䈭呎́'#4)));
Note that if you look at the byte array returned TEncoding.Unicode.GetBytes('散汤湡獤杀潯汧浥楡潣m䌴䅓㜭䙇ⵊ䵙㑗㈭呖ⵆ䥉儵䈭呎́'#4)
, you will see that it contains null. So it's actually a null terminated string after the email address.
The problems start here:
value : PWideChar;
....
value := GetFeatureValue(FieldName, nil);
It actually GetFeatureValue
returns PAnsiChar
. And the payload is UTF-8 encoded if I'm interpreting you correctly.
So, you need to make the following changes:
- Change the return type
GetFeatureValue
toPAnsiChar
. - Change the type
value
toPAnsiChar
. - Convert
value
to string withUnicodeFromLocaleChars
orTEncoding.GetString
.
It might look like this:
var
Bytes: TBytes;
....
SetLength(Bytes, StrLen(value));
Move(value^, Pointer(Bytes)^, Length(Bytes));
str := TEncoding.UTF8.GetString(Bytes);
Now, for the data in question, which sets str
in cedlands@googlemail.com
. As mentioned above, the data contains a null terminator that cannot complete the string when it is mistakenly interpreted as UTF-16. That is, the text 4CSA-7GFJ-YMW4-2VTF-II5Q-BNTA♥♦
comes from a buffer overflow.
source to share