UTF16 in text format
I have a UTF-16 hexadecimal representation such as "0633064406270645" which is "سلام" in Arabic.
I would like to convert it to its text equivalent. Is there a straight forward way to do this in PostgreSQL?
I can convert the UTF code point as shown below; unfortunately it looks like UTF16 is not supported. Any ideas on how to do this in PostgreSQL, worst case scenario I write a function?
SELECT convert_from (decode (E'D8B3D984D8A7D985', 'hex'),'UTF8');
"سلام"
SELECT convert_from (decode (E'0633064406270645', 'hex'),'UTF16');
ERROR: invalid source encoding name "UTF16"
********** Error **********
source to share
That's right, Postgres doesn't support UTF-16.
However, it does support Unicode escape sequences :
SELECT U&'\0633\0644\0627\0645'
But keep in mind that Unicode code points and UTF-16 code units are only equivalent in the Basic Multilingual Plan . In other words, if you have UTF-16 characters that span multiple 16-bit blocks of code, you will need to translate them to the corresponding code point yourself.
source to share
convert_from
(or PostgreSQL in general) does not support UTF-16
, but you can refer to one of the optional languages.
Example in plperlu
(to create a function, you need database superuser privileges and CREATE LANGUAGE plperlu
, if not already created):
CREATE FUNCTION decode_utf16(text) RETURNS text AS $$
require Encode;
return Encode::decode("UTF-16BE", pack("H*", $_[0]));
$$ immutable language plperlu;
=> select decode_utf16('0633064406270645');
decode_utf16
--------------
سلام
source to share
PostgreSQL does not natively support UTF-16. I suggest you convert your data to UTF-8 before posting it to the DB. If it's too late (invalid data already exists in your DB), you can use these serving functions to convert data from UTF-16 (logic copied from wikipedia ):
-- convert from bytea, containing UTF-16-BE data
CREATE OR REPLACE FUNCTION convert_from_utf16be(utf16_data bytea, invalid_replacement text DEFAULT '?')
RETURNS text
LANGUAGE sql
IMMUTABLE
STRICT
AS $function$
WITH source(unit) AS (
SELECT (get_byte(utf16_data, i) << 8) | get_byte(utf16_data, i + 1)
FROM generate_series(0, octet_length(utf16_data) - 2, 2) i
),
codes(lag, unit, lead) AS (
SELECT lag(unit, 1) OVER (), unit, lead(unit, 1) OVER ()
FROM source
)
SELECT string_agg(CASE
WHEN unit >= 56320 AND unit <= 57343 THEN CASE
WHEN lag >= 55296 AND lag <= 56319 THEN '' -- already processed
ELSE invalid_replacement
END
WHEN unit >= 55296 AND unit <= 56319 THEN CASE
WHEN lead >= 56320 AND lead <= 57343 THEN chr((unit << 10) + lead - 56613888)
ELSE invalid_replacement
END
ELSE chr(unit)
END, '')
FROM codes
$function$;
-- convert from bytea, containing UTF-16-LE data
CREATE OR REPLACE FUNCTION convert_from_utf16le(utf16_data bytea, invalid_replacement text DEFAULT '?')
RETURNS text
LANGUAGE sql
IMMUTABLE
STRICT
AS $function$
WITH source(unit) AS (
SELECT get_byte(utf16_data, i) | (get_byte(utf16_data, i + 1) << 8)
FROM generate_series(0, octet_length(utf16_data) - 2, 2) i
),
codes(lag, unit, lead) AS (
SELECT lag(unit, 1) OVER (), unit, lead(unit, 1) OVER ()
FROM source
)
SELECT string_agg(CASE
WHEN unit >= 56320 AND unit <= 57343 THEN CASE
WHEN lag >= 55296 AND lag <= 56319 THEN '' -- already processed
ELSE invalid_replacement
END
WHEN unit >= 55296 AND unit <= 56319 THEN CASE
WHEN lead >= 56320 AND lead <= 57343 THEN chr((unit << 10) + lead - 56613888)
ELSE invalid_replacement
END
ELSE chr(unit)
END, '')
FROM codes
$function$;
-- convert from bytea, containing UTF-16 data (with or without BOM)
CREATE OR REPLACE FUNCTION convert_from_utf16(utf16_data bytea, invalid_replacement text DEFAULT '?')
RETURNS text
LANGUAGE sql
IMMUTABLE
STRICT
AS $function$
SELECT CASE COALESCE(octet_length(utf16_data), 0)
WHEN 0 THEN ''
WHEN 1 THEN invalid_replacement
ELSE CASE substring(utf16_data FOR 2)
WHEN E'\\xFFFE' THEN convert_from_utf16le(substring(utf16_data FROM 3), invalid_replacement)
ELSE convert_from_utf16be(substring(utf16_data FROM 3), invalid_replacement)
END
END
$function$;
With these functions, you can convert from all UTF-16 types:
SELECT convert_from_utf16be(decode('0633064406270645D852DF62', 'hex')),
convert_from_utf16le(decode('330644062706450652D862DF', 'hex')),
convert_from_utf16(decode('FEFF0633064406270645D852DF62', 'hex')),
convert_from_utf16(decode('FFFE330644062706450652D862DF', 'hex'));
-- convert_from_utf16be | convert_from_utf16le | convert_from_utf16 | convert_from_utf16
------------------------+----------------------+--------------------+-------------------
-- سلام𤭢 | سلام𤭢 | سلام𤭢 | سلام𤭢
source to share