Removing Unicode characters during query in Hive
I want to clear unicode data from Hive table. Below is the data,
select ('http://10.0.0.1/���m��v������)�a�^�����kn:4�+9x�2c��m�{��')
My required output is to find if my column has Unicode characters and remove it. The conclusion here should be,
http://10.0.0.1/
or completely null. Any of them are fine. If the string contains any unicode character, then it is quite correct to make it null.
Below are my results,
select REGEXP_REPLACE('http://10.0.0.1/���m��v������)�a�^�����kn:4�+9x�2c��m�{��', '\\[[:xdigit:]]{4}', '')
and
select REGEXP_REPLACE('http://10.0.0.1/���m��v������)�a�^�����kn:4�+9x�2c��m�{��', '[||chr(128)||'-'||chr(255)||]', '')
Executed as Single statement. Failed [40000 : 42000] Error while compiling statement: FAILED: ParseException line 1:193 mismatched input '<EOF>' expecting ) near ')' in function specification
Elapsed time = 00:00:00.220
STATEMENT 1: SELECT Statement failed.
Can someone help me clean them on my desk?
thank
Edit:
Places where he works
select REGEXP_REPLACE('"http://r.rxthdr.com/w?i=s�F�""�HY|�K�>�0����D����W8뤒�O0�Q�D�1��Vc~�j[Q��f��{u�Be�S>n���Ò���&��F9���C�i��8:Ú"�_@ĪO��K?�Ēc�6��=��v[�����D�$%��:�a�40ݩ�&O��K��""�0�a<x��TcX���b��TN�}�x�o��UY$K�I�Օ""��(+�M���E�=K�A�I�A���q#l�(�yt�5��h}��~[��YOA��G�=ïˆï¿½{���. �Q���Ø;x=�s�0:�', '(?s).*\\P{ASCII}.*', '')
The places where he doesn't work
select REGEXP_REPLACE('c4k0j,}W""d+2|4y0hkCkRh+.{pq80{?X8O>b<:ph.3!{T', '(?s).*\\P{ASCII}.*', '')
select REGEXP_REPLACE('z|""},}69]6N2|c_;5.su={IU+|8ubq1<r$!Xxy#?Bhkv20:jXNgRh+5fwj:ndfWBJ}e)>','(?s).*\\P{ASCII}.*', '')
The first one in the image has a unicode character. But when inserted, it becomes a point.
Could you help me with this?
source to share
you can use
select REGEXP_REPLACE(YOUR_STRING_HERE, '\\P{ASCII}.*', '')
It will remove the entire string to the end from the first non-ASCII char it finds.
Regex for hive supports Unicode property classes and \p{ASCII}
matches any ASCII char. Opposite Unicode property classes are formed by turning p
to uppercase. So, \p{ASCII}
matches any char that is not ASCII. .*
matches any 0+ characters as much as possible, since *
is a greedy quantifier.
Note that there .
is no line break by default. If you need to remove line breaks, add (?s)
at the beginning of the template:
'(?s)\\P{ASCII}.*'
source to share