Removing Unicode characters during query in Hive

Question

Removing Unicode characters during query in Hive

I want to clear unicode data from Hive table. Below is the data,

select ('http://10.0.0.1/ï¿½ï¿½ï¿½mï¿½ï¿½vï¿½ï¿½ï¿½ï¿½ï¿½ï¿½)ï¿½aï¿½^ï¿½ï¿½ï¿½ï¿½ï¿½kn:4ï¿½+9xï¿½2cï¿½ï¿½mï¿½{ï¿½ï¿½')

My required output is to find if my column has Unicode characters and remove it. The conclusion here should be,

http://10.0.0.1/

or completely null. Any of them are fine. If the string contains any unicode character, then it is quite correct to make it null.

Below are my results,

 select REGEXP_REPLACE('http://10.0.0.1/ï¿½ï¿½ï¿½mï¿½ï¿½vï¿½ï¿½ï¿½ï¿½ï¿½ï¿½)ï¿½aï¿½^ï¿½ï¿½ï¿½ï¿½ï¿½kn:4ï¿½+9xï¿½2cï¿½ï¿½mï¿½{ï¿½ï¿½', '\\[[:xdigit:]]{4}', '')

and

 select REGEXP_REPLACE('http://10.0.0.1/ï¿½ï¿½ï¿½mï¿½ï¿½vï¿½ï¿½ï¿½ï¿½ï¿½ï¿½)ï¿½aï¿½^ï¿½ï¿½ï¿½ï¿½ï¿½kn:4ï¿½+9xï¿½2cï¿½ï¿½mï¿½{ï¿½ï¿½', '[||chr(128)||'-'||chr(255)||]', '')

Executed as Single statement.  Failed [40000 : 42000] Error while compiling statement: FAILED: ParseException line 1:193 mismatched input '<EOF>' expecting ) near ')' in function specification 
Elapsed time = 00:00:00.220 

STATEMENT 1: SELECT Statement failed.

Can someone help me clean them on my desk?

thank

Edit:

Places where he works

select REGEXP_REPLACE('"http://r.rxthdr.com/w?i=sï¿½Fï¿½""ï¿½HY|ï¿½Kï¿½>ï¿½0ï¿½ï¿½ï¿½ï¿½Dï¿½ï¿½ï¿½ï¿½W8ë¤’ï¿½O0ï¿½Qï¿½Dï¿½1ï¿½ï¿½Vc~ï¿½j[Qï¿½ï¿½fï¿½ï¿½{uï¿½Beï¿½S>nï¿½ï¿½ï¿½Òï¿½ï¿½ï¿½&ï¿½ï¿½F9ï¿½ï¿½ï¿½Cï¿½iï¿½ï¿½8:Ú"ï¿½_@ÄªOï¿½ï¿½K?ï¿½Ä’cï¿½6ï¿½ï¿½=ï¿½ï¿½v[ï¿½ï¿½ï¿½ï¿½ï¿½Dï¿½$%ï¿½ï¿½:ï¿½aï¿½40Ý©ï¿½&Oï¿½ï¿½Kï¿½ï¿½""ï¿½0ï¿½a<xï¿½ï¿½TcXï¿½ï¿½ï¿½bï¿½ï¿½TNï¿½}ï¿½xï¿½oï¿½ï¿½UY$Kï¿½Iï¿½Õ•""ï¿½ï¿½(+ï¿½Mï¿½ï¿½ï¿½Eï¿½=Kï¿½Aï¿½Iï¿½Aï¿½ï¿½ï¿½q#lï¿½(ï¿½ytï¿½5ï¿½ï¿½h}ï¿½ï¿½~[ï¿½ï¿½YOAï¿½ï¿½Gï¿½=ïˆï¿½{ï¿½ï¿½ï¿½. ï¿½Qï¿½ï¿½ï¿½Ø;x=ï¿½sï¿½0:ï¿½', '(?s).*\\P{ASCII}.*', '')

The places where he doesn't work

 select REGEXP_REPLACE('c4k0j,}W""d+2|4y0hkCkRh+.{pq80{?X8O>b<:ph.3!{T', '(?s).*\\P{ASCII}.*', '')

 select REGEXP_REPLACE('z|""},}69]6N2|c_;5.su={IU+|8ubq1<r$!Xxy#?Bhkv20:jXNgRh+5fwj:ndfWBJ}e)>','(?s).*\\P{ASCII}.*', '')

The first one in the image has a unicode character. But when inserted, it becomes a point.

Could you help me with this?

+3

sql regex unicode hadoop hive

haimen Jul 26 17 at 21:29

source to share

1 answer

Wiktor Stribiżew · Accepted Answer · 2017-07-26T21:52:34+0000

you can use

select REGEXP_REPLACE(YOUR_STRING_HERE, '\\P{ASCII}.*', '')

It will remove the entire string to the end from the first non-ASCII char it finds.

Regex for hive supports Unicode property classes and \p{ASCII}

matches any ASCII char. Opposite Unicode property classes are formed by turning p

to uppercase. So, \p{ASCII}

matches any char that is not ASCII. .*

matches any 0+ characters as much as possible, since *

is a greedy quantifier.

Note that there .

is no line break by default. If you need to remove line breaks, add (?s)

at the beginning of the template:

'(?s)\\P{ASCII}.*'

Removing Unicode characters during query in Hive

More articles: