Hive RegexSerDe does not give correct output

I tried to parse the bottom line of the input using Hive RegexSerDe, but I am not getting the expected result. I really don't know if the problem is sitting in my regex request or in RegexSerDe. My regex query works as expected in another online regex simulator, but it doesn't work in hive RegexSerDe. Can someone please help me understand what is wrong here?

I am using apachehive-0.9.0 version.


1 :: Toy Story (1995) :: Adventure | Animation | Children | Comedy | Fantasy

My expected output:

1 Toy Story 1995 Adventure | Animation | Children | Comedy | Fantasy

My hive request:

CREATE TABLE myMovie3(  
id STRING,  
name STRING,  
year STRING,  
category STRING)  
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'  
WITH SERDEPROPERTIES ("input.regex" = "^(.*?)::(.*)\(([0-9]*)\)::(.*)$","output.format.string" = "%1$s %2$s %3$s %4$s") 


The actual output I got from the regex:

hive> select * from mymovie3;  
1   Toy Story (1995)



source to share

1 answer

The regex is the reason. While perfect in a normal context, RegexSerDe is a Java class that needs to be accelerated for backslashes. Use the following:





All Articles