Using JSON-SerDe in Hive Tables
I am trying to use JSON-SerDe from below link http://code.google.com/p/hive-json-serde/wiki/GettingStarted .
CREATE TABLE my_table (field1 string, field2 int,
field3 string, field4 double)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde' ;
I added Json-SerDe jar as
ADD JAR /path-to/hive-json-serde.jar;
And loaded data like
LOAD DATA LOCAL INPATH '/home/hduser/pradi/Test.json' INTO TABLE my_table;
and loads the data successfully.
But when the request data is like
Select * from my_table ;
I am only getting one row from the table as
data1 100 more than data1 123.001
Test.json contains
{"field1":"data1","field2":100,"field3":"more data1","field4":123.001}
{"field1":"data2","field2":200,"field3":"more data2","field4":123.002}
{"field1":"data3","field2":300,"field3":"more data3","field4":123.003}
{"field1":"data4","field2":400,"field3":"more data4","field4":123.004}
Where is the problem? why only one row goes instead of 4 rows when I query the table. And / user / hive / warehouse / my_table contains all 4 rows !!
hive> add jar /home/hduser/pradeep/hive-json-serde-0.2.jar;
Added /home/hduser/pradeep/hive-json-serde-0.2.jar to class path
Added resource: /home/hduser/pradeep/hive-json-serde-0.2.jar
hive> CREATE EXTERNAL TABLE my_table (field1 string, field2 int,
> field3 string, field4 double)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
> WITH SERDEPROPERTIES (
> "field1"="$.field1",
> "field2"="$.field2",
> "field3"="$.field3",
> "field4"="$.field4"
> );
OK
Time taken: 0.088 seconds
hive> LOAD DATA LOCAL INPATH '/home/hduser/pradi/test.json' INTO TABLE my_table;
Copying data from file:/home/hduser/pradi/test.json
Copying file: file:/home/hduser/pradi/test.json
Loading data to table default.my_table
OK
Time taken: 0.426 seconds
hive> select * from my_table;
OK
data1 100 more data1 123.001
Time taken: 0.17 seconds
I have already posted the contents of the test.json file. so you can see that the query results in only one row as
data1 100 more data1 123.001
I changed the json file to employee.json which contains
{"firstName": "Mike", "lastName": "Chepesky", "employeeNumber": 1840192}
and also changed the table, but it shows null values โโwhen I query the table
hive> add jar /home/hduser/pradi/hive-json-serde-0.2.jar;
Added /home/hduser/pradi/hive-json-serde-0.2.jar to class path
Added resource: /home/hduser/pradi/hive-json-serde-0.2.jar
hive> create EXTERNAL table employees_json (firstName string, lastName string, employeeNumber int )
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde';
OK
Time taken: 0.297 seconds
hive> load data local inpath '/home/hduser/pradi/employees.json' into table employees_json;
Copying data from file:/home/hduser/pradi/employees.json
Copying file: file:/home/hduser/pradi/employees.json
Loading data to table default.employees_json
OK
Time taken: 0.293 seconds
hive>select * from employees_json;
OK
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
Time taken: 0.194 seconds
source to share
It's a bit difficult to tell what happens without logs (see Getting Started ) in case of doubt. Just think - can you try it if it works WITH SERDEPROPERTIES
like this:
CREATE EXTERNAL TABLE my_table (field1 string, field2 int,
field3 string, field4 double)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
WITH SERDEPROPERTIES (
"field1"="$.field1",
"field2"="$.field2",
"field3"="$.field3",
"field4"="$.field4"
);
There is also a fork you can try from ThinkBigAnalytics.
UPDATE: Turns off input to Test.json, JSON is invalid, hence records are dumped.
See fooobar.com/questions/1425186 / ... answer for details .
source to share
-
First of all, you should check your json file at http://jsonlint.com/ after that make your file as one line per line and remove the []. the comma at the end of the line is required.
[{"field1": "data1", "field2": 100, "field3": "more data1", "field4": 123.001}, {"field1": "data2", "field2": 200, "field3" : "more data2", "field4": 123.002}, {"field1": "data3", "field2": 300, "field3": "more data3", "field4": 123.003}, {"field1": " data4 "," field2 ": 400," field3 ":" more data4 "," field4 ": 123.004}]
-
In my test, I added hive-json-serde-0.2.jar from hadoop cluster, I think hive-json-serde-0.1.jar should be fine.
ADD JAR hive-json-serde-0.2.jar;
-
Create a table
CREATE TABLE my_table (string field1, field2 int, string field3, field 4 double) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde';
-
Load Json datafile, here I am loading it from hadoop cluster not from local
LOAD DATA INPATH 'Test2.json' INTO TABLE my_table;
source to share
to parse json based on cwiki / confluence we need to follow several steps
-
need to load hive-hcatalog-core.jar
-
hive> add jar / path / hive-hcatalog-core.jar
-
create table tablename (colname1 datatype, .....) row formatserde'org.apache.hive.hcatalog.data.JsonSerDe 'stored as ORCFILE;
-
colname in table creation and colname in test.json should be the same if not showing null values โโHope it will be helpful
source to share
I have a question ,,, what if we followed the data
{"Is_student": true, "color": ["red", "yellow", "orange"], "jump": 19.5 "nick": "poop"}
{"Is_student" false "color": ["red", "yellow", "black"], "jump": 129.5, "nick": "stars"}
{"Is_student": false, "color": ["pink", "gold"], "jump": 222.56, "nick": "Fiat"}
one has 3 colors and one has 2 colors, whereas we can make the table schema using json serde.
i tried it but it gives output like this
json2.is_student - json2.color - json2.jump - json2.nick
true ["red", "yellow", "orange"] 19.5 poop
false ["red", "yellow", "black"] 129.5 stars
false ["pink", "gold"] 222.56
where the circuit looks like this
CREATE TABLE json2(
Is_student boolean,
color array <string>
jump double
nick string )
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
stored as textfile;
source to share