Using JSON-SerDe in Hive Tables

I am trying to use JSON-SerDe from below link http://code.google.com/p/hive-json-serde/wiki/GettingStarted .

         CREATE TABLE my_table (field1 string, field2 int, 
                                     field3 string, field4 double)
         ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde' ;

      

I added Json-SerDe jar as

          ADD JAR /path-to/hive-json-serde.jar;

      

And loaded data like

LOAD DATA LOCAL INPATH  '/home/hduser/pradi/Test.json' INTO TABLE my_table;

      

and loads the data successfully.

But when the request data is like

Select * from my_table ;

I am only getting one row from the table as

data1 100 more than data1 123.001

Test.json contains

{"field1":"data1","field2":100,"field3":"more data1","field4":123.001} 

{"field1":"data2","field2":200,"field3":"more data2","field4":123.002} 

{"field1":"data3","field2":300,"field3":"more data3","field4":123.003} 

{"field1":"data4","field2":400,"field3":"more data4","field4":123.004}

      

Where is the problem? why only one row goes instead of 4 rows when I query the table. And / user / hive / warehouse / my_table contains all 4 rows !!


hive> add jar /home/hduser/pradeep/hive-json-serde-0.2.jar;
Added /home/hduser/pradeep/hive-json-serde-0.2.jar to class path
Added resource: /home/hduser/pradeep/hive-json-serde-0.2.jar

hive> CREATE EXTERNAL TABLE my_table (field1 string, field2 int,
>                                 field3 string, field4 double)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
> WITH SERDEPROPERTIES (
>   "field1"="$.field1",
>   "field2"="$.field2",
>   "field3"="$.field3",
>   "field4"="$.field4"
> );
OK
Time taken: 0.088 seconds

hive> LOAD DATA LOCAL INPATH  '/home/hduser/pradi/test.json' INTO TABLE my_table;
Copying data from file:/home/hduser/pradi/test.json
Copying file: file:/home/hduser/pradi/test.json
Loading data to table default.my_table
OK
Time taken: 0.426 seconds

hive> select * from my_table;
OK
data1   100     more data1      123.001
Time taken: 0.17 seconds

      

I have already posted the contents of the test.json file. so you can see that the query results in only one row as

data1   100     more data1      123.001

      


I changed the json file to employee.json which contains

{"firstName": "Mike", "lastName": "Chepesky", "employeeNumber": 1840192}

and also changed the table, but it shows null values โ€‹โ€‹when I query the table

hive> add jar /home/hduser/pradi/hive-json-serde-0.2.jar;
Added /home/hduser/pradi/hive-json-serde-0.2.jar to class path
Added resource: /home/hduser/pradi/hive-json-serde-0.2.jar

hive> create EXTERNAL table employees_json (firstName string, lastName string,        employeeNumber int )
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde';
OK
Time taken: 0.297 seconds


hive> load data local inpath '/home/hduser/pradi/employees.json' into table     employees_json;
Copying data from file:/home/hduser/pradi/employees.json
Copying file: file:/home/hduser/pradi/employees.json
Loading data to table default.employees_json
OK
Time taken: 0.293 seconds


 hive>select * from employees_json;
  OK
  NULL    NULL    NULL
  NULL    NULL    NULL
  NULL    NULL    NULL
  NULL    NULL    NULL
  NULL    NULL    NULL
  NULL    NULL    NULL
Time taken: 0.194 seconds

      

+4


source to share


4 answers


It's a bit difficult to tell what happens without logs (see Getting Started ) in case of doubt. Just think - can you try it if it works WITH SERDEPROPERTIES

like this:

CREATE EXTERNAL TABLE my_table (field1 string, field2 int, 
                                field3 string, field4 double)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
WITH SERDEPROPERTIES (
  "field1"="$.field1",
  "field2"="$.field2",
  "field3"="$.field3",
  "field4"="$.field4" 
);

      

There is also a fork you can try from ThinkBigAnalytics.



UPDATE: Turns off input to Test.json, JSON is invalid, hence records are dumped.

See fooobar.com/questions/1425186 / ... answer for details .

+1


source


  • First of all, you should check your json file at http://jsonlint.com/ after that make your file as one line per line and remove the []. the comma at the end of the line is required.

    [{"field1": "data1", "field2": 100, "field3": "more data1", "field4": 123.001}, {"field1": "data2", "field2": 200, "field3" : "more data2", "field4": 123.002}, {"field1": "data3", "field2": 300, "field3": "more data3", "field4": 123.003}, {"field1": " data4 "," field2 ": 400," field3 ":" more data4 "," field4 ": 123.004}]

  • In my test, I added hive-json-serde-0.2.jar from hadoop cluster, I think hive-json-serde-0.1.jar should be fine.

    ADD JAR hive-json-serde-0.2.jar;

  • Create a table

    CREATE TABLE my_table (string field1, field2 int, string field3, field 4 double) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde';

  • Load Json datafile, here I am loading it from hadoop cluster not from local

    LOAD DATA INPATH 'Test2.json' INTO TABLE my_table;



My test

0


source


to parse json based on cwiki / confluence we need to follow several steps

  • need to load hive-hcatalog-core.jar

  • hive> add jar / path / hive-hcatalog-core.jar

  • create table tablename (colname1 datatype, .....) row formatserde'org.apache.hive.hcatalog.data.JsonSerDe 'stored as ORCFILE;

  • colname in table creation and colname in test.json should be the same if not showing null values โ€‹โ€‹Hope it will be helpful

0


source


I have a question ,,, what if we followed the data

{"Is_student": true, "color": ["red", "yellow", "orange"], "jump": 19.5 "nick": "poop"}

{"Is_student" false "color": ["red", "yellow", "black"], "jump": 129.5, "nick": "stars"}

{"Is_student": false, "color": ["pink", "gold"], "jump": 222.56, "nick": "Fiat"}

one has 3 colors and one has 2 colors, whereas we can make the table schema using json serde.

i tried it but it gives output like this

json2.is_student - json2.color - json2.jump - json2.nick

true ["red", "yellow", "orange"] 19.5 poop

false ["red", "yellow", "black"] 129.5 stars

false ["pink", "gold"] 222.56

where the circuit looks like this

CREATE TABLE json2( 
    Is_student boolean,
    color array <string>
    jump double
    nick string ) 
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' 
stored as textfile;

      

0


source







All Articles