Uploading large csvs to hadoop via Hue will only store a 64MB block

Im using Cloudera faststart vm 5.1.0-1

I am trying to upload my 3GB csv to Hadoop via Hue and what I have tried so far: - Upload the csv to HDFS and in particular to a folder called datasets located at / user / hive / datasets - Use Metastore Manager. to load it into standard db

Everything works fine that I manage to load it with the correct columns. The main problem is that when I query the table from Impala, I run the following query:

show statistics for new_table

I understand the size is only 64MB instead of the actual csv size which should be 3GB.

Also, if I do an invoice (*) via Impala, the row count is only 70,000 versus the actual 7 million.

Any help would be deeply appreciated.

Thanks in advance.

+3


source to share


3 answers


I had the same problem. This is an issue with the way Hue imports the file via the web interface, which has a 64MB limit.

I imported large datasets using the Hive CLI and the -f flag into a text file with DDL code.

Example:



hive -f beer_data_loader.hql



beer_data_loader.hql:

  CREATE DATABASE IF NOT EXISTS beer  
  COMMENT "Beer Advocate Database";


CREATE TABLE IF NOT EXISTS beer.beeradvocate_raw(  
    beer_name           STRING,
    beer_ID             BIGINT,
    beer_brewerID       INT,
    beer_ABV            FLOAT,
    beer_style          STRING,
    review_appearance   FLOAT,
    review_aroma        FLOAT,
    review_palate       FLOAT,
    review_taste        FLOAT,
    review_overall      FLOAT,
    review_time         BIGINT,
    review_profileName  STRING,
    review_text         STRING
    )
 COMMENT "Beer Advocate Data Raw"
 ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '|'
 STORED AS parquet;


CREATE EXTERNAL TABLE IF NOT EXISTS beer.beeradvocate_temp(  
    beer_name           STRING,
    beer_ID             BIGINT,
    beer_brewerID       INT,
    beer_ABV            FLOAT,
    beer_style          STRING,
    review_appearance   FLOAT,
    review_aroma        FLOAT,
    review_palate       FLOAT,
    review_taste        FLOAT,
    review_overall      FLOAT,
    review_time         BIGINT,
    review_profileName  STRING,
    review_text         STRING
    )
 COMMENT "Beer Advocate External Loading Table"
 ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '|'
 LOCATION '/user/name/beeradvocate.data';


INSERT OVERWRITE TABLE beer.beeradvocate_raw SELECT * FROM beer.beeradvocate_temp;  
DROP TABLE beer.beeradvocate_temp; 

      

+4


source


Looks like a bug in Hue. Workaround found. The file will be truncated if you select the Import data from file check box when creating the table. Leave this unprocessed to create an empty table. Then select the newly created table in the Metastore Manager and use the Import Data option from the Actions menu to populate it. This should fill all the lines.



+2


source


This error ( HUE-2501 ) occurred when importing a file with headers larger than 64MB.

Peter's workaround is good and has been fixed in Hue 3.8 and as of CDH5.3.2.

0


source







All Articles