How to extend the content of bz2 files - Hadoop

I have a tar archive (about 40GB) which has many subfolders where my data is stored. Structure: Folders -> Sub Folders -> json.bz2. TAR file:

Total size: ~ 40GB
Number of inner .bz2 files (arranged in folders): 50,000
Size of one .bz2 file: ~700kb
Size of one extracted JSON file: ~6 MB.

      

I need to upload json files to HDFS cluster. I am trying to manually extract it in my local directory but I am running out of free space. I am planning to download the archive directly to HDFS and then unpack it. But I don't know if this is a good way to solve the problem. Since I am new to Hadoop, any pointers would be helpful.

+3


source to share





All Articles