Splitting a .gz file at specified file sizes in Java using byte [] array

Question

Splitting a .gz file at specified file sizes in Java using byte [] array

I wrote code to split the .gz file into user-defined pieces using the byte [] array. But the for loop does not read / write the last part of the parent file that is less than the size of the array. Could you help me with this?

package com.bitsighttech.collection.packaging;

import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.DataInputStream;
import java.io.DataOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.log4j.Logger;

public class FileSplitterBytewise
{
private static Logger logger = Logger.getLogger(FileSplitterBytewise.class);
private static final long KB = 1024;
private static final long MB = KB * KB;

private FileInputStream fis;
private FileOutputStream fos;   
private DataInputStream dis;
private DataOutputStream dos;

public boolean split(File inputFile, String splitSize)  
{  

    int expectedNoOfFiles =0;       

    try  
    {  
        double parentFileSizeInB = inputFile.length();

        Pattern p = Pattern.compile("(\\d+)\\s([MmGgKk][Bb])");
        Matcher m = p.matcher(splitSize);
        m.matches();

        String FileSizeString = m.group(1);
        String unit = m.group(2);
        double FileSizeInMB = 0;

        try {
            if (unit.toLowerCase().equals("kb"))
                FileSizeInMB = Double.parseDouble(FileSizeString) / KB;         
            else if (unit.toLowerCase().equals("mb"))
                FileSizeInMB = Double.parseDouble(FileSizeString);          
            else if (unit.toLowerCase().equals("gb"))
                FileSizeInMB = Double.parseDouble(FileSizeString) * KB;         
        } catch (NumberFormatException e) {
            logger.error("invalid number [" + FileSizeInMB  + "] for expected file size");
        }

        double fileSize = FileSizeInMB * MB;
        int fileSizeInByte = (int) Math.ceil(fileSize);
        double noOFFiles = parentFileSizeInB/fileSizeInByte;            
        expectedNoOfFiles =  (int) Math.ceil(noOFFiles);                    
        int splinterCount = 1;
        fis = new FileInputStream(inputFile);
        dis = new DataInputStream(new BufferedInputStream(fis));
        fos = new FileOutputStream("F:\\ff\\" + "_part_" + splinterCount + "_of_" + expectedNoOfFiles);
        dos = new DataOutputStream(new BufferedOutputStream(fos));  

        byte[] data = new byte[(int) fileSizeInByte];

        while ( splinterCount <= expectedNoOfFiles ) {                  

            int i;          
            for(i = 0; i<data.length-1; i++)
            {
                data[i] = s.readByte();             
            }               
            dos.write(data);
            splinterCount ++; 
            }
    }       
    catch(Exception e)  
    {  
        logger.error("Unable to split the file " + inputFile.getName() + " in to " + expectedNoOfFiles);
        return false;
    }  


    logger.debug("Successfully split the file [" + inputFile.getName() + "] in to " + expectedNoOfFiles + " files");
    return true;
}    

public static void main(String args[]) 
{
    String FilePath1 = "F:\\az.gz";     
    File  file= new File(FilePath1);
    FileSplitterBytewise fileSplitter = new FileSplitterBytewise();
    String splitlen = "1 MB";

    fileSplitter.split(file, splitlen);

}
  }

+3

java split gzip

manil 14 Mar 12 at 9:20 am

source to share

2 answers

sarnold · Answer 1 · 2012-03-14T09:38:03+0000

I suggest doing more methods. You have a complex section of processing lines of code in split()

; it would be best to make a method that takes a user-friendly string as input and returns the number you are looking for. (It would also make it easier for you to test this section of the subroutine; now you cannot test it.)

Once it separates and you write test cases, you will likely find that the error message you generate if the string does not contain kb

, mb

or gb

is extremely confusing - - it blames the number 0

for the error, rather than specifying that the string does not has expected units.

Using int

file size for storage means your program will never process files larger than two gigabytes . You have to stick with long

or double

. ( double

feels wrong for something that's actually limited to integer values, but I can't quickly think of why it will fail.)

byte[] data = new byte[(int) fileSizeInByte];

Allocating a few gigabytes like this will wipe out your performance - a potentially huge allocation of memory (and one that might be considered under the control of the adversary, depending on your security model this might or might not be a big deal). Don't try to work with the whole file in one piece.

You seem to be reading and writing files one byte at a time. This is a guarantee of very slow performance. Doing some performance testing for another question earlier, I found that my machine could read (from the hot cache) 2000x faster using 131KB blocks than double-byte blocks. Single byte blocks will be even worse. Cold cache will be significantly worse for such small sizes.

        fos = new FileOutputStream("F:\\ff\\" + "_part_" + splinterCount + "_of_" + expectedNoOfFiles);

You only ever open one stream of file output. Your post probably should have said "first works only" because it looks like you haven't tried it yet in a file that creates three or more chunks.

catch(Exception e)

At this point, you have the ability to detect errors in your program; you completely ignore them. Of course, you log the error message, but you cannot debug your program with the data you log. You should log at a minimum the type of exception, the message, and possibly even a full stack trace. This data combination is extremely useful when trying to solve problems, especially over the course of several months when you forget the details of how it works.

Peter lawrey · Answer 2 · 2012-03-14T10:25:33+0000

Could you help me with this?

I would use;

remove DataInput / OutputStreams, you don't need them.
use in.read (data) to read the entire block instead of one byte at a time. Reading one byte at a time is much slower!
or read the whole dataset, you read less.
stop when you reach the end of the file, it may not be a multiple of the size.
write as much as you read, if your 1MB byte blocks remain at 100KB, you only need to read / write 100KB at the end.
Close your files when done, esp, since you have a buffered stream.
you "split" everything and write to the same file (therefore it is not divided). You need to create, write and close the output files in a loop.
don't use fields when you could / should use local variables.
will use the length as long in bytes.
the template ignores invalid input and your template does not match the test you are testing. for example your patten allows 1 G

or 1 k

, but they will be processed as 1MB .

Splitting a .gz file at specified file sizes in Java using byte [] array

More articles: