Reading large CSV files in Java

I am trying to read a 1,000,000 CSV file in Java. I am using the OpenCSV library and it works fine in a smaller file of 30,000 lines . Works it out in less than half a second. But when I try to read from the millionth line file it never ends.

Now I tested to see that it really stops, and using my own version of a binary search, I first tried to read a line of 500 thousand. , And then 250 thousand. , And so on, and I found that it is easy to read lines 145k , 0.5-0.7 seconds , and 150 thousand does not even end.

I searched SO thoroughly, found several solutions that I used in my code, like using BufferedReader

, BufferedInputStream

etc., but none of them solved it. However, it fails between the 145-150k lines .

This is the relevant part of my code (replacing 150000 with 145000 is what makes the program execute in <1 sec):

try {
       // BufferedInputStream bufferedInputStream = new BufferedInputStream(new FileInputStream("myFile.csv"));
        CSVReader csvReader = new CSVReader(new InputStreamReader
                (new BufferedInputStream(new FileInputStream("myFile.csv"), 8192 * 32)));
        try {
            int count = 0;
            String[] line;
            long timeStart = System.nanoTime();
            while((line = csvReader.readNext()) != null){
                count ++;
                if(count >= 150000){
                    break;
                }
            }
            long timeEnd = System.nanoTime();
            System.out.println("Count: " + count);
            System.out.println("Time: " + (timeEnd - timeStart) * 1.0 / 1000000000 + " sec");
        } catch (IOException e) {
            e.printStackTrace();
        }
    } catch (FileNotFoundException e) {
        System.out.println("File not found");
    }

      

As you can see, I tried setting a large buffer size. I've tried various combinations Readers

, Input Streams

etc., and nothing really changed.

I am wondering how can I do this? Is there a way to read, say, 100k lines at a time and then continue reading for the next 100k?

Also, I'm open to any other solution that doesn't necessarily include the library OpenCSV

. I just used this for simplicity to parse the csv file.

+3


source to share


2 answers


I just looked at the OpenCSV implementation, I don't see anything that explains this behavior just because the file is large and contains many entries.

But OpenCSV can handle multi-line data from a website:

Handling quoted records with inline carriage returns (i.e. records that span multiple lines).

I think in your case there is an entry - about the 150th entry - which contains the wrong quote. The default char is "

. It can be a record like:



value,value,"badvalue,value
value,value,value,value

      

In this case, the parser that is used by OpenCSV is set to a pending state, which means that writing to read continues on the next line. And the call CSVReader.readNext()

tries to read as many lines as needed to complete the csv write. If there is no mismatch for a non-local quote character, it will read, read, and read until it runs out of buffers or some other error.

To find a record, you can read the file the same way you do, count the records, and print the current counter. This will give you the number of the last valid entry and then it will freeze / hang like it is now.

Then I would write a new program that just reads the file line by line (not using CSVParser, just plain lines) and skips the number of lines you know are good. Then print about 10 lines and you have data to analyze.

+3


source


Maybe the problem is not how many lines are in the CSV file, but the content. Perhaps there is some data in the lines between 145k and 150k that is causing your application to never end.



You can test this if you copy the first 145k lines from your file and paste them into a new CSV file until 1m lines are selected. If your application can handle this new file, the problem is in the data, not in the row count.

+2


source







All Articles