Working with large CSV files in Ruby

I want to parse two CSV files of MaxMind GeoIP2 database, do some column based join and merge the result into one output file.

I used the Ruby CSV standard library, it is very slow. I think it is trying to load the entire file into memory.

block_file = File.read(block_path)
block_csv   = CSV.parse(block_file, :headers => true) 
location_file = File.read(location_path)
location_csv = CSV.parse(location_file, :headers => true)


CSV.open(output_path, "wb",
    :write_headers=> true,
    :headers => ["geoname_id","Y","Z"] ) do |csv|


    block_csv.each do |block_row|
    puts "#{block_row['geoname_id']}"

        location_csv.each do |location_row|
            if (block_row['geoname_id'] === location_row['geoname_id'])
                puts " match :"    
                csv << [block_row['geoname_id'],block_row['Y'],block_row['Z']]
                break location_row
            end
        end

    end

      

Is there any other ruby ​​library that supports chuncks processing?

block_csv

- 800 MB, and location_csv

- 100 MB.

+3


source to share


1 answer


Just use CSV.open(block_path, 'r', :headers => true).each do |line|

instead of File.read

and CSV.parse

. It will parse the file line by line.



In the current version, you will explicitly tell it to read the entire file with File.read

and then parse the entire file as a string with CSV.parse

. This is how it does what you said.

+6


source







All Articles