Ruby CSV repetitive syntax series

Question

Ruby CSV repetitive syntax series

I have some CSV data that I need to process and a problem finding a way to match duplicates.

the data looks something like this:

line    id    name   item_1    item_2    item_3    item_4
1      251   john    foo       foo       foo       foo
2      251   john    foo       bar       bar       bar
3      251   john    foo       bar       baz       baz
4      251   john    foo       bar       baz       pat

lines 1-3 are duplicates in this case.

line    id    name   item_1    item_2    item_3    item_4
5      347   bill    foo       foo       foo       foo
6      347   bill    foo       bar       bar       bar

in this case only line 5 is a duplicate

line    id    name   item_1    item_2    item_3    item_4
7      251   mary    foo       foo       foo       foo
8      251   mary    foo       bar       bar       bar
9      251   mary    foo       bar       baz       baz

here lines 7 and 8 are duplicates

so basically, if the template adds a new "element", the previous line is a duplicate. I want to get one row for each person, no matter how many elements they have

I am using Ruby 1.9.3 like this:

require 'csv'
puts "loading data"
people = CSV.read('input-file.csv')

CSV.open("output-file", "wb") do |csv|
    #write the first row (header) to the output file
    csv << people[0]
    people.each do |p|
        ... logic to test for dupe ...
        csv << p.unique
    end
end

+3

ruby parsing csv ruby-1.9

sysconfig 07 Mar 12 at 13:24

source to share

3 answers

It looks like you are trying to get a list of unique items associated with each person, where the person is identified by id and name. If that's correct, you can do something like this:

peoplehash = {}
maxitems = 0
people.each do |id, name, *items|
    (peoplehash[[id, name]] ||= []) += items
peoplehash.keys.each do |k|
    peoplehash[k].uniq!
    peoplehash[k].sort!
    maxitems = [maxitems, peoplehash[k].size].max

This will give you a structure like:

{
    [251, "john"] => ["bar", "bat", "baz", "foo"],
    [347, "bill"] => ["bar", "foo"]
}

and maxitems

that will tell you how long is a long array of elements that you can use for whatever you need.

+1

glenn mcdonald 07 Mar 12 at 18:15

source to share

You can use 'uniq'

irb(main):009:0> row= ['ruby', 'rails', 'gem', 'ruby']
irb(main):010:0> row.uniq
=> ["ruby", "rails", "gem"]
or 

row.uniq!
=> ["ruby", "rails", "gem"]

irb(main):017:0> row
=> ["ruby", "rails", "gem"]

irb(main):018:0> row = [1,      251,   'john',    'foo',       'foo',       'foo',       'foo']
=> [1, 251, "john", "foo", "foo", "foo", "foo"]
irb(main):019:0> row.uniq
=> [1, 251, "john", "foo"]

0

suvankar 07 Mar 12 at 14:07

source to share

Derek harmel · Accepted Answer · 2012-03-07T14:12:39+0000

First, there is a small bug with your code. Instead:

csv << people[0]

You will need to do the following if you don't want to change the loop code:

csv << people.shift

Now, the following solution will only add the first occurrence of the person, discarding any subsequent duplicates as defined by id (as I am assuming the IDs are unique).

require 'csv'
puts "loading data"
people = CSV.read('input-file.csv')
ids = [] # or you could use a Set

CSV.open("output-file", "wb") do |csv|
  #write the first row (header) to the output file
  csv << people.shift
  people.each do |p|
    # If the id of the current records is in the ids array, we've already seen 
    # this person
    next if ids.include?(p[0])

    # Now add the new id to the front of the ids array since the example you gave
    # the duplicate records directly follow the original, this will be slightly
    # faster than if we added the array to the end, but above we still check the
    # entire array to be safe
    ids.unshift p[0]
    csv << p
  end
end

Note that there is a better solution, if your duplicate records always follow the original directly, you will only need to keep the last original ID and check the current record ID rather than include it in the whole array. The difference may be minor if your input file does not contain many entries.

It will look like this:

require 'csv'
puts "loading data"
people = CSV.read('input-file.csv')
previous_id = nil

CSV.open("output-file", "wb") do |csv|
  #write the first row (header) to the output file
  csv << people.shift
  people.each do |p|
    next if p[0] == previous_id
    previous_id = p[0]
    csv << p
  end
end

Ruby CSV repetitive syntax series

More articles: