Find duplicates in an array of hashes on specific keys

Question

Find duplicates in an array of hashes on specific keys

I have an array of hashes (CSV strings, actually) and I need to find and store all the strings that match two specific keys (user, section). Here's some sample data:

[
  { user: 1, role: "staff", section: 123 },
  { user: 2, role: "staff", section: 456 },
  { user: 3, role: "staff", section: 123 },
  { user: 1, role: "exec", section: 123 },
  { user: 2, role: "exec", section: 456 },
  { user: 3, role: "staff", section: 789 }
]

So I would need to return an array containing only strings where the same user / section name appears more than once:

[
  { user: 1, role: "staff", section: 123 },
  { user: 1, role: "exec", section: 123 },
  { user: 2, role: "staff", section: 456 },
  { user: 2, role: "exec", section: 456 }
]

The double loop solution I am trying is as follows:

enrollments.each_with_index do |a, ai|
  enrollments.each_with_index do |b, bi|
    next if ai == bi

    duplicates << b if a[2] == b[2] && a[6] == b[6]
  end
end

but since the CSV is 145k lines it takes forever.

How can I get the output that I need more efficiently?

+3

arrays ruby duplicates csv hash

lyonsinbeta 22 oct. 14 at 17:54

source to share

2 answers

You don't need a double loop to do this check in memory, you can store an array of unique values and check every new csv line:

found = []
unique_enrollments = []

CSV.foreach('/path/to/csv') do |row|
  # do whatever you're doing to parse this row into the hash you show in your question:
  # => { user: 1, role: "staff", section: 123 }
  # you might have to do `next if row.header_row?` if the first row is the header

  enrollment = parse_row_into_enrollment_hash(row)
  unique_tuple = [enrollment[:user], enrollment[:section]]

  unless found.include? unique_tuple
    found << unique_tuple
    unique_enrollments << enrollment
  end
end

Now you have unique_enrollments

. With this approach, you parse the CSV line by line so as not to store everything in memory. Next, let's build a smaller array of unique user and section tuples that you will use to check for uniqueness and also to create an array of unique strings.

You can optimize this even further by not storing unique_enrollments

in a large array, but simply building your model and storing it in db:

unless found.include? unique_tuple
  found << unique_tuple
  Enrollment.create enrollment
end

With the aforementioned setup, you can save on memory without saving a large set of tickets. The disadvantage, though, is that if something explodes, you won't be able to roll back. For example, if we did the first and store the array unique_enrollments

at the end, you can do:

Enrollment.transaction do
  unique_enrollments.each &:save!
end

And now you have the option to rollback if any of these saves explode. Also, wrapping multiple db calls in one transaction

is much faster. I would go with this approach.

Edit: Using an array unique_enrollments

, you can iterate over them at the end and create a new CSV:

CSV.open('path/to/new/csv') do |csv|
  csv << ['user', 'role', 'staff'] # write the header

  unique_enrollments.each do |enrollment|
    csv << enrollment.values # just the values not the keys
  end
end

0

DiegoSalazar 22 oct. 14 at 18:13

source to share

Alireza · Accepted Answer · 2014-10-22T18:31:02+0000

In terms of efficiency, you can try the following:

grouped = csv_arr.group_by{|row| [row[:user],row[:section]]}
filtered = grouped.values.select { |a| a.size > 1 }.flatten

The first operator groups the entries with the :user

and keys :section

. result:

{[1, 123]=>[{:user=>1, :role=>"staff", :section=>123}, {:user=>1, :role=>"exec", :section=>123}],
 [2, 456]=>[{:user=>2, :role=>"staff", :section=>456}, {:user=>2, :role=>"exec", :section=>456}],
 [3, 123]=>[{:user=>3, :role=>"staff", :section=>123}],
 [3, 789]=>[{:user=>3, :role=>"staff", :section=>789}]}

The second operator only selects the values of groups with more than one member and then flattens the result to give you:

[{:user=>1, :role=>"staff", :section=>123},
 {:user=>1, :role=>"exec", :section=>123},
 {:user=>2, :role=>"staff", :section=>456},
 {:user=>2, :role=>"exec", :section=>456}]

It can improve your speed, but memory is reasonable, I can't tell what effect it will have with large input because it will depend on your computer, resources and file size.

Find duplicates in an array of hashes on specific keys

More articles: