Find duplicates in an array of hashes on specific keys
I have an array of hashes (CSV strings, actually) and I need to find and store all the strings that match two specific keys (user, section). Here's some sample data:
[
{ user: 1, role: "staff", section: 123 },
{ user: 2, role: "staff", section: 456 },
{ user: 3, role: "staff", section: 123 },
{ user: 1, role: "exec", section: 123 },
{ user: 2, role: "exec", section: 456 },
{ user: 3, role: "staff", section: 789 }
]
So I would need to return an array containing only strings where the same user / section name appears more than once:
[
{ user: 1, role: "staff", section: 123 },
{ user: 1, role: "exec", section: 123 },
{ user: 2, role: "staff", section: 456 },
{ user: 2, role: "exec", section: 456 }
]
The double loop solution I am trying is as follows:
enrollments.each_with_index do |a, ai|
enrollments.each_with_index do |b, bi|
next if ai == bi
duplicates << b if a[2] == b[2] && a[6] == b[6]
end
end
but since the CSV is 145k lines it takes forever.
How can I get the output that I need more efficiently?
source to share
In terms of efficiency, you can try the following:
grouped = csv_arr.group_by{|row| [row[:user],row[:section]]}
filtered = grouped.values.select { |a| a.size > 1 }.flatten
The first operator groups the entries with the :user
and keys :section
. result:
{[1, 123]=>[{:user=>1, :role=>"staff", :section=>123}, {:user=>1, :role=>"exec", :section=>123}],
[2, 456]=>[{:user=>2, :role=>"staff", :section=>456}, {:user=>2, :role=>"exec", :section=>456}],
[3, 123]=>[{:user=>3, :role=>"staff", :section=>123}],
[3, 789]=>[{:user=>3, :role=>"staff", :section=>789}]}
The second operator only selects the values ββof groups with more than one member and then flattens the result to give you:
[{:user=>1, :role=>"staff", :section=>123},
{:user=>1, :role=>"exec", :section=>123},
{:user=>2, :role=>"staff", :section=>456},
{:user=>2, :role=>"exec", :section=>456}]
It can improve your speed, but memory is reasonable, I can't tell what effect it will have with large input because it will depend on your computer, resources and file size.
source to share
You don't need a double loop to do this check in memory, you can store an array of unique values ββand check every new csv line:
found = []
unique_enrollments = []
CSV.foreach('/path/to/csv') do |row|
# do whatever you're doing to parse this row into the hash you show in your question:
# => { user: 1, role: "staff", section: 123 }
# you might have to do `next if row.header_row?` if the first row is the header
enrollment = parse_row_into_enrollment_hash(row)
unique_tuple = [enrollment[:user], enrollment[:section]]
unless found.include? unique_tuple
found << unique_tuple
unique_enrollments << enrollment
end
end
Now you have unique_enrollments
. With this approach, you parse the CSV line by line so as not to store everything in memory. Next, let's build a smaller array of unique user and section tuples that you will use to check for uniqueness and also to create an array of unique strings.
You can optimize this even further by not storing unique_enrollments
in a large array, but simply building your model and storing it in db:
unless found.include? unique_tuple
found << unique_tuple
Enrollment.create enrollment
end
With the aforementioned setup, you can save on memory without saving a large set of tickets. The disadvantage, though, is that if something explodes, you won't be able to roll back. For example, if we did the first and store the array unique_enrollments
at the end, you can do:
Enrollment.transaction do
unique_enrollments.each &:save!
end
And now you have the option to rollback if any of these saves explode. Also, wrapping multiple db calls in one transaction
is much faster. I would go with this approach.
Edit: Using an array unique_enrollments
, you can iterate over them at the end and create a new CSV:
CSV.open('path/to/new/csv') do |csv|
csv << ['user', 'role', 'staff'] # write the header
unique_enrollments.each do |enrollment|
csv << enrollment.values # just the values not the keys
end
end
source to share