What's the most efficient way to split a large hash into N smaller hashes in Ruby?

Problem


I am working on a problem that involves shards. As part of a problem, I need to find the fastest way to split a large Ruby hash (> 200,0000 entries) into two or more parts.

  • Are there any non O (n) approaches?

  • Is there a non-Ruby ie C / C ++ implementation?

Please don't answer with examples using the trivial approach of converting a hash to an array and rearranging N different hashes.

My concern is that Ruby is too slow to get the job done.


Initial approach


This was the first solution I tried. What was attractive in this regard:

  • he didn't need to split by hash
  • he didn't need to manage a counter to distribute the items evenly between the shards.
  • this is a short and neat look

Ok, it's not O (n), but it relies on methods in the standard library, which I figured would be faster than writing native Ruby code.


pivot = s.size / 2

slices = s.each_slice(pivot)

s1 = Hash[*slices.entries[0].flatten]

s2 = Hash[*slices.entries[1].flatten]

      


The best decision

Mark and Mike were kind enough to suggest approaches. I have to admit that Marx's approach turned out to be wrong - he did exactly what I didn't want - he stuck with all the members and evaluated the conditional as soon as it worked - but since he took the time to conduct the evaluation, I thought that I should try a similar approach and compare this. This is my adapted version of his approach (My keys are not numbers, so I can't take his approach verbatim)

def split_shard(s)
    shard1 = {}
    shard2 = {}


    t = Benchmark.measure do
        n = 0

        pivot = s.size / 2

        s.each_pair do |k,v|
            if n < pivot
                shard1[k] = v
            else
                shard2[k] = v
            end

            n += 1
        end
    end

    $b += t.real
    $e += s.size
    return shard1, shard2
end

      


results


In both cases, a large number of hashes are broken into shards. The total cardinality across all hashes in the test dataset was 1,680,324.

My original solution, which should have been faster because it uses methods in the standard library and minimizes the amount of Ruby code (no loop, no conditional) - runs a little over 9s

Mark works in just 5 seconds

This is a significant victory


Take away


Don't be fooled by "intuition" - measure the performance of a competing algorithm

Don't worry about the performance of Ruby as a language - my first problem is that if I do ten million of these operations it can take a significant amount of time in Ruby, but it really isn't.

Thanks to Mark and Mike who both receive points from me for their help.

Thank!

+2


source to share


2 answers


This is probably not fast enough for your needs (which sounds like they would need a C extension), but perhaps you can use Hash # select?

I agree with Mike Woodhouse's idea. Is it possible for you to build your shards in the same place where the original 200k-item hash is being built? If items come out of the database, you can split your query into multiple non-overlapping queries based on either some aspect of the key, or reusing something like LIMIT 10000 to grab a chunk at a time.

Additional

Hi Chris, I just compared your approach with the Hash # selection:

"standard" is required



s = {}
1.upto(200_000) { |i| s[i] = i}

Benchmark.bm do |x|
  x.report {
    pivot = s.size / 2
    slices = s.each_slice(pivot)
    s1 = Hash[*slices.entries[0].flatten]
    s2 = Hash[*slices.entries[1].flatten]
  }
  x.report {
    s1 = {}
    s2 = {}
    s.each_pair do |k,v|
      if k < 100_001
        s1[k] = v
      else
        s2[k] = v
      end
    end
  }
end

      

It looks like selecting Hash # is much faster, although it goes through the whole big hash for each of the sub-hashes:

# ruby test.rb 
      user     system      total        real
  0.560000   0.010000   0.570000 (  0.571401)
  0.320000   0.000000   0.320000 (  0.323099)

      

Hope it helps.

+3


source


I don't see how you can achieve this using an unmodified "vanilla" Hash - I would expect you would need to go into the internals to do the splitting into some sort of bulk copy operation. How good is your C?

I would be more inclined to consider partitioning instead of creating the Hash in the first place, especially if the only reason for the existing 200K element hash to be partitioned in the first place.

EDIT: After thinking about it at the gym ...

The problem with finding any existing solution is that someone else must have (a) been in pain, (b) had the technical ability to solve the problem, and (c) felt that the community was friendly enough to let him go. wildlife, Oh, and for your OS platform.



How about using B-Tree instead of Hash? Keep the data sorted by key and it can be passed through memcpy (). B-Tree retrieve - O (log N) which didn't hit Hash hard most of the time.

I found something here that might help, and I expect there will only be a little duck printable wrapper needed to create it shakes like a hash.

However, these C / C ++ skills are needed. (Mine is hopelessly rusty).

+5


source







All Articles