Uniq for Enumerator :: Lazy

I am handling something with a lot of duplicate lines:

# => [ [1, "A", 23626], [1, "A", 31314], [2, "B", 2143], [2, "B", 5247] ]
puts xs

# => [ [1, "A"], [2, "B"] ]
puts xs.uniq{ |x| x[0] }.map{ |x| [x[0], x[1]] }

      

But xs is huge. I am trying to load it lazily, but Enumerator # Lazy has no uniq method .

How do I achieve this lazily?

+3


source to share


3 answers


module EnumeratorLazyUniq
  refine Enumerator::Lazy do
    require 'set'
    def uniq
      set = Set.new
      select { |e|
        val = block_given? ? yield(e) : e
        !set.include?(val).tap { |exists|
          set << val unless exists
        }
      }
    end
  end
end

using EnumeratorLazyUniq
xs = [ [1, "A", 23626], [1, "A", 31314], [2, "B", 2143], [2, "B", 5247] ].to_enum.lazy

us = xs.uniq{ |x| x[0] }.map{ |x| [x[0], x[1]] }
puts us.to_a.inspect
# => [[1, "A"], [2, "B"]]
# Works with a block

puts us.class
# => Enumerator::Lazy
# Yep, still lazy.

ns = [1, 4, 6, 1, 2].to_enum.lazy
puts ns.uniq.to_a.inspect
# => [1, 4, 6, 2]
# Works without a block

      



This is a simple implementation using Set

; this means that any uniq'd values ​​(i.e. things like [1, "A"]

, but not the elements of a stream themselves, like [1, "A", 23626]

) will take up memory.

+7


source


I decided to run the test using the two methods I suggested and the @Amadan method. The results speak for themselves.

Control code

require 'benchmark'

module EnumeratorLazyUniq
  refine Enumerator::Lazy do
    require 'set'
    def uniq
      set = Set.new
      select { |e|
        val = block_given? ? yield(e) : e
        !set.include?(val).tap { |exists|
          set << val unless exists
        }
      }
    end
  end
end

using EnumeratorLazyUniq

def amadan(xs)
  xs.uniq{ |x| x[0] }.map{ |x| [x[0], x[1]] }
end

      

require 'set'

def cary_set(arr)
  first = Set.new
  arr.each_with_object([]) do |(a0, a1, *_), b|
    unless first.include?(a0)
      first << a0 
      b << [a0, a1]
    end
  end
end

      

def cary_hash(arr)
  arr.each_with_object({}) { |(a0, a1, *_), h|
    h[a0]=[a0, a1] unless h.key?(a0) }.values
end

      

Test data



n_uniq = 10_000
n_copies = 100
tot = n_uniq * n_copies

xs = tot.times.map { |i| [i % n_uniq, 0, 1] }

      

Performance test

Benchmark.bm do |x|
  x.report("cary_set ") { cary_set(xs)  }
  x.report("cary_hash") { cary_hash(xs) }
  x.report("amadan   ") { amadan(xs)    }
end

      

results

Unique elements: 200,000
Number of copies of each unique element: 5

Array size: 1,000,000
       user     system      total        real
cary_set   0.980000   0.030000   1.010000 (  1.018618)
cary_hash  0.980000   0.010000   0.990000 (  0.982508)
amadan     0.590000   0.010000   0.600000 (  0.597249)


Unique elements: 100,000
Number of copies of each unique element: 10
Array size: 1,000,000

       user     system      total        real
cary_set   0.920000   0.030000   0.950000 (  0.942539)
cary_hash  0.630000   0.020000   0.650000 (  0.642367)
amadan     0.470000   0.000000   0.470000 (  0.478658)


Unique elements: 50,000
Number of copies of each unique element: 20
Array size: 1,000,000

       user     system      total        real
cary_set   0.910000   0.020000   0.930000 (  0.932277)
cary_hash  0.570000   0.000000   0.570000 (  0.575439)
amadan     0.410000   0.010000   0.420000 (  0.417695)

      

Unique elements: 1000000
Number of copies of each unique element: 10
Array size: 10000000

       user     system      total        real
cary_set  12.660000   0.270000  12.930000 ( 12.962183)
cary_hash  7.730000   0.060000   7.790000 (  7.797486)
amadan     6.640000   0.060000   6.700000 (  6.707706)

      

+2


source


Why not something simple?

code

require 'set'

def extract(arr)
  first = Set.new
  arr.each_with_object([]) do |(a0, a1, *_), b|
    unless first.include?(a0)
      first << a0 
      b << [a0, a1]
    end
  end
end

      

Example

arr = [ [1, "A", 23626], [1, "A", 31314], [2, "B", 2143], [2, "B", 5247] ]
extract(arr)
  #=> [[1, "A"], [2, "B"]]

      

Alternative

A variation of this:

def extract(arr)
  arr.each_with_object({}) { |(a0, a1, *_), h|
    h[a0]=[a0, a1] unless h.key?(a0) }.values
end

      

I would expect the performance to be about the same, but the hash uses more memory due to values

+1


source







All Articles