Scala - Example of Fold operation on HashMap ** not foldLeft

I'm interested in using Scala because it seems like a good way to parallelize operations. I need to develop a machine learning algorithm that uses vector multiplication (as many do). I know how to make an algorithm, but what I want to do is a rare implementation of vectors from HashMaps. Almost all vectors are stored as HashMaps [Int, Double], where the index of this double in the vector is an integer that is a key.

Using Pythonish psuedocode, <7, 6, 5, 4> ==> {1: 7, 2: 6, 3: 5, 4: 4}

I want to define a dot product function using either fold, reduce, map ... etc, but I do NOT want to use foldLeft, reduceLeft ... because I would like this to potentially be paralyzed because my vectors can get up to 6000+ sizes and for dot products the order doesn't matter.

I have read MANY examples of foldLeft and reduceLeft, but I still have to figure out how to use HashMap.fold or HashMap.reduce.

I understand functional programming quite well, but I don't understand the error messages in Scala. Here's a template for what I want more or less.

object NGramAnalysis {
  def main(args: Array[String]) {
    val mapped = HashMap(1->1.2, 5->2.4)
    println(mapped.fold( .... What goes here ... )
  }
}

      

Conclusion I want a solid example using HashMap.fold NOT foldLeft and same for HashMap.reduce

Thanks in advance. I struggled for a while.

+3


source to share


2 answers


First of all, the difference between fold

and reduce

is that it fold

takes an extra argument that is used as the initial value, whereas it reduce

takes the first item in the collection as that initial value and if the collection is empty, it throws an exception. So, fold

somewhat more general than reduce

that, so I will refer to both of these functions as fold

from now on.

For correct operation, the fold

elements in your collection must form a semigroup, that is, there must be a binary operation, which must also be associative, i.e. must satisfy the following identity: (a `op` b) `op` c == a `op` (b `op` c)

. Associativity is necessary because it fold

does not indicate the order of the application's operations, which is especially important in a parallel context. This operation is used to make a fold:

a1 `op` a2 `op` a3 `op` ... `op` an

      

If reduce

run in parallel, it can split the collection and shrink the first half in one thread and the second half in another thread; then their results are combined using the same operation. This will only work correctly if the operation is associative.

As I said, the method fold

takes two arguments: an initial value and an [associative] binary operator. For example, to concatenate a list of strings in parallel, you can do this:

val strings = Seq("a", "b", "c", "d", ...)
strings.par.fold("")(_ ++ _)  // or strings.par.reduce(_ ++ _) if you know that strings is non-empty

      

So, to implement a dotted product, you need to think about the collection that you will add / decrease and the binary operator that will perform this reduction.



This is a trivial implementation of the dot product for two collections:

(c1 zip c2).par.map {
  case (e1, e2) => e1 * e2
}.reduce(_ + _)

      

That is, we put these collections together to multiply their elements in pairs using the operator *

, and then we decrease the result using the operator +

. Of course, *

and +

must be defined on the elements c1

and c2

.

However, it is HashMap

not ordered, so the order of its iteration is undefined. There is no guarantee that it will zip

attach to elements with the same key, which makes the wrong idea about the exact product. You need to do something like this:

c1.par.map {
  case (k, v) => v * c2(k)
}.reduce(_ + _)

      

Here we are not looping the collection, but instead we search the second map using all the keys from the first map.

+3


source


I just want to add a simple example implementation as @Vladimir Matveev closed the background.

Here the vector is supported by HashMap

. The applicable factory method ensures that there is a default value for all non-specified indices 0

.

The idea is pretty simple. We combine keys so that we have all the keys shown on any card. Then we multiply the corresponding values ​​and add them up. Since we have a default value for non-existent keys, this works great.



class SparseVector private (val entries: Map[Int, Double]) {

  def **(vec: SparseVector) = 
    (entries.keySet ++ vec.entries.keySet).par
      .map(index => entries(index) * vec.entries(index)).sum

  //alternative suggested by @wingedsubmariner
  def **(vec: SparseVector) = 
    (entries.keySet ++ vec.entries.keySet).par
     .aggregate(0.0)((sum, index) => sum + entries(index) * vec.entries(index), (_ + _))
}

object SparseVector {
  def apply(entries: HashMap[Int, Double]) =
    new SparseVector(entries.withDefaultValue(0.0))
}

      

Methods map

, sum

and aggregate

have parallel implementation.

0


source







All Articles