Scala - Example of Fold operation on HashMap ** not foldLeft
I'm interested in using Scala because it seems like a good way to parallelize operations. I need to develop a machine learning algorithm that uses vector multiplication (as many do). I know how to make an algorithm, but what I want to do is a rare implementation of vectors from HashMaps. Almost all vectors are stored as HashMaps [Int, Double], where the index of this double in the vector is an integer that is a key.
Using Pythonish psuedocode, <7, 6, 5, 4> ==> {1: 7, 2: 6, 3: 5, 4: 4}
I want to define a dot product function using either fold, reduce, map ... etc, but I do NOT want to use foldLeft, reduceLeft ... because I would like this to potentially be paralyzed because my vectors can get up to 6000+ sizes and for dot products the order doesn't matter.
I have read MANY examples of foldLeft and reduceLeft, but I still have to figure out how to use HashMap.fold or HashMap.reduce.
I understand functional programming quite well, but I don't understand the error messages in Scala. Here's a template for what I want more or less.
object NGramAnalysis {
def main(args: Array[String]) {
val mapped = HashMap(1->1.2, 5->2.4)
println(mapped.fold( .... What goes here ... )
}
}
Conclusion I want a solid example using HashMap.fold NOT foldLeft and same for HashMap.reduce
Thanks in advance. I struggled for a while.
source to share
First of all, the difference between fold
and reduce
is that it fold
takes an extra argument that is used as the initial value, whereas it reduce
takes the first item in the collection as that initial value and if the collection is empty, it throws an exception. So, fold
somewhat more general than reduce
that, so I will refer to both of these functions as fold
from now on.
For correct operation, the fold
elements in your collection must form a semigroup, that is, there must be a binary operation, which must also be associative, i.e. must satisfy the following identity: (a `op` b) `op` c == a `op` (b `op` c)
. Associativity is necessary because it fold
does not indicate the order of the application's operations, which is especially important in a parallel context. This operation is used to make a fold:
a1 `op` a2 `op` a3 `op` ... `op` an
If reduce
run in parallel, it can split the collection and shrink the first half in one thread and the second half in another thread; then their results are combined using the same operation. This will only work correctly if the operation is associative.
As I said, the method fold
takes two arguments: an initial value and an [associative] binary operator. For example, to concatenate a list of strings in parallel, you can do this:
val strings = Seq("a", "b", "c", "d", ...)
strings.par.fold("")(_ ++ _) // or strings.par.reduce(_ ++ _) if you know that strings is non-empty
So, to implement a dotted product, you need to think about the collection that you will add / decrease and the binary operator that will perform this reduction.
This is a trivial implementation of the dot product for two collections:
(c1 zip c2).par.map {
case (e1, e2) => e1 * e2
}.reduce(_ + _)
That is, we put these collections together to multiply their elements in pairs using the operator *
, and then we decrease the result using the operator +
. Of course, *
and +
must be defined on the elements c1
and c2
.
However, it is HashMap
not ordered, so the order of its iteration is undefined. There is no guarantee that it will zip
attach to elements with the same key, which makes the wrong idea about the exact product. You need to do something like this:
c1.par.map {
case (k, v) => v * c2(k)
}.reduce(_ + _)
Here we are not looping the collection, but instead we search the second map using all the keys from the first map.
source to share
I just want to add a simple example implementation as @Vladimir Matveev closed the background.
Here the vector is supported by HashMap
. The applicable factory method ensures that there is a default value for all non-specified indices 0
.
The idea is pretty simple. We combine keys so that we have all the keys shown on any card. Then we multiply the corresponding values ββand add them up. Since we have a default value for non-existent keys, this works great.
class SparseVector private (val entries: Map[Int, Double]) {
def **(vec: SparseVector) =
(entries.keySet ++ vec.entries.keySet).par
.map(index => entries(index) * vec.entries(index)).sum
//alternative suggested by @wingedsubmariner
def **(vec: SparseVector) =
(entries.keySet ++ vec.entries.keySet).par
.aggregate(0.0)((sum, index) => sum + entries(index) * vec.entries(index), (_ + _))
}
object SparseVector {
def apply(entries: HashMap[Int, Double]) =
new SparseVector(entries.withDefaultValue(0.0))
}
Methods map
, sum
and aggregate
have parallel implementation.
source to share