How to improve zipWith performance in Haskell
I wrote some code for clustering with Data.Clustering.Hierarchical but it is slow.
I've tried profiling and modifying the code, but I don't know why it zipWith
takes so long? (even if, I change the list to a vector.)
import Data.Clustering.Hierarchical
import qualified Data.Vector.Primitive as DV
import System.Random
import Control.Monad
main = do
vectorList <- genTestdata
let cluster = dendrogram SingleLinkage vectorList getVectorDistance
putStrLn $ show cluster
genZero x
| x<5 = x
|otherwise = 0
genVector::IO (DV.Vector Int)
genVector = do
listRandom <- mapM (\x -> randomRIO (1,30) ) [1..20]
let intOut = DV.fromList $ map genZero listRandom
return intOut
genTestdata = do
r <- sequence $ map (\x -> liftM (\y -> (x,y)) genVector) [1..1000]
return r
getExp2 v1 v2 = d*d
where
d = v1 - v2
getExp v1 v2
| v1 == v2 = 0
| otherwise = getExp2 v1 v2
tfoldl d = DV.foldl1' (+) d
changeDataType:: Int -> Double
changeDataType d = fromIntegral d
getVectorDistance::(a,DV.Vector Int)->(a, DV.Vector Int )->Double
getVectorDistance v1 v2 = fromIntegral $ tfoldl dat
where
l1 = snd v1
l2 = snd v2
dat = DV.zipWith getExp l1 l2
To build it use: ghc -prof -fprof-auto -rtsopts -O2 log_cluster.hs
Run with log_cluster.exe +RTS -p
The profiling result on my machine follows - note the result for getVectorDistance.dat
:
> log_cluster.exe +RTS -p -RTS
total time = 8.43 secs (8433 ticks @ 1000 us, 1 processor)
total alloc = 1,614,252,224 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
getVectorDistance.dat Main 49.4 37.8 <------
tfoldl Main 5.7 0.0
getExp Main 4.5 0.0
getExp2 Main 0.5 1.5
source to share
Taking the advice in my comment, here are the timings for running the same code:
user:~/explorations$ ghc -O2 log_cluster.hs -rtsopts
[1 of 1] Compiling Main ( log_cluster.hs, log_cluster.o )
Linking log_cluster ...
user:~/explorations$ time ./log_cluster
101000
real 0m0.127s
user 0m0.120s
sys 0m0.000s
and when building with profiling:
user:~/explorations$ ghc -prof -fprof-auto -O2 log_cluster.hs -rtsopts
[1 of 1] Compiling Main ( log_cluster.hs, log_cluster.o )
Linking log_cluster ...
user:~/explorations$ time ./log_cluster
101000
real 0m2.937s
user 0m2.920s
sys 0m0.000s
So the profiled build is about 25 times slower, which is quite a significant overhead.
At this point I am assuming your program is slow, so this is what you are building it for profiling. If a non-profiled assembly is too slow, you may need to use more advanced profiling techniques.
Of course, this is a bit hypothetical, since the code you provided doesn't compile, so I had to fill in some gaps.
Edit. To be clear, my position is that adding annotations SCC
(whether manually or automatically) limits the optimizations that ghc can do. The more liberally they are applied, the greater the difference between profiled code and non-profiled code. This can lead to misleading profiles, as features that appear as bottlenecks in profiled code may be less different. I think what is going on here.
The OP asks quite intelligently how to find the bottleneck if the profiling results are skewed. I expect what DV.zipWith
is actually the bottleneck in this example because it is the only function that does significant work (see below for the test code generation code), however a manual inspection of the kernel (generated by compiling with -ddump-simpl -ddump-to-file -dsuppress-coercions
) shows that it getVectorDistance
creates a nice unoccupied cycle, while the intermediate vector is completely merged. I doubt this can be significantly improved without heroic measures. (see note 2)
In general, the best way to use profiling is to start at the top and drill down. You can manually add a few annotations SCC
near the top level, or use a -fprof-auto-exported
preferentially defined one for only a few key top level modules to get a rough idea. From there, you can proceed further by adding annotations to more modules, manually adding a few more annotations, SCC
or if you're lucky, switch to -fprof-auto
. Unfortunately, just using -fprof-auto-exported
doesn't help in this example unless you also add the operator module Main (main, getVectorDistance)
.
An alternative is to use a different profiling method. You can use for example. ghc-events-analysis to profile your code. This includes manually adding some trace statements and post-processing the event log, but this usually interferes with compiler optimizations. In pure code, it can sometimes be tricky to figure out where to put assertions so that they are evaluated correctly, the chronograph package can handle this (it doesn't support the ghc-events-analysis format yet, but I'll add that shortly).
I expect this to be a condensed example from the complete code. Hopefully one of these methods helps you find a bottleneck that can be improved more easily.
Note 1: Data generation code can almost certainly be sped up if it looks like your complete program. System.Random
is known to be slow, use mwc-random or mersenne-random . I am also a little suspicious of the use DV.fromList
, but it can be linked.
note 2: when compiled with -prof -fprof-auto
kernel not so good. Instead of a parsed loop over two vectors, a new vector is first created, then the loop traverses that new vector to calculate the sum. This way you have extra allocations, extra memory pressure, and two rounds instead of one. This is why the profiled version is significantly slower, and why I think the profile is misleading: times for are DV.zipWith
significantly bloated.
source to share