How to improve zipWith performance in Haskell

I wrote some code for clustering with Data.Clustering.Hierarchical but it is slow.

I've tried profiling and modifying the code, but I don't know why it zipWith

takes so long? (even if, I change the list to a vector.)

import Data.Clustering.Hierarchical
import qualified Data.Vector.Primitive as DV
import System.Random
import Control.Monad


main = do
    vectorList <- genTestdata
    let cluster = dendrogram SingleLinkage vectorList getVectorDistance  
    putStrLn $ show cluster

genZero x 
    | x<5 = x
    |otherwise = 0

genVector::IO (DV.Vector Int)
genVector = do
    listRandom <- mapM (\x -> randomRIO (1,30) ) [1..20]
    let intOut = DV.fromList $ map genZero listRandom
    return intOut

genTestdata = do 
    r <- sequence  $ map (\x -> liftM (\y -> (x,y)) genVector) [1..1000]
    return r

getExp2 v1 v2 = d*d
    where
        d = v1 - v2

getExp v1 v2
    | v1 == v2 = 0
    | otherwise = getExp2 v1 v2

tfoldl  d = DV.foldl1' (+) d

changeDataType:: Int -> Double
changeDataType d = fromIntegral d

getVectorDistance::(a,DV.Vector Int)->(a, DV.Vector Int )->Double
getVectorDistance v1 v2 = fromIntegral $ tfoldl dat
    where
        l1 = snd v1
        l2 = snd v2
        dat = DV.zipWith getExp l1 l2

      

To build it use: ghc -prof -fprof-auto -rtsopts -O2 log_cluster.hs

Run with log_cluster.exe +RTS -p

The profiling result on my machine follows - note the result for getVectorDistance.dat

:

> log_cluster.exe +RTS -p -RTS

total time  =        8.43 secs   (8433 ticks @ 1000 us, 1 processor)
total alloc = 1,614,252,224 bytes  (excludes profiling overheads)

COST CENTRE            MODULE  %time %alloc
getVectorDistance.dat  Main     49.4   37.8  <------
tfoldl                 Main      5.7    0.0
getExp                 Main      4.5    0.0
getExp2                Main      0.5    1.5

      

+3


source to share


1 answer


Taking the advice in my comment, here are the timings for running the same code:

user:~/explorations$ ghc -O2 log_cluster.hs -rtsopts
[1 of 1] Compiling Main             ( log_cluster.hs, log_cluster.o )
Linking log_cluster ...
user:~/explorations$ time ./log_cluster
101000

real    0m0.127s
user    0m0.120s
sys     0m0.000s

      

and when building with profiling:

user:~/explorations$ ghc -prof -fprof-auto -O2 log_cluster.hs -rtsopts
[1 of 1] Compiling Main             ( log_cluster.hs, log_cluster.o )
Linking log_cluster ...
user:~/explorations$ time ./log_cluster
101000

real    0m2.937s
user    0m2.920s
sys     0m0.000s

      

So the profiled build is about 25 times slower, which is quite a significant overhead.

At this point I am assuming your program is slow, so this is what you are building it for profiling. If a non-profiled assembly is too slow, you may need to use more advanced profiling techniques.

Of course, this is a bit hypothetical, since the code you provided doesn't compile, so I had to fill in some gaps.



Edit. To be clear, my position is that adding annotations SCC

(whether manually or automatically) limits the optimizations that ghc can do. The more liberally they are applied, the greater the difference between profiled code and non-profiled code. This can lead to misleading profiles, as features that appear as bottlenecks in profiled code may be less different. I think what is going on here.

The OP asks quite intelligently how to find the bottleneck if the profiling results are skewed. I expect what DV.zipWith

is actually the bottleneck in this example because it is the only function that does significant work (see below for the test code generation code), however a manual inspection of the kernel (generated by compiling with -ddump-simpl -ddump-to-file -dsuppress-coercions

) shows that it getVectorDistance

creates a nice unoccupied cycle, while the intermediate vector is completely merged. I doubt this can be significantly improved without heroic measures. (see note 2)

In general, the best way to use profiling is to start at the top and drill down. You can manually add a few annotations SCC

near the top level, or use a -fprof-auto-exported

preferentially defined one for only a few key top level modules to get a rough idea. From there, you can proceed further by adding annotations to more modules, manually adding a few more annotations, SCC

or if you're lucky, switch to -fprof-auto

. Unfortunately, just using -fprof-auto-exported

doesn't help in this example unless you also add the operator module Main (main, getVectorDistance)

.

An alternative is to use a different profiling method. You can use for example. ghc-events-analysis to profile your code. This includes manually adding some trace statements and post-processing the event log, but this usually interferes with compiler optimizations. In pure code, it can sometimes be tricky to figure out where to put assertions so that they are evaluated correctly, the chronograph package can handle this (it doesn't support the ghc-events-analysis format yet, but I'll add that shortly).

I expect this to be a condensed example from the complete code. Hopefully one of these methods helps you find a bottleneck that can be improved more easily.

Note 1: Data generation code can almost certainly be sped up if it looks like your complete program. System.Random

is known to be slow, use mwc-random or mersenne-random . I am also a little suspicious of the use DV.fromList

, but it can be linked.

note 2: when compiled with -prof -fprof-auto

kernel not so good. Instead of a parsed loop over two vectors, a new vector is first created, then the loop traverses that new vector to calculate the sum. This way you have extra allocations, extra memory pressure, and two rounds instead of one. This is why the profiled version is significantly slower, and why I think the profile is misleading: times for are DV.zipWith

significantly bloated.

+2


source







All Articles