Why does this Haskell program do a weird job when compiled with -threaded?

Question

Why does this Haskell program do a weird job when compiled with -threaded?

Consider the following toy that brutely forces a password hash by applying character substitutions to dictionary words. The dictionary is moved sequentially or in parallel, started at compile time by a symbol PARMAP

.

import qualified Control.Parallel.Strategies as Strat
import qualified Crypto.Hash.SHA256 as SHA256
import qualified Data.ByteString as BS
import qualified Data.ByteString.Base16 as BS.Base16
import qualified Data.ByteString.Char8 as BS.Char8
import Data.Char (isLower, toUpper)
import Data.Maybe (listToMaybe)

variants :: String -> [String]
variants "" = [""]
variants (c:s) = [c':s' | s' <- variants s, c' <- subst c]
  where subst 'a' = "aA@"
        subst 'e' = "eE3"
        subst 'i' = "iI1"
        subst 'l' = "lL1"
        subst 'o' = "oO0"
        subst 's' = "sS$5"
        subst 'z' = "zZ2"
        subst x | isLower x = [x, toUpper x]
        subst x = [x]

myMap :: (a -> [a]) -> [a] -> [[a]]
#ifdef PARMAP
myMap = Strat.parMap (Strat.evalList Strat.rseq)
#else
myMap = map
#endif

bruteForce :: [String] -> BS.ByteString -> Maybe String
bruteForce dictionary hash = listToMaybe $ concat $ myMap match dictionary
  where match word = [var | var <- variants word,
                      SHA256.hash (BS.Char8.pack var) == hash]

main :: IO ()
main = do
  dictionary <- lines `fmap` (readFile "dictionary.txt")
  hash <- (fst . BS.Base16.decode . BS.Char8.pack) `fmap` getLine
  case bruteForce dictionary hash of
    Just preimage -> putStrLn preimage
    Nothing -> return ()

I will compile this program with and without PARMAP

and -threaded

.

$ ghc Brute.hs -cpp -fforce-recomp -Wall -O2 -o brute.seq
$ ghc Brute.hs -cpp -fforce-recomp -Wall -O2 -DPARMAP -o brute.par
$ ghc Brute.hs -cpp -fforce-recomp -Wall -O2 -threaded -o brute.seq+th
$ ghc Brute.hs -cpp -fforce-recomp -Wall -O2 -threaded -DPARMAP -o brute.par+th

To run this program, I make a little dictionary and take the last word from it.

$ shuf -n 300 /usr/share/dict/american-english >dictionary.txt
$ tail -n 1 dictionary.txt 
desalinates
$ echo -n 'De$aL1n@teS' | sha256sum
3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072  -

I am running this on a dual core processor. No other processor intensive processes are running on this computer.

The sequential version of the map performs as expected.

$ TIMEFORMAT='real %R   user %U   sys %S'
$ time ./brute.seq <<<3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072
De$aL1n@teS
real 39.797   user 39.574   sys 0.156
$ time ./brute.seq+th <<<3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072
De$aL1n@teS
real 44.313   user 44.159   sys 0.088
$ time ./brute.seq+th +RTS -N <<<3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072
De$aL1n@teS
real 44.990   user 44.835   sys 0.876

The parallel version of the map, compiled without -threaded

, has the same performance.

$ time ./brute.par <<<3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072
De$aL1n@teS
real 39.847   user 39.742   sys 0.056

When I combine a parallel card with -threaded

, but don't use 2 cores yet, things start to look strange.

$ time ./brute.par+th <<<3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072
De$aL1n@teS
real 58.636   user 73.661   sys 17.717

When I use 2 cores, things get even weirder. Performance now varies greatly from launch to launch, which previous versions have not shown. Sometimes it is twice as fast as single core brute.par+th

, which is what I expect, since the algorithm is awkwardly parallel. But sometimes it is even slower than on a single core.

$ time ./brute.par+th +RTS -N <<<3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072
De$aL1n@teS
real 28.589   user 51.615   sys 2.304
$ time ./brute.par+th +RTS -N <<<3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072
De$aL1n@teS
real 59.149   user 106.255   sys 4.664
$ time ./brute.par+th +RTS -N <<<3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072
De$aL1n@teS
real 49.428   user 87.193   sys 3.816

Now I have two questions that may be related.

Why is 1-core brute.par+th

so slower than 1-core brute.par

? What is he doing in these streams? What does it do in kernel mode in a whopping 17 seconds?
Why does the performance of a dual-core brute.par+th

vary so wildly that it doesn't reliably double the performance of a 1-core brute.par+th

?

Im using GHC 7.4.1 and parallel-3.2.0.2. I know I should probably use newer versions, but for now this is what I'm comfortable with.

I tried to compile with -rtsopts

and disable threaded GC with +RTS -qg

, no effect.

I tried ThreadScope but it changed like crazy and was unable to load the event log even when I was using a much smaller dictionary.

+3

performance multithreading parallel-processing haskell ghc

Vasiliy Faronov June 24. 15 at 21:40

source to share

1 answer

Josh · Accepted Answer · 2015-06-24T23:15:26+0000

The unexpected performance can be explained by the fact that "Crypto.Hash.SHA256" is calling unsafe FFI code . GHC does not guarantee that other Haskell threads will not block when this code is called. If the threads that spawn are blocked by GHC, it will cause a lot of conflicts in your program, leading to long run times and inconsistent runtime results.

Why does this Haskell program do a weird job when compiled with -threaded?

More articles: