Why does this Haskell program do a weird job when compiled with -threaded?
Consider the following toy that brutely forces a password hash by applying character substitutions to dictionary words. The dictionary is moved sequentially or in parallel, started at compile time by a symbol PARMAP
.
import qualified Control.Parallel.Strategies as Strat
import qualified Crypto.Hash.SHA256 as SHA256
import qualified Data.ByteString as BS
import qualified Data.ByteString.Base16 as BS.Base16
import qualified Data.ByteString.Char8 as BS.Char8
import Data.Char (isLower, toUpper)
import Data.Maybe (listToMaybe)
variants :: String -> [String]
variants "" = [""]
variants (c:s) = [c':s' | s' <- variants s, c' <- subst c]
where subst 'a' = "aA@"
subst 'e' = "eE3"
subst 'i' = "iI1"
subst 'l' = "lL1"
subst 'o' = "oO0"
subst 's' = "sS$5"
subst 'z' = "zZ2"
subst x | isLower x = [x, toUpper x]
subst x = [x]
myMap :: (a -> [a]) -> [a] -> [[a]]
#ifdef PARMAP
myMap = Strat.parMap (Strat.evalList Strat.rseq)
#else
myMap = map
#endif
bruteForce :: [String] -> BS.ByteString -> Maybe String
bruteForce dictionary hash = listToMaybe $ concat $ myMap match dictionary
where match word = [var | var <- variants word,
SHA256.hash (BS.Char8.pack var) == hash]
main :: IO ()
main = do
dictionary <- lines `fmap` (readFile "dictionary.txt")
hash <- (fst . BS.Base16.decode . BS.Char8.pack) `fmap` getLine
case bruteForce dictionary hash of
Just preimage -> putStrLn preimage
Nothing -> return ()
I will compile this program with and without PARMAP
and -threaded
.
$ ghc Brute.hs -cpp -fforce-recomp -Wall -O2 -o brute.seq $ ghc Brute.hs -cpp -fforce-recomp -Wall -O2 -DPARMAP -o brute.par $ ghc Brute.hs -cpp -fforce-recomp -Wall -O2 -threaded -o brute.seq+th $ ghc Brute.hs -cpp -fforce-recomp -Wall -O2 -threaded -DPARMAP -o brute.par+th
To run this program, I make a little dictionary and take the last word from it.
$ shuf -n 300 /usr/share/dict/american-english >dictionary.txt
$ tail -n 1 dictionary.txt
desalinates
$ echo -n 'De$aL1n@teS' | sha256sum
3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072 -
I am running this on a dual core processor. No other processor intensive processes are running on this computer.
The sequential version of the map performs as expected.
$ TIMEFORMAT='real %R user %U sys %S'
$ time ./brute.seq <<<3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072
De$aL1n@teS
real 39.797 user 39.574 sys 0.156
$ time ./brute.seq+th <<<3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072
De$aL1n@teS
real 44.313 user 44.159 sys 0.088
$ time ./brute.seq+th +RTS -N <<<3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072
De$aL1n@teS
real 44.990 user 44.835 sys 0.876
The parallel version of the map, compiled without -threaded
, has the same performance.
$ time ./brute.par <<<3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072
De$aL1n@teS
real 39.847 user 39.742 sys 0.056
When I combine a parallel card with -threaded
, but don't use 2 cores yet, things start to look strange.
$ time ./brute.par+th <<<3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072
De$aL1n@teS
real 58.636 user 73.661 sys 17.717
When I use 2 cores, things get even weirder. Performance now varies greatly from launch to launch, which previous versions have not shown. Sometimes it is twice as fast as single core brute.par+th
, which is what I expect, since the algorithm is awkwardly parallel. But sometimes it is even slower than on a single core.
$ time ./brute.par+th +RTS -N <<<3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072
De$aL1n@teS
real 28.589 user 51.615 sys 2.304
$ time ./brute.par+th +RTS -N <<<3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072
De$aL1n@teS
real 59.149 user 106.255 sys 4.664
$ time ./brute.par+th +RTS -N <<<3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072
De$aL1n@teS
real 49.428 user 87.193 sys 3.816
Now I have two questions that may be related.
- Why is 1-core
brute.par+th
so slower than 1-corebrute.par
? What is he doing in these streams? What does it do in kernel mode in a whopping 17 seconds? - Why does the performance of a dual-core
brute.par+th
vary so wildly that it doesn't reliably double the performance of a 1-corebrute.par+th
?
Im using GHC 7.4.1 and parallel-3.2.0.2. I know I should probably use newer versions, but for now this is what I'm comfortable with.
I tried to compile with -rtsopts
and disable threaded GC with +RTS -qg
, no effect.
I tried ThreadScope but it changed like crazy and was unable to load the event log even when I was using a much smaller dictionary.
source to share
The unexpected performance can be explained by the fact that "Crypto.Hash.SHA256" is calling unsafe FFI code . GHC does not guarantee that other Haskell threads will not block when this code is called. If the threads that spawn are blocked by GHC, it will cause a lot of conflicts in your program, leading to long run times and inconsistent runtime results.
source to share