Why are Clojure frequencies not faster than Python.Counter collections?

Note - Clojure is a beginner here.

I was expecting Clojure's instance implementation to be significantly faster than Python. But it turns out that Python is faster! What is the explanation for this? How can one reason about where Python will be faster and where Clojure will be faster?

I am using CPython 2.7.8 and Clojure 1.6.0 with 64 bit OpenJDK server with VM server 1.7.0_75-b13.

Python code:

from string import ascii_lowercase
import timeit

DATA = list(ascii_lowercase)*100000

def frequencies(items):
    counter = {}
    for item in items:
        counter[item] = counter.setdefault(item, 0) + 1

    return counter

print(timeit.timeit(lambda: frequencies(DATA), number=1))

      

Output:

0.528199911118

      

Clojure code:

(ns test
  (:gen-class))

(defn -main
  [& args]
  (let
    [data
     (doall (apply concat
                   (repeat 100000 (map char (range (int \a) (+ (int \z) 1))))))]
    (time (frequencies data))))

      

Outputs:

"Elapsed time: 861.668743 msecs"

      

UPDATE # 1

I did some optimization:

(ns test
  (:gen-class))

(defn frequencies2
  [coll]
  (into {} (reduce (fn [^java.util.HashMap counts x]
             (.put counts x
                   (inc (or (.get counts x) 0))) counts)
           (java.util.HashMap. {}) coll)))    

(defn -main
  [& args]
  (let
    [data
     (doall (apply concat
                   (repeat 10000 (map char (range (int \a) (inc (int \z)))))))]
    (time (dotimes [_ 15] (frequencies data)))
    (time (dotimes [_ 15] (frequencies2 data)))))

      

It outputs:

"Elapsed time: 1524.498547 msecs"
"Elapsed time: 476.387626 msecs"

      

So I add two questions:

  • Why doesn't the clojure.core

    implementation use type hints?
  • How can I optimize performance? Can I add a type hint for hashmap values ​​that are integers?
+3


source to share


2 answers


Benchmarking something in the JVM is a tricky exercise. The JVM will optimize your code when it runs, but it is not easy to predict when this will happen or will control it. To get anything more than the most general performance hint between the two functions (like Clojure), you need to use a custom comparison library. Criterium is the most used library in the Clojure community for this.

Reasoning about performance is pretty tricky, especially between two very different platforms. I think benchmarking and measuring a lot of code would be the best way to develop intuition between the two languages. Digging into the underlying data structures and understanding their performance characteristics will help you. As you saw in frequencies2

, you can achieve better performance using a modified HashMap than with Clojure's persistent maps. However, if you go down this route, you lose all unfailing kindness.

The Clojure version is missing type hints for several reasons.

  • Frequencies are a general purpose function, so they can deal with any value.
  • The type hint is really a performance value for interacting with Java classes . From Clojure Programming , page 367

    The type of hints for function arguments or returns is not a signature declaration: they do not affect the types that a function can accept or return. Their only effect is to allow Clojure to call Java methods and access Java fields using compile-time generated code, rather than the much slower use of run-time reflection to find methods or fields that match the form of interaction. Thus, if the prompt does not report interaction operations, they are not actually processed. [...] This conflicts with the signature declarations that Clojure provides, but only for primitive arguments and return types.



If you are exclusively working with Java primitives in a function, you can use the declarationsfor optimization. Again from Clojure Programming , page 438

When Clojure compiles a function, it generates a corresponding class that implements clojure.lang.IFn, one of the Java w1210 interfaces. IFn defines several invoke methods; this is what gets called under the covers when a Clojure function is called.

All arguments and return values ​​are objects within function boundaries (no decoding).These call methods all take arguments and return values ​​of the root type java.lang.Object. This allows Clojure s dynamic input parameters (i.e., your function implementations define the range of valid argument types, not static type declarations that are enforced by the language), but has the side effect of forcing the JVM to inject any primitives passed as arguments or returned the results of these functions. So, if we call a Clojure function with a primitive argument - for example, long - that argument will be put into a Long object to match the type signature of the Clojure functions underlying the calling method. Likewise, if the result of a function is a primitive value, the underlying type of the returned object ensures that such primitives are boxed beforehow the recipient will receive the result. [...]

(defn round ^long [^double a] (Math/round a))
;= #'user/round
(seq (.getDeclaredMethods (round foo)))
;= (#<Method public java.lang.Object user$round.invoke(java.lang.Object)> 
#<Method public final long user$round.invokePrim(double)>)

      

If you wanted to optimize this further, and you were dealing exclusively with primitive Java integers, then you could use type declarations ^int

for your arguments or return function values. However, I don't think it will be useful for your current code. Another way to go down is to parallelize the counts and concatenate them at the end. You can also take a look at http://java-performance.info/implementing-world-fastest-java-int-to-int-hash-map/ for more ideas, although at this point you are indeed writing Java in funny domain syntax ...

+4


source


Using

(set! *warn-on-reflection* true)
(set! *unchecked-math* :warn-on-boxed)    ;; clojure 1.7

      

to get warnings from the compiler. Your update version is simple enough with two warnings:

Boxed math warning, /home/.../foo/src/foo/core.clj:68:28 - call: public static java.lang.Number clojure.lang.Numbers.unchecked_inc(java.lang.Object).
Boxed math warning, /home/.../foo/src/foo/core.clj:68:28 - call: public static java.lang.Number clojure.lang.Numbers.inc(java.lang.Object).

      

Here's a version with additional hints and exit criteria:



(defn frequencies2
  []
  (into {} (reduce (fn [^java.util.HashMap counts x]
                     (let [^int v (or (.get counts x) 0)]
                       (.put counts x
                             (inc v))) counts)
                   (HashMap.) data)))

      

criterium:

> (bench (frequencies2))
Evaluation count : 720 in 60 samples of 12 calls.
             Execution time mean : 91.375085 ms
    Execution time std-deviation : 1.415710 ms
   Execution time lower quantile : 89.957446 ms ( 2.5%)
   Execution time upper quantile : 95.135782 ms (97.5%)
                   Overhead used : 2.313579 ns

Found 3 outliers in 60 samples (5.0000 %)
    low-severe   1 (1.6667 %)
    low-mild     2 (3.3333 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers

      

Please note that the original version frequencies

is much slower: "Elapsed time: 525.264668 msecs"

+1


source







All Articles