Why is the ThreeSum "map" so slow?
I expected this Python implementation for ThreeSum to be slow:
def count(a):
"""ThreeSum: Given N distinct integers, how many triples sum to exactly zero?"""
N = len(a)
cnt = 0
for i in range(N):
for j in range(i+1, N):
for k in range(j+1, N):
if sum([a[i], a[j], a[k]]) == 0:
cnt += 1
return cnt
But I was shocked that this version looks rather slow:
def count_python(a):
"""ThreeSum using itertools"""
return sum(map(lambda X: sum(X)==0, itertools.combinations(a, r=3)))
Can you recommend a faster Python implementation? Both implementations seem so slow ... Thanks
...
ANSWER TO ANSWER: This is how the versions of all the different versions provided in this thread are O (N ^ 3) versions worked out on my machine (for educational purposes not used in real life):
56 sec RUNNING count_slow ...
28 seconds RUNNING count_itertools, written by Ashwini Chaudhary ...
14 sec RUNNING count_fixed, written by roippi ...
11 seconds RUNNING count_itertools (faster), written by Veedrak ...
08 sec RUNNING count_enumerate, written by roippi .. ...
* Note: Veedrak's solution needs to be changed to get the correct counter output:
sum (1 for x, y, z in itertools.combinations (a, r = 3) if x + y == - z)
source to share
Delivery of the second answer. From various comments it seems that you are primarily concerned about why this particular O (n ** 3) algorithm is slow when porting from java. Let the dive in.
def count(a):
"""ThreeSum: Given N distinct integers, how many triples sum to exactly zero?"""
N = len(a)
cnt = 0
for i in range(N):
for j in range(i+1, N):
for k in range(j+1, N):
if sum([a[i], a[j], a[k]]) == 0:
cnt += 1
return cnt
One problem with main that immediately pops up is that you are doing something that your Java code almost certainly doesn't: materializing a 3-item list just lets you add three numbers!
if sum([a[i], a[j], a[k]]) == 0:
Ugh! Just write that
if a[i] + a[j] + a[k] == 0:
Some benchmarking shows that you are adding 50% + overhead just by doing it. Clap.
Another problem is that you are using indexing, in which you have to use iteration. In python, try to avoid writing code like this:
for i in range(len(some_list)):
do_something(some_list[i])
And instead, just write:
for x in some_list:
do_something(x)
And if you explicitly need the index you are at (as you do in your code) use enumerate
:
for i,x in enumerate(some_list):
#etc
This is, in general, a stylish thing (although it goes deeper than with duck typing and iterator protocol), but it is also a matter of performance. To see the value a[i]
, this call is converted to a.__getitem__(i)
, then python has to dynamically resolve the method lookup __getitem__
, call it and return the value. Everytime. It's not a crazy amount of overhead - at least on built-in types - but it adds up if you do it a lot in a loop. On the other hand, referring to it a
as iterable wraps around a lot of this overhead.
So, with this change in mind, you can rewrite your function again:
def count_enumerate(a):
cnt = 0
for i, x in enumerate(a):
for j, y in enumerate(a[i+1:], i+1):
for z in a[j+1:]:
if x + y + z == 0:
cnt += 1
return cnt
Let's take a look at some timings:
%timeit count(range(-100,100))
1 loops, best of 3: 394 ms per loop
%timeit count_fixed(range(-100,100)) #just fixing your sum() line
10 loops, best of 3: 158 ms per loop
%timeit count_enumerate(range(-100,100))
10 loops, best of 3: 88.9 ms per loop
And it's about as fast as it gets. You can shave off a percentage or so by wrapping everything in understanding instead of doing cnt += 1
, but this is pretty minor.
I've played itertools
around with several implementations , but I really can't get them to run faster than this explicit version of the loop. This makes sense if you think about it - for each iteration, the version has itertools.combinations
to repeat which one all three variables belong to, whereas explicit loops become "spoofing" and relaying variables in outer loops much less frequently.
Reality check time though: after all is said and done, you can still expect cPython to run this algorithm an order of magnitude slower than a modern JVM. There is too much abstraction built into python that gets in the way quickly. If you want speed (and you can't fix your algorithm - see my other answer), either use something like numpy
to spend all your time in C or use a different python implementation.
postscript: pypy
For fun, I ran count_fixed
into a list of 1000 items, both in cPython and pypy.
CPython:
In [81]: timeit.timeit('count_fixed(range(-500,500))', setup='from __main__ import count_fixed', number = 1)
Out[81]: 19.230753898620605
PyPy:
>>>> timeit.timeit('count_fixed(range(-500,500))', setup='from __main__ import count_fixed', number = 1)
0.6961538791656494
Speedy!
I can add some java testing later to compare :-)
source to share
Algorithmically, both versions of your function are O (n ** 3) - so asymptotically neither is superior. You will find that the version of itertools is somewhat faster in practice, as it spends more time in C rather than python bytecode. You can get it a few more percentage points by removing it entirely map
(especially if you're using py2), but it will still be "slow" compared to what you got from running in the JVM.
Note that there are many python implementations other than cPython - for loop code, pypy
it tends to be much faster than cPython. So I won't write python-as-language as slow, necessarily, but of course I would say the python reference implementation is not known for its blazing loop speed. Give other python flavors a shot if you need something.
According to your algorithm, optimization will get you down to O (n ** 2). Create a set of integers s
and create all pairs (a,b)
. You know that you can "zero" (a+b)
if and only if -(a+b) in (s - {a,b})
.
Thanks to @Veedrak: unfortunately construction s - {a,b}
is a slow O (len (s)) operation, so just check if -(a+b)
either is equal to a
or b
. If so, you know that there is no third c
one that can fulfill a+b+c == 0
, since all the numbers in your input are different.
def count_python_faster(a):
s = frozenset(a)
return sum(1 for x,y in itertools.combinations(a,2)
if -(x+y) not in (x,y) and -(x+y) in s) // 3
Notice the division by three at the end; this is because every successful combination has a triple count. It can be avoided, but it doesn't really speed things up and (imo) just complicates the code.
Some timings for the curious:
%timeit count(range(-100,100))
1 loops, best of 3: 407 ms per loop
%timeit count_python(range(-100,100)) #this is about 100ms faster on py3
1 loops, best of 3: 382 ms per loop
%timeit count_python_faster(range(-100,100))
100 loops, best of 3: 5.37 ms per loop
source to share
You didn't mention which version of Python you are using.
In Python 3.x, a generator expression is about 10% faster than either of the two implementations you have implemented. Using a random array of 100 numbers in the range [-100,100] for a
:
count(a) -> 8.94 ms # as per your implementation
count_python(a) -> 8.75 ms # as per your implementation
def count_generator(a):
return sum((sum(x) == 0 for x in itertools.combinations(a,r=3)))
count_generator(a) -> 7.63 ms
But other than that, it's the number of combination shifts that dominate the runtime - O (N ^ 3).
I have to add the times shown above for loops of 10 calls each, averaged over 10 loops. And yes, my laptop is slow too :)
source to share