Is displaying one function slower than twice, displaying two separate functions?
The following example seems to imply a runtime optimization which I don't understand
Can anyone explain this behavior and how it might apply to more general cases?
Example
Consider the following simple (example) functions
def y(x): # str output
y = 1 if x else 0
return str(y)
def _y(x): # no str
y = 1 if x else 0
return y
Suppose I want to apply a function y
to all items in a list
l = range(1000) # test input data
Result
OperationsA map
will have to go through all the elements in the list. It seems counter intuitive that the decomposition of a function into double is map
vastly superior to a single map
function
%timeit map(str, map(_y, l))
1000 loops, best of 3: 206 µs per loop
%timeit map(y, l)
1000 loops, best of 3: 241 µs per loop
In general, this also applies to non-standard library nested functions such as
def f(x):
return _y(_y(x))
%timeit map(_y, map(_y, l))
1000 loops, best of 3: 235 µs per loop
%timeit map(f, l)
1000 loops, best of 3: 294 µs per loop
Is this a python service issue where it map
compiles low-level python code where possible, and hence throttles when it has to interpret a nested function?
source to share
The difference is that it is map()
implemented in C code, and calling other C-implemented functions is cheap, and calling Python code is expensive. Also, making other calls from Python code is expensive too:
>>> timeit.timeit('f(1)', 'def f(x): return str(x)')
0.21682000160217285
>>> timeit.timeit('str(1)')
0.140916109085083
and third, you are passing objects to the function in map()
(so no further lookups are done), but y()
must look up the name str
every time. Global searches are relatively expensive compared to local searches; binding the global to a function argument to make it local might help a little with this:
>>> timeit.timeit('f(1)', 'def f(x, _str=str): return _str(x)')
0.19425392150878906
Much closer to the version str(1)
, although this also had to use the global; it can still interrupt the hands-down call if you also gave the time test to the locals:
>>> timeit.timeit('_str(1)', '_str = str')
0.10266494750976562
Thus, to execute Python bytecode, an additional object, a stack frame, must be created for each call. This picture frame object must be managed on a dedicated Python call stack when other code is called. Also, your function y
looks up the name str
as global every time , while the call map(str, ...)
keeps a single reference to that object and uses it over and over again.
By moving the call str()
out of the function y
and allowing it map()
to be called str()
directly through a single reference, you've removed stack handling and global lookups and made things a little faster.
As a diagram, it map(y, l)
runs for each input value:
- Create a stack frame for
y
, execute body- find
str
as global- push a
y
stack of frames onto a stack - execute
str(...)
- pop stack frame from stack
- push a
- return result
- find
while map(str, map(_y, l))
doing
- Create stack for
_y
- return result
- execute
str(...)
The same goes for setting up a function f()
:
>>> def f(x):
... return _y(_y(x))
...
>>> timeit.timeit("map(_y, map(_y, l))", 'from __main__ import _y, testdata as l', number=10000)
2.691640853881836
>>> timeit.timeit("map(f, l)", 'from __main__ import f, testdata as l', number=10000)
3.104063034057617
Calling map()
on _y
is twice as fast as inserting a call _y(_y(x))
into another function, which then has to perform global searches and underline the Python stack again; in your example, f()
each iteration map()
should create 3 stack frames and push and push them onto the stack and turn off the stack, while in the setup map(_y, map(_y, ...))
, only 2 frames are created for each iteration element:
- Create a stack frame for
f
, execute body- find
_y
as global- push a
f
stack of frames onto a stack - Create a stack frame for
_y
, execute body - pop stack frame from stack
- push a
- find
_y
as global (yes, again)- push a
f
stack of frames onto a stack - Create a stack frame for
_y
, execute body - pop stack frame from stack
- push a
- return result
- find
against
- Create a stack frame for
_y
, execute body- return result
- Create a stack frame for
_y
, execute body- return result
Again, using locals can change the difference a bit:
>>> def f(x, _y=_y):
... return _y(_y(x))
...
>>> timeit.timeit("map(f, l)", 'from __main__ import f, testdata as l', number=10000)
2.981696128845215
but this extra Python frame object is still picking a single call map(f, ...)
.
TL; DR : Your function y()
gets O (N) extra global name lookups and O (N) extra stackframe objects to stack on and off the Python stack, compared to the double map()
version.
If speed matches this match, try to avoid creating Python stack frames and globals in a tight loop.
source to share