Is displaying one function slower than twice, displaying two separate functions?

The following example seems to imply a runtime optimization which I don't understand

Can anyone explain this behavior and how it might apply to more general cases?

Example

Consider the following simple (example) functions

def y(x): # str output
    y = 1 if x else 0
    return str(y)

def _y(x): # no str
    y = 1 if x else 0
    return y

      

Suppose I want to apply a function y

to all items in a list

l = range(1000) # test input data

      

Result

Operations

A map

will have to go through all the elements in the list. It seems counter intuitive that the decomposition of a function into double is map

vastly superior to a single map

function

%timeit map(str, map(_y, l))
1000 loops, best of 3: 206 µs per loop

%timeit map(y, l)
1000 loops, best of 3: 241 µs per loop

      

In general, this also applies to non-standard library nested functions such as

def f(x):
    return _y(_y(x))

%timeit map(_y, map(_y, l))
1000 loops, best of 3: 235 µs per loop
%timeit map(f, l)
1000 loops, best of 3: 294 µs per loop

      

Is this a python service issue where it map

compiles low-level python code where possible, and hence throttles when it has to interpret a nested function?

+3


source to share


1 answer


The difference is that it is map()

implemented in C code, and calling other C-implemented functions is cheap, and calling Python code is expensive. Also, making other calls from Python code is expensive too:

>>> timeit.timeit('f(1)', 'def f(x): return str(x)')
0.21682000160217285
>>> timeit.timeit('str(1)')
0.140916109085083

      

and third, you are passing objects to the function in map()

(so no further lookups are done), but y()

must look up the name str

every time. Global searches are relatively expensive compared to local searches; binding the global to a function argument to make it local might help a little with this:

>>> timeit.timeit('f(1)', 'def f(x, _str=str): return _str(x)')
0.19425392150878906

      

Much closer to the version str(1)

, although this also had to use the global; it can still interrupt the hands-down call if you also gave the time test to the locals:

>>> timeit.timeit('_str(1)', '_str = str')
0.10266494750976562

      

Thus, to execute Python bytecode, an additional object, a stack frame, must be created for each call. This picture frame object must be managed on a dedicated Python call stack when other code is called. Also, your function y

looks up the name str

as global every time , while the call map(str, ...)

keeps a single reference to that object and uses it over and over again.

By moving the call str()

out of the function y

and allowing it map()

to be called str()

directly through a single reference, you've removed stack handling and global lookups and made things a little faster.

As a diagram, it map(y, l)

runs for each input value:

  • Create a stack frame for y

    , execute body
    • find str

      as global
      • push a y

        stack of frames onto a stack
      • execute str(...)

      • pop stack frame from stack
    • return result

while map(str, map(_y, l))

doing

  • Create stack for _y

    • return result
  • execute str(...)



The same goes for setting up a function f()

:

>>> def f(x):
...     return _y(_y(x))
...
>>> timeit.timeit("map(_y, map(_y, l))", 'from __main__ import _y, testdata as l', number=10000)
2.691640853881836
>>> timeit.timeit("map(f, l)", 'from __main__ import f, testdata as l', number=10000)
3.104063034057617

      

Calling map()

on _y

is twice as fast as inserting a call _y(_y(x))

into another function, which then has to perform global searches and underline the Python stack again; in your example, f()

each iteration map()

should create 3 stack frames and push and push them onto the stack and turn off the stack, while in the setup map(_y, map(_y, ...))

, only 2 frames are created for each iteration element:

  • Create a stack frame for f

    , execute body
    • find _y

      as global
      • push a f

        stack of frames onto a stack
      • Create a stack frame for _y

        , execute body
      • pop stack frame from stack
    • find _y

      as global (yes, again)
      • push a f

        stack of frames onto a stack
      • Create a stack frame for _y

        , execute body
      • pop stack frame from stack
    • return result

against

  • Create a stack frame for _y

    , execute body
    • return result
  • Create a stack frame for _y

    , execute body
    • return result

Again, using locals can change the difference a bit:

>>> def f(x, _y=_y):
...     return _y(_y(x))
...
>>> timeit.timeit("map(f, l)", 'from __main__ import f, testdata as l', number=10000)
2.981696128845215

      

but this extra Python frame object is still picking a single call map(f, ...)

.


TL; DR : Your function y()

gets O (N) extra global name lookups and O (N) extra stackframe objects to stack on and off the Python stack, compared to the double map()

version.

If speed matches this match, try to avoid creating Python stack frames and globals in a tight loop.

+3


source







All Articles