String.split behavior for short strings when delivered by maxsplit

I recently ran into some intriguing behavior of the string.split method in python2.7, especially regarding short bites (less than 25 characters, see below), which has to do with contrasting behavior:

# Without maxplsit
my_string.split('A')

      

to

# With maxsplit=1
my_string.split('A', 1)

      

The second method is slower for short strings and I am very curious as to why.

Test

This came from a little time call that a colleague of mine discovered:

# Without maxsplit
$ python -m timeit -s "json_line='a|b|c'" "part_one='|'.split(json_line)[0]"
1000000 loops, best of 3: 0.274 usec per loop
# With maxsplit
$ python -m timeit -s "json_line='a|b|c'" "part_one='|'.split(json_line,1)[0]"
1000000 loops, best of 3: 0.461 usec per loop

      

I thought this was curious, so I put together a more detailed test. First, I wrote the following small function that generates random strings of a given length, consisting of the first ten capital letters:

from random import choice

# 'A' through 'J'
choices = map(chr, range(65, 75))

def make_random_string(length):
    return ''.join(choice(choices) for i in xrange(length))

      

Then I wrote a couple of tester functions to repartition and randomly generate strings with a given length:

from timeit import timeit

def time_split_of_size(str_length, n_strs_to_split):
    times = []
    data = [make_random_string(str_length) for i in xrange(n_strs_to_split)]
    for s in data:
        t = timeit("'{s}'.split('A')".format(s=s),
                   setup="from __main__ import make_random_string",
                   number=1000)
        times.append(t)
    return times

def time_split_of_size_with_maxcount(str_length, n_strs_to_split):
    times = []
    data = [make_random_string(str_length) for i in xrange(n_strs_to_split)]
    for s in data:
        t = timeit("'{s}'.split('A', 1)".format(s=s),
                   setup="from __main__ import make_random_string",
                   number=1000)
        times.append(t)
    return times

      

Then I used these test methods across different sized strings:

from collections import OrderedDict
d = OrderedDict({})
for str_length in xrange(10, 10*1000, 25):
    no_maxcount = mean(time_split_of_size(str_length, 20))
    with_maxcount = mean(time_split_of_size_with_maxcount(str_length, 20))
    d[str_length] = [no_maxcount, with_maxcount]

      

This gives you the behavior you expect, O (1) for a method with maxsplit = 1 and O (n) for a full split. Here's a timeline along the length of the line, the barely visible green curve has maxsplit=1

and the blue curve is missing:

StringSplitTiming

However, the behavior my colleague discovered for small bites is real. Here's some code that breaks many short bites:

from collections import OrderedDict
d = OrderedDict({})
for str_length in xrange(1, 50, 2):
    no_maxcount = mean(time_split_of_size(str_length, 500))
    with_maxcount = mean(time_split_of_size_with_maxcount(str_length, 500))
    d[str_length] = [no_maxcount, with_maxcount]

      

With the following results:

StringSplitShortString

It looks like there is some overhead for lines less than 25 or so characters. The shape of the green curve is also quite curious, as it grows parallel to the blue before flattening it out constantly.

I looked at the source code you can find here:

stringobject.c (line 1449) stringlib / split.h (line 105)

but nothing obvious jumped out at me.

Any idea which is causing the overhead when maxsplit is passed for short strings?

+3


source to share


2 answers


The difference actually has nothing to do with what's going on inside string_split

. In fact, the time spent inside this function is always slightly longer for the default division than for maxsplit=1

, even if there are no divisions. And that's not a difference PyArg_ParseTuple

(the best report I can get without the interpreter tools says it takes 0ns anyway, so whatever the difference, it doesn't matter).

The difference is that the extra parameter requires an extra bytecode.

Like Stefan Pochmann , you can tell this by checking explicitly maxsplit=-1

:

In [212]: %timeit ''.split('|')
1000000 loops, best of 3: 267 ns per loop
In [213]: %timeit ''.split('|', 1)
1000000 loops, best of 3: 303 ns per loop
In [214]: %timeit ''.split('|', -1)
1000000 loops, best of 3: 307 ns per loop

      

So, even in this minimal example, it is -1

slightly slower than 1

. But we're talking about 4ns extra work. (I'm sure it's 4ns because of the preallocating list of size 12 instead of size 2 , but I don't want to run through the profiler to be sure.)

Meanwhile, the NOP bytecode takes 32ns to evaluate on my system (from another answer I'm still trying to find ...). I can't imagine which is LOAD_CONST

faster than NOP

.

So, until you do enough work to suppress 32ns + without passing the maxsplit argument, you will save time.



If it's not obvious, here's a breakdown for two cases:

  1           0 LOAD_CONST               0 ('')
              3 LOAD_ATTR                0 (split)
              6 LOAD_CONST               1 ('|')
              9 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             12 RETURN_VALUE

  1           0 LOAD_CONST               0 ('')
              3 LOAD_ATTR                0 (split)
              6 LOAD_CONST               1 ('|')
              9 LOAD_CONST               3 (-1)
             12 CALL_FUNCTION            2 (2 positional, 0 keyword pair)
             15 RETURN_VALUE

      


For examples like this:

In [217]: %timeit int()
10000000 loops, best of 3: 94 ns per loop
In [218]: %timeit int(0)
10000000 loops, best of 3: 134 ns per loop
In [235]: %timeit [0].pop()
1000000 loops, best of 3: 229 ns per loop
In [236]: %timeit [0].pop(0)
1000000 loops, best of 3: 270 ns per loop

      

So, it LOAD_CONST

takes about 40 ns in both of these cases, just like passing -1

instead of an argument to split

.

Python 3.4 is a little trickier to test because it caches some things that aren't in 2.7, but it looks like around 33ns passes an extra argument - or 533ns if it's a keyword argument. So, if you need to split tiny strings in Python 3 a billion times, use s.split('|', 10)

rather than s.split('|', maxsplit=10)

.

+4


source


correct initial test (original test had json_line and

'|' mixed)

python -m timeit -s "json_line='a|b|c'" "part_one=json_line.split('|')[0]"
1000000 loops, best of 3: 0.239 usec per loop
python -m timeit -s "json_line='a|b|c'" "part_one=json_line.split('|',1)[0]"
1000000 loops, best of 3: 0.267 usec per loop

      



The time difference is less.

0


source