Using pstats and cProfile in Python. How do I make the array run faster?

This is my first code optimization and I am delighted with it. Read some of the articles, but I still have some questions.

1) First, what is taking a long time in my code below? I think the array here is: array.append (len (set (line.split ()))). I read online that python lists are faster, but I don't see lists here. Does anyone know how this can be improved?

2) Are there any improvements that I am missing?

3) Also, it says online that the code is slowing down very slowly for a loop. Can we improve here? (I think writing code in C would be better, but: D)

4) Why do people suggest to always look at "ncalls" and "tottime"? For me, "percall" makes more sense. It tells you how fast your function or call is.

5) Here in class B's correct answer, he used lists. Is he? For me, I can still see an array and a FOR loop which is supposed to slow things down. Fastest way to create a numeric numeric array

Thank.

New cProfile result:

 618384 function calls in 9.966 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    19686    3.927    0.000    4.897    0.000 <ipython-input-120-d8351bb3dd17>:14(f)
    78744    3.797    0.000    3.797    0.000 {numpy.core.multiarray.array}
    19686    0.948    0.000    0.948    0.000 {range}
    19686    0.252    0.000    0.252    0.000 {method 'partition' of 'numpy.ndarray' objects}
    19686    0.134    0.000    0.930    0.000 function_base.py:2896(_median)
        1    0.126    0.126    9.965    9.965 <ipython-input-120-d8351bb3dd17>:22(<module>)
    19686    0.125    0.000    0.351    0.000 _methods.py:53(_mean)
    19686    0.120    0.000    0.120    0.000 {method 'reduce' of 'numpy.ufunc' objects}
    19686    0.094    0.000    4.793    0.000 function_base.py:2747(_ureduce)
    19686    0.071    0.000    0.071    0.000 {method 'flatten' of 'numpy.ndarray' objects}
    19686    0.065    0.000    0.065    0.000 {method 'format' of 'str' objects}
    78744    0.055    0.000    3.852    0.000 numeric.py:464(asanyarray)

      

NEW CODE:

import numpy
import cProfile

pr = cProfile.Profile()
pr.enable()

#paths to files
read_path = '../tweet_input/tweets.txt'
write_path = "../tweet_output/ft2.txt"


def f(a):  
    for i in range(0, len(array)):
        if a <= array[i]:
            array.insert(i, a)
            break
    if 0 == len(array):
        array.append(a)

try:
    with open(read_path) as inf, open(write_path, 'a') as outf:
        array = []
        #for every line (tweet) in the file
        for line in inf:                                            ###Loop is bad. Builtin function is good
            #append amount of unique words to the array
            wordCount = len(set(line.split()))
            #print wordCount, array
            f(wordCount)
            #write current median of the array to the file
            result = "{:.2f}\n".format(numpy.median(array))
            outf.write(result)
except IOError as e:
    print 'Operation failed: %s' % e.strerror


###Service
pr.disable()
pr.print_stats(sort = 'time')

      


OLD cProfile result: 551211 function calls in 13.195 seconds Ordered by: internal time
     ncalls tottime percall cumtime percall filename: lineno (function) 78744 10.193 0.000 10.193 0.000 {numpy.core.multiarray.array}

OLD CODE:

    with open(read_path) as inf, open(write_path, 'a') as outf:
        array = []
        #for every line in the file
        for line in inf:                            
            #append amount of unique words to the array
            array.append(len(set(line.split())))
            #write current median of the array to the file
            result = "{:.2f}\n".format(numpy.median(array))
            outf.write(result)

      

+3


source to share


1 answer


Numpy uses a median search algorithm , which is O (n log n). You call numpy.meadian

once per line, so your algorithm ends up in O (n ^ 2 log n).

There are several ways to improve this. One is to preserve the sorting of the array (i.e. Insert each element at a position that maintains the sorted order). Each insert takes O (n) (an array insert is a linear time operation), and getting the median of a sorted array is O (1), so it ends up in O (n ^ 2).



For profiling, the main thing you want to look at is this tottime

, since it tells you how much time your program spent executing the function as a whole. As in your example, percall

sometimes not very useful, because sometimes if you have a slow function (high percall

) but only called it a few times (low numcalls

), then it tottime

ends up being insignificant compared to other functions.

+1


source







All Articles