Optimized way to count the number of rows with conditions

I've seen that a quick way to count the number of lines in a file is as follows:

n_lines=sum(1 for line in open(myfile))

      

I would like to know if it is possible to put some conditions in the sum function to have something like this:

n_lines=sum(1 for line in open(PATHDIFF) if line=='\n' break if line.startswith('#') continue)

      

Thanks in advance.

+3


source to share


6 answers


You can with certain restrictions. You pass a generator expression as an argument sum

, and a generator expression can take one expression with a clause if

. You can combine your conditions like this:

n_lines=sum(1 for line in open(PATHDIFF)
                if line != '\n' and not line.startswith('#'))

      

However, this does not shorten the iteration of your file when clicked newline

; it continues to read the file to the end. To avoid this, you can use itertools.takewhile

which will only read from the iterator created by the generator expression until you read a newline.

from itertools import takewhile
n_lines = sum(1 for line in takewhile(lambda x: x != '\n',
                                      open(PATHDIFF))
                   if not line.startswith('#'))

      

You can also use itertools.ifilterfalse

to fill the same role as a generator expression clause.

from itertools import takewhile, ifilterfalse
n_lines = sum(1 for line in ifilterfalse(lambda x: x.startswith('#'),
                                         takewhile(lambda x: x != '\n',
                                                   open(PATHDIFF))))

      



Of course, now your code starts to look like you are writing in Schema or Lisp. The generator expression is a little easier to read, but the module itertool

is useful for creating modified iterators that you can loop around like different objects.


In another thread, you should always be sure to close all the files you open, which means that you are not using anonymous files in your iterators. The cleanest way to do this is to use the statement with

:

with open(PATHDIFF) as f:
    n_lines = sum(1 for line in f if line != '\n' and not line.startswith('#'))

      

Other examples can be modified in a similar way; just replace open(PATHDIFF)

with f

where it happens.

+5


source


There is actually a quick way (borrowing from Funcy ) to calculate the length of an iterator without using it:

Example:



from collections import deque
from itertools import count, izip


def ilen(seq):
    counter = count()
    deque(izip(seq, counter), maxlen=0)  # (consume at C speed)
    return next(counter)


def lines(filename)
    with open(filename, 'r') as f:
        return ilen(
            None for line in f
            if line != "\n" and not line.startswith("#")
        )


nlines = lines("file.txt")

      

+2


source


You can not use break

, and continue

in the expression or list generator expression, so "right" for your example syntax is as follows:

nlines = 0
with  open(PATHDIFF) as f:
    for line in f:
        if line=='\n':
            # not sure that _really_ what you want
            # => this will exit the loop at the first 'empty' line
            break 
        if line.startswith('#'):
            continue
        nlines += 1

      

Now, if you really want to get out of the first "blank" line. And want to make it a one-liner, you can also use itertools.takewhile()

:

from itertools import takewhile
with open(XXX) as f: 
    nlines = sum(1 for line in takewhile(lambda l: l != '\n', f) 
                 if not line.starstwith("#"))

      

+2


source


from itertools import ifilter,takewhile
with open("test.txt") as f:
     fil = sum(1 for _ in takewhile(str.strip, ifilter(lambda line: not line.startswith("#"), f)))
     print(fil)

      

Or maybe indexing will be faster than calls startswith

:

 fil = sum(1 for _ in takewhile(str.strip, ifilter(lambda x: x[0] != "#", f)))

      

Usage str.strip

will capture any blank lines.

Indexing looks a little faster:

In [11]: from itertools import ifilter,takewhile

In [12]: %%timeit
   ....: with open("test.txt") as f:
   ....:      fil = sum(1 for _ in takewhile(str.strip, ifilter(lambda x: x[0] != "#", f)))
   ....: 

1000 loops, best of 3: 400 ยตs per loop

In [13]: %%timeit
   ....: with open("test.txt") as f:
   ....:      fil = sum(1 for _ in takewhile(str.strip, ifilter(lambda line: not line.startswith("#"), f)))
   ....: 

1000 loops, best of 3: 531 ยตs per loop

      

+2


source


If you want speed and don't mind using bash

grep -v '^#' yourfile | wc -l

      

Will read all lines that don't start with C # and will be faster than python.

+1


source


Do you need a number of comment lines or not a comment? If it's something like this, it should work.

comment_lines = sum([1 for line in open(PATHDIFF) if line.startswith('#')])
non_comment_lines = sum([1 for line in open(PATHDIFF) if not line.startswith('#')])

      

0


source







All Articles