Optimized way to count the number of rows with conditions

Question

Optimized way to count the number of rows with conditions

I've seen that a quick way to count the number of lines in a file is as follows:

n_lines=sum(1 for line in open(myfile))

I would like to know if it is possible to put some conditions in the sum function to have something like this:

n_lines=sum(1 for line in open(PATHDIFF) if line=='\n' break if line.startswith('#') continue)

Thanks in advance.

+3

python

SOCKet May 27 '15 at 13:41

source to share

6 answers

There is actually a quick way (borrowing from Funcy ) to calculate the length of an iterator without using it:

Example:

from collections import deque
from itertools import count, izip


def ilen(seq):
    counter = count()
    deque(izip(seq, counter), maxlen=0)  # (consume at C speed)
    return next(counter)


def lines(filename)
    with open(filename, 'r') as f:
        return ilen(
            None for line in f
            if line != "\n" and not line.startswith("#")
        )


nlines = lines("file.txt")

+2

James mills May 27 '15 at 13:44

source to share

You can not use break

, and continue

in the expression or list generator expression, so "right" for your example syntax is as follows:

nlines = 0
with  open(PATHDIFF) as f:
    for line in f:
        if line=='\n':
            # not sure that _really_ what you want
            # => this will exit the loop at the first 'empty' line
            break 
        if line.startswith('#'):
            continue
        nlines += 1

Now, if you really want to get out of the first "blank" line. And want to make it a one-liner, you can also use itertools.takewhile()

:

from itertools import takewhile
with open(XXX) as f: 
    nlines = sum(1 for line in takewhile(lambda l: l != '\n', f) 
                 if not line.starstwith("#"))

+2

bruno desthuilliers May 27 '15 at 13:53

source to share

from itertools import ifilter,takewhile
with open("test.txt") as f:
     fil = sum(1 for _ in takewhile(str.strip, ifilter(lambda line: not line.startswith("#"), f)))
     print(fil)

Or maybe indexing will be faster than calls startswith

:

 fil = sum(1 for _ in takewhile(str.strip, ifilter(lambda x: x[0] != "#", f)))

Usage str.strip

will capture any blank lines.

Indexing looks a little faster:

In [11]: from itertools import ifilter,takewhile

In [12]: %%timeit
   ....: with open("test.txt") as f:
   ....:      fil = sum(1 for _ in takewhile(str.strip, ifilter(lambda x: x[0] != "#", f)))
   ....: 

1000 loops, best of 3: 400 µs per loop

In [13]: %%timeit
   ....: with open("test.txt") as f:
   ....:      fil = sum(1 for _ in takewhile(str.strip, ifilter(lambda line: not line.startswith("#"), f)))
   ....: 

1000 loops, best of 3: 531 µs per loop

+2

Padraic cunningham May 27 '15 at 14:02

source to share

If you want speed and don't mind using bash

grep -v '^#' yourfile | wc -l

Will read all lines that don't start with C # and will be faster than python.

+1

firelynx May 27 '15 at 13:45

source to share

Do you need a number of comment lines or not a comment? If it's something like this, it should work.

comment_lines = sum([1 for line in open(PATHDIFF) if line.startswith('#')])
non_comment_lines = sum([1 for line in open(PATHDIFF) if not line.startswith('#')])

0

Songy May 27 '15 at 13:49

source to share

chepner · Accepted Answer · 2015-05-27T13:50:10+0000

You can with certain restrictions. You pass a generator expression as an argument sum

, and a generator expression can take one expression with a clause if

. You can combine your conditions like this:

n_lines=sum(1 for line in open(PATHDIFF)
                if line != '\n' and not line.startswith('#'))

However, this does not shorten the iteration of your file when clicked newline

; it continues to read the file to the end. To avoid this, you can use itertools.takewhile

which will only read from the iterator created by the generator expression until you read a newline.

from itertools import takewhile
n_lines = sum(1 for line in takewhile(lambda x: x != '\n',
                                      open(PATHDIFF))
                   if not line.startswith('#'))

You can also use itertools.ifilterfalse

to fill the same role as a generator expression clause.

from itertools import takewhile, ifilterfalse
n_lines = sum(1 for line in ifilterfalse(lambda x: x.startswith('#'),
                                         takewhile(lambda x: x != '\n',
                                                   open(PATHDIFF))))

Of course, now your code starts to look like you are writing in Schema or Lisp. The generator expression is a little easier to read, but the module itertool

is useful for creating modified iterators that you can loop around like different objects.

In another thread, you should always be sure to close all the files you open, which means that you are not using anonymous files in your iterators. The cleanest way to do this is to use the statement with

:

with open(PATHDIFF) as f:
    n_lines = sum(1 for line in f if line != '\n' and not line.startswith('#'))

Other examples can be modified in a similar way; just replace open(PATHDIFF)

with f

where it happens.

Optimized way to count the number of rows with conditions

More articles: