Optimized way to count the number of rows with conditions
I've seen that a quick way to count the number of lines in a file is as follows:
n_lines=sum(1 for line in open(myfile))
I would like to know if it is possible to put some conditions in the sum function to have something like this:
n_lines=sum(1 for line in open(PATHDIFF) if line=='\n' break if line.startswith('#') continue)
Thanks in advance.
source to share
You can with certain restrictions. You pass a generator expression as an argument sum
, and a generator expression can take one expression with a clause if
. You can combine your conditions like this:
n_lines=sum(1 for line in open(PATHDIFF)
if line != '\n' and not line.startswith('#'))
However, this does not shorten the iteration of your file when clicked newline
; it continues to read the file to the end. To avoid this, you can use itertools.takewhile
which will only read from the iterator created by the generator expression until you read a newline.
from itertools import takewhile
n_lines = sum(1 for line in takewhile(lambda x: x != '\n',
open(PATHDIFF))
if not line.startswith('#'))
You can also use itertools.ifilterfalse
to fill the same role as a generator expression clause.
from itertools import takewhile, ifilterfalse
n_lines = sum(1 for line in ifilterfalse(lambda x: x.startswith('#'),
takewhile(lambda x: x != '\n',
open(PATHDIFF))))
Of course, now your code starts to look like you are writing in Schema or Lisp. The generator expression is a little easier to read, but the module itertool
is useful for creating modified iterators that you can loop around like different objects.
In another thread, you should always be sure to close all the files you open, which means that you are not using anonymous files in your iterators. The cleanest way to do this is to use the statement with
:
with open(PATHDIFF) as f:
n_lines = sum(1 for line in f if line != '\n' and not line.startswith('#'))
Other examples can be modified in a similar way; just replace open(PATHDIFF)
with f
where it happens.
source to share
There is actually a quick way (borrowing from Funcy ) to calculate the length of an iterator without using it:
Example:
from collections import deque
from itertools import count, izip
def ilen(seq):
counter = count()
deque(izip(seq, counter), maxlen=0) # (consume at C speed)
return next(counter)
def lines(filename)
with open(filename, 'r') as f:
return ilen(
None for line in f
if line != "\n" and not line.startswith("#")
)
nlines = lines("file.txt")
source to share
You can not use break
, and continue
in the expression or list generator expression, so "right" for your example syntax is as follows:
nlines = 0
with open(PATHDIFF) as f:
for line in f:
if line=='\n':
# not sure that _really_ what you want
# => this will exit the loop at the first 'empty' line
break
if line.startswith('#'):
continue
nlines += 1
Now, if you really want to get out of the first "blank" line. And want to make it a one-liner, you can also use itertools.takewhile()
:
from itertools import takewhile
with open(XXX) as f:
nlines = sum(1 for line in takewhile(lambda l: l != '\n', f)
if not line.starstwith("#"))
source to share
from itertools import ifilter,takewhile
with open("test.txt") as f:
fil = sum(1 for _ in takewhile(str.strip, ifilter(lambda line: not line.startswith("#"), f)))
print(fil)
Or maybe indexing will be faster than calls startswith
:
fil = sum(1 for _ in takewhile(str.strip, ifilter(lambda x: x[0] != "#", f)))
Usage str.strip
will capture any blank lines.
Indexing looks a little faster:
In [11]: from itertools import ifilter,takewhile
In [12]: %%timeit
....: with open("test.txt") as f:
....: fil = sum(1 for _ in takewhile(str.strip, ifilter(lambda x: x[0] != "#", f)))
....:
1000 loops, best of 3: 400 ยตs per loop
In [13]: %%timeit
....: with open("test.txt") as f:
....: fil = sum(1 for _ in takewhile(str.strip, ifilter(lambda line: not line.startswith("#"), f)))
....:
1000 loops, best of 3: 531 ยตs per loop
source to share