Print lines between line numbers from a large file
I have a very large text file that is over 30 GB in size. For some reason I want to read lines between 1,000,000 and 2,000,000 and compare against a custom input string. If that matches, I need to write the contents of the string to another file.
I know how to read a file line by line.
input_file = open('file.txt', 'r')
for line in input_file:
print line
But if the file size is large, does it really affect performance? How to solve this problem in an optimized way.
source to share
You can use itertools.islice
:
from itertools import islice
with open('file.txt') as fin:
lines = islice(fin, 1000000, 2000000) # or whatever ranges
for line in lines:
# do something
Of course, if your strings are fixed length, you can use that directly fin.seek()
before the beginning of the string. Otherwise, the above approach should still read lines n
until islice
it produces output, but is simply a convenient way to limit the range.
source to share
You can use linecache .
Let me quote from the docs: "The linecache module allows you to get any line from any file, trying to optimize internally using the cache, a common case where many lines are read from a single file.":
import linecache
for i in xrange(1000000, 2000000)
print linecache.getline('file.txt', i)
source to share
Do all your lines have the same size? If that were the case, you could use seek()
to jump directly to the first line of interest. Otherwise, you will have to iterate over the entire file, because there is no way to tell ahead of time where each line starts:
input_file = open('file.txt', 'r')
for index, line in enumerate(input_file):
# Assuming you start counting from zero
if 1000000 <= index <= 2000000:
print line
For small files, the module may be useful linecache
.
source to share
If you are running on Linux, do you think that you are using modules os.system
or commands
Python for direct execution of shell commands, such as sed
, awk
, head
or tail
for that?
Running command: os.system("tail -n+50000000 test.in | head -n10")
will read line 50.000.000 to 50.000.010 from test.in
fooobar.com/questions/217 / ... discusses different ways to invoke commands, if performance is key, there might be more efficient methods than os.system.
- 100,000,000 lines of the file created
seq 100000000 > test.in
- Reading lines 50,000,000-50,000 0 0
- Tests in no particular order
- in real time as reported by bash inline time
A combination of a tail and a head, or using sed, seems to offer the fastest solutions.
4.373 4.418 4.395 tail -n+50000000 test.in | head -n10
5.210 5.179 6.181 sed -n '50000000,50000010p;57890010q' test.in
5.525 5.475 5.488 head -n50000010 test.in | tail -n10
8.497 8.352 8.438 sed -n '50000000,50000010p' test.in
22.826 23.154 23.195 tail -n50000001 test.in | head -n10
25.694 25.908 27.638 ed -s test.in <<<"50000000,50000010p"
31.348 28.140 30.574 awk 'NR<57890000{next}1;NR==57890010{exit}' test.in
51.359 50.919 51.127 awk 'NR >= 57890000 && NR <= 57890010' test.in
source to share
Generally, you cannot just go to the x line number in the file, because the text strings can be of variable length, so they can span anything between one and more bytes.
However, if you want to search these files frequently, you can index them by remembering in separate files where bytes begin, say, every thousandth line. You can open the file and use file.seek()
to navigate to the part of the file you are interested in and start from there.
source to share