Print lines between line numbers from a large file

I have a very large text file that is over 30 GB in size. For some reason I want to read lines between 1,000,000 and 2,000,000 and compare against a custom input string. If that matches, I need to write the contents of the string to another file.

I know how to read a file line by line.

input_file = open('file.txt', 'r')
for line in input_file:
    print line

      

But if the file size is large, does it really affect performance? How to solve this problem in an optimized way.

+4


source to share


6 answers


You can use itertools.islice

:

from itertools import islice
with open('file.txt') as fin:
    lines = islice(fin, 1000000, 2000000) # or whatever ranges
    for line in lines:
        # do something

      



Of course, if your strings are fixed length, you can use that directly fin.seek()

before the beginning of the string. Otherwise, the above approach should still read lines n

until islice

it produces output, but is simply a convenient way to limit the range.

+7


source


You can use linecache .

Let me quote from the docs: "The linecache module allows you to get any line from any file, trying to optimize internally using the cache, a common case where many lines are read from a single file.":



import linecache

for i in xrange(1000000, 2000000)
    print linecache.getline('file.txt', i)

      

+2


source


Do all your lines have the same size? If that were the case, you could use seek()

to jump directly to the first line of interest. Otherwise, you will have to iterate over the entire file, because there is no way to tell ahead of time where each line starts:

input_file = open('file.txt', 'r')
for index, line in enumerate(input_file):
    # Assuming you start counting from zero
    if 1000000 <= index <= 2000000:
        print line

      

For small files, the module may be useful linecache

.

+1


source


If you are running on Linux, do you think that you are using modules os.system

or commands

Python for direct execution of shell commands, such as sed

, awk

, head

or tail

for that?

Running command: os.system("tail -n+50000000 test.in | head -n10")

will read line 50.000.000 to 50.000.010 from test.in

fooobar.com/questions/217 / ... discusses different ways to invoke commands, if performance is key, there might be more efficient methods than os.system.

This unix.stackexchange discussion discusses in detail how to select specific ranges of a text file using the command line:

  • 100,000,000 lines of the file created seq 100000000 > test.in

  • Reading lines 50,000,000-50,000 0 0
  • Tests in no particular order
  • in real time as reported by bash inline time

A combination of a tail and a head, or using sed, seems to offer the fastest solutions.

 4.373  4.418  4.395    tail -n+50000000 test.in | head -n10
 5.210  5.179  6.181    sed -n '50000000,50000010p;57890010q' test.in
 5.525  5.475  5.488    head -n50000010 test.in | tail -n10
 8.497  8.352  8.438    sed -n '50000000,50000010p' test.in 
22.826 23.154 23.195    tail -n50000001 test.in | head -n10
25.694 25.908 27.638    ed -s test.in <<<"50000000,50000010p"
31.348 28.140 30.574    awk 'NR<57890000{next}1;NR==57890010{exit}' test.in
51.359 50.919 51.127    awk 'NR >= 57890000 && NR <= 57890010' test.in

      

+1


source


Generally, you cannot just go to the x line number in the file, because the text strings can be of variable length, so they can span anything between one and more bytes.

However, if you want to search these files frequently, you can index them by remembering in separate files where bytes begin, say, every thousandth line. You can open the file and use file.seek()

to navigate to the part of the file you are interested in and start from there.

0


source


The best way I have found is this:

lines_data = []     
text_arr = multilinetext.split('\n')
for i in range(line_number_begin, line_number_end):
    lines_data.append(multilinetext[i])

      

0


source







All Articles