Python processes multiple files iteratively, without an explicit loop

I have a script that uses a large chunk of text to train a model. Now that it's written I can either read from a file or stdin

parser.add_argument('-i', help='input_file', default=sys.stdin)
... # do a bunch of other stuff
if args.i is sys.stdin:
    m.train(args.i)
else:
    m.train(open(args.i, 'r'))

      

then I can call my script as follows:

python myscript.py -i trainingdata.txt

      

or

cat trainingdata.txt | python myscript.py

      

The second version is especially useful if I want to search the file system and use multiple files to train the model. However, it gets tricky, because of the pipe, if I try to profile with cProfiler

ie at the same time

python -m cProfile myscript.py ... 

      

I know I can send multiple files using a parameter -i

and iterate over the files, but then I'll have to change the behavior of the method train()

to avoid data overwriting.

Is there a good way to open an IO channel for lack of a better expression that concatenates input without explicitly reading and writing line by line?

+3


source to share


1 answer


you can chain

open files and use generator to yield

open files from filenames:

from itertools import chain

def yield_open(filenames):
    for filename in filenames:
        with open(filename, 'r') as file:
            yield file

def train(file):
    for line in file:
        print(line, end='')
    print()

files = chain.from_iterable(yield_open(filenames=['file1.txt', 'file2.txt']))
train(files)

      

this has the added advantage that only one of your files is open at a time.



you can also use this as a "data pipeline" (might be more readable):

file_gen = yield_open(filenames=['file1.txt', 'file2.txt'])
files = chain.from_iterable(file_gen)
train(files)

      

+2


source







All Articles