Python list n files and then n files in a directory and map it to mapper function

Question

Python list n files and then n files in a directory and map it to mapper function

I have a directory in which I have about a hundred thousand text files.
Python code creates a list of the names of these files,

listoffiles = os.listdir(directory)

I'll break this down listoffiles

with a function lol

in 64 parts

lol = lambda lst, sz: [lst[i:i+sz] for i in range(0, len(lst), sz)]
partitioned_listoffiles = lol(listoffiles, 64)

Then I combine it into 2 processes

pool = Pool(processes=2,)
single_count_tuples = pool.map(Map, partitioned_listoffiles)

In a function, Map

I read these files and do further processing

My problem is this code works great if I do it for a small folder with thousands of files. Large catalogs are running out. How can I solve this problem. Can I read the first n files and then the next n files and create listoffiles

and process those steps in a loop.

+3

python out-of-memory mapreduce

Wazzzy Dec 12. 14 at 11:27

source to share

2 answers

I had a very similar problem where I needed to check that a certain number of files are in a certain folder. The problem was that the folder can contain up to 20 million very small files. From what I've learned, there is no way to limit pythons to a listdir

specific number of items.

Mine listdir

takes quite a while to list the directory and a lot of RAM, but it manages to run on a VM with 4GB of RAM.

You can try using instead glob

, which can reduce the list of files depending on your requirements.

import glob
print glob.glob("/tmp/*.txt")

+1

Daedalus mythos Dec 12. 14 at 11:33

source to share

jfs · Accepted Answer · 2014-12-12T11:42:00+0000

If the directory is very large, you can use scandir()

instead os.listdir()

. But it is unlikely that os.listdir()

causes MemoryError

, so the problem is in two other places:

Use a generator expression instead of a list comprehension:

chunks = (lst[i:i+n] for i in range(0, len(lst), n))

Use pool.imap

or pool.imap_unordered

instead of pool.map()

:

for result in pool.imap_unordered(Map, chunks):
    pass

Or better:

files = os.listdir(directory)
for result in pool.imap_unordered(process_file, files, chunksize=100):
    pass

Python list n files and then n files in a directory and map it to mapper function

More articles: