Python list n files and then n files in a directory and map it to mapper function
I have a directory in which I have about a hundred thousand text files.
Python code creates a list of the names of these files,
listoffiles = os.listdir(directory)
I'll break this down listoffiles
with a function lol
in 64 parts
lol = lambda lst, sz: [lst[i:i+sz] for i in range(0, len(lst), sz)]
partitioned_listoffiles = lol(listoffiles, 64)
Then I combine it into 2 processes
pool = Pool(processes=2,)
single_count_tuples = pool.map(Map, partitioned_listoffiles)
In a function, Map
I read these files and do further processing
My problem is this code works great if I do it for a small folder with thousands of files. Large catalogs are running out. How can I solve this problem. Can I read the first n files and then the next n files and create listoffiles
and process those steps in a loop.
source to share
If the directory is very large, you can use scandir()
instead os.listdir()
. But it is unlikely that os.listdir()
causes MemoryError
, so the problem is in two other places:
-
Use a generator expression instead of a list comprehension:
chunks = (lst[i:i+n] for i in range(0, len(lst), n))
-
Use
pool.imap
orpool.imap_unordered
instead ofpool.map()
:for result in pool.imap_unordered(Map, chunks): pass
Or better:
files = os.listdir(directory)
for result in pool.imap_unordered(process_file, files, chunksize=100):
pass
source to share
I had a very similar problem where I needed to check that a certain number of files are in a certain folder. The problem was that the folder can contain up to 20 million very small files. From what I've learned, there is no way to limit pythons to a listdir
specific number of items.
Mine listdir
takes quite a while to list the directory and a lot of RAM, but it manages to run on a VM with 4GB of RAM.
You can try using instead glob
, which can reduce the list of files depending on your requirements.
import glob
print glob.glob("/tmp/*.txt")
source to share