Python list n files and then n files in a directory and map it to mapper function

I have a directory in which I have about a hundred thousand text files.
Python code creates a list of the names of these files,

listoffiles = os.listdir(directory)

      

I'll break this down listoffiles

with a function lol

in 64 parts

lol = lambda lst, sz: [lst[i:i+sz] for i in range(0, len(lst), sz)]
partitioned_listoffiles = lol(listoffiles, 64)

      

Then I combine it into 2 processes

pool = Pool(processes=2,)
single_count_tuples = pool.map(Map, partitioned_listoffiles)

      

In a function, Map

I read these files and do further processing

My problem is this code works great if I do it for a small folder with thousands of files. Large catalogs are running out. How can I solve this problem. Can I read the first n files and then the next n files and create listoffiles

and process those steps in a loop.

+3


source to share


2 answers


If the directory is very large, you can use scandir()

instead os.listdir()

. But it is unlikely that os.listdir()

causes MemoryError

, so the problem is in two other places:

  • Use a generator expression instead of a list comprehension:

    chunks = (lst[i:i+n] for i in range(0, len(lst), n))
    
          

  • Use pool.imap

    or pool.imap_unordered

    instead of pool.map()

    :

    for result in pool.imap_unordered(Map, chunks):
        pass
    
          



Or better:

files = os.listdir(directory)
for result in pool.imap_unordered(process_file, files, chunksize=100):
    pass

      

+2


source


I had a very similar problem where I needed to check that a certain number of files are in a certain folder. The problem was that the folder can contain up to 20 million very small files. From what I've learned, there is no way to limit pythons to a listdir

specific number of items.

Mine listdir

takes quite a while to list the directory and a lot of RAM, but it manages to run on a VM with 4GB of RAM.



You can try using instead glob

, which can reduce the list of files depending on your requirements.

import glob
print glob.glob("/tmp/*.txt")

      

+1


source







All Articles