BeautifulSoup MemoryError when opening multiple files in a directory

Context: Every week I receive a list of lab results in the form of an html file. There are approximately 3000 results each week, with each result set having between two and four tables associated with them. For each outcome / test, I only care about some standard information that is stored in one of these tables. This table can be uniquely identified because the first cell, the first column, always has the text "Laboratory results".

Problem: The following code works great when I do each file at a time. That is, instead of executing a for loop over a directory, I specify get_data = open () on a specific file. However, I want to get data from the past few years and most likely won't be doing each file individually. So I used the glob module and a for loop to cycle through all the files in a directory. The problem I am having is I get a MemoryError by the time I get to the third file in the directory.

Question: Is there a way to clear / reset memory between each file? That way I could loop over all the files in the directory and not insert into each filename individually. As you can see in the code below, I tried to clear the variables with del, but it didn't work.

Thank.

from bs4 import BeautifulSoup
import glob
import gc

for FileName in glob.glob("\\Research Results\\*"):

    get_data = open(FileName,'r').read()

    soup = BeautifulSoup(get_data)

    VerifyTable = "Clinical Results"

    tables = soup.findAll('table')

    for table in tables:
        First_Row_First_Column = table.findAll('tr')[0].findAll('td')[0].text
        if VerifyTable == First_Row_First_Column.strip():
            v1 = table.findAll('tr')[1].findAll('td')[0].text
            v2 = table.findAll('tr')[1].findAll('td')[1].text

            complete_row = v1.strip() + ";" + v2.strip()

            print (complete_row)

            with open("Results_File.txt","a") as out_file:
                out_string = ""
                out_string += complete_row
                out_string += "\n"
                out_file.write(out_string)
                out_file.close()

    del get_data
    del soup
    del tables
    gc.collect()

print ("done")

      

+3


source to share


1 answer


I am a very beginner programmer and I faced the same problem. I did three things that seemed to solve the problem:

  • Also call garbage collection ('gc.collect ()') at the beginning of the iteration
  • transforms the parsing on iteration, so all global variables will become local variables and will be removed at the end of the function.
  • Use soupe.decompose ()

I think the second change probably solved it, but I haven't had time to test it out and I don't want to change the working code.



For this code, the solution would be something like this:

from bs4 import BeautifulSoup
import glob
import gc

def parser(file):
    gc.collect()

    get_data = open(file,'r').read()

    soup = BeautifulSoup(get_data)
    get_data.close()
    VerifyTable = "Clinical Results"

    tables = soup.findAll('table')

    for table in tables:
        First_Row_First_Column = table.findAll('tr')[0].findAll('td')[0].text
        if VerifyTable == First_Row_First_Column.strip():
            v1 = table.findAll('tr')[1].findAll('td')[0].text
            v2 = table.findAll('tr')[1].findAll('td')[1].text

            complete_row = v1.strip() + ";" + v2.strip()

            print (complete_row)

            with open("Results_File.txt","a") as out_file:
                out_string = ""
                out_string += complete_row
                out_string += "\n"
                out_file.write(out_string)
                out_file.close()

    soup.decompose()
    gc.collect()
    return None


for filename in glob.glob("\\Research Results\\*"):
    parser(filename)

print ("done")

      

0


source







All Articles