Re.findall regex freezes or very slow

My input file is a large txt file with concatenated texts that I got from an open text library. Now I am trying to fetch only the contents of the book itself and filter out other stuff like disclaimers etc. This way I have about 100 documents in a large text file (about 50 mb).

I then identified the start and end markers of the content itself, and decided to use a Python regex to find everything between the start and end markers. To summarize, the regex should look for a start marker, then match everything after it and stop looking after it reaches the end marker, then repeat these steps until the end of the file is reached.

The following code works flawlessly when I load a small 100KB file into it:

import codecs
import re

outfile = codecs.open("outfile.txt", "w", "utf-8-sig")
inputfile = codecs.open("infile.txt", "r", "utf-8-sig")
filecontents = inputfile.read()
for result in re.findall(r'START\sOF\sTHE\sPROJECT\sGUTENBERG\sEBOOK.*?\n(.*?)END\sOF\THE\sPROJECT\sGUTENBERG\sEBOOK', filecontents, re.DOTALL):
    outfile.write(result)
outfile.close()

      

When I use this regex operation on my larger file it won't do anything, the program just hangs. I checked it overnight to make sure it was just slow and even after about 8 hours the program was still stuck.

I'm pretty sure the source of the problem is the (. *?) Part of the regexp, combined with re.DOTALL. When I use a similar regex at shorter distances, the script runs fine and fast. Now my question is, why does this just freeze everything up? I know the text between the delimiters is small, but the 50MB file shouldn't be too big to handle, right? Perhaps I am missing a better solution?

Thanks in advance.

+3


source to share


1 answer


You are correct that using a sequence .*

that appears multiple times causes problems. The problem is that the solver tries to use many possible combinations .*

, resulting in a result known as catastrophic backtracking .

The usual solution is to replace it .

with a character class that is much more specific, usually what you are trying to break the first .*

with. Something like:



`[^\n]*(.*)`

      

so that the capture group can only match the first newline to the end. Another option is to acknowledge that a regex solution might not be the best approach, and also use a context free expression (like pyparsing

) or be the first to break the input into smaller, easier to process chunks (like c corpus.split('\n')

)

+10


source







All Articles