Python Regex for partial brackets only
I have poorly formatted text that I need to filter out. As such, there are many cases where a quote in text starts on one line and then gets cut and ends on the second line. In such a case, I prefer to just omit the partial quotes entirely, BUT I want to keep the regular full quotes. I know it can be done iteratively with a counter, but I would rather use it with regular expressions.
Let's take an example:
"This is a quote" This is an end "partial- quote "Here is more text. This is an end "partial- quote w / o more text " This is an "embedded" quote
Here's an example of my current (\"[^\"\n]+?|^[^\"\n]+?\")(\n|$)
attempt.Note that it fails in two cases:
- Line 3 - The partial quote continues the rest of the sentence (a very rare occurrence, so if we can't solve it it's not the end of the world).
- line 6 is an inline quote. This is a serious problem and the main reason I came to CO with my problem. It grabs the last quote in the inline quote to the end of the line.
I figured I could set up an if statement and run each line, checking if it has less than two quotes and then continues to parse the partial quotes, but I thought the minds in SO would have a much cleaner solution.
NOTE . Desired output:
"This is a quote" This is an end Here is more text. This is an end This is an "embedded" quote
(I handle spaces later)
source to share
Here you go,
^((?:[^"\n]*"[^"\n]*")*[^"\n]*)"[^"\n]*\n[^"\n]*"(\n|)
Replace matching characters \1\n
>>> import re
>>> s = '''"This is a quote"
This is an end "partial-
quote" Here is more text.
This is an end "partial-
quote w/o more text"
This is an "embedded" quote'''
>>> m = re.sub(r'(?m)^((?:[^"\n]*"[^"\n]*")*[^"\n]*)"[^"\n]*\n[^"\n]*"(\n|)', r'\1\n', s)
>>> print(m)
"This is a quote"
This is an end
Here is more text.
This is an end
This is an "embedded" quote
Use this regex when you want to deal with more than one string inside between double quotes.
^((?:[^"\n]*"[^"\n]*")*[^"\n]*)"(?:[^"\n]*\n)+[^"\n]*"(\n|)
source to share