Python Regex for partial brackets only

I have poorly formatted text that I need to filter out. As such, there are many cases where a quote in text starts on one line and then gets cut and ends on the second line. In such a case, I prefer to just omit the partial quotes entirely, BUT I want to keep the regular full quotes. I know it can be done iteratively with a counter, but I would rather use it with regular expressions.

Let's take an example:

"This is a quote"
This is an end "partial-
quote "Here is more text.
This is an end "partial-
quote w / o more text "
This is an "embedded" quote

Here's an example of my current (\"[^\"\n]+?|^[^\"\n]+?\")(\n|$)

attempt.Note that it fails in two cases:

  • Line 3 - The partial quote continues the rest of the sentence (a very rare occurrence, so if we can't solve it it's not the end of the world).
  • line 6 is an inline quote. This is a serious problem and the main reason I came to CO with my problem. It grabs the last quote in the inline quote to the end of the line.

I figured I could set up an if statement and run each line, checking if it has less than two quotes and then continues to parse the partial quotes, but I thought the minds in SO would have a much cleaner solution.

NOTE . Desired output:

"This is a quote"
This is an end 
 Here is more text.
This is an end 
This is an "embedded" quote

(I handle spaces later)

+3


source to share


3 answers


Here you go,

^((?:[^"\n]*"[^"\n]*")*[^"\n]*)"[^"\n]*\n[^"\n]*"(\n|)

      

Replace matching characters \1\n

DEMO



>>> import re
>>> s = '''"This is a quote"
This is an end "partial-
quote" Here is more text.
This is an end "partial-
quote w/o more text"
This is an "embedded" quote'''
>>> m = re.sub(r'(?m)^((?:[^"\n]*"[^"\n]*")*[^"\n]*)"[^"\n]*\n[^"\n]*"(\n|)', r'\1\n', s)
>>> print(m)
"This is a quote"
This is an end 
 Here is more text.
This is an end 
This is an "embedded" quote

      

Use this regex when you want to deal with more than one string inside between double quotes.

^((?:[^"\n]*"[^"\n]*")*[^"\n]*)"(?:[^"\n]*\n)+[^"\n]*"(\n|)

      

DEMO

+2


source


Perhaps you can use this regex:

"[^"\n]+?\n[^"\n]+?(?:"|$)\s*

      

and replace with \n

.



regex101 demo

"[^"\n]+?\n[^"\n]+?

will only match partial quotes (ensures line breaks between quotes).

ideone demo

+1


source


("[^"\n]*")|"[^"]*(\n)[^"]*"(?![^\n]*")|"[^"]*\n.*?(?=\n[^"]*"[^\n"]*")

      

You can try this. It will also take the case with an odd number of quotes. See demo.

https://regex101.com/r/dL7oF8/6

+1


source







All Articles