Python Regex Eliminating Multiple New Lines
So, I have a problem parsing text. I am trying to parse music files and they are semi-formatted. For example, I'm trying to exclude choruses from the text. In most cases, the formatting looks like this:
[Chorus: x2] Some Lyrics Some More Lyrics [Verse] Lyrics Lyrics
In this case, these two functions can parse correctly:
subChorus = re.sub(r'\[Chorus.*?\].*?\[', '[', lyrics, flags = re.DOTALL);
subChorus2 = re.sub(r'\[Chorus.*?\].*?(\n{2,})', '', lyrics, flags = re.DOTALL);
However, sometimes the Chorus is the last section of the file:
Lyrics [Chorus] Some Lyrics Other Lyrics
In such a case, I cannot determine the correct expression to remove the chorus. If I just do
subChorusEnd = re.sub(r'\[Chorus.*?\].*?$', '', lyrics, flags = re.DOTALL);
It will work; however, for other files where the final chorus section is not at the end, it will delete the verses that need to be kept. All Chorus blocks with the following verses are separated by at least two new lines. So I came up with this solution:
subChorusEnd = re.sub(r'\[Chorus.*?\][^(\n{2,})]*?$', '', subChorus4, flags = re.DOTALL);
But that won't work. Can anyone please explain to me the correct regex to make the above statement work or is it better to ONLY remove the chorus blocks that are at the end of the section of text, which will also SAVE files where the last chorus is not at the end.
source to share
Instead of using regular expressions, I would rather step over the text line by line and decide whether to store each line using what is basically a crappy state machine:
lyrics1 = '''Lyrics
[Chorus]
Some Lyrics
Other Lyrics'''
lyrics2 = '''[Chorus: x2]
Some Lyrics
Some More Lyrics
[Verse]
Lyrics
Lyrics'''
def clean(lyrics):
result = []
omitting = False
for line in lyrics.split('\n'):
if '[Chorus' in line:
omitting = True
if '[' in line and '[Chorus' not in line:
omitting = False
if not omitting:
result.append(line)
return '\n'.join(result)
print(clean(lyrics1))
print('------------')
print(clean(lyrics2))
Result:
Lyrics
------------
[Verse]
Lyrics
Lyrics
Basically, we turn on the flag if we see the line "Chorus" and stop printing lines; then if we see any parenthesis that is not "Chorus", we drop the flag and resume outputting lines.
I don't know what the actual files you are playing look like, but it is quite possible that such a strategy could prove more fruitful than throwing gargantuan regexes into the problem.
source to share