Python Regex Eliminating Multiple New Lines

So, I have a problem parsing text. I am trying to parse music files and they are semi-formatted. For example, I'm trying to exclude choruses from the text. In most cases, the formatting looks like this:

[Chorus: x2]
Some Lyrics
Some More Lyrics

[Verse]
Lyrics
Lyrics

In this case, these two functions can parse correctly:

subChorus = re.sub(r'\[Chorus.*?\].*?\[', '[', lyrics, flags = re.DOTALL);
subChorus2 = re.sub(r'\[Chorus.*?\].*?(\n{2,})', '', lyrics, flags = re.DOTALL);

      

However, sometimes the Chorus is the last section of the file:

Lyrics

[Chorus]
Some Lyrics
Other Lyrics

In such a case, I cannot determine the correct expression to remove the chorus. If I just do

subChorusEnd = re.sub(r'\[Chorus.*?\].*?$', '', lyrics, flags = re.DOTALL);

      

It will work; however, for other files where the final chorus section is not at the end, it will delete the verses that need to be kept. All Chorus blocks with the following verses are separated by at least two new lines. So I came up with this solution:

subChorusEnd = re.sub(r'\[Chorus.*?\][^(\n{2,})]*?$', '', subChorus4, flags = re.DOTALL);

      

But that won't work. Can anyone please explain to me the correct regex to make the above statement work or is it better to ONLY remove the chorus blocks that are at the end of the section of text, which will also SAVE files where the last chorus is not at the end.

+3


source to share


3 answers


You can try the following regex to match all chorus blocks.

\[Chorus.*?\].*?(\n{2,}|$)

      

DEMO

OR



(?!.*\n\n)\[Chorus.*?\].*?$

      

It only matches the block chorus

that was at the end. Don't forget to include the DOTALL modifier in both regular expressions.

DEMO

+1


source


Instead of using regular expressions, I would rather step over the text line by line and decide whether to store each line using what is basically a crappy state machine:

lyrics1 = '''Lyrics

[Chorus]
Some Lyrics
Other Lyrics'''

lyrics2 = '''[Chorus: x2]
Some Lyrics
Some More Lyrics

[Verse]
Lyrics
Lyrics'''

def clean(lyrics):
    result = []
    omitting = False
    for line in lyrics.split('\n'):
        if '[Chorus' in line:
            omitting = True
        if '[' in line and '[Chorus' not in line:
            omitting = False
        if not omitting:
            result.append(line)
    return '\n'.join(result)

print(clean(lyrics1))
print('------------')
print(clean(lyrics2))

      

Result:



Lyrics

------------
[Verse]
Lyrics
Lyrics

      

Basically, we turn on the flag if we see the line "Chorus" and stop printing lines; then if we see any parenthesis that is not "Chorus", we drop the flag and resume outputting lines.

I don't know what the actual files you are playing look like, but it is quite possible that such a strategy could prove more fruitful than throwing gargantuan regexes into the problem.

0


source


\[Chorus:[^\]]+\][\s\S]*?(?=\n{2}|$)

      

Try this food for all types of chorus. Replace empty string

. See demo.

https://regex101.com/r/vN3sH3/77

0


source







All Articles