Negative look after new line?

I have a CSV-like text file that has about 1000 lines. There is a long series of dashes between each record in the file. Entries usually end with \ n, but sometimes an extra \ n appears before the \ n until the end of the entry. Simplified example:

"1x", "1y", "Hi there"
-------------------------------
"2x", "2y", "Hello - I'm lost"
-------------------------------
"3x", "3y", "How ya
doing?"
-------------------------------

      

I want to replace the extra \ n with spaces, i.e. concatenate lines between dashes. I thought I could do this (Python 2.5):

text = open("thefile.txt", "r").read()    
better_text = re.sub(r'\n(?!\-)', ' ', text)

      

but that seems to replace every \ n, not just those not followed by a dash. What am I doing wrong?

I am asking this question trying to improve my own regex skills and understand the mistakes I have made. The end goal is to create a text file in a format that can be used with a custom VBA macro for Word that generates a Word document in a style that will then be digested by the Word-friendly CMS.

+2


source to share


4 answers


You need to exclude line breaks at the end of the separator lines. Try the following:

\n(?<!-\n)(?!-)

      



This regex uses a negative look-behind assertion to exclude \n

those that preceded -

.

+5


source


This is a good place to use a generator function to skip lines ----

and get something the csv module can read.

def readCleanLines( someFile ):
    for line in someFile:
        if line.strip() == len(line.strip())*'-':
            continue
        yield line

reader= csv.reader( readCleanLines( someFile ) )
for row in reader:
    print row

      

This should handle line breaks inside quotes easily and quietly.




If you want to do other things with this file, like save a copy with the lines removed ----

, you can do that.

with open( "source", "r" ) as someFile:
    with open( "destination", "w" ) as anotherFile:
        for line in readCleanLines( someFile ):
            anotherFile.write( line )

      

This will make a copy with the lines removed ----

. It's not worth the effort as reading and skipping lines is very, very fast and doesn't require additional storage.

+7


source


re.sub(r'(?<!-)\n(?!-)', ' ', text)

      

(Hyphen does not require escaping outside the character class.)

+1


source


RegEx is not always the best tool for the job. How do I get it through something like "Split" or "Tokenize"? (I'm sure python has an equivalent). Then you have your entries, and you can think of newlines as just continuations.

0


source







All Articles