Negative look after new line?

Question

Negative look after new line?

I have a CSV-like text file that has about 1000 lines. There is a long series of dashes between each record in the file. Entries usually end with \ n, but sometimes an extra \ n appears before the \ n until the end of the entry. Simplified example:

"1x", "1y", "Hi there"
-------------------------------
"2x", "2y", "Hello - I'm lost"
-------------------------------
"3x", "3y", "How ya
doing?"
-------------------------------

I want to replace the extra \ n with spaces, i.e. concatenate lines between dashes. I thought I could do this (Python 2.5):

text = open("thefile.txt", "r").read()    
better_text = re.sub(r'\n(?!\-)', ' ', text)

but that seems to replace every \ n, not just those not followed by a dash. What am I doing wrong?

I am asking this question trying to improve my own regex skills and understand the mistakes I have made. The end goal is to create a text file in a format that can be used with a custom VBA macro for Word that generates a Word document in a style that will then be digested by the Word-friendly CMS.

+2

python regex

fwkb 14 Sep 09 at 18:48

source to share

4 answers

This is a good place to use a generator function to skip lines ----

and get something the csv module can read.

def readCleanLines( someFile ):
    for line in someFile:
        if line.strip() == len(line.strip())*'-':
            continue
        yield line

reader= csv.reader( readCleanLines( someFile ) )
for row in reader:
    print row

This should handle line breaks inside quotes easily and quietly.

If you want to do other things with this file, like save a copy with the lines removed ----

, you can do that.

with open( "source", "r" ) as someFile:
    with open( "destination", "w" ) as anotherFile:
        for line in readCleanLines( someFile ):
            anotherFile.write( line )

This will make a copy with the lines removed ----

. It's not worth the effort as reading and skipping lines is very, very fast and doesn't require additional storage.

+7

S.Lott 14 Sep At 19:08

source to share

re.sub(r'(?<!-)\n(?!-)', ' ', text)

(Hyphen does not require escaping outside the character class.)

+1

chaos 14 Sep 09 at 19:03

source to share

RegEx is not always the best tool for the job. How do I get it through something like "Split" or "Tokenize"? (I'm sure python has an equivalent). Then you have your entries, and you can think of newlines as just continuations.

0

Eric Nicholson 14 Sep 09 at 19:29

source to share

Gumbo · Accepted Answer · 2009-09-14T18:55:20+0000

You need to exclude line breaks at the end of the separator lines. Try the following:

\n(?<!-\n)(?!-)

This regex uses a negative look-behind assertion to exclude \n

those that preceded -

.

Negative look after new line?

More articles: