How can I format a txt file in python to remove extra paragraph lines as well as extra spaces?

Question

How can I format a txt file in python to remove extra paragraph lines as well as extra spaces?

I am trying to format a file like this: (random.txt)

        Hi,    im trying   to format  a new txt document so
that extra     spaces between    words   and paragraphs   are only 1.



   This should make     this txt document look like:

This is how it should look like below: (randomoutput.txt)

Hi, I'm trying to format a new txt document so
that extra spaces between words and paragraphs are only 1.

This should make this txt document look like:

So far, the code I was able to do only removed the spaces, but I am having trouble recognizing where a new paragraph starts so that it does not remove blank lines between paragraphs. This is what I have so far.

def removespaces(input, output):
    ivar = open(input, 'r')
    ovar = open(output, 'w')
    n = ivar.read()
    ovar.write(' '.join(n.split()))
    ivar.close()
    ovar.close()

Edit:

I also found a way to create spaces between paragraphs, but for now it just takes up every line break and creates space between the old line and the new line using:

m = ivar.readlines()
m[:] = [i for i in m if i != '\n']
ovar.write('\n'.join(m))

+3

python

J0hn 10 oct. '14 at 19:30

source to share

6 answers

The trick is that you want to turn any sequence of 2 or more \n

into exactly 2 \n

characters. It's hard to write with only split

and join

, but it's dead just to write with re.sub

:

n = re.sub(r'\n\n+', r'\n\n', n)

If you want lines containing spaces to be empty, do so after removing the spaces; if you want to be treated as non-empty, please do so earlier.

You will probably also want to change the whitespace code to use split(' ')

, not just split()

so it doesn't mess up newlines. (You can also use for this re.sub

, but it is not necessary, because turning 1 or more spaces into exactly 1 is not hard to write with split

and join

.)

Alternatively, you can just go line by line and keep track of the last line (either with an explicit variable inside the loop, or by writing a simple mixed_pairs iterator, for example i1, i2 = tee(ivar); next(i2); return zip_longest(i1, i2, fillvalue='')

), and if the current line and the previous line are empty, don't write the current line.

+1

abarnert 10 oct. 14 at 19:38

source to share

Separation

without an argument will cause the string to be truncated on each occurrence if a space (space, tab, newline, ...). Write n.split ("") and it will only split into spaces. Instead of writing the output to a file, put it in the variable Ingo a New and repeat the step one more time, this time with

m.split("\n")

0

sweber 10 oct. 14 at 19:41

source to share

First, let's see what the problem is ... You cannot have 1 + consecutive spaces or 2 + consecutive lines.

You know how to handle 1+ spaces. This approach will not work on 2+ new lines, since three situations are possible: - 1 new line - 2 new lines - 2+ new lines

Great, so .. How do you solve this? There are many solutions. I will list 3 of them.

Based on Regex. This problem is very easy to solve iff ¹ you know how to use regex ... So here is the code:

s = re.sub(r'\n{2,}', r'\n\n', in_file.read())

If you have memory constraints, this is not the best way as we read the entire file into momory.

During the cycle. This code is really self-explanatory, but I wrote this line anyway ...

s = in_file.read()
while "\n\n\n" in s:
    s = s.replace("\n\n\n", "\n\n")

Again, you have memory limits, we are still reading the whole file into momory.

Condition. Another way to approach this problem is in turn. By keeping track of whether the last line we encountered was empty, we can decide what to do.

was_last_line_blank = False
for line in in_file:
    # Uncomment if you consider lines with only spaces blank
    # line = line.strip()

    if not line:
        was_last_line_blank = True
        continue
    if not was_last_line_blank:
        # Add a new line to output file
        out_file.write("\n")
    # Write contents of `line` in file
    out_file.write(line)

    was_last_line_blank = False

Now you need to load the whole file into memory and the other is harder. I mean, all of these work, but since there is a slight difference in what they work, what they need in the system changes ...

¹ "iff" intentionally.

0

pradyunsg 10 oct. 14 at 19:43

source to share

Basically, you want to use non-empty strings ( line.strip()

which is why it returns an empty string, which is False

in a boolean context). You can do this using a list / generator comprehension in the result str.splitlines()

, with a suggestion if

for filtering empty strings.

Then, for each line, you want to ensure that all words are separated by a space - for that you can use ' '.join()

on result str.split()

.

So this should do the job for you:

compressed = '\n'.join(
    ' '.join(line.split()) for line in txt.splitlines() 
        if line.strip() 
    )

or you can use filter

it map

with a helper function to make it more readable:

def squash_line(line):
    return ' '.join(line.split())

non_empty_lines = filter(str.strip, txt.splitlines())
compressed = '\n'.join(map(squash_line, non_empty_lines))

0

m.wasowski 10 oct. 14 at 20:13

source to share

To fix the paragraph issue:

import re
data = open("data.txt").read()

result = re.sub("[\n]+", "\n\n", data)
print(result)

-1

jftuga 10 oct. 14 at 19:42

source to share

5gon12eder · Accepted Answer · 2014-10-10T19:38:49+0000

You have to process the input line by line. This will not only simplify your program, but it will also make it easier in system memory.

The logic for normalizing horizontal space in a line remains the same (split words and single-space concatenation).

What you need to do for paragraphs is check if it is line.strip()

empty (just use it as a boolean expression) and keep the flag if the previous line was also empty. You just throw blank lines, but if you encounter a non-blank line and the flag is set, print one blank line in front of it.

with open('input.txt', 'r') as istr:
    new_par = False
    for line in istr:
        line = line.strip()
        if not line:  # blank
            new_par = True
            continue
        if new_par:
            print()  # print a single blank line
        print(' '.join(line.split()))
        new_par = False

If you want to suppress blank lines at the top of the file, you need an additional flag that you only set after you encounter the first non-blank line.

If you'd like to get some more attention, take a look at textwrap

, but remember that there is (or at least had, as I can tell, some bad worst performance issues.

How can I format a txt file in python to remove extra paragraph lines as well as extra spaces?

More articles: