Python How to merge hyphenated words with new characters?
I want to say that Napp Granade
serves in the spirit of a town in our dis-
trict of Georgia called Andersonville.
I have thousands of text files with data like the one above and the words have been wrapped with hyphen and newline.
What I am trying to do is remove the hyphen and put the newline at the end of the word. I do not want, if possible, to remove all the hyphenated words only those at the end of the line.
with open(filename, encoding="utf8") as f:
file_str = f.read()
re.sub("\s*-\s*", "", file_str)
with open(filename, "w", encoding="utf8") as f:
f.write(file_str)
The above code doesn't work and I tried in multiple ways.
I would like to go through the entire text file and remove any hyphens that denote a new line. For example:
I want to say that Napp Granade
serves in the spirit of a town in our district
of Georgia called Andersonville.
Any help would be appreciated.
source to share
You don't need to use a regular expression:
filename = 'test.txt'
# I want to say that Napp Granade
# serves in the spirit of a town in our dis-
# trict of Georgia called Anderson-
# ville.
with open(filename, encoding="utf8") as f:
lines = [line.strip('\n') for line in f]
for num, line in enumerate(lines):
if line.endswith('-'):
# the end of the word is at the start of next line
end = lines[num+1].split()[0]
# we remove the - and append the end of the word
lines[num] = line[:-1] + end
# and remove the end of the word and possibly the
# following space from the next line
lines[num+1] = lines[num+1][len(end)+1:]
text = '\n'.join(lines)
with open(filename, "w", encoding="utf8") as f:
f.write(text)
# I want to say that Napp Granade
# serves in the spirit of a town in our district
# of Georgia called Andersonville.
But you can, of course, and shorter:
with open(filename, encoding="utf8") as f:
text = f.read()
text = re.sub(r'-\n(\w+ *)', r'\1\n', text)
with open(filename, "w", encoding="utf8") as f:
f.write(text)
We search -
for what follows \n
and write the next word, which is the end of the forked word.
We replace all of this with the captured word followed by a newline.
Remember to use raw replacement strings to be \1
interpreted correctly.
source to share