`re.split ()` works strange in Python

Has a bit of a fix in python. I would like to take a .txt file with many comments and split it into a list. However, I would like to separate all punctuation marks, spaces and \ n. When I run the following python code, it breaks my text file at strange points. NOTE. Below I try to split into periods and ends to test this. But he still often gets rid of the last letter with words.

import regex as re
with open('G:/My Documents/AHRQUnstructuredComments2.txt','r') as infile:
    nf = infile.read()
    wList = re.split('. | \n, nf)

print(wList)

      

+3


source to share


3 answers


You need to fix the quotation marks and make a small change to the regex:



import regex as re
with open('G:/My Documents/AHRQUnstructuredComments2.txt','r') as infile:
    nf = infile.read()
    wList = re.split('\W+' nf)

print(wList)

      

+2


source


You forgot to close the line and you need \ before.

import regex as re
with open('G:/My Documents/AHRQUnstructuredComments2.txt','r') as infile:
    nf = infile.read()
    wList = re.split('\. |\n |\s', nf)

print(wList)

      

For details see Split Multiple Delimited Lines? ...



Also, RichieHindle answers your question perfectly:

import re
DATA = "Hey, you - what are you doing here!?"
print re.findall(r"[\w']+", DATA)
# Prints ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

      

+2


source


In a regular expression, a character .

means any character. You must avoid this \.

in order to capture periods.

+2


source







All Articles