Empty space character after re.split

Here's the line from the file .txt

I'm reading and assigning it x

:

x = "Wild_lions live mostly in "Africa""
result = re.split('[^a-zA-Z0-9]+', x)

      

I end up with:

['Wild', 'lions', 'live', 'mostly', 'in', 'Africa', ''] # (there an empty space character as the last element)

      

Why is there empty space at the end? I understand what I can just do result.remove(' ')

to get rid of the space, but for large files, I think that would be rather inefficient.

+3


source to share


3 answers


You don't need to use this complex regex to split on it, the simpler:

result = re.split('\s+', x)
result
# ['Wild_lions', 'live', 'mostly', 'in', '"Africa"']

      

\s+

will match any number of spaces (tabs, spaces, translation strings, etc.).




If you only want an alphabetical match, it is better to use re.compile

with findall

.

myre = re.compile('[a-zA-Z]+')
myre.findall(x)
# ['Wild', 'lions', 'live', 'mostly', 'in', 'Africa']

      

+2


source


try this:

x = "Wild_lions live mostly in 'Africa'"
result = re.split('[\s_]+', x)

      



You'll get:

['Wild', 'lions', 'live', 'mostly', 'in', "'Africa'"]

      

+2


source


The pattern [^a-zA-Z0-9]+

splits the supplied string into any character or character sequences that are not ASCII numbers or letters.

The last character in the example line matches the split pattern. re.split

adds substrings before and after the match (until the next match or the end of the string) before its output. In this case, there will be an empty string after the substring, hence the reported output.

The other answers have provided workarounds to get the behavior you want, so I won't repeat them in this answer.

+1


source







All Articles