Correct strip: char with regex

Question

Correct strip: char with regex

I want to get words in a text string in python

s = "The saddest aspect of life right now is: science gathers knowledge faster than society gathers wisdom."

result = re.sub("\b[^\w\d_]+\b", " ",  s ).split()
print result

I get:

['The', 'saddest', 'aspect', 'of', 'life', 'right', 'now', 'is:', 'science', 'gathers', 'knowledge', 'faster', 'than', 'society', 'gathers', 'wisdom.']

How can I get "is" and not "is:" on lines that contain :

? I thought it \b

would be enough to use ...

+3

python

cMinor June 16 15 at 20:46

source to share

3 answers

You forgot to make it a string literal ( r".."

)

>>> import re
>>> s = "The saddest aspect of life right now is: science gathers knowledge faster than society gathers wisdom."
>>> re.sub("\b[^\w\d_]+\b", " ",  s ).split()
['The', 'saddest', 'aspect', 'of', 'life', 'right', 'now', 'is:', 'science', 'gathers', 'knowledge', 'faster', 'than', 'society', 'gathers', 'wisdom.']
>>> re.sub(r"\b[^\w\d_]+\b", " ",  s ).split()
['The', 'saddest', 'aspect', 'of', 'life', 'right', 'now', 'is', 'science', 'gathers', 'knowledge', 'faster', 'than', 'society', 'gathers', 'wisdom.']

+1

jamylak June 16 15 at 20:55

source to share

As other answers pointed out, you need to define a string literal using r

like this:(r"...")

If you want to split periods, I believe you can simplify the regex:

result = re.sub(r"[^\w' ]", " ", s ).split()

As you probably know, the metacharacter \w

separates the string of everything that is not az, AZ, 0-9

So, if you can expect your proposals will not contain numbers that should do the trick.

+1

gffbss June 16 15 at 21:06

source to share

Alexander O'Mara · Accepted Answer · 2015-06-16T20:55:25+0000

I think you intended to pass the raw string to re.sub

(note at r

).

result = re.sub(r"\b[^\w\d_]+\b", " ",  s ).split()

Return:

['The', 'saddest', 'aspect', 'of', 'life', 'right', 'now', 'is', 'science', 'gathers', 'knowledge', 'faster', 'than', 'society', 'gathers', 'wisdom.']

Correct strip: char with regex

More articles: