Replace one word of a string if that word is within a certain number of words of another word
I have a text column in a dataframe called "DESCRIPTION". I need to find all instances where the word "tile" or "tile" is within 6 words of the word "roof" and then change the word "tile / s" to "rooftiles". I need to do the same for "floor" and "tiles" (change "tiles" to "floortiles"). This will help distinguish what the construction trade we are considering when certain words are used in conjunction with other words.
To show what I mean, sample data and my last wrong try:
s1=pd.Series(["After the storm the roof was damaged and some of the tiles are missing"])
s2=pd.Series(["I dropped the saw and it fell on the floor and damaged some of the tiles"])
s3=pd.Series(["the roof was leaking and when I checked I saw that some of the tiles were cracked"])
df=pd.DataFrame([list(s1), list(s2), list(s3)], columns = ["DESCRIPTION"])
df
The solution I have to follow should look something like this (in framework format):
1.After the storm the roof was damaged and some of the rooftiles are missing
2.I dropped the saw and it fell on the floor and damaged some of the floortiles
3.the roof was leaking and when I checked I saw that some of the tiles were cracked
here i tried to match using the REGEX pattern to replace the word "tiles" but this is completely wrong ... is there a way to do what i am trying to do? I'm new to Python ...
regex=r"(roof)\b\s+([^\s]+\s+){0,6}\b(.*tiles)"
replacedString=re.sub(regex, r"(roof)\b\s+([^\s]+\s+){0,6}\b(.*rooftiles)", df['DESCRIPTION'])
UPDATE: SOLUTION
Thanks for the help! I managed to get it to work using Jan code with a few additions / tricks. Final working code below (using real, not example, file and data):
claims_file = pd.read_csv(project_path + claims_filename) # Read input file
claims_file["LOSS_DESCRIPTION"] = claims_file["LOSS_DESCRIPTION"].fillna('NA') #get rid of encoding errors generated because some text was just 'NA' and it was read in as NaN
#create the REGEX
rx = re.compile(r'''
( # outer group
\b(floor|roof) # floor or roof
(?:\W+\w+){0,6}\s* # any six "words"
)
\b(tiles?)\b # tile or tiles
''', re.VERBOSE)
#create the reverse REGEX
rx2 = re.compile(r'''
( # outer group
\b(tiles?) # tile or tiles
(?:\W+\w+){0,6}\s* # any six "words"
)
\b(floor|roof)\b # roof or floor
''', re.VERBOSE)
#apply it to every row of Loss Description:
claims_file["LOSS_DESCRIPTION"] = claims_file["LOSS_DESCRIPTION"].apply(lambda x: rx.sub(r'\1\2\3', x))
#apply the reverse regex:
claims_file["LOSS_DESCRIPTION"] = claims_file["LOSS_DESCRIPTION"].apply(lambda x: rx2.sub(r'\3\1\2', x))
# Write results into CSV file and check results
claims_file.to_csv(project_path + output_filename, index = False
, encoding = 'utf-8')
source to share
You can use a regex solution here:
( # outer group
\b(floor|roof) # floor or roof
(?:\W+\w+){1,6}\s* # any six "words"
)
\b(tiles?)\b # tile or tiles
See demo for regular expression at regex101.com .
Then just connect the captured pieces and reconnect them with
rx.sub()
and apply them to all elements of the column
DESCRIPTION
so that you end up with the following code:
import pandas as pd, re
s1 = pd.Series(["After the storm the roof was damaged and some of the tiles are missing"])
s2 = pd.Series(["I dropped the saw and it fell on the floor and damaged some of the tiles"])
s3 = pd.Series(["the roof was leaking and when I checked I saw that some of the tiles were cracked"])
df = pd.DataFrame([list(s1), list(s2), list(s3)], columns = ["DESCRIPTION"])
rx = re.compile(r'''
( # outer group
\b(floor|roof) # floor or roof
(?:\W+\w+){1,6}\s* # any six "words"
)
\b(tiles?)\b # tile or tiles
''', re.VERBOSE)
# apply it to every row of "DESCRIPTION"
df["DESCRIPTION"] = df["DESCRIPTION"].apply(lambda x: rx.sub(r'\1\2\3', x))
print(df["DESCRIPTION"])
Note that your original question was not entirely clear: this solution will find
tile
or
tiles
after
roof
, which means that the type clause
Can you give me the tile for the roof, please?
will not be matched (although the word
tile
is within a six-word range of
roof
, that is).
source to share
I'll show you a quick and dirty partial implementation. You can make it more reliable and useful. Let's say s
- one of your descriptions:
s = "I dropped the saw and it fell on the roof and damaged roof " +\
"and some of the tiles"
Let's break it down into words first (tokenize; you can remove punctuation if you like):
tokens = nltk.word_tokenize(s)
Now select the markers of interest and sort them alphabetically, but remember their original positions in s
:
my_tokens = sorted((w.lower(), i) for i,w in enumerate(tokens)
if w.lower() in ("roof", "tiles"))
#[('roof', 6), ('roof', 12), ('tiles', 17)]
Combine identical tokens and create a dictionary where tokens are keys and lists of their positions are values. Use Dictionary Comprehension:
token_dict = {name: [p0 for _, p0 in pos]
for name,pos
in itertools.groupby(my_tokens, key=lambda a:a[0])}
#{'roof': [9, 12], 'tiles': [17]}
Go to the list of positions tiles
, if any, and see if there is a number roof
, and if so, change the word:
for i in token_dict['tiles']:
for j in token_dict['roof']:
if abs(i-j) <= 6:
tokens[i] = 'rooftiles'
Finally, put the words back together:
' '.join(tokens)
#'I dropped the saw and it fell on the roof and damaged roof '+\
#' and some of the rooftiles'
source to share
I can generalize this to more substrings than roof and floor, but this looks like simpler code:
for idx,r in enumerate(df.loc[:,'DESCRIPTION']):
if "roof" in r and "tile" in r:
fill=r[r.find("roof")+4:]
fill = fill[0:fill.replace(' ','_',7).find(' ')]
sixWords = fill if fill.find('.') == -1 else ''
df.loc[idx,'DESCRIPTION'] = r.replace(sixWords,sixWords.replace("tile", "rooftile"))
elif "floor" in r and "tile" in r:
fill=r[r.find("floor")+5:]
fill = fill[0:fill.replace(' ','_',7).find(' ')]
sixWords = fill if fill.find('.') == -1 else ''
df.loc[idx,'DESCRIPTION'] = r.replace(sixWords,sixWords.replace("tile", "floortile"))
Note that this also includes a full loop (".") Check. You can remove it by deleting the variable sixWords
and replacing it withfill
source to share
The main problem you are facing is this. * before the tiles in your regex. This makes it so that any number of characters can go there and still fit. \ B is unnecessary, since they are still in between spaces and no spaces. And grouping () wasn't used either, so I removed them.
r "(roof \ s + [^ \ s] + \ s +) {0,6} tiles" will only match the roof within 6 "words" (a group of characters without spaces separated by spaces) fragments. To replace it, take all but the last 5 characters of the matching string from the regex, add "rooftiles", and then replace the string with the updated string. Or you can group everything except the tiles with () in the regex and then replace that group with yourself and the roof. You can't use re.sub for something that tricky, because it will replace the entire rooftop match with tiles, not just a tile word.
source to share