Replace one word of a string if that word is within a certain number of words of another word

I have a text column in a dataframe called "DESCRIPTION". I need to find all instances where the word "tile" or "tile" is within 6 words of the word "roof" and then change the word "tile / s" to "rooftiles". I need to do the same for "floor" and "tiles" (change "tiles" to "floortiles"). This will help distinguish what the construction trade we are considering when certain words are used in conjunction with other words.

To show what I mean, sample data and my last wrong try:

s1=pd.Series(["After the storm the roof was damaged and some of the tiles are missing"])
s2=pd.Series(["I dropped the saw and it fell on the floor and damaged some of the tiles"])
s3=pd.Series(["the roof was leaking and when I checked I saw that some of the tiles were cracked"])
df=pd.DataFrame([list(s1), list(s2),  list(s3)],  columns =  ["DESCRIPTION"])
df

      

The solution I have to follow should look something like this (in framework format):

1.After the storm the roof was damaged and some of the rooftiles are missing      
2.I dropped the saw and it fell on the floor and damaged some of the floortiles
3.the roof was leaking and when I checked I saw that some of the tiles were cracked

      

here i tried to match using the REGEX pattern to replace the word "tiles" but this is completely wrong ... is there a way to do what i am trying to do? I'm new to Python ...

regex=r"(roof)\b\s+([^\s]+\s+){0,6}\b(.*tiles)"
replacedString=re.sub(regex, r"(roof)\b\s+([^\s]+\s+){0,6}\b(.*rooftiles)", df['DESCRIPTION'])

      

UPDATE: SOLUTION

Thanks for the help! I managed to get it to work using Jan code with a few additions / tricks. Final working code below (using real, not example, file and data):

claims_file = pd.read_csv(project_path + claims_filename) # Read input file
claims_file["LOSS_DESCRIPTION"] = claims_file["LOSS_DESCRIPTION"].fillna('NA') #get rid of encoding errors generated because some text was just 'NA' and it was read in as NaN
#create the REGEX    
rx =  re.compile(r'''
        (                      # outer group
            \b(floor|roof)     # floor or roof
            (?:\W+\w+){0,6}\s* # any six "words"
        )
        \b(tiles?)\b           # tile or tiles
        ''', re.VERBOSE)

#create the reverse REGEX
rx2 =  re.compile(r'''
        (                      # outer group
            \b(tiles?)     # tile or tiles
            (?:\W+\w+){0,6}\s* # any six "words"
        )
        \b(floor|roof)\b           # roof or floor
        ''', re.VERBOSE)
#apply it to every row of Loss Description:
claims_file["LOSS_DESCRIPTION"] = claims_file["LOSS_DESCRIPTION"].apply(lambda x: rx.sub(r'\1\2\3', x)) 

#apply the reverse regex:
claims_file["LOSS_DESCRIPTION"] = claims_file["LOSS_DESCRIPTION"].apply(lambda x: rx2.sub(r'\3\1\2', x)) 

# Write results into CSV file and check results
claims_file.to_csv(project_path + output_filename, index = False
                       , encoding = 'utf-8')

      

+3


source to share


4 answers


You can use a regex solution here:

(                      # outer group
    \b(floor|roof)     # floor or roof
    (?:\W+\w+){1,6}\s* # any six "words"
)
\b(tiles?)\b           # tile or tiles

      

See demo for regular expression at regex101.com .


Then just connect the captured pieces and reconnect them with rx.sub()

and apply them to all elements of the column DESCRIPTION

so that you end up with the following code:
import pandas as pd, re

s1 = pd.Series(["After the storm the roof was damaged and some of the tiles are missing"])
s2 = pd.Series(["I dropped the saw and it fell on the floor and damaged some of the tiles"])
s3 = pd.Series(["the roof was leaking and when I checked I saw that some of the tiles were cracked"])

df = pd.DataFrame([list(s1), list(s2),  list(s3)],  columns =  ["DESCRIPTION"])

rx = re.compile(r'''
            (                      # outer group
                \b(floor|roof)     # floor or roof
                (?:\W+\w+){1,6}\s* # any six "words"
            )
            \b(tiles?)\b           # tile or tiles
            ''', re.VERBOSE)

# apply it to every row of "DESCRIPTION"
df["DESCRIPTION"] = df["DESCRIPTION"].apply(lambda x: rx.sub(r'\1\2\3', x))
print(df["DESCRIPTION"])

      




Note that your original question was not entirely clear: this solution will find tile

or tiles

after roof

, which means that the type clause Can you give me the tile for the roof, please?

will not be matched (although the word tile

is within a six-word range of roof

, that is).
+2


source


I'll show you a quick and dirty partial implementation. You can make it more reliable and useful. Let's say s

- one of your descriptions:

s = "I dropped the saw and it fell on the roof and damaged roof " +\
    "and some of the tiles"

      

Let's break it down into words first (tokenize; you can remove punctuation if you like):

tokens = nltk.word_tokenize(s)

      

Now select the markers of interest and sort them alphabetically, but remember their original positions in s

:

my_tokens = sorted((w.lower(), i) for i,w in enumerate(tokens)
                    if w.lower() in ("roof", "tiles"))
#[('roof', 6), ('roof', 12), ('tiles', 17)]

      



Combine identical tokens and create a dictionary where tokens are keys and lists of their positions are values. Use Dictionary Comprehension:

token_dict = {name: [p0 for _, p0 in pos] 
              for name,pos 
              in itertools.groupby(my_tokens, key=lambda a:a[0])}
#{'roof': [9, 12], 'tiles': [17]}

      

Go to the list of positions tiles

, if any, and see if there is a number roof

, and if so, change the word:

for i in token_dict['tiles']:
    for j in token_dict['roof']:
        if abs(i-j) <= 6: 
            tokens[i] = 'rooftiles'

      

Finally, put the words back together:

' '.join(tokens)
#'I dropped the saw and it fell on the roof and damaged roof '+\
#' and some of the rooftiles'

      

+2


source


I can generalize this to more substrings than roof and floor, but this looks like simpler code:

for idx,r in enumerate(df.loc[:,'DESCRIPTION']):
    if "roof" in r and "tile" in r:
        fill=r[r.find("roof")+4:]
        fill = fill[0:fill.replace(' ','_',7).find(' ')]
        sixWords = fill if fill.find('.') == -1 else ''
        df.loc[idx,'DESCRIPTION'] = r.replace(sixWords,sixWords.replace("tile", "rooftile"))
    elif "floor" in r and "tile" in r:
        fill=r[r.find("floor")+5:]
        fill = fill[0:fill.replace(' ','_',7).find(' ')]
        sixWords = fill if fill.find('.') == -1 else ''
        df.loc[idx,'DESCRIPTION'] = r.replace(sixWords,sixWords.replace("tile", "floortile"))

      

Note that this also includes a full loop (".") Check. You can remove it by deleting the variable sixWords

and replacing it withfill

+1


source


The main problem you are facing is this. * before the tiles in your regex. This makes it so that any number of characters can go there and still fit. \ B is unnecessary, since they are still in between spaces and no spaces. And grouping () wasn't used either, so I removed them.

r "(roof \ s + [^ \ s] + \ s +) {0,6} tiles" will only match the roof within 6 "words" (a group of characters without spaces separated by spaces) fragments. To replace it, take all but the last 5 characters of the matching string from the regex, add "rooftiles", and then replace the string with the updated string. Or you can group everything except the tiles with () in the regex and then replace that group with yourself and the roof. You can't use re.sub for something that tricky, because it will replace the entire rooftop match with tiles, not just a tile word.

0


source







All Articles