How to return a specific word from a column of r from a list of text in another dataframe

I have a dataframe of two variables. let's say my data frame is df then two variables are df $ id, df $ address. The df $ address contains the full address, eg TT Road, Bhopal, Madhya Pradesh 462003. I have another data frame containing 10 locations, and one of 10 is bhopal. so I only want to return bhopal in a new column. this is an example, I have over 200,000 IDs and 300 locations. below is an example

data frame 1:

df <- data.frame(id = c("297308272","297308281","297308299"), address = c("MGROAD, AMBIKAPUR, CH-546453","TT Road, Bhopal, Madhya Pradesh 462003","STREET NO. 2, WHITEFIELD, PALI, RJ"))

      

data frame 2:

 AD <- data.frame(place = c("Bhopal", "Pali", "Wardha", "AMBIKAPUR", "Anuhul"))

      

+3


source to share


1 answer


Let's start first by converting the entire address column of the data frame and the location vector to lowercase.

df$address<-tolower(df$address)

#> df
#    id                                address
#1 297308272           mgroad, ambikapur, ch-546453
#2 297308281 tt road, bhopal, madhya pradesh 462003

place<-tolower(place)

#> place
# "bhopal"    "pali"      "wardha"    "ambikapur"
# [5] "anuhul"

      

Now split the string into words using "" as separator. For this we will use strsplit

in R.

listofstrvec<-strsplit(x = df$address,split = " ")

#> listofstrvec
# [[1]]
# [1] "mgroad,"    "ambikapur," "ch-546453" 

# [[2]]
# [1] "tt"      "road,"   "bhopal," "madhya"  "pradesh"
# [6] "462003"

      

We now have a list of string vectors. We will now try to clear up these lines a little. We'll use a function gsub

in R to remove unnecessary punctuation. At this point, you may have to try several combinations depending on how messy your data is.



listofstrvec<-lapply(listofstrvec,FUN = gsub,pattern="[\\,\\.\\-]",replacement= "")

#> listofstrvec
# [[1]]
# [1] "mgroad"    "ambikapur" "ch546453" 

# [[2]]
# [1] "tt"      "road"    "bhopal"  "madhya"  "pradesh"
# [6] "462003" 

      

Now we will try to match

arrange the places in each of the lines in vectors.

matched.place<-lapply(X = listofcharvec,FUN = match,table=place)
#> matched.place
#[[1]]
#[1] NA  4 NA

#[[2]]
#[1] NA NA  1 NA NA NA

      

Finally, using a combination of sapply

, is.na

and length

, you can get the location in a vector.

df$place<-sapply(matched.place,function(t){ifelse(test = (length(!is.na(t))>0),
yes = place[t[!is.na(t)]],no = NA)})

#> df
#         id                                address     place
#1 297308272           mgroad, ambikapur, ch-546453 ambikapur
#2 297308281 tt road, bhopal, madhya pradesh 462003    bhopal

      

0


source







All Articles