How to return a specific word from a column of r from a list of text in another dataframe
I have a dataframe of two variables. let's say my data frame is df then two variables are df $ id, df $ address. The df $ address contains the full address, eg TT Road, Bhopal, Madhya Pradesh 462003. I have another data frame containing 10 locations, and one of 10 is bhopal. so I only want to return bhopal in a new column. this is an example, I have over 200,000 IDs and 300 locations. below is an example
data frame 1:
df <- data.frame(id = c("297308272","297308281","297308299"), address = c("MGROAD, AMBIKAPUR, CH-546453","TT Road, Bhopal, Madhya Pradesh 462003","STREET NO. 2, WHITEFIELD, PALI, RJ"))
data frame 2:
AD <- data.frame(place = c("Bhopal", "Pali", "Wardha", "AMBIKAPUR", "Anuhul"))
source to share
Let's start first by converting the entire address column of the data frame and the location vector to lowercase.
df$address<-tolower(df$address)
#> df
# id address
#1 297308272 mgroad, ambikapur, ch-546453
#2 297308281 tt road, bhopal, madhya pradesh 462003
place<-tolower(place)
#> place
# "bhopal" "pali" "wardha" "ambikapur"
# [5] "anuhul"
Now split the string into words using "" as separator. For this we will use strsplit
in R.
listofstrvec<-strsplit(x = df$address,split = " ")
#> listofstrvec
# [[1]]
# [1] "mgroad," "ambikapur," "ch-546453"
# [[2]]
# [1] "tt" "road," "bhopal," "madhya" "pradesh"
# [6] "462003"
We now have a list of string vectors. We will now try to clear up these lines a little. We'll use a function gsub
in R to remove unnecessary punctuation. At this point, you may have to try several combinations depending on how messy your data is.
listofstrvec<-lapply(listofstrvec,FUN = gsub,pattern="[\\,\\.\\-]",replacement= "")
#> listofstrvec
# [[1]]
# [1] "mgroad" "ambikapur" "ch546453"
# [[2]]
# [1] "tt" "road" "bhopal" "madhya" "pradesh"
# [6] "462003"
Now we will try to match
arrange the places in each of the lines in vectors.
matched.place<-lapply(X = listofcharvec,FUN = match,table=place)
#> matched.place
#[[1]]
#[1] NA 4 NA
#[[2]]
#[1] NA NA 1 NA NA NA
Finally, using a combination of sapply
, is.na
and length
, you can get the location in a vector.
df$place<-sapply(matched.place,function(t){ifelse(test = (length(!is.na(t))>0),
yes = place[t[!is.na(t)]],no = NA)})
#> df
# id address place
#1 297308272 mgroad, ambikapur, ch-546453 ambikapur
#2 297308281 tt road, bhopal, madhya pradesh 462003 bhopal
source to share