How to map rows of vectors to a dataframe column in R
I have the following dataframe in r
Id titles
1 emami paper mills slips 10% on dismal q4 numbers
2 jsw steel q4fy17 standalone net profit rises 173.33%
3 fmcg major hul q4fy17 standalone net profit rises 6.2
4 chennai petroleum, allsec tech slip 6-7% on poor q4
And I have names in vectors
names <- c("emami ltd","jsw steel ltd","abc","hul india ltd","tcs","chennai petroleum corp ltd")
I want to map the column headers of a dataframe to rows of vectors and print the corresponding row in a new column. My desired framework
Id titles names
1 emami paper mills slips 10% on dismal q4 numbers emami ltd
2 jsw steel q4fy17 standalone net profit rises 173.33% jsw steel ltd
3 fmcg major hul q4fy17 standalone net profit rises 6.2 hul india ltd
4 chennai petroleum, allsec tech slip 6-7% on poor q4 chennai petroleum corp ltd
I do it with the following code, but it doesn't give me what I want.
df[grepl(paste(names, collapse="|"), df$titles),]
How do I do this in R?
source to share
If I understand you correctly, you can use BaseR gregexpr
along with regematches
and gsub
to accomplish your task.
Data : EDIT After the OP changed the question
options(stringsAsFactors = F)
df <- data.frame(titles = c("emami paper mills slips 10% on dismal q4 numbers",
"jsw steel q4fy17 standalone net profit rises 173.33%",
"fmcg major hul q4fy17 standalone net profit rises 6.2",
"chennai petroleum, allsec tech slip 6-7% on poor q4"),stringsAsFactors = F)
names <- c("emami ltd","jsw steel ltd","abc","hul india ltd","tcs","chennai petroleum corp ltd")
Regex
library(dplyr)
library(stringr)
newnames <- gsub("^(\\w+).*","\\1",names)
regmat <- regmatches(df$titles,gregexpr(paste0(newnames,collapse="|"),df$titles))
regmat[lapply(regmat,length) == 0] <- NA
df <- data.frame(cbind(df,newnames =do.call("rbind",regmat)),stringsAsFactors = F)
df1 <- data.frame(names=names,newnames=newnames,stringsAsFactors = F)
left_join(df,df1,by="newnames")
You can also use the library stringr
as shown below:
library(stringr)
newnames <- str_replace(names,"^(\\w+).*","\\1")
df$newnames <- str_extract(df$titles,paste0(newnames,collapse="|"))
df1 <- data.frame(names=names,newnames=newnames,stringsAsFactors = F)
left_join(df,df1,by="newnames")
Output :
> left_join(df,df1,by="newnames")
titles newnames names
1 emami paper mills slips 10% on dismal q4 numbers emami emami ltd
2 jsw steel q4fy17 standalone net profit rises 173.33% jsw jsw steel ltd
3 fmcg major hul q4fy17 standalone net profit rises 6.2 hul hul india ltd
4 chennai petroleum, allsec tech slip 6-7% on poor q4 chennai chennai petroleum corp ltd
source to share
It is also possible to use a sqldf
"fuzzy" merge for this type.
Build search:
names <- data.frame(name = c("emami ltd","jsw steel ltd","abc","hul india ltd","tcs"))
names$lookup <- gsub("(\\w+).*", "\\1", names$name)
Merge:
library(sqldf)
res <- sqldf("SELECT l.*, r.name
FROM df as l
LEFT JOIN names as r
ON l.titles LIKE '%'||r.lookup||'%'")
A few notes: I'm pulling the first word from the search since you said you wanted it "hul"
, not "hul india"
. Also in sql
||
means concatenate and %
means wildcard (which will match anything), so this will match if any search appears anywhere in the text no matter what came before or after it.
Another option using Reduce
then merge:
df$lookup <- Reduce( function(x, y) {x[grepl(y,x)] <- y; x}, c(list(df$titles), names$lookup))
merge(df, names)
source to share
To add to the previous answer, I created a function that includes some of the previous comments:
df <- data.frame(title=c("emami paper mills slips 10% on dismal q4 numbers",
"jsw steel q4fy17 standalone net profit rises 173.33%",
"fmcg major hul q4fy17 standalone net profit rises 6.2"))
names <- c("emami ltd","jsw steel ltd","abc","hul india ltd","tcs")
find_string <- function(data,names){
### Clean the names
newnames <- gsub("^(\\w+).*","\\1",names)
### Loop over the names to find which sentence contain it
for(i in 1:length(newnames)){
if(length(grep(newnames[i],df$title)) != 0){
df$names[grep(newnames[i],df$title)] <- newnames[i]
}else{
print(paste(names[i],"not found in the data!"))
}
}
return(df)
}
### Run the function
find_string(df,names)
Hope this helps!
source to share