How to map rows of vectors to a dataframe column in R

I have the following dataframe in r

Id    titles
1     emami paper mills slips 10% on dismal q4 numbers
2     jsw steel q4fy17 standalone net profit rises 173.33%
3     fmcg major hul q4fy17 standalone net profit rises 6.2
4     chennai petroleum, allsec tech slip 6-7% on poor q4

      

And I have names in vectors

names <- c("emami ltd","jsw steel ltd","abc","hul india ltd","tcs","chennai petroleum corp ltd")

      

I want to map the column headers of a dataframe to rows of vectors and print the corresponding row in a new column. My desired framework

 Id    titles                                                    names
1     emami paper mills slips 10% on dismal q4 numbers           emami ltd
2     jsw steel q4fy17 standalone net profit rises 173.33%       jsw steel ltd
3     fmcg major hul q4fy17 standalone net profit rises 6.2      hul india ltd
4     chennai petroleum, allsec tech slip 6-7% on poor q4        chennai petroleum corp ltd

      

I do it with the following code, but it doesn't give me what I want.

df[grepl(paste(names, collapse="|"), df$titles),]

      

How do I do this in R?

+3


source to share


4 answers


If I understand you correctly, you can use BaseR gregexpr

along with regematches

and gsub

to accomplish your task.

Data : EDIT After the OP changed the question

options(stringsAsFactors = F)
df <- data.frame(titles = c("emami paper mills slips 10% on dismal q4 numbers",
                            "jsw steel q4fy17 standalone net profit rises 173.33%",
                            "fmcg major hul q4fy17 standalone net profit rises 6.2",
                            "chennai petroleum, allsec tech slip 6-7% on poor q4"),stringsAsFactors = F)

names <- c("emami ltd","jsw steel ltd","abc","hul india ltd","tcs","chennai petroleum corp ltd")

      

Regex

library(dplyr)
library(stringr)

newnames <- gsub("^(\\w+).*","\\1",names)
regmat <- regmatches(df$titles,gregexpr(paste0(newnames,collapse="|"),df$titles))
regmat[lapply(regmat,length) == 0] <- NA
df <- data.frame(cbind(df,newnames =do.call("rbind",regmat)),stringsAsFactors = F)
df1 <- data.frame(names=names,newnames=newnames,stringsAsFactors = F)
left_join(df,df1,by="newnames")

      



You can also use the library stringr

as shown below:

library(stringr)
newnames <- str_replace(names,"^(\\w+).*","\\1")
df$newnames <- str_extract(df$titles,paste0(newnames,collapse="|"))
df1 <- data.frame(names=names,newnames=newnames,stringsAsFactors = F)
left_join(df,df1,by="newnames")

      

Output :

    > left_join(df,df1,by="newnames")
                                                 titles newnames                      names
1      emami paper mills slips 10% on dismal q4 numbers    emami                  emami ltd
2  jsw steel q4fy17 standalone net profit rises 173.33%      jsw              jsw steel ltd
3 fmcg major hul q4fy17 standalone net profit rises 6.2      hul              hul india ltd
4   chennai petroleum, allsec tech slip 6-7% on poor q4  chennai chennai petroleum corp ltd

      

+2


source


Remove ltd from your names:



names <- gsub(" ltd","",names)

      

0


source


It is also possible to use a sqldf

"fuzzy" merge for this type.

Build search:

names <- data.frame(name = c("emami ltd","jsw steel ltd","abc","hul india ltd","tcs"))
names$lookup <- gsub("(\\w+).*", "\\1", names$name)

      

Merge:

library(sqldf)
res <- sqldf("SELECT l.*, r.name
       FROM df as l
       LEFT JOIN names as r
       ON l.titles LIKE '%'||r.lookup||'%'")

      

A few notes: I'm pulling the first word from the search since you said you wanted it "hul"

, not "hul india"

. Also in sql

||

means concatenate and %

means wildcard (which will match anything), so this will match if any search appears anywhere in the text no matter what came before or after it.


Another option using Reduce

then merge:

df$lookup <- Reduce( function(x, y) {x[grepl(y,x)] <- y; x}, c(list(df$titles), names$lookup))
merge(df, names)

      

0


source


To add to the previous answer, I created a function that includes some of the previous comments:

df <-  data.frame(title=c("emami paper mills slips 10% on dismal q4 numbers",
                            "jsw steel q4fy17 standalone net profit rises 173.33%",
                            "fmcg major hul q4fy17 standalone net profit rises 6.2"))


names <- c("emami ltd","jsw steel ltd","abc","hul india ltd","tcs")

find_string <- function(data,names){

    ### Clean the names 
    newnames <- gsub("^(\\w+).*","\\1",names)

    ### Loop over the names to find which sentence contain it
    for(i in 1:length(newnames)){

        if(length(grep(newnames[i],df$title)) != 0){
            df$names[grep(newnames[i],df$title)] <- newnames[i]

        }else{
            print(paste(names[i],"not found in the data!"))
        }
    }
    return(df)
}

### Run the function

find_string(df,names)

      

Hope this helps!

0


source







All Articles