Assign a unique index to similar strings in R based on Levenshtein distance

I am trying to figure out how to assign an index to identify similar strings in R. Example data below

test_data <- data.frame(char_name = c("star Lord", "Star Lords", "Star Lords", "Star Lord", 
                                      rep("Gamora", 2), rep("GamOOOra", 2)),
                        address = rep(c("Space", "Universe"), 4),
                        phone = c(rep(123, 4), rep(456, 4)))

      

And the desired output:

output_data <- data.frame(char_name = c("star Lord", "Star Lords", "Star Lords", "Star Lord", 
                                      rep("Gamora", 2), rep("GamOOOra", 2)),
                        address = (c(rep(c("Space", "Universe"), 4))),
                        phone = c(rep(123, 4), rep(456, 4)), 
                        same_person_ind = c(rep(1, 4), rep(2, 4)))

      

Logic for same_person_ind

:

  • Group members with similars char_name

    together based on Levenshtein distance less than or equal to 3
  • For each group of similar ones char_name

    , if the Levenshtein distance or address

    OR is phone

    less than or equal to 3, then a unique identifier is assigned to the group.

I've looked at both packages stringdist

and dplyr

but I don't know how to implement my logic in R. Any help would be greatly appreciated.

Many thanks,

+3


source to share


1 answer


Here's a helper function that groups the elements of the vector at some distance. I am using adist

here:

### col.     : vector of words to search by distance
### max_dist : maximum distance between similar words
create_groups <- 
  function(col.,max_dist=3) { 
    nn <- as.character(col.)
    grp_names_id <- 
      as.data.frame(t(unique((adist(nn)<max_dist))))

   .to_data_frame <- 
      function(x)
        data.frame(char_name=nn[grp_names_id[,x]],grp=x)
    res <- 
      unique(do.call(rbind,
                     lapply(seq_len(ncol(grp_names_id)),
                            .to_data_frame)))

    res
  }

      

For example, applying this to char_name

, we get 3 groups:

res <- create_groups(test_data$char_name)
##    char_name grp
## 1  star Lord   1
## 2 Star Lords   1
## 4  Star Lord   1
## 5     Gamora   2
## 7   GamOOOra   3

      



Applying this to your data and combining the result:

res <- create_groups(test_data$char_name)
res <- merge(test_data,res
##    char_name  address phone grp
## 1   GamOOOra    Space   456   3
## 2   GamOOOra Universe   456   3
## 3     Gamora    Space   456   2
## 4     Gamora Universe   456   2
## 5  star Lord    Space   123   1
## 6  Star Lord Universe   123   1
## 7 Star Lords Universe   123   1
## 8 Star Lords    Space   123   1

      

Now the idea is to apply the same process to the subgroup already formed in the previous step. It is natural to use here data.table

to apply operations by groups. For example:

library(data.table)
setkey(setDT(res),grp,char_name)

res[,c("key","grp1"):= {
  create_groups(address)

},"grp,char_name"]

##     char_name  address phone grp      key grp1
## 1:  star Lord    Space   123   1    Space    1
## 2:  Star Lord Universe   123   1    Space    1
## 3: Star Lords Universe   123   1    Space    1
## 4: Star Lords    Space   123   1 Universe    2
## 5:     Gamora    Space   456   2    Space    1
## 6:     Gamora Universe   456   2 Universe    2
## 7:   GamOOOra    Space   456   3    Space    1
## 8:   GamOOOra Universe   456   3 Universe    2

      

+3


source







All Articles