Assign a unique index to similar strings in R based on Levenshtein distance
I am trying to figure out how to assign an index to identify similar strings in R. Example data below
test_data <- data.frame(char_name = c("star Lord", "Star Lords", "Star Lords", "Star Lord",
rep("Gamora", 2), rep("GamOOOra", 2)),
address = rep(c("Space", "Universe"), 4),
phone = c(rep(123, 4), rep(456, 4)))
And the desired output:
output_data <- data.frame(char_name = c("star Lord", "Star Lords", "Star Lords", "Star Lord",
rep("Gamora", 2), rep("GamOOOra", 2)),
address = (c(rep(c("Space", "Universe"), 4))),
phone = c(rep(123, 4), rep(456, 4)),
same_person_ind = c(rep(1, 4), rep(2, 4)))
Logic for same_person_ind
:
- Group members with similars
char_name
together based on Levenshtein distance less than or equal to 3 - For each group of similar ones
char_name
, if the Levenshtein distance oraddress
OR isphone
less than or equal to 3, then a unique identifier is assigned to the group.
I've looked at both packages stringdist
and dplyr
but I don't know how to implement my logic in R. Any help would be greatly appreciated.
Many thanks,
source to share
Here's a helper function that groups the elements of the vector at some distance. I am using adist
here:
### col. : vector of words to search by distance
### max_dist : maximum distance between similar words
create_groups <-
function(col.,max_dist=3) {
nn <- as.character(col.)
grp_names_id <-
as.data.frame(t(unique((adist(nn)<max_dist))))
.to_data_frame <-
function(x)
data.frame(char_name=nn[grp_names_id[,x]],grp=x)
res <-
unique(do.call(rbind,
lapply(seq_len(ncol(grp_names_id)),
.to_data_frame)))
res
}
For example, applying this to char_name
, we get 3 groups:
res <- create_groups(test_data$char_name)
## char_name grp
## 1 star Lord 1
## 2 Star Lords 1
## 4 Star Lord 1
## 5 Gamora 2
## 7 GamOOOra 3
Applying this to your data and combining the result:
res <- create_groups(test_data$char_name)
res <- merge(test_data,res
## char_name address phone grp
## 1 GamOOOra Space 456 3
## 2 GamOOOra Universe 456 3
## 3 Gamora Space 456 2
## 4 Gamora Universe 456 2
## 5 star Lord Space 123 1
## 6 Star Lord Universe 123 1
## 7 Star Lords Universe 123 1
## 8 Star Lords Space 123 1
Now the idea is to apply the same process to the subgroup already formed in the previous step. It is natural to use here data.table
to apply operations by groups. For example:
library(data.table)
setkey(setDT(res),grp,char_name)
res[,c("key","grp1"):= {
create_groups(address)
},"grp,char_name"]
## char_name address phone grp key grp1
## 1: star Lord Space 123 1 Space 1
## 2: Star Lord Universe 123 1 Space 1
## 3: Star Lords Universe 123 1 Space 1
## 4: Star Lords Space 123 1 Universe 2
## 5: Gamora Space 456 2 Space 1
## 6: Gamora Universe 456 2 Universe 2
## 7: GamOOOra Space 456 3 Space 1
## 8: GamOOOra Universe 456 3 Universe 2
source to share