Calculating Levenshtein Distance QWERTY Resolution in R

I want to calculate the Levenshtein distance in R between user-entered Fortune 1000 usernames, but taking QWERTY typographical errors into account. For example, it Mcdimldes

should have a distance of 2 from McDonalds

because it i

is next to o

and m

next to n

.

There was another attempt at implementation, but in Python (click here). Any help is greatly appreciated.

Please let me know if more information needs to be added to clarify the issue.

+3


source to share


1 answer


Maybe you can build something on this:

## from the link in the linked python answer:
# txt <- "'q': {'x':0, 'y':0}, 'w': {'x':1, 'y':0}, 'e': {'x':2, 'y':0}, 'r': {'x':3, 'y':0}, 't': {'x':4, 'y':0}, 'y': {'x':5, 'y':0}, 'u': {'x':6, 'y':0}, 'i': {'x':7, 'y':0}, 'o': {'x':8, 'y':0}, 'p': {'x':9, 'y':0}, 'a': {'x':0, 'y':1},'z': {'x':0, 'y':2},'s': {'x':1, 'y':1},'x': {'x':1, 'y':2},'d': {'x':2, 'y':1},'c': {'x':2, 'y':2}, 'f': {'x':3, 'y':1}, 'b': {'x':4, 'y':2}, 'm': {'x':5, 'y':2}, 'j': {'x':6, 'y':1}, 'g': {'x':4, 'y':1}, 'h': {'x':5, 'y':1}, 'j': {'x':6, 'y':1}, 'k': {'x':7, 'y':1}, 'l': {'x':8, 'y':1}, 'v': {'x':3, 'y':2}, 'n': {'x':5, 'y':2}"
# txt <- strsplit(txt, "\\},\\s?")[[1]]
# m <- t(sapply(regmatches(txt, regexec("'(.)':\\s*\\{'x':(\\d+),\\s*'y':(\\d+).*", txt)), "[", -1))
# m <- matrix(as.numeric(m[,-1]), ncol=2, dimnames = list(m[,1],c("x","y")))
# dput(m)
m <- structure(c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 0, 1, 1, 2, 2, 3, 
  4, 5, 6, 4, 5, 6, 7, 8, 3, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 
  2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2), .Dim = c(27L, 
  2L), .Dimnames = list(c("q", "w", "e", "r", "t", "y", "u", "i", 
  "o", "p", "a", "z", "s", "x", "d", "c", "f", "b", "m", "j", "g", 
  "h", "j", "k", "l", "v", "n"), c("x", "y")))
m["m", ] <- c(6,2) # 5,2 seems wrong...

f <- function(a, b) {
  posis <- lapply(strsplit(c(a, b), "", T), function(x) m[x,,drop=F])
  d <- abs(posis[[1]]-posis[[2]])
  idx <- which(rowSums(d>1)==0)
  if (length(idx)>0) rownames(posis[[1]])[idx] <- rownames(posis[[2]])[idx]
  paste(rownames(posis[[1]]), collapse="")
}
a <- tolower("Mcdimldes") # make it case-insensitive
b <- tolower("McDonalds")
adist(a,b) # regular distance
# [1,]    4
newa <- f(a, b) # replace possible typo chars
adist(newa,b) # new dist is 2 - as requested
#      [,1]
# [1,]    2

      

Matrix keyboard layout:



keyb <- sweep(m, 2, c(1, -1), "*")
plot(keyb, type = "n")
text(keyb, rownames(keyb))
grid()

      

enter image description here

+1


source







All Articles