Calculating Levenshtein Distance QWERTY Resolution in R
I want to calculate the Levenshtein distance in R between user-entered Fortune 1000 usernames, but taking QWERTY typographical errors into account. For example, it Mcdimldes
should have a distance of 2 from McDonalds
because it i
is next to o
and m
next to n
.
There was another attempt at implementation, but in Python (click here). Any help is greatly appreciated.
Please let me know if more information needs to be added to clarify the issue.
source to share
Maybe you can build something on this:
## from the link in the linked python answer:
# txt <- "'q': {'x':0, 'y':0}, 'w': {'x':1, 'y':0}, 'e': {'x':2, 'y':0}, 'r': {'x':3, 'y':0}, 't': {'x':4, 'y':0}, 'y': {'x':5, 'y':0}, 'u': {'x':6, 'y':0}, 'i': {'x':7, 'y':0}, 'o': {'x':8, 'y':0}, 'p': {'x':9, 'y':0}, 'a': {'x':0, 'y':1},'z': {'x':0, 'y':2},'s': {'x':1, 'y':1},'x': {'x':1, 'y':2},'d': {'x':2, 'y':1},'c': {'x':2, 'y':2}, 'f': {'x':3, 'y':1}, 'b': {'x':4, 'y':2}, 'm': {'x':5, 'y':2}, 'j': {'x':6, 'y':1}, 'g': {'x':4, 'y':1}, 'h': {'x':5, 'y':1}, 'j': {'x':6, 'y':1}, 'k': {'x':7, 'y':1}, 'l': {'x':8, 'y':1}, 'v': {'x':3, 'y':2}, 'n': {'x':5, 'y':2}"
# txt <- strsplit(txt, "\\},\\s?")[[1]]
# m <- t(sapply(regmatches(txt, regexec("'(.)':\\s*\\{'x':(\\d+),\\s*'y':(\\d+).*", txt)), "[", -1))
# m <- matrix(as.numeric(m[,-1]), ncol=2, dimnames = list(m[,1],c("x","y")))
# dput(m)
m <- structure(c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 0, 1, 1, 2, 2, 3,
4, 5, 6, 4, 5, 6, 7, 8, 3, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2), .Dim = c(27L,
2L), .Dimnames = list(c("q", "w", "e", "r", "t", "y", "u", "i",
"o", "p", "a", "z", "s", "x", "d", "c", "f", "b", "m", "j", "g",
"h", "j", "k", "l", "v", "n"), c("x", "y")))
m["m", ] <- c(6,2) # 5,2 seems wrong...
f <- function(a, b) {
posis <- lapply(strsplit(c(a, b), "", T), function(x) m[x,,drop=F])
d <- abs(posis[[1]]-posis[[2]])
idx <- which(rowSums(d>1)==0)
if (length(idx)>0) rownames(posis[[1]])[idx] <- rownames(posis[[2]])[idx]
paste(rownames(posis[[1]]), collapse="")
}
a <- tolower("Mcdimldes") # make it case-insensitive
b <- tolower("McDonalds")
adist(a,b) # regular distance
# [1,] 4
newa <- f(a, b) # replace possible typo chars
adist(newa,b) # new dist is 2 - as requested
# [,1]
# [1,] 2
Matrix keyboard layout:
keyb <- sweep(m, 2, c(1, -1), "*")
plot(keyb, type = "n")
text(keyb, rownames(keyb))
grid()
source to share