Measuring distance on the Tanimoto scale
The Tanimoto similarity coefficient (which is not a true measure of distance) was determined at
d(x,y) = x.y / ((|x|*|x|) + (|y|*|y|)- x.y)
for bit vectors x and y.
Now compare this with the cosine similarity factor ,
d(x,y) = x.y / (|x| * |y|)
The denominators differ by x.y
. Tanimoto and cosine similarity coefficients will be the same if x.y
equal to zero.
Geometrically x.y
equal to zero if and only if x
and are y
perpendicular.
Since x
and y
are bit vectors (that is, the values โโof which in each dimension can only be 0 or 1) x.y
, equal to zero means
x1*y1 + x2*y2 + ... + xn*yn = 0
If xi * yi = 1 * 1 = 1, then the whole amount will be positive. For the entire sum to be equal to zero, no term xi * yi could equal 1. They must be equal to 0:
So
x1*y1 = 0 x2*y2 = 0 ... xn*yn = 0
In other words, if xi is 1, then yi must be 0 and vice versa.
Thus, there are many examples where Tanimoto's similarity is equal to that of cosine:
x = (0,1,0,1)
y = (1,0,0,0)
eg.
source to share
Even though the general form of Tanimoto distance has been presented, you should always remember that computationally there is a binary form and a continuous form.
Binary form:
d(x,y) = n(X โฉ Y) / [ n(X) + n(Y) - n(X โฉ Y) ]
while continuous form:
d(x,y) = X.Y / (||X|| + ||Y|| - X.Y )
The difference is obvious. If the encoder is working for you, you should tell them that n (X โฉ Y), n (X), n (Y) only includes counting the number of ones in vectors. Whereas for || X || and || Y || You must specify that the square root of (X1 ^ 2 + X2 ^ 2 + ... Xp ^ 2) is necessary because || X || the length of the X vector from the origin (also called the norm). Taking square roots for the binary form is unnecessary and would be computationally expensive (wasteful) to analyze big data, as irrational mathematical functions are expensive. However, for the continuous variant, you must use the square root.
To summarize, always remember that there are two types for Tanimoto distance: binary and continuous.
source to share