Measuring distance on the Tanimoto scale

Can two objects have the same cosine coefficient and Tanimoto distance, where

Tanimoto distance measure, d(x,y) = x.y / (|x|*|x|) + (|y|*|y|)- x*y

      

and

cosine measure, d(x,y) = x.y /(|x|* |x|) * (|y| *|y|)

      

+5


source to share


2 answers


The Tanimoto similarity coefficient (which is not a true measure of distance) was determined at

d(x,y) = x.y / ((|x|*|x|) + (|y|*|y|)- x.y)

      

for bit vectors x and y.

Now compare this with the cosine similarity factor ,

 d(x,y) = x.y / (|x| * |y|)

      

The denominators differ by x.y

. Tanimoto and cosine similarity coefficients will be the same if x.y

equal to zero.

Geometrically x.y

equal to zero if and only if x

and are y

perpendicular.

Since x

and y

are bit vectors (that is, the values โ€‹โ€‹of which in each dimension can only be 0 or 1) x.y

, equal to zero means



x1*y1 + x2*y2 + ... + xn*yn = 0

      

If xi * yi = 1 * 1 = 1, then the whole amount will be positive. For the entire sum to be equal to zero, no term xi * yi could equal 1. They must be equal to 0:

So

x1*y1 = 0
x2*y2 = 0
...
xn*yn = 0

      

In other words, if xi is 1, then yi must be 0 and vice versa.

Thus, there are many examples where Tanimoto's similarity is equal to that of cosine:

x = (0,1,0,1)
y = (1,0,0,0)

      

eg.

+4


source


Even though the general form of Tanimoto distance has been presented, you should always remember that computationally there is a binary form and a continuous form.

Binary form:

d(x,y) = n(X โˆฉ Y) / [ n(X) + n(Y) - n(X โˆฉ Y) ]

      

while continuous form:



d(x,y) = X.Y / (||X|| + ||Y|| - X.Y )

      

The difference is obvious. If the encoder is working for you, you should tell them that n (X โˆฉ Y), n (X), n (Y) only includes counting the number of ones in vectors. Whereas for || X || and || Y || You must specify that the square root of (X1 ^ 2 + X2 ^ 2 + ... Xp ^ 2) is necessary because || X || the length of the X vector from the origin (also called the norm). Taking square roots for the binary form is unnecessary and would be computationally expensive (wasteful) to analyze big data, as irrational mathematical functions are expensive. However, for the continuous variant, you must use the square root.

To summarize, always remember that there are two types for Tanimoto distance: binary and continuous.

+1


source







All Articles