We need a similarity measure for these vectors

Question

We need a similarity measure for these vectors

I have a Python function that takes a block of text and returns a special 2D vector / dictionary representation, depending on the selected length n. Sample output might look like this:

1: [6, 8, 1]
2: [6, 16, 4, 4, 5, 11, 5, 8]
3: [4, 7, 8, 4]
..
..
n: [5, 2, 1, 4, 5, 6]

Keys 1 through n represent positions in the input text; for example, if n = 12, key 5 will contain data that is ~ 5/12 of the path to the document.

The length of the int list on each key is arbitrary; so another block of text for the same value of n might well have produced this:

1: [4, 5, 16, 7, 6]
2: None
3: [7, 9, 12]
..
..
n: [3]

I want to create a similarity measure for any two such vectors of the same length n. One thing I've tried is to only consider the averages of each integer list in the dictionary, providing simple, 1D vectors for simple cosine comparisons.

But that loses a little more information than it would like (not to mention the problems with random None values).

Since I can create different vectors / different "granularities" of the view by choosing different * n * s, will the value take two documents by creating multiple pairs of vectors in the range of matching * n * s and then making the mean of the averages?

Or would it be better to approach it in a completely different way? I can just imagine the input texts as 1D vectors and still get the idea I want, but they will have different lengths, which can make comparison difficult. (think about it, different lengths in each key in the original view does not solve this problem ... ha, but still ...)

+3

python comparison algorithm vector cosine-similarity

nicole 04 Sep 14 at 21:54

source to share

No one has answered this question yet

Check out similar questions:

1518

Image processing: improvement of the algorithm for the recognition of "Coca-Cola Can"

1067

A simple interview question has become more complicated: by the numbers 1..100 find the missing numbers by which exactly k are missing

994

How do I return multiple values from a function?

977

Measure time elapsed in Python

441

Adding a vector to a vector

1

How to optimize similarity searches?

1

How to iterate over dictionary keys to calculate cosine similarity using values?

0

What are the preprocessing requirements for cosine similarity?

0

Why should cosine similarity be used for word vectors?

0

vector of themes of fixed size in gensim LDA modeling themes to find similar texts

We need a similarity measure for these vectors

More articles: