We need a similarity measure for these vectors

I have a Python function that takes a block of text and returns a special 2D vector / dictionary representation, depending on the selected length n. Sample output might look like this:

1: [6, 8, 1]
2: [6, 16, 4, 4, 5, 11, 5, 8]
3: [4, 7, 8, 4]
..
..
n: [5, 2, 1, 4, 5, 6]

      

Keys 1 through n represent positions in the input text; for example, if n = 12, key 5 will contain data that is ~ 5/12 of the path to the document.

The length of the int list on each key is arbitrary; so another block of text for the same value of n might well have produced this:

1: [4, 5, 16, 7, 6]
2: None
3: [7, 9, 12]
..
..
n: [3]

      

I want to create a similarity measure for any two such vectors of the same length n. One thing I've tried is to only consider the averages of each integer list in the dictionary, providing simple, 1D vectors for simple cosine comparisons.

But that loses a little more information than it would like (not to mention the problems with random None values).

Since I can create different vectors / different "granularities" of the view by choosing different * n * s, will the value take two documents by creating multiple pairs of vectors in the range of matching * n * s and then making the mean of the averages?

Or would it be better to approach it in a completely different way? I can just imagine the input texts as 1D vectors and still get the idea I want, but they will have different lengths, which can make comparison difficult. (think about it, different lengths in each key in the original view does not solve this problem ... ha, but still ...)

+3


source to share





All Articles