Determine Rows That Satisfy The Distance Matrix

I am trying to create a list of rows from a clutter distance matrix. Each line must contain 20 characters with a 4-letter alphabet (A, B, C, D). For example, let's say I have the following distance matrix:

   S1 S2 S3
S1  0  5 12
S2  5  0 14
S3 12 14  0

      

From this matrix I need to create 3 rows, for example:

S1 = "ABBBBAAAAAAAAAABBBBB"
S2 = "BAAAAAAAAAAAAAABBBBB"
S3 = "CBBBABBBBBBBBBBBBBBB"

      

I created these rows manually, but I need to do it for a noisy distance matrix representing 100 rows, which is impractical to do manually. Can anyone suggest an algorithm that can do this?

Thank you, Chris

+3


source to share


1 answer


It's a fun exercise. :-)

The following octave

script randomly generates n

length lines len

. Subsequently, it calculates the noise distance between all these lines.

It will be shown below that the strings are compared in pairs. If, for example, you are searching [5 12 14]

, you will find a table n

to contain rows that are sized by 5

and 12

, as well as rows located by 12

and 14

. The next task is to find a circuit in which those that are 5

and 12

from each other, may be combined with those 12

and 14

segregated in such a way that the circuit "locked."



% We generate n strings of length len
n = 50;
len = 20;

% We have a categorical variable of size 4 (ABCD)
cat = 4;

% We want to generate strings that correspond with the following hamming distance matrix
search = [5 12 14];
% search = [10 12 14 14 14 16];
S = squareform (search);

% Note that we generate each string totally random. If you need small distances it makes sense to introduce
% correlations across the strings
X = randi (cat-1, n, len);

% Calculate the hamming distances
t = pdist (X, 'hamming') * len;

% The big matrix we have to find our little matrix S within
Y = squareform (t);

% All the following might be replaced by something like submatrix (Y, S) if that would exist
R = zeros (size (S), size (Y));
for j = 1: size (S)
    M = zeros (size (Y), size (S));
    for i = 1: size (Y)
        M (i,:) = ismember (S (j, :), Y (i, :));
    endfor
    R (j,:) = all (M ');
endfor

[x, y] = find (R);

% A will be a set of cells that contains the indices of the columns / rows that will make up our submatrices
A = accumarray (x, y, [], @ (v) {sort (v). '});

% If for example the distance 5 doesn't occur at all, we can already drop out
if (sum (cellfun (@ isempty, A))> 0)  
    printf ("There are no matches \ n");
    return
endif

% We are now gonna get all possible submatrices with the values ​​in "search"
C = cell (1, numel (A));
[C {:}] = ndgrid (A {:});

N = cell2mat (cellfun (@ (v) v (:), C, 'UniformOutput', false));
N = unique (sort (N, 2), 'rows');

printf ("Found% i potential matches (but contains duplicates) \ n", size (N, 1));

% We are now further filtering (remove duplicates)
[f, g] = mode (N, 2);
h = g == 1;
N = N (h, :);

printf ("Found% i potential matches \ n", size (N, 1));

M = zeros (size (N), size (search, 2));
for i = 1: size (N) 
    f = N (i, :);
    M (i,:) = squareform (Y (f, f)) ';
endfor

F = squareform (S) ';

% For now we forget about wrong permutations, so for search> 3 you need to filter these out!
M = sort (M, 2);
F = sort (F, 2);

% Get the sorted search string out of the (large) table M
% We search for the ones that "close" the circuit
D = ismember (M, F, 'rows');
mf = find (D);

if (mf) 
    matches = size (mf, 1);
    printf ("Found% i matches \ n", matches);  
    for i = 1: matches
        r = mf (i);
        printf ("We return match% i (only check permutations now) \ n", r);
        t = N (r, :) ';
        str = X (t, :);
        check = squareform (pdist (str, 'hamming') * len);
        strings = char (str + 64)
        check
    endfor
else
    printf ("There are no matches \ n");
endif

It will generate lines like:

ABAACCBCACABBABBAABA
ABACCCBCACBABAABACBA
CABBCBBBABCBBACAAACC
+1


source







All Articles