Determine Rows That Satisfy The Distance Matrix
I am trying to create a list of rows from a clutter distance matrix. Each line must contain 20 characters with a 4-letter alphabet (A, B, C, D). For example, let's say I have the following distance matrix:
S1 S2 S3
S1 0 5 12
S2 5 0 14
S3 12 14 0
From this matrix I need to create 3 rows, for example:
S1 = "ABBBBAAAAAAAAAABBBBB"
S2 = "BAAAAAAAAAAAAAABBBBB"
S3 = "CBBBABBBBBBBBBBBBBBB"
I created these rows manually, but I need to do it for a noisy distance matrix representing 100 rows, which is impractical to do manually. Can anyone suggest an algorithm that can do this?
Thank you, Chris
source to share
It's a fun exercise. :-)
The following octave
script randomly generates n
length lines len
. Subsequently, it calculates the noise distance between all these lines.
It will be shown below that the strings are compared in pairs. If, for example, you are searching [5 12 14]
, you will find a table n
to contain rows that are sized by 5
and 12
, as well as rows located by 12
and 14
. The next task is to find a circuit in which those that are 5
and 12
from each other, may be combined with those 12
and 14
segregated in such a way that the circuit "locked."
% We generate n strings of length len n = 50; len = 20; % We have a categorical variable of size 4 (ABCD) cat = 4; % We want to generate strings that correspond with the following hamming distance matrix search = [5 12 14]; % search = [10 12 14 14 14 16]; S = squareform (search); % Note that we generate each string totally random. If you need small distances it makes sense to introduce % correlations across the strings X = randi (cat-1, n, len); % Calculate the hamming distances t = pdist (X, 'hamming') * len; % The big matrix we have to find our little matrix S within Y = squareform (t); % All the following might be replaced by something like submatrix (Y, S) if that would exist R = zeros (size (S), size (Y)); for j = 1: size (S) M = zeros (size (Y), size (S)); for i = 1: size (Y) M (i,:) = ismember (S (j, :), Y (i, :)); endfor R (j,:) = all (M '); endfor [x, y] = find (R); % A will be a set of cells that contains the indices of the columns / rows that will make up our submatrices A = accumarray (x, y, [], @ (v) {sort (v). '}); % If for example the distance 5 doesn't occur at all, we can already drop out if (sum (cellfun (@ isempty, A))> 0) printf ("There are no matches \ n"); return endif % We are now gonna get all possible submatrices with the values ββin "search" C = cell (1, numel (A)); [C {:}] = ndgrid (A {:}); N = cell2mat (cellfun (@ (v) v (:), C, 'UniformOutput', false)); N = unique (sort (N, 2), 'rows'); printf ("Found% i potential matches (but contains duplicates) \ n", size (N, 1)); % We are now further filtering (remove duplicates) [f, g] = mode (N, 2); h = g == 1; N = N (h, :); printf ("Found% i potential matches \ n", size (N, 1)); M = zeros (size (N), size (search, 2)); for i = 1: size (N) f = N (i, :); M (i,:) = squareform (Y (f, f)) '; endfor F = squareform (S) '; % For now we forget about wrong permutations, so for search> 3 you need to filter these out! M = sort (M, 2); F = sort (F, 2); % Get the sorted search string out of the (large) table M % We search for the ones that "close" the circuit D = ismember (M, F, 'rows'); mf = find (D); if (mf) matches = size (mf, 1); printf ("Found% i matches \ n", matches); for i = 1: matches r = mf (i); printf ("We return match% i (only check permutations now) \ n", r); t = N (r, :) '; str = X (t, :); check = squareform (pdist (str, 'hamming') * len); strings = char (str + 64) check endfor else printf ("There are no matches \ n"); endif
It will generate lines like:
ABAACCBCACABBABBAABA ABACCCBCACBABAABACBA CABBCBBBABCBBACAAACC
source to share