Newick tree view for scipy.cluster.hierarchy binding matrix format

I have a set of genes that have been aligned and grouped based on DNA sequences and I have this set of genes in a Newick tree view ( https://en.wikipedia.org/wiki/Newick_format ). Does anyone know how to convert this format to scipy.cluster.hierarchy.linkage format? From the scipy docs for the relationship matrix:

A (n-1) by 4 matrix Z is returned. At the i-th iteration, clusters with indices Z [i, 0] and Z [i, 1] are combined to form cluster n + i. the cluster with index less than n matches one of the n original observations. The distance between the clusters Z [i, 0] and Z [i, 1] is equal to the specified Z [i, 2]. The fourth value Z [i, 3] represents the number of initial observations in the newly formed cluster.

At least from the scipy docs, their description of how this matrix of links is structured is pretty confusing. What do they mean by "iteration"? Also, how does this view keep track of which source observations are in the cluster?

I would like to figure out how to do this transformation since the results of other cluster analyzes in my project were done with scipy view and I used it for build purposes.

+3


source to share


1 answer


I figured out how a matrix of relationships is created from a tree view, thanks to @cel for clarification. Take an example from the Newick wiki ( https://en.wikipedia.org/wiki/Newick_format )

Tree in string format:

(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);

      

First you need to calculate the distances between all the leaves. If, for example, we want to calculate the distance of A and B, then the method is to traverse the tree from A to B through the nearest branch. Since we are given the distance between each leaf and branch in the Newick format, the distance from A to B is simple 0.1 + 0.2 = 0.3

. For A-D we would have to do 0.1 + (0.5 + 0.4) = 1.0

since the distance from D to the nearest branch is set to 0.4 and the distance from D-branch to A is 0.5. Thus, the distance matrix looks like this (indexed A=0

, B=1

, C=2

, D=3

):

distance_matrix=
 [[0.0, 0.3, 0.9, 1.0],
  [0.3, 0.0, 1.0, 1.1],
  [0.9, 1.0, 0.0, 0.7],
  [1.0, 1.1, 0.1, 0.0]]

      

From here it is easy to find the matrix of connections. Since starting observations we already have clusters n=4

( A

, B

, C

, D

), we need to find additional clusters n-1

of wood. Each step simply merges two clusters into a new one, and we take the two clusters that are closest to each other. In this case, A and B are closer to each other, so the first row of the relationship matrix will look like this:

[A,B,0.3,2]

      

From now on, we consider A and B as one cluster, the distance to the nearest branch of which is the distance between A and B.

Now we've got 3 cluster AB

, C

and D

. We can update the distance matrix to see which clusters are closest to each other. Let AB

has an index 0

in the updated distance matrix.



distance_matrix=
[[0.0, 1.1, 1.2],
 [1.1, 0.0, 0.7],
 [1.2, 0.7, 0.0]]

      

Now we can see that C and D are closest to each other, so combine them into a new cluster. The second line in the matrix of links will now be

[C,D,0.7,2]

      

Now there are only two clusters left, AB

and CD

. The distance from these clusters to the root branch is 0.3 and 0.7, respectively, so their distance is 1.0. The last row of the binding matrix will be:

[AB,CD,1.0,4]

      

Now the scipy matrix doesn't actually have rows, as I showed here, we would use an indexing scheme since we first combined A and B, AB

would have index 4 and CD

would have index 5. So the actual result we should be seen in the scipy linkage matrix would be:

[[0,1,0.3,2],
 [2,3,0.7,2],
 [4,5,1.0,4]]

      

This is a common way to go from tree view to matrix view scipy linkage. However, there are already tools from other python packages for reading in trees in Newick format, and from them we can find the distance matrix quite easily and then pass that to scipy linkage functions. Below is a little script that does exactly that for this example.

from ete2 import ClusterTree, TreeStyle
import scipy.cluster.hierarchy as sch
import scipy.spatial.distance
import matplotlib.pyplot as plt
import numpy as np
from itertools import combinations


tree = ClusterTree('(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);')
leaves = tree.get_leaf_names()
ts = TreeStyle()
ts.show_leaf_name=True
ts.show_branch_length=True
ts.show_branch_support=True

idx_dict = {'A':0,'B':1,'C':2,'D':3}
idx_labels = [idx_dict.keys()[idx_dict.values().index(i)] for i in range(0, len(idx_dict))]

#just going through the construction in my head, this is what we should get in the end
my_link = [[0,1,0.3,2],
        [2,3,0.7,2],
        [4,5,1.0,4]]

my_link = np.array(my_link)


dmat = np.zeros((4,4))

for l1,l2 in combinations(leaves,2):
    d = tree.get_distance(l1,l2)
    dmat[idx_dict[l1],idx_dict[l2]] = dmat[idx_dict[l2],idx_dict[l1]] = d

print 'Distance:'
print dmat


schlink = sch.linkage(scipy.spatial.distance.squareform(dmat),method='average',metric='euclidean')

print 'Linkage from scipy:'
print schlink

print 'My link:'
print my_link

print 'Did it right?: ', schlink == my_link

dendro = sch.dendrogram(my_link,labels=idx_labels)
plt.show()

tree.show(tree_style=ts)

      

+4


source







All Articles