Create an array from list and dictionary with Python
I am trying to build a matrix with a list box and then fill it with values. It works with small data, but the computer crashes when large data is used (not enough RAM). My script is clearly too heavy, but I don't see how to improve it (first time in programming). Thanks to
import numpy as np
liste = ["a","b","c","d","e","f","g","h","i","j"]
dico = {"a/b": 4, "c/d" : 2, "f/g" : 5, "g/h" : 2}
#now i'd like to build a square array (liste x liste) and fill it up with the values of
# my dict.
def make_array(liste,dico):
array1 = []
liste_i = [] #each line of the array
for i in liste:
if liste_i :
array1.append(liste_i)
liste_i = []
for j in liste:
if dico.has_key(i+"/"+j):
liste_i.append(dico[i+"/"+j])
elif dico.has_key(j+"/"+i):
liste_i.append(dico[j+"/"+i])
else :
liste_i.append(0)
array1.append(liste_i)
print array1
matrix = np.array(array1)
print matrix.shape()
print matrix
return matrix
make_array(liste,dico)
Thanks a in dico
lot for the answers, using or understanding lists improves the speed of the script and it was very helpful. But it looks like my problem is caused by the following function:
def clustering(matrix, liste_globale_occurences, output2):
most_common_groups = []
Y = scipy.spatial.distance.pdist(matrix)
Z = scipy.cluster.hierarchy.linkage(Y,'average', 'euclidean')
scipy.cluster.hierarchy.dendrogram(Z)
clust_h = scipy.cluster.hierarchy.fcluster(Z, t = 15, criterion='distance')
print clust_h
print len(clust_h)
most_common = collections.Counter(clust_h).most_common(3)
group1 = most_common[0][0]
group2 = most_common[1][0]
group3 = most_common[2][0]
most_common_groups.append(group1)
most_common_groups.append(group2)
most_common_groups.append(group3)
with open(output2, 'w') as results: # here the begining of the problem
for group in most_common_groups:
for i, val in enumerate(clust_h):
if group == val:
mise_en_page = "{0:36s} groupe co-occurences = {1:5s} \n"
results.write(mise_en_page.format(str(liste_globale_occurences[i]),str(val)))
When using a small file, I get correct results, e.g .:
pin a = groupe 2
pin b = groupe 2
pin c = groupe 2
pin d = groupe 2
contact e = groupe 3
pin f = groupe 3
But when a heavy file is used, I only get one example for each group:
pin a = groupe 2
pin a = groupe 2
pin a = groupe 2
pin a = groupe 2
contact e = groupe 3
contact e = groupe 3
source to share
Your problem looks like O (n 2 ), because you want to get all combinations of liste
with yourself. Thus, you must have an inner loop.
What you can try is to write each line to a file and then in a new process create a matrix from the file. The new process will use less memory because it won't need to store your large inputs liste
and dico
. So something like this:
def make_array(liste,dico):
f = open('/temp/matrix.txt', 'w')
for i in liste:
for j in liste:
# This is just short circuit evaluation of logical or. It gets the first value that not nothing
f.write('%s ' % (dico.get(i+"/"+j) or dico.get(j+"/"+i) or 0))
f.write('\n')
f.close()
return
Then, once that is done, you can call
print np.loadtxt('/temp/matrix.txt', dtype=int)
I used short circuit evaluation to reduce the lines of code of your statements if
. In fact, if you are using a list of concepts , you can reduce your function make_array
to this:
def make_array(liste,dico):
return np.array([[dico.get(i+"/"+j) or dico.get(j+"/"+i) or 0 for j in liste] for i in liste])
source to share