Create an array from list and dictionary with Python

I am trying to build a matrix with a list box and then fill it with values. It works with small data, but the computer crashes when large data is used (not enough RAM). My script is clearly too heavy, but I don't see how to improve it (first time in programming). Thanks to

import numpy as np
liste = ["a","b","c","d","e","f","g","h","i","j"]

dico = {"a/b": 4, "c/d" : 2, "f/g" : 5, "g/h" : 2}

#now i'd like to build a square array (liste x liste) and fill it up with the values of
# my dict.


def make_array(liste,dico):
    array1 = []
    liste_i = [] #each line of the array
    for i in liste:
        if liste_i :
            array1.append(liste_i)
            liste_i = []
        for j in liste:
            if dico.has_key(i+"/"+j): 
                liste_i.append(dico[i+"/"+j])
            elif dico.has_key(j+"/"+i):
                liste_i.append(dico[j+"/"+i])
            else :
                liste_i.append(0)
    array1.append(liste_i)
    print array1
    matrix = np.array(array1)
    print matrix.shape()
    print matrix
    return matrix

make_array(liste,dico)

      

Thanks a in dico

lot for the answers, using or understanding lists improves the speed of the script and it was very helpful. But it looks like my problem is caused by the following function:

def clustering(matrix, liste_globale_occurences, output2):
    most_common_groups = []
    Y = scipy.spatial.distance.pdist(matrix)
    Z = scipy.cluster.hierarchy.linkage(Y,'average', 'euclidean')
    scipy.cluster.hierarchy.dendrogram(Z)
    clust_h = scipy.cluster.hierarchy.fcluster(Z, t = 15, criterion='distance')
    print clust_h
    print len(clust_h)
    most_common = collections.Counter(clust_h).most_common(3)
    group1 = most_common[0][0]
    group2 = most_common[1][0]
    group3 = most_common[2][0]
    most_common_groups.append(group1)
    most_common_groups.append(group2)
    most_common_groups.append(group3)
    with open(output2, 'w') as results: # here the begining of the problem 
        for group in most_common_groups: 
            for i, val in enumerate(clust_h):
                if group == val:
                    mise_en_page = "{0:36s} groupe co-occurences = {1:5s} \n"
                    results.write(mise_en_page.format(str(liste_globale_occurences[i]),str(val)))

      

When using a small file, I get correct results, e.g .:

pin a = groupe 2

pin b = groupe 2

pin c = groupe 2

pin d = groupe 2

contact e = groupe 3

pin f = groupe 3

But when a heavy file is used, I only get one example for each group:

pin a = groupe 2

pin a = groupe 2

pin a = groupe 2

pin a = groupe 2

contact e = groupe 3

contact e = groupe 3

+3


source to share


2 answers


You can create a matrix mat = len (liste) * len (liste) of zeros and go through your dico and split key: val before '/' will be the number of rows and val after '/' will be the number of columns. This way you do not need to use the "has_key" search function.



0


source


Your problem looks like O (n 2 ), because you want to get all combinations of liste

with yourself. Thus, you must have an inner loop.

What you can try is to write each line to a file and then in a new process create a matrix from the file. The new process will use less memory because it won't need to store your large inputs liste

and dico

. So something like this:

def make_array(liste,dico):
    f = open('/temp/matrix.txt', 'w')
    for i in liste:
        for j in liste:
            # This is just short circuit evaluation of logical or. It gets the first value that not nothing
            f.write('%s ' % (dico.get(i+"/"+j) or dico.get(j+"/"+i) or 0))
        f.write('\n')
    f.close()
    return

      

Then, once that is done, you can call



print np.loadtxt('/temp/matrix.txt', dtype=int)

      

I used short circuit evaluation to reduce the lines of code of your statements if

. In fact, if you are using a list of concepts , you can reduce your function make_array

to this:

def make_array(liste,dico):
    return np.array([[dico.get(i+"/"+j) or dico.get(j+"/"+i) or 0 for j in liste] for i in liste])

      

0


source







All Articles