Extract: Python dictionary, multi-valued key

I have two files and I am trying to extract some values ​​from file 1, for example:

File1:
2    word1
4    word2
4    word2_1
4    word2_2
8    word5
8    word5_3

File 2:
4
8

      

I want all lines to start with 4 and 8 (from file 2) and they are a lot. So usually if only one line matches, I would use a python dictionary, one key element is easy! But now that I have multiple element matches with the same key, my script will only fetch the last one (obviously, it erases the previous ones as it is removed!). So I understand that this is not how it works, but I have no idea and I would be very glad if someone can help me get started.

Here's my "regular" code:

gene_count = {}
my_file = open('file1.txt')
for line in my_file:
    columns = line.strip().split()
    gene = columns[0]
    count = columns[1:13]
    gene_count[gene] = count

names_file = open('file2.txt')
output_file = open('output.txt', 'w')

for line in names_file:
    gene = line.strip()
    count = gene_count[gene]
    output_file.write('{0}\t{1}\n'.format(gene,"\t".join(count)))

output_file.close()

      

+3


source to share


2 answers


Make the meanings of your vocabulary, lists and add to them.

Generally:

from collections import defaultdict
my_dict = defaultdict(lambda: [])

for x in xrange(101):
    if x % 2 == 0:
        my_dict['evens'].append(str(x))
    else:
        my_dict['odds'].append(str(x))

print 'evens:', ' '.join(my_dict['evens'])
print 'odds:', ' '.join(my_dict['odds'])

      



In your case, your values ​​are lists, so add (concatenate) lists to your dictionary's lists:

from collections import defaultdict
gene_count = defaultdict(lambda: [])

my_file = open('file1.txt')
for line in my_file:
    columns = line.strip().split()
    gene = columns[0]
    count = columns[1:13]
    gene_count[gene] += count

names_file = open('file2.txt')
output_file = open('output.txt', 'w')

for line in names_file:
    gene = line.strip()
    count = gene_count[gene]
    output_file.write('{0}\t{1}\n'.format(gene,"\t".join(count)))

output_file.close()

      

If what you actually want to print is a counter for each gene, then replace "\t".join(count)

with len(count)

, the length of the list.

+1


source


Have you considered using pandas

. You can upload files to DataFrame

and then filter them:

In [5]: file1 = pn.read_csv('file1',sep='    ', 
                            names=['number','word'],
                            engine='python')

In [6]: file1
Out[6]: 
   number     word
0       2    word1
1       4    word2
2       4  word2_1
3       4  word2_2
4       8    word5
5       8  word5_3

In [9]: file1[(file1.number==4) | (file1.number==8)]
Out[9]: 
   number     word
1       4    word2
2       4  word2_1
3       4  word2_2
4       8    word5
5       8  word5_3

In [13]: foo = file1[(file1.number==4) | (file1.number==8)].append(file2[(file2.number==4) | (file2.number==8)])
Out[13]: 
   number     word
1       4    word2
2       4  word2_1
3       4  word2_2
4       8    word5
5       8  word5_3
1       4    word2
2       4  word2_1
3       4  word2_2
4       8    word5
5       8  word5_3

      

At 5 you create a file, at 9 you filter the file by the values ​​of the numbers, at 13 you concatenate the two filtered files together.
Then you can sort it and make your calculations much easier than with a dictionary.



UPDATE
To filter pandas.DataFrame

by condition that the column value is in some list, you can use isin

by specifying it or using range

for example.

In [46]: file1[file1.number.isin([1,2,3])]
Out[46]: 
   number   word
0       2  word1

      

+1


source







All Articles