Extract: Python dictionary, multi-valued key

Question

Extract: Python dictionary, multi-valued key

I have two files and I am trying to extract some values from file 1, for example:

File1:
2    word1
4    word2
4    word2_1
4    word2_2
8    word5
8    word5_3

File 2:
4
8

I want all lines to start with 4 and 8 (from file 2) and they are a lot. So usually if only one line matches, I would use a python dictionary, one key element is easy! But now that I have multiple element matches with the same key, my script will only fetch the last one (obviously, it erases the previous ones as it is removed!). So I understand that this is not how it works, but I have no idea and I would be very glad if someone can help me get started.

Here's my "regular" code:

gene_count = {}
my_file = open('file1.txt')
for line in my_file:
    columns = line.strip().split()
    gene = columns[0]
    count = columns[1:13]
    gene_count[gene] = count

names_file = open('file2.txt')
output_file = open('output.txt', 'w')

for line in names_file:
    gene = line.strip()
    count = gene_count[gene]
    output_file.write('{0}\t{1}\n'.format(gene,"\t".join(count)))

output_file.close()

+3

python dictionary

user3188922 03 Sep '14 at 7:34

source to share

2 answers

OregonTrail · Answer 1 · 2014-09-03T07:39:11+0000

Make the meanings of your vocabulary, lists and add to them.

Generally:

from collections import defaultdict
my_dict = defaultdict(lambda: [])

for x in xrange(101):
    if x % 2 == 0:
        my_dict['evens'].append(str(x))
    else:
        my_dict['odds'].append(str(x))

print 'evens:', ' '.join(my_dict['evens'])
print 'odds:', ' '.join(my_dict['odds'])

In your case, your values are lists, so add (concatenate) lists to your dictionary's lists:

from collections import defaultdict
gene_count = defaultdict(lambda: [])

my_file = open('file1.txt')
for line in my_file:
    columns = line.strip().split()
    gene = columns[0]
    count = columns[1:13]
    gene_count[gene] += count

names_file = open('file2.txt')
output_file = open('output.txt', 'w')

for line in names_file:
    gene = line.strip()
    count = gene_count[gene]
    output_file.write('{0}\t{1}\n'.format(gene,"\t".join(count)))

output_file.close()

If what you actually want to print is a counter for each gene, then replace "\t".join(count)

with len(count)

, the length of the list.

Pawel wisniewski · Answer 2 · 2014-09-03T07:55:38+0000

Have you considered using pandas

. You can upload files to DataFrame

and then filter them:

In [5]: file1 = pn.read_csv('file1',sep='    ', 
                            names=['number','word'],
                            engine='python')

In [6]: file1
Out[6]: 
   number     word
0       2    word1
1       4    word2
2       4  word2_1
3       4  word2_2
4       8    word5
5       8  word5_3

In [9]: file1[(file1.number==4) | (file1.number==8)]
Out[9]: 
   number     word
1       4    word2
2       4  word2_1
3       4  word2_2
4       8    word5
5       8  word5_3

In [13]: foo = file1[(file1.number==4) | (file1.number==8)].append(file2[(file2.number==4) | (file2.number==8)])
Out[13]: 
   number     word
1       4    word2
2       4  word2_1
3       4  word2_2
4       8    word5
5       8  word5_3
1       4    word2
2       4  word2_1
3       4  word2_2
4       8    word5
5       8  word5_3

At 5 you create a file, at 9 you filter the file by the values of the numbers, at 13 you concatenate the two filtered files together.
Then you can sort it and make your calculations much easier than with a dictionary.

UPDATE
To filter pandas.DataFrame

by condition that the column value is in some list, you can use isin

by specifying it or using range

for example.

In [46]: file1[file1.number.isin([1,2,3])]
Out[46]: 
   number   word
0       2  word1

Extract: Python dictionary, multi-valued key

More articles: