Extract: Python dictionary, multi-valued key
I have two files and I am trying to extract some values ββfrom file 1, for example:
File1:
2 word1
4 word2
4 word2_1
4 word2_2
8 word5
8 word5_3
File 2:
4
8
I want all lines to start with 4 and 8 (from file 2) and they are a lot. So usually if only one line matches, I would use a python dictionary, one key element is easy! But now that I have multiple element matches with the same key, my script will only fetch the last one (obviously, it erases the previous ones as it is removed!). So I understand that this is not how it works, but I have no idea and I would be very glad if someone can help me get started.
Here's my "regular" code:
gene_count = {}
my_file = open('file1.txt')
for line in my_file:
columns = line.strip().split()
gene = columns[0]
count = columns[1:13]
gene_count[gene] = count
names_file = open('file2.txt')
output_file = open('output.txt', 'w')
for line in names_file:
gene = line.strip()
count = gene_count[gene]
output_file.write('{0}\t{1}\n'.format(gene,"\t".join(count)))
output_file.close()
source to share
Make the meanings of your vocabulary, lists and add to them.
Generally:
from collections import defaultdict
my_dict = defaultdict(lambda: [])
for x in xrange(101):
if x % 2 == 0:
my_dict['evens'].append(str(x))
else:
my_dict['odds'].append(str(x))
print 'evens:', ' '.join(my_dict['evens'])
print 'odds:', ' '.join(my_dict['odds'])
In your case, your values ββare lists, so add (concatenate) lists to your dictionary's lists:
from collections import defaultdict
gene_count = defaultdict(lambda: [])
my_file = open('file1.txt')
for line in my_file:
columns = line.strip().split()
gene = columns[0]
count = columns[1:13]
gene_count[gene] += count
names_file = open('file2.txt')
output_file = open('output.txt', 'w')
for line in names_file:
gene = line.strip()
count = gene_count[gene]
output_file.write('{0}\t{1}\n'.format(gene,"\t".join(count)))
output_file.close()
If what you actually want to print is a counter for each gene, then replace "\t".join(count)
with len(count)
, the length of the list.
source to share
Have you considered using pandas
. You can upload files to DataFrame
and then filter them:
In [5]: file1 = pn.read_csv('file1',sep=' ',
names=['number','word'],
engine='python')
In [6]: file1
Out[6]:
number word
0 2 word1
1 4 word2
2 4 word2_1
3 4 word2_2
4 8 word5
5 8 word5_3
In [9]: file1[(file1.number==4) | (file1.number==8)]
Out[9]:
number word
1 4 word2
2 4 word2_1
3 4 word2_2
4 8 word5
5 8 word5_3
In [13]: foo = file1[(file1.number==4) | (file1.number==8)].append(file2[(file2.number==4) | (file2.number==8)])
Out[13]:
number word
1 4 word2
2 4 word2_1
3 4 word2_2
4 8 word5
5 8 word5_3
1 4 word2
2 4 word2_1
3 4 word2_2
4 8 word5
5 8 word5_3
At 5 you create a file, at 9 you filter the file by the values ββof the numbers, at 13 you concatenate the two filtered files together.
Then you can sort it and make your calculations much easier than with a dictionary.
UPDATE
To filter pandas.DataFrame
by condition that the column value is in some list, you can use isin
by specifying it or using range
for example.
In [46]: file1[file1.number.isin([1,2,3])]
Out[46]:
number word
0 2 word1
source to share