Find matches in two files using python

Question

Find matches in two files using python

I am analyzing sequencing data and I have several candidate genes that I need to find their function.

After editing the available human database, I want to compare my candidate genes with the database and derive a function for my candidate gene.

I only have basic python skills, so I thought this might help me speed up my job of finding the functions of my candidate genes.

so file1 containing candidate genes looks like this

Gene
AQP7
RLIM
SMCO3
COASY
HSPA6

and the database file2.csv looks like this:

Gene   function 
PDCD6  Programmed cell death protein 6 
CDC2   Cell division cycle 2, G1 to S and G2 to M, isoform CRA_a 
CDC2   Cell division cycle 2, G1 to S and G2 to M, isoform CRA_a 
CDC2   Cell division cycle 2, G1 to S and G2 to M, isoform CRA_a 
CDC2   Cell division cycle 2, G1 to S and G2 to M, isoform CRA_a

desired output

 Gene(from file1) ,function(matching from file2)

I tried using this code:

file1 = 'file1.csv'
file2 = 'file2.csv'
output = 'file3.txt'

with open(file1) as inf:
    match = set(line.strip() for line in inf)

with open(file2) as inf, open(output, 'w') as outf:
    for line in inf:
        if line.split(' ',1)[0] in match:
            outf.write(line)

I am only getting a blank page.

I tried using the intersection function

with open('file1.csv', 'r') as ref:
    with open('file2.csv','r') as com:
       with open('common_genes_function','w') as output:
           same = set(ref).intersection(com)
                print same

doesn't work too ..

Please help otherwise, I need to do it manually

+3

python match

Jan Shamsani Apr 29. 15 at 7:39

source to share

2 answers

Using basic Python, you can try this:

import re

gene_function = {}
with open('file2.csv','r') as input:
    lines = [line.strip() for line in input.readlines()[1:]]
    for line in lines:
        match = re.search("(\w+)\s+(.*)",line)
        gene = match.group(1)
        function = match.group(2)
        if gene not in gene_function:
            gene_function[gene] = function

with open('file1.csv','r') as input:
    genes = [i.strip() for i in input.readlines()[1:]]
    for gene in genes:
        if gene in gene_function:
            print "{}, {}".format(gene, gene_function[gene])

+1

MervS Apr 29. '15 at 8:11

source to share

RaJa · Accepted Answer · 2015-04-29T08:01:48+0000

I would recommend using a function pandas

merge

. However, this requires a clear separator between the "Gene" and "function" columns. In my example, I assume it is in a tab:

import pandas as pd
#open files as pandas datasets
file1 = pd.read_csv(filepath1, sep = '\t')
file2 = pd.read_csv(filepath2, sep = '\t')

#merge files by column 'Gene' using 'inner', so it comes up
#with the intersection of both datasets
file3 = pd.merge(file1, file2, how = 'inner', on = ['Gene'], suffixes = ['1','2'])
file3.to_csv(filepath3, sep = ',')

Find matches in two files using python

More articles: