What's the best way to replace thousands of strings of identifier names with matching names in python?
I have two datasets. One contains 16169 rows of 5 columns and I would like to replace one of the columns with their corresponding names. And these matching names come from a different dataset.
For example:
UniProtID NAME Q15173 PPP2R5B P30154 PPP2R1B P63151 PPP2R2A DrugBankID Name Type UniProtID UniProt Name DB00001 Lepirudin BiotechDrug P00734 Prothrombin DB00002 Cetuximab BiotechDrug P00533 Epidermal growth factor receptor DB00002 Cetuximab BiotechDrug O75015 Low affinity immunoglobulin gamma Fc region receptor III-B
In this example, I want to replace all UniProt IDs with the corresponding names from the top dataset example. What would be the best way to do this?
I'm really new to programming and python, so any suggestion, help is appreciated.
source to share
It seems to me you need , created if some values ββdon't match get :map
Series
set_index
NaN
#change data for match
print (df1)
UniProtID NAME
0 O75015 PPP2R5B
1 P00734 PPP2R1B
2 P63151 PPP2R2A
df2['UniProt Name'] = df2['UniProtID'].map(df1.set_index('UniProtID')['NAME'])
print (df2)
DrugBankID Name Type UniProtID UniProt Name
0 DB00001 Lepirudin BiotechDrug P00734 PPP2R1B
1 DB00002 Cetuximab BiotechDrug P00533 NaN
2 DB00002 Cetuximab BiotechDrug O75015 PPP2R5B
If the NaN
original values ββare needed instead :
df2['UniProt Name'] = df2['UniProtID'].map(df1.set_index('UniProtID')['NAME'])
.fillna(df2['UniProt Name'])
print (df2)
DrugBankID Name Type UniProtID \
0 DB00001 Lepirudin BiotechDrug P00734
1 DB00002 Cetuximab BiotechDrug P00533
2 DB00002 Cetuximab BiotechDrug O75015
UniProt Name
0 PPP2R1B
1 Epidermal growth factor receptor
2 PPP2R5B
And a solution with merge
- need to left
join fillna
or combine_first
, the last one to delete a column drop
:
df = pd.merge(df2, df1, on="UniProtID", how='left')
df['UniProt Name'] = df['NAME'].fillna(df['UniProt Name'])
#alternative
#df['UniProt Name'] = df['NAME'].combine_first(df['UniProt Name'])
df.drop('NAME', axis=1, inplace=True)
print (df)
DrugBankID Name Type UniProtID \
0 DB00001 Lepirudin BiotechDrug P00734
1 DB00002 Cetuximab BiotechDrug P00533
2 DB00002 Cetuximab BiotechDrug O75015
UniProt Name
0 PPP2R1B
1 Epidermal growth factor receptor
2 PPP2R5B
df = pd.merge(df2, df1, on="UniProtID", how='left')
df = df.drop('UniProt Name', axis=1).rename(columns={'NAME':'UniProt Name'})
print (df)
DrugBankID Name Type UniProtID UniProt Name
0 DB00001 Lepirudin BiotechDrug P00734 PPP2R1B
1 DB00002 Cetuximab BiotechDrug P00533 NaN
2 DB00002 Cetuximab BiotechDrug O75015 PPP2R5B
source to share
A more general approach to this problem is to perform a SQL-like join on two tables.
Note . This can be expensive for large datasets, I have not experimented with performance.
import pandas as pd
left = pd.DataFrame({"UniProtID": ["Q15173", "P30154", "P63151"],
"Name": ["PPP2R5B", "PPP2R1B", "PPP2R2A"]})
right = pd.DataFrame({"UniProtID": ["Q15173", "P30154", "P63151"],
"UniProt Name": ["Prothrombin", "Epidermal growth factor receptor", "Low affinity immunoglobulin gamma Fc region receptor III-B"],
"Type": ["BiotechDrug", "BiotechDrug", "BiotechDrug"],
"DrugBankID": ["DB00001", "DB00002", "DB00003"]})
result = pd.merge(left, right, on="UniProtID")
Link: https://pandas.pydata.org/pandas-docs/stable/merging.html#overlapping-value-columns
source to share