What's the best way to replace thousands of strings of identifier names with matching names in python?

I have two datasets. One contains 16169 rows of 5 columns and I would like to replace one of the columns with their corresponding names. And these matching names come from a different dataset.

For example:

UniProtID NAME
Q15173 PPP2R5B
P30154 PPP2R1B
P63151 PPP2R2A

DrugBankID Name Type UniProtID UniProt Name
DB00001 Lepirudin BiotechDrug P00734 Prothrombin
DB00002 Cetuximab BiotechDrug P00533 Epidermal growth factor receptor
DB00002 Cetuximab BiotechDrug O75015 Low affinity immunoglobulin gamma Fc region receptor III-B

In this example, I want to replace all UniProt IDs with the corresponding names from the top dataset example. What would be the best way to do this?

I'm really new to programming and python, so any suggestion, help is appreciated.

+3


source to share


2 answers


It seems to me you need , created if some values ​​don't match get :map

Series

set_index

NaN

#change data for match
print (df1)
  UniProtID     NAME
0    O75015  PPP2R5B
1    P00734  PPP2R1B
2    P63151  PPP2R2A

df2['UniProt Name'] = df2['UniProtID'].map(df1.set_index('UniProtID')['NAME'])
print (df2)
  DrugBankID       Name         Type UniProtID UniProt Name
0    DB00001  Lepirudin  BiotechDrug    P00734      PPP2R1B
1    DB00002  Cetuximab  BiotechDrug    P00533          NaN
2    DB00002  Cetuximab  BiotechDrug    O75015      PPP2R5B

      

If the NaN

original values ​​are needed instead :

df2['UniProt Name'] = df2['UniProtID'].map(df1.set_index('UniProtID')['NAME'])
                                      .fillna(df2['UniProt Name'])
print (df2)
  DrugBankID       Name         Type UniProtID  \
0    DB00001  Lepirudin  BiotechDrug    P00734   
1    DB00002  Cetuximab  BiotechDrug    P00533   
2    DB00002  Cetuximab  BiotechDrug    O75015   

                       UniProt Name  
0                           PPP2R1B  
1  Epidermal growth factor receptor  
2                           PPP2R5B  

      



And a solution with merge

- need to left

join fillna

or combine_first

, the last one to delete a column drop

:

df = pd.merge(df2, df1, on="UniProtID", how='left')
df['UniProt Name'] = df['NAME'].fillna(df['UniProt Name'])
#alternative
#df['UniProt Name'] = df['NAME'].combine_first(df['UniProt Name'])
df.drop('NAME', axis=1, inplace=True)
print (df)
  DrugBankID       Name         Type UniProtID  \
0    DB00001  Lepirudin  BiotechDrug    P00734   
1    DB00002  Cetuximab  BiotechDrug    P00533   
2    DB00002  Cetuximab  BiotechDrug    O75015   

                       UniProt Name  
0                           PPP2R1B  
1  Epidermal growth factor receptor  
2                           PPP2R5B  

      


df = pd.merge(df2, df1, on="UniProtID", how='left')
df = df.drop('UniProt Name', axis=1).rename(columns={'NAME':'UniProt Name'})
print (df)
  DrugBankID       Name         Type UniProtID UniProt Name
0    DB00001  Lepirudin  BiotechDrug    P00734      PPP2R1B
1    DB00002  Cetuximab  BiotechDrug    P00533          NaN
2    DB00002  Cetuximab  BiotechDrug    O75015      PPP2R5B

      

+3


source


A more general approach to this problem is to perform a SQL-like join on two tables.

Note . This can be expensive for large datasets, I have not experimented with performance.



import pandas as pd

left = pd.DataFrame({"UniProtID": ["Q15173", "P30154", "P63151"],
                     "Name": ["PPP2R5B", "PPP2R1B", "PPP2R2A"]})

right = pd.DataFrame({"UniProtID": ["Q15173", "P30154", "P63151"],
                      "UniProt Name": ["Prothrombin", "Epidermal growth factor receptor", "Low affinity immunoglobulin gamma Fc region receptor III-B"],
                      "Type": ["BiotechDrug", "BiotechDrug", "BiotechDrug"],
                      "DrugBankID": ["DB00001", "DB00002", "DB00003"]})

result = pd.merge(left, right, on="UniProtID")

      

Link: https://pandas.pydata.org/pandas-docs/stable/merging.html#overlapping-value-columns

0


source







All Articles