Replace column names in pandas dataframe that partially match the row

Background

I would like to identify the column names in the dataframe that partially match the string and replace them with the original names plus some new elements added to them. The new items are integers defined by the list. Here is a similar question , but I'm afraid the proposed solution won't be flexible enough in my particular case. And here is another article with some great answers that come close to the problem I'm having.

Some research

I know I can combine two lists of strings, match them by park in a dictionary, and rename columns using the dictionary as input to the function df.rename

. But this seems too complicated and not very flexible, given that the number of existing columns will differ. As well as the number of columns to be renamed.

The following snippet will give an example of input:

# Libraries
import numpy as np
import pandas as pd
import itertools

# A dataframe
Observations = 5
Columns = 5
np.random.seed(123)
df = pd.DataFrame(np.random.randint(90,110,size=(Observations, Columns)),
              columns = ['Price','obs_1','obs_2','obs_3','obs_4'])

datelist = pd.date_range(pd.datetime.today().strftime('%Y-%m-%d'),
                     periods=Observations).tolist()
df['Dates'] = datelist
df = df.set_index(['Dates'])
print(df)

      

Input

enter image description here

I want to identify the column names starting with obs_

and add items (integers) from the list newElements = [5, 10, 15, 20]

after the '=' sign. The name column Price

remains the same. Other columns that appear after the columns obs_

must also remain unchanged.

The following snippet will display the desired output:

# Desired output
Observations = 5
Columns = 5
np.random.seed(123)
df2 = pd.DataFrame(np.random.randint(90,110,size=(Observations, Columns)),
              columns = ['Price','Obs_1 = 5','Obs_2 = 10','Obs_3 = 15','Obs_4 = 20'])

df2['Dates'] = datelist
df2 = df2.set_index(['Dates'])
print(df2)

      

Output

enter image description here

My attempt

# Define the partial string I'm lookin for
stringMatch = 'Obs_'

# Put existing column names in a list
oldnames = list(df)

# Put elements that should be added to the column names
# where the three first letters match 'obs_'
newElements = [5, 10, 15, 20]
oldElements = [1, 2, 3, 4]

# Change types of the elements in the list
str_newElements = [str(x) for x in newElements]
str_oldElements = [str(y) for y in oldElements]
str_newNames = str_newElements.copy()

# Since I know the first column should not be renamed,
# I start with 'Price' in a list
newnames = ['Price']

# Then I add the renamed parts to the same list
i = 0
for oldElement in str_oldElements:   
    #print(repr(oldElement) + repr(str_newElements[i]))
    newnames.append(stringMatch + oldElement + ' = ' + str_newElements[i])
    i = i + 1

# Rename columns using the dict as input in df.rename
df.rename(columns = dict(zip(oldnames, newnames)), inplace = True)

print('My attempt: ', df)

      

enter image description here

Having made a complete list of the new column names I could use as well df.columns = newnames

, but hopefully one of you has a suggestion using a df.rename

more pythonic way than I could.

Thanks for any suggestions!

Here's all the code for a simple copy-paste:

# Libraries
import numpy as np
import pandas as pd
import itertools

# A dataframe
Observations = 5
Columns = 5
np.random.seed(123)
df = pd.DataFrame(np.random.randint(90,110,size=(Observations, Columns)),
                  columns = ['Price','obs_1','obs_2','obs_3','obs_4'])

datelist = pd.date_range(pd.datetime.today().strftime('%Y-%m-%d'),
                         periods=Observations).tolist()
df['Dates'] = datelist
df = df.set_index(['Dates'])
print('Input: ', df)

# Desired output
Observations = 5
Columns = 5
np.random.seed(123)
df2 = pd.DataFrame(np.random.randint(90,110,size=(Observations, Columns)),
                  columns = ['Price','Obs_1 = 5','Obs_2 = 10','Obs_3 = 15','Obs_4 = 20'])

df2['Dates'] = datelist
df2 = df2.set_index(['Dates'])
print('Desired output: ', df2)

# My attempts
# Define the partial string I'm lookin for
stringMatch = 'Obs_'

# Put existing column names in a list
oldnames = list(df)

# Put elements that should be added to the column names
# where the three first letters match 'obs_'
newElements = [5, 10, 15, 20]
oldElements = [1, 2, 3, 4]

# Change types of the elements in the list
str_newElements = [str(x) for x in newElements]
str_oldElements = [str(y) for y in oldElements]
str_newNames = str_newElements.copy()


# Since I know the first column should not be renamed,
# I start with 'Price' in a list
newnames = ['Price']

# Then I add the renamed parts to the same list
i = 0
for oldElement in str_oldElements:

    #print(repr(oldElement) + repr(str_newElements[i]))
    newnames.append(stringMatch + oldElement + ' = ' + str_newElements[i])
    i = i + 1

# Rename columns using the dict as input in df.rename
df.rename(columns = dict(zip(oldnames, newnames)), inplace = True)

print('My attempt: ', df)

      

EDIT: Aftermath

So many good answers after just one day is amazing! This made it very difficult to decide which answer to accept. I don't know if the following will add meaning to the post as a whole, but I went ahead and included all the suggestions in functions and tested them with% timeit.

Here are the results: enter image description here

The HH1 proposal snippet was the first to be posted and is also one of the fastest in terms of runtime. I'll include the code later if anyone is interested.

EDIT 2

A suggestion from suvy showed these results when I tried it: enter image description here

The snippet worked fine until the last line. After executing the line, the df = df.rename(columns=dict(zip(names,renames)))

dataframe looked like this:

enter image description here

+3


source to share


5 answers


does it work?



df.columns = [col + ' = ' + str(newElements.pop(0)) if col.startswith(stringMatch) else col for col in df.columns]

      

+2


source


You can use a list comprehension:

df.columns = [ i if "_" not in i else i + "=" + str(newElements[int(i[-1])-1]) for i in df.columns]

      



output

    Price   obs_1=5 obs_2=10    obs_3=15    obs_4=20
0   103     92       92         96          107
1   109     100      91         90          107
2   105     99       90         104         90
3   105     109      104        94          90
4   106     94       107        93          92

      

+3


source


starting from your input dataframe called here df

            Price  obs_1  obs_2  obs_3  obs_4
Dates                                        
2017-06-15    103     92     92     96    107
2017-06-16    109    100     91     90    107
2017-06-17    105     99     90    104     90
2017-06-18    105    109    104     94     90
2017-06-19    106     94    107     93     92


newElements = [5, 10, 15, 20]
names = list(filter(lambda x: x.startswith('obs'), df.columns.values))
renames = list(map(lambda x,y: ' = '.join([x,str(y)]), names, newElements))
df = df.rename(columns=dict(zip(names,renames)))

      

returns

            Price   obs_1 = 5   obs_2 = 10  obs_3 = 15  obs_4 = 20
Dates                   
2017-06-19  103     92          92          96          107
2017-06-20  109     100         91          90          107
2017-06-21  105     99          90          104         90
2017-06-22  105     109         104         94          90
2017-06-23  106     94          107         93          92

      

+3


source


Select the columns you want, make the necessary changes and join to the original df

obs_cols = df.columns[df.columns.str.startswith('obs')]

obs_cols = [col + ' = ' + str(val) for col, val in zip(obs_cols, newElements)]

df.columns = list(df.columns[~df.columns.str.startswith('obs')]) + obs_cols


    Price   obs_1 = 5   obs_2 = 10  obs_3 = 15  obs_4 = 20
0   103     92          92          96          107
1   109     100         91          90          107
2   105     99          90          104         90
3   105     109         104         94          90
4   106     94          107         93          92

      

+2


source


For completeness, as you mention df.rename

, you can create an input for this with a dictionary comprehension, similar to the list comprehension in the other answers.

# Where Observations = len(df.index) as in the example
>>>newcols = {col: col+' = '+str(int(col[col.rfind('_')+1:])*Observations)
              for col in df.columns if col.find('obs_') != -1}
>>>df.rename(columns=newcols)
            Price  obs_1 = 5  obs_2 = 10  obs_3 = 15  obs_4 = 20
Dates                                                           
2017-06-15    103         92          92          96         107
2017-06-16    109        100          91          90         107
2017-06-17    105         99          90         104          90
2017-06-18    105        109         104          94          90
2017-06-19    106         94         107          93          92

      

I also made some assumptions here about why you are adding certain new items. if those assumptions are wrong df.rename

and vocabulary comprehension can still be used with the method from one of the other answers.

+1


source







All Articles