Replace column names in pandas dataframe that partially match the row
Background
I would like to identify the column names in the dataframe that partially match the string and replace them with the original names plus some new elements added to them. The new items are integers defined by the list. Here is a similar question , but I'm afraid the proposed solution won't be flexible enough in my particular case. And here is another article with some great answers that come close to the problem I'm having.
Some research
I know I can combine two lists of strings, match them by park in a dictionary, and rename columns using the dictionary as input to the function df.rename
. But this seems too complicated and not very flexible, given that the number of existing columns will differ. As well as the number of columns to be renamed.
The following snippet will give an example of input:
# Libraries
import numpy as np
import pandas as pd
import itertools
# A dataframe
Observations = 5
Columns = 5
np.random.seed(123)
df = pd.DataFrame(np.random.randint(90,110,size=(Observations, Columns)),
columns = ['Price','obs_1','obs_2','obs_3','obs_4'])
datelist = pd.date_range(pd.datetime.today().strftime('%Y-%m-%d'),
periods=Observations).tolist()
df['Dates'] = datelist
df = df.set_index(['Dates'])
print(df)
Input
I want to identify the column names starting with obs_
and add items (integers) from the list newElements = [5, 10, 15, 20]
after the '=' sign. The name column Price
remains the same. Other columns that appear after the columns obs_
must also remain unchanged.
The following snippet will display the desired output:
# Desired output
Observations = 5
Columns = 5
np.random.seed(123)
df2 = pd.DataFrame(np.random.randint(90,110,size=(Observations, Columns)),
columns = ['Price','Obs_1 = 5','Obs_2 = 10','Obs_3 = 15','Obs_4 = 20'])
df2['Dates'] = datelist
df2 = df2.set_index(['Dates'])
print(df2)
Output
My attempt
# Define the partial string I'm lookin for
stringMatch = 'Obs_'
# Put existing column names in a list
oldnames = list(df)
# Put elements that should be added to the column names
# where the three first letters match 'obs_'
newElements = [5, 10, 15, 20]
oldElements = [1, 2, 3, 4]
# Change types of the elements in the list
str_newElements = [str(x) for x in newElements]
str_oldElements = [str(y) for y in oldElements]
str_newNames = str_newElements.copy()
# Since I know the first column should not be renamed,
# I start with 'Price' in a list
newnames = ['Price']
# Then I add the renamed parts to the same list
i = 0
for oldElement in str_oldElements:
#print(repr(oldElement) + repr(str_newElements[i]))
newnames.append(stringMatch + oldElement + ' = ' + str_newElements[i])
i = i + 1
# Rename columns using the dict as input in df.rename
df.rename(columns = dict(zip(oldnames, newnames)), inplace = True)
print('My attempt: ', df)
Having made a complete list of the new column names I could use as well df.columns = newnames
, but hopefully one of you has a suggestion using a
df.rename
more pythonic way than I could.
Thanks for any suggestions!
Here's all the code for a simple copy-paste:
# Libraries
import numpy as np
import pandas as pd
import itertools
# A dataframe
Observations = 5
Columns = 5
np.random.seed(123)
df = pd.DataFrame(np.random.randint(90,110,size=(Observations, Columns)),
columns = ['Price','obs_1','obs_2','obs_3','obs_4'])
datelist = pd.date_range(pd.datetime.today().strftime('%Y-%m-%d'),
periods=Observations).tolist()
df['Dates'] = datelist
df = df.set_index(['Dates'])
print('Input: ', df)
# Desired output
Observations = 5
Columns = 5
np.random.seed(123)
df2 = pd.DataFrame(np.random.randint(90,110,size=(Observations, Columns)),
columns = ['Price','Obs_1 = 5','Obs_2 = 10','Obs_3 = 15','Obs_4 = 20'])
df2['Dates'] = datelist
df2 = df2.set_index(['Dates'])
print('Desired output: ', df2)
# My attempts
# Define the partial string I'm lookin for
stringMatch = 'Obs_'
# Put existing column names in a list
oldnames = list(df)
# Put elements that should be added to the column names
# where the three first letters match 'obs_'
newElements = [5, 10, 15, 20]
oldElements = [1, 2, 3, 4]
# Change types of the elements in the list
str_newElements = [str(x) for x in newElements]
str_oldElements = [str(y) for y in oldElements]
str_newNames = str_newElements.copy()
# Since I know the first column should not be renamed,
# I start with 'Price' in a list
newnames = ['Price']
# Then I add the renamed parts to the same list
i = 0
for oldElement in str_oldElements:
#print(repr(oldElement) + repr(str_newElements[i]))
newnames.append(stringMatch + oldElement + ' = ' + str_newElements[i])
i = i + 1
# Rename columns using the dict as input in df.rename
df.rename(columns = dict(zip(oldnames, newnames)), inplace = True)
print('My attempt: ', df)
EDIT: Aftermath
So many good answers after just one day is amazing! This made it very difficult to decide which answer to accept. I don't know if the following will add meaning to the post as a whole, but I went ahead and included all the suggestions in functions and tested them with% timeit.
The HH1 proposal snippet was the first to be posted and is also one of the fastest in terms of runtime. I'll include the code later if anyone is interested.
EDIT 2
A suggestion from suvy showed these results when I tried it:
The snippet worked fine until the last line. After executing the line, the df = df.rename(columns=dict(zip(names,renames)))
dataframe looked like this:
source to share
starting from your input dataframe called here df
Price obs_1 obs_2 obs_3 obs_4 Dates 2017-06-15 103 92 92 96 107 2017-06-16 109 100 91 90 107 2017-06-17 105 99 90 104 90 2017-06-18 105 109 104 94 90 2017-06-19 106 94 107 93 92 newElements = [5, 10, 15, 20] names = list(filter(lambda x: x.startswith('obs'), df.columns.values)) renames = list(map(lambda x,y: ' = '.join([x,str(y)]), names, newElements)) df = df.rename(columns=dict(zip(names,renames)))
returns
Price obs_1 = 5 obs_2 = 10 obs_3 = 15 obs_4 = 20
Dates
2017-06-19 103 92 92 96 107
2017-06-20 109 100 91 90 107
2017-06-21 105 99 90 104 90
2017-06-22 105 109 104 94 90
2017-06-23 106 94 107 93 92
source to share
Select the columns you want, make the necessary changes and join to the original df
obs_cols = df.columns[df.columns.str.startswith('obs')]
obs_cols = [col + ' = ' + str(val) for col, val in zip(obs_cols, newElements)]
df.columns = list(df.columns[~df.columns.str.startswith('obs')]) + obs_cols
Price obs_1 = 5 obs_2 = 10 obs_3 = 15 obs_4 = 20
0 103 92 92 96 107
1 109 100 91 90 107
2 105 99 90 104 90
3 105 109 104 94 90
4 106 94 107 93 92
source to share
For completeness, as you mention df.rename
, you can create an input for this with a dictionary comprehension, similar to the list comprehension in the other answers.
# Where Observations = len(df.index) as in the example
>>>newcols = {col: col+' = '+str(int(col[col.rfind('_')+1:])*Observations)
for col in df.columns if col.find('obs_') != -1}
>>>df.rename(columns=newcols)
Price obs_1 = 5 obs_2 = 10 obs_3 = 15 obs_4 = 20
Dates
2017-06-15 103 92 92 96 107
2017-06-16 109 100 91 90 107
2017-06-17 105 99 90 104 90
2017-06-18 105 109 104 94 90
2017-06-19 106 94 107 93 92
I also made some assumptions here about why you are adding certain new items. if those assumptions are wrong df.rename
and vocabulary comprehension can still be used with the method from one of the other answers.
source to share