Indexing columns based on cell value in pandas

Question

Indexing columns based on cell value in pandas

I have a dataframe of race results. I would like to create a series that takes the last position in the scene and subtracts it on average over all the stages before. Here is a small snippet for df (there may be more stages, countries and lines)

race_location     stage1_position  stage2_position  stage3_position  number_of_stages
AUS               2.0              2.0              NaN              2
AUS               1.0              5.0              NaN              2
AUS               3.0              4.0              NaN              2
AUS               4.0              8.0              NaN              2
AUS               10.0             6.0              NaN              2
AUS               9.0              7.0              NaN              2
FRA               23.0             1.0              10.0             3
FRA               6.0              12.0             24.0             3
FRA               14.0             11.0             14.0             3
FRA               18.0             10.0             1.0              3
FRA               15.0             14.0             4.0              3
USA               24.0             NaN              NaN              1
USA               7.0              NaN              NaN              1
USA               22.0             NaN              NaN              1
USA               11.0             NaN              NaN              1
USA               8.0              NaN              NaN              1
USA               16.0             NaN              NaN              1
USA               13.0             NaN              NaN              1
USA               19.0             NaN              NaN              1
USA               5.0              NaN              NaN              1
USA               25.0             NaN              NaN              1

The output will be

last_stage_minus_average
0
4
1
4
-4
-2
-2
15
1.5             
-13            
-10.5           
0
0
0
0
0
0
0
0
0
0
0

It doesn't work, but I thought something like this:

new_series = []
for country in country_list:

    num_stages = df.loc[df['race_location'] == country, 'number_of_stages']

    differnce = df.ix[df['race_location'] == country, num_stages] -
        df.iloc[:, 0:num_stages-1].mean(axis=1)

    new_series.append(difference)

I'm not sure how to do this. Any help or direction would be awesome!

+3

python pandas dataframe

moto Apr 25. 17 at 12:24 am

source to share

3 answers

I would use filter

to get only the columns of the scene and then stack

andgroupby

stages = df.filter(regex='^stage\d+.*')

stages.stack().groupby(level=0).apply(
    lambda x: x.iloc[-1] - x.iloc[:-1].mean()
).fillna(0)

0      0.0
1      4.0
2      1.0
3      4.0
4     -4.0
5     -2.0
6     -2.0
7     15.0
8      1.5
9    -13.0
10   -10.5
11     0.0
12     0.0
13     0.0
14     0.0
15     0.0
16     0.0
17     0.0
18     0.0
19     0.0
20     0.0
dtype: float64

how it works

stack

automatically converts values NaN

when converting to a series.
Now the position -1

is the last value within each group if we are grouped by the first level of the new multi-index
So we use lambda

and calculate the average for each thing up to the last valuex.iloc[:-1].mean()
And subtract from the last value x.iloc[-1]

+2

piRSquared Apr 25. 17 at 0:31

source to share

subtracts that on average for all stages before

It doesn't matter, but I'm just curious! Contrary to your desired result, but along with your description, if one of the riders only finished one race, shouldn't their result be inf or nan instead of 0? (to quote them from someone who has already done 2 ~ 3 races, but the result of the last race is exactly the same as the race average, e.g. racer # 1 vs racer # 11 ~ 20)

df_sp = df.filter(regex='^stage\d+.*')
df['last'] = df_sp.T.fillna(method='ffill').T.iloc[:, -1]
df['mean'] = (df_sp.sum(axis=1) - df['last']) / (df['number_of_stages'] - 1)
print(df['last'] - df['mean'])

0      0.0
1      4.0
2      1.0
3      4.0
4     -4.0
5     -2.0
6     -2.0
7     15.0
8      1.5
9    -13.0
10   -10.5
11     NaN
12     NaN
13     NaN
14     NaN
15     NaN
16     NaN
17     NaN
18     NaN
19     NaN
20     NaN

0

su79eu7k Apr 25. 17 at 1:14

source to share

Allen · Accepted Answer · 2017-04-25T00:32:52+0000

#use pandas apply to take the mean for the first n-1 stages and subtract from last stage.
df.apply(lambda x: x.iloc[x.number_of_stages]-np.mean(x.iloc[1:x.number_of_stages]),axis=1).fillna(0)
Out[264]: 
0      0.0
1      4.0
2      1.0
3      4.0
4     -4.0
5     -2.0
6     -2.0
7     15.0
8      1.5
9    -13.0
10   -10.5
11     0.0
12     0.0
13     0.0
14     0.0
15     0.0
16     0.0
17     0.0
18     0.0
19     0.0
20     0.0
dtype: float64

Indexing columns based on cell value in pandas

More articles: