Indexing columns based on cell value in pandas

I have a dataframe of race results. I would like to create a series that takes the last position in the scene and subtracts it on average over all the stages before. Here is a small snippet for df (there may be more stages, countries and lines)

race_location     stage1_position  stage2_position  stage3_position  number_of_stages
AUS               2.0              2.0              NaN              2
AUS               1.0              5.0              NaN              2
AUS               3.0              4.0              NaN              2
AUS               4.0              8.0              NaN              2
AUS               10.0             6.0              NaN              2
AUS               9.0              7.0              NaN              2
FRA               23.0             1.0              10.0             3
FRA               6.0              12.0             24.0             3
FRA               14.0             11.0             14.0             3
FRA               18.0             10.0             1.0              3
FRA               15.0             14.0             4.0              3
USA               24.0             NaN              NaN              1
USA               7.0              NaN              NaN              1
USA               22.0             NaN              NaN              1
USA               11.0             NaN              NaN              1
USA               8.0              NaN              NaN              1
USA               16.0             NaN              NaN              1
USA               13.0             NaN              NaN              1
USA               19.0             NaN              NaN              1
USA               5.0              NaN              NaN              1
USA               25.0             NaN              NaN              1

      

The output will be

last_stage_minus_average
0
4
1
4
-4
-2
-2
15
1.5             
-13            
-10.5           
0
0
0
0
0
0
0
0
0
0
0

      

It doesn't work, but I thought something like this:

new_series = []
for country in country_list:

    num_stages = df.loc[df['race_location'] == country, 'number_of_stages']

    differnce = df.ix[df['race_location'] == country, num_stages] -
        df.iloc[:, 0:num_stages-1].mean(axis=1)

    new_series.append(difference)

      

I'm not sure how to do this. Any help or direction would be awesome!

+3


source to share


3 answers


#use pandas apply to take the mean for the first n-1 stages and subtract from last stage.
df.apply(lambda x: x.iloc[x.number_of_stages]-np.mean(x.iloc[1:x.number_of_stages]),axis=1).fillna(0)
Out[264]: 
0      0.0
1      4.0
2      1.0
3      4.0
4     -4.0
5     -2.0
6     -2.0
7     15.0
8      1.5
9    -13.0
10   -10.5
11     0.0
12     0.0
13     0.0
14     0.0
15     0.0
16     0.0
17     0.0
18     0.0
19     0.0
20     0.0
dtype: float64

      



+2


source


I would use filter

to get only the columns of the scene and then stack

andgroupby

stages = df.filter(regex='^stage\d+.*')

stages.stack().groupby(level=0).apply(
    lambda x: x.iloc[-1] - x.iloc[:-1].mean()
).fillna(0)

0      0.0
1      4.0
2      1.0
3      4.0
4     -4.0
5     -2.0
6     -2.0
7     15.0
8      1.5
9    -13.0
10   -10.5
11     0.0
12     0.0
13     0.0
14     0.0
15     0.0
16     0.0
17     0.0
18     0.0
19     0.0
20     0.0
dtype: float64

      




how it works

  • stack

    automatically converts values NaN

    when converting to a series.
  • Now the position -1

    is the last value within each group if we are grouped by the first level of the new multi-index
  • So we use lambda

    and calculate the average for each thing up to the last valuex.iloc[:-1].mean()

  • And subtract from the last value x.iloc[-1]

+2


source


subtracts that on average for all stages before

It doesn't matter, but I'm just curious! Contrary to your desired result, but along with your description, if one of the riders only finished one race, shouldn't their result be inf or nan instead of 0? (to quote them from someone who has already done 2 ~ 3 races, but the result of the last race is exactly the same as the race average, e.g. racer # 1 vs racer # 11 ~ 20)

df_sp = df.filter(regex='^stage\d+.*')
df['last'] = df_sp.T.fillna(method='ffill').T.iloc[:, -1]
df['mean'] = (df_sp.sum(axis=1) - df['last']) / (df['number_of_stages'] - 1)
print(df['last'] - df['mean'])

0      0.0
1      4.0
2      1.0
3      4.0
4     -4.0
5     -2.0
6     -2.0
7     15.0
8      1.5
9    -13.0
10   -10.5
11     NaN
12     NaN
13     NaN
14     NaN
15     NaN
16     NaN
17     NaN
18     NaN
19     NaN
20     NaN

      

0


source







All Articles