Difference between maximum and second maximum from DataFrame
I have a DataFrame. I wanted the difference between the maximum and second maximum from the DataFrame as a new column added to the output of the DataFrame.
The dataframe looks like, for example (it's a rather large DataFrame):
gene_id Time_1 Time_2 Time_3
a 0.01489251 8.00246 8.164309
b 6.67943235 0.8832114 1.048761
So far I've tried the following, but it just accepted headers,
largest = max(df)
second_largest = max(item for item in df if item < largest)
and only returns the header value.
source to share
You can define func that takes values, sorts them, splits the top 2 values ββ( [:2]
), then calculates the difference and returns the second value (as the first value NaN
). You apply
pass this to arg axis=1
to be applied differently:
In [195]:
def func(x):
return -x.sort(inplace=False, ascending=False)[:2].diff()[1]
df['diff'] = df.loc[:,'Time_1':].apply(func, axis=1)
df
Out[195]:
gene_id Time_1 Time_2 Time_3 diff
0 a 0.014893 8.002460 8.164309 0.161849
1 b 6.679432 0.883211 1.048761 5.630671
source to share
Here is my solution:
# Load data
data = {'a': [0.01489251, 8.00246, 8.164309], 'b': [6.67943235, 0.8832114, 1.048761]}
df = pd.DataFrame.from_dict(data, 'index')
The trick is to make the values ββlook linear and keep the top-2 using numpy.argpartition . You are making a difference in the two maximum values ββin absolute terms. The function is applied in different ways.
def f(x):
ind = np.argpartition(x.values, -2)[-2:]
return np.abs(x.iloc[ind[0]] - x.iloc[ind[1]])
df.apply(f, axis=1)
source to share
Here's an elegant solution that doesn't require sorting or defining any functions. It is also fully vectorized as it avoids using the method apply
.
maxes = df.max(axis=1) less_than_max = df.where(df.lt(maxes, axis='rows')) seconds = less_than_max.max(axis=1) df['diff'] = maxes - seconds
source to share