Difference between maximum and second maximum from DataFrame

Question

Difference between maximum and second maximum from DataFrame

I have a DataFrame. I wanted the difference between the maximum and second maximum from the DataFrame as a new column added to the output of the DataFrame.

The dataframe looks like, for example (it's a rather large DataFrame):

 gene_id    Time_1  Time_2  Time_3
a   0.01489251  8.00246 8.164309
b   6.67943235  0.8832114   1.048761

So far I've tried the following, but it just accepted headers,

largest = max(df)
second_largest = max(item for item in df if item < largest)

and only returns the header value.

+3

python pandas

user1017373 23 jul. At 10:09 am

source to share

3 answers

Here is my solution:

# Load data
data = {'a': [0.01489251, 8.00246, 8.164309], 'b': [6.67943235, 0.8832114, 1.048761]}
df = pd.DataFrame.from_dict(data, 'index')

The trick is to make the values look linear and keep the top-2 using numpy.argpartition . You are making a difference in the two maximum values in absolute terms. The function is applied in different ways.

def f(x):
    ind = np.argpartition(x.values, -2)[-2:]
    return np.abs(x.iloc[ind[0]] - x.iloc[ind[1]])

df.apply(f, axis=1)

+1

Kikohs 23 jul. 15 at 10:24

source to share

Here's an elegant solution that doesn't require sorting or defining any functions. It is also fully vectorized as it avoids using the method apply

.

maxes = df.max(axis=1)
less_than_max = df.where(df.lt(maxes, axis='rows'))
seconds = less_than_max.max(axis=1)
df['diff'] = maxes - seconds

+1

JoeCondron 23 jul. 15 at 11:42

source to share

EdChum · Accepted Answer · 2015-07-23T10:28:36+0000

You can define func that takes values, sorts them, splits the top 2 values ( [:2]

), then calculates the difference and returns the second value (as the first value NaN

). You apply

pass this to arg axis=1

to be applied differently:

In [195]:
def func(x):
    return -x.sort(inplace=False, ascending=False)[:2].diff()[1]

df['diff'] = df.loc[:,'Time_1':].apply(func, axis=1)
df

Out[195]:
  gene_id    Time_1    Time_2    Time_3      diff
0       a  0.014893  8.002460  8.164309  0.161849
1       b  6.679432  0.883211  1.048761  5.630671

Difference between maximum and second maximum from DataFrame

More articles: