How do I create a pandas framework to match the top 20% value in a column?
There is a pandas dataframe:
df = pd.DataFrame({'c1':['a','b','c','d','e','f','g','h','i','j'],
'c2':[10,12,23,4,18,98,11,23,33,99]})
c1 c2
0 a 10
1 b 12
2 c 23
3 d 4
4 e 18
5 f 98
6 g 11
7 h 23
8 i 33
9 j 99
I want to create a new dataframe that only contains the top 20% of the rows according to the values in column c2, in this case:
output:
c1 c2
0 f 98
1 j 99
+3
freefrog
source
to share
4 answers
You can use the method quantile
to calculate the 80 percent threshold and store values above it:
df[df.c2.gt(df.c2.quantile(0.8))]
# c1 c2
#5 f 98
#9 j 99
Or use nlargest
:
df.nlargest(int(len(df) * 0.2), 'c2')
# c1 c2
#9 j 99
#5 f 98
+2
Psidom
source
to share
In the interest of diversity ...
top_percentage = 0.2
>>> df.sort_values('c2').tail(int(len(df) * top_percentage))
# Output:
# c1 c2
# 5 f 98
# 9 j 99
+2
Alexander
source
to share
df = df.sort_values(by=['c2'],ascending = True)
split_len = int(0.8*len(df))
df = df.iloc[split_len:]
+1
Bhushan mehta
source
to share
Using a parameter pct=True
in a methodpd.Series.rank
df[df.c2.rank(pct=True).gt(.8)]
c1 c2
5 f 98
9 j 99
+1
piRSquared
source
to share