Python Pandas returns DataFrame where count exceeds a given number
I have a Pandas DataFrame and I only want to return a DataFrame if that client number occurs more than a certain number of times.
Here's an example DataFrame:
114 2017-04-26 1 7507 34 13
115 2017-04-26 3 77314 41 14
116 2017-04-27 7 4525 190 315
117 2017-04-27 7 5525 67 94
118 2017-04-27 1 6525 43 378
119 2017-04-27 3 7415 38 27
120 2017-04-27 2 7613 47 10
121 2017-04-27 2 77314 9 3
122 2017-04-28 1 227 17 4
123 2017-04-28 8 4525 205 341
124 2017-04-28 1 7415 31 20
125 2017-04-28 2 77314 8 2
And now, if this client is encountered more than 5 times, using this code:
print(zip_data_df['Customers'].value_counts()>5)
7415 True
4525 True
5525 True
77314 True
6525 True
4111 True
227 True
206 False
7507 False
7613 False
4108 False
3046 False
2605 False
4139 False
4119 False
Now I was expecting to do this:
print(zip_data_df[zip_data_df['Customers'].value_counts()>5])
It will show me the entire DataFrame for clients that occur more than 5 times, but I got a Boolean error. I understand why this is giving me an error now: one DataFrame just tells me if this client number happens more than 5 times or not, and the other shows me every time the client number. They are not the same length. But how can I get it so that the dataframe only returns records where this client occurs more than 5 times?
I'm sure there is a simple answer that I am missing, but I appreciate any help you can get me.
So, the problem here is indexing: value_counts () returns a series indexed on "Clients", while zip_data_df seems to be indexed on something else. You can do something like:
cust_counts = zip_data_df['Customers'].value_counts().rename('cust_counts') zip_data_df = zip_data_df.merge(cust_counts.to_frame(), left_on='Customers', right_index=True)
From there, you can select conditionally from zip_data_df like this:
zip_data_df[zip_data_df.cust_counts > 5]
I believe what you are looking for is:
zip_data_df['Customers'].value_counts()[zip_data_df['Customers'].value_counts()>5]
I had a similar problem and solved it this way.
cust_counts = zip_data_df['Customers'].value_counts()
cust_list = cust_counts[cust_counts > 5].index.tolist()
zip_data_df = zip_data_df[zip_data_df['Customers'].isin(cust_list)]