Python Pandas returns DataFrame where count exceeds a given number

I have a Pandas DataFrame and I only want to return a DataFrame if that client number occurs more than a certain number of times.

Here's an example DataFrame:

114  2017-04-26      1       7507       34      13
115  2017-04-26      3      77314       41      14
116  2017-04-27      7       4525      190     315
117  2017-04-27      7       5525       67      94
118  2017-04-27      1       6525       43     378
119  2017-04-27      3       7415       38      27
120  2017-04-27      2       7613       47      10
121  2017-04-27      2      77314        9       3
122  2017-04-28      1        227       17       4
123  2017-04-28      8       4525      205     341
124  2017-04-28      1       7415       31      20
125  2017-04-28      2      77314        8       2

      

And now, if this client is encountered more than 5 times, using this code:

print(zip_data_df['Customers'].value_counts()>5)

7415      True
4525      True
5525      True
77314     True
6525      True
4111      True
227       True
206      False
7507     False
7613     False
4108     False
3046     False
2605     False
4139     False
4119     False

      

Now I was expecting to do this:

print(zip_data_df[zip_data_df['Customers'].value_counts()>5])

      

It will show me the entire DataFrame for clients that occur more than 5 times, but I got a Boolean error. I understand why this is giving me an error now: one DataFrame just tells me if this client number happens more than 5 times or not, and the other shows me every time the client number. They are not the same length. But how can I get it so that the dataframe only returns records where this client occurs more than 5 times?

I'm sure there is a simple answer that I am missing, but I appreciate any help you can get me.

+3


source to share


3 answers


So, the problem here is indexing: value_counts () returns a series indexed on "Clients", while zip_data_df seems to be indexed on something else. You can do something like:

cust_counts = zip_data_df['Customers'].value_counts().rename('cust_counts')

zip_data_df = zip_data_df.merge(cust_counts.to_frame(),
                                left_on='Customers',
                                right_index=True)

      



From there, you can select conditionally from zip_data_df like this:

zip_data_df[zip_data_df.cust_counts > 5]

      

+3


source


I believe what you are looking for is:



zip_data_df['Customers'].value_counts()[zip_data_df['Customers'].value_counts()>5]

      

+1


source


I had a similar problem and solved it this way.

cust_counts = zip_data_df['Customers'].value_counts()
cust_list = cust_counts[cust_counts > 5].index.tolist()
zip_data_df = zip_data_df[zip_data_df['Customers'].isin(cust_list)]

      

0


source







All Articles