Sort entire csv by frequency of occurrence in one column
I have a large csv file that is a caller data log.
Short file snippet:
CompanyName High Priority QualityIssue
Customer1 Yes User
Customer1 Yes User
Customer2 No User
Customer3 No Equipment
Customer1 No Neither
Customer3 No User
Customer3 Yes User
Customer3 Yes Equipment
Customer4 No User
I want to sort the entire list by customer frequency so that it looks like this:
CompanyName High Priority QualityIssue
Customer3 No Equipment
Customer3 No User
Customer3 Yes User
Customer3 Yes Equipment
Customer1 Yes User
Customer1 Yes User
Customer1 No Neither
Customer2 No User
Customer4 No User
I tried groupby, but this only prints the company name and frequency, but not other columns, I also tried
df['Totals']= [sum(df['CompanyName'] == df['CompanyName'][i]) for i in xrange(len(df))]
and
df = [sum(df['CompanyName'] == df['CompanyName'][i]) for i in xrange(len(df))]
But this gives me errors: ValueError: Incorrect number of passed items 1, indices mean 24
I looked at something like this:
for key, value in sorted(mydict.iteritems(), key=lambda (k,v): (v,k)):
print "%s: %s" % (key, value)
but this only prints two columns and I want to sort my entire CSV. My output should be all my CSVs sorted by the first column.
Thanks for the help in advance!
source to share
This is similar to what you want, basically add a count column by doing groupby
and transform
with value_counts
, and then you can sort by that column:
In [22]:
df['count'] = df.groupby('CompanyName')['CompanyName'].transform(pd.Series.value_counts)
df.sort('count', ascending=False)
Out[22]:
CompanyName HighPriority QualityIssue count
5 Customer3 No User 4
3 Customer3 No Equipment 4
7 Customer3 Yes Equipment 4
6 Customer3 Yes User 4
0 Customer1 Yes User 3
4 Customer1 No Neither 3
1 Customer1 Yes User 3
8 Customer4 No User 1
2 Customer2 No User 1
You can remove the extraneous column using df.drop
:
In [24]:
df.drop('count', axis=1)
Out[24]:
CompanyName HighPriority QualityIssue
5 Customer3 No User
3 Customer3 No Equipment
7 Customer3 Yes Equipment
6 Customer3 Yes User
0 Customer1 Yes User
4 Customer1 No Neither
1 Customer1 Yes User
8 Customer4 No User
2 Customer2 No User
source to share
Top-voted's answer needs a minor addition: sort
deprecated in favor of sort_values
and sort_index
.
sort_values
will work like this:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 1], 'b': [1, 2, 3]})
df['count'] = \
df.groupby('a')['a']\
.transform(pd.Series.value_counts)
df.sort_values('count', inplace=True, ascending=False)
print('df sorted: \n{}'.format(df))
df sorted:
a b count
0 1 1 2
2 1 3 2
1 2 2 1
source to share
I think there must be a better way to do this, but this should work:
Data preparation:
data = """
CompanyName HighPriority QualityIssue
Customer1 Yes User
Customer1 Yes User
Customer2 No User
Customer3 No Equipment
Customer1 No Neither
Customer3 No User
Customer3 Yes User
Customer3 Yes Equipment
Customer4 No User
"""
df = pd.read_table(io.StringIO(data), sep=r"\s*")
And let's do the transformation:
# create a (sorted) data frame that lists the customers with their number of occurrences
count_df = pd.DataFrame(df.CompanyName.value_counts())
# join the count data frame back with the original data frame
new_index = count_df.merge(df[["CompanyName"]], left_index=True, right_on="CompanyName")
# output the original data frame in the order of the new index.
df.reindex(new_index.index)
Output:
CompanyName HighPriority QualityIssue
3 Customer3 No Equipment
5 Customer3 No User
6 Customer3 Yes User
7 Customer3 Yes Equipment
0 Customer1 Yes User
1 Customer1 Yes User
4 Customer1 No Neither
8 Customer4 No User
2 Customer2 No User
It's probably unintuitive what's going on here, but at the moment I can't figure out how to do it. I tried to comment as much as possible.
The tricky part here is that the index count_df
is a (unique) occurrence of customers. So I am joining index count_df
( left_index=True
) with column CompanyName
df
( right_on="CompanyName"
).
The magic here is that it's count_df
already sorted by the number of occurrences, so we don't need an explicit sort. So all we have to do is change the order of the original dataframe lines to the lines of the concatenated dataframe and we will get the expected result.
source to share