Sort entire csv by frequency of occurrence in one column

I have a large csv file that is a caller data log.

Short file snippet:

CompanyName    High Priority     QualityIssue
Customer1         Yes             User
Customer1         Yes             User
Customer2         No              User
Customer3         No              Equipment
Customer1         No              Neither
Customer3         No              User
Customer3         Yes             User
Customer3         Yes             Equipment
Customer4         No              User

      

I want to sort the entire list by customer frequency so that it looks like this:

CompanyName    High Priority     QualityIssue
Customer3         No               Equipment
Customer3         No               User
Customer3         Yes              User
Customer3         Yes              Equipment
Customer1         Yes              User
Customer1         Yes              User
Customer1         No               Neither
Customer2         No               User
Customer4         No               User

      

I tried groupby, but this only prints the company name and frequency, but not other columns, I also tried

df['Totals']= [sum(df['CompanyName'] == df['CompanyName'][i]) for i in xrange(len(df))]

      

and

df = [sum(df['CompanyName'] == df['CompanyName'][i]) for i in xrange(len(df))]

      

But this gives me errors: ValueError: Incorrect number of passed items 1, indices mean 24

I looked at something like this:

for key, value in sorted(mydict.iteritems(), key=lambda (k,v): (v,k)):
    print "%s: %s" % (key, value)

      

but this only prints two columns and I want to sort my entire CSV. My output should be all my CSVs sorted by the first column.

Thanks for the help in advance!

+3


source to share


3 answers


This is similar to what you want, basically add a count column by doing groupby

and transform

with value_counts

, and then you can sort by that column:

In [22]:

df['count'] = df.groupby('CompanyName')['CompanyName'].transform(pd.Series.value_counts)
df.sort('count', ascending=False)
Out[22]:
  CompanyName HighPriority QualityIssue count
5   Customer3           No         User     4
3   Customer3           No    Equipment     4
7   Customer3          Yes    Equipment     4
6   Customer3          Yes         User     4
0   Customer1          Yes         User     3
4   Customer1           No      Neither     3
1   Customer1          Yes         User     3
8   Customer4           No         User     1
2   Customer2           No         User     1

      



You can remove the extraneous column using df.drop

:

In [24]:
df.drop('count', axis=1)

Out[24]:
  CompanyName HighPriority QualityIssue
5   Customer3           No         User
3   Customer3           No    Equipment
7   Customer3          Yes    Equipment
6   Customer3          Yes         User
0   Customer1          Yes         User
4   Customer1           No      Neither
1   Customer1          Yes         User
8   Customer4           No         User
2   Customer2           No         User

      

+5


source


Top-voted's answer needs a minor addition: sort

deprecated in favor of sort_values

and sort_index

.

sort_values

will work like this:



    import pandas as pd
    df = pd.DataFrame({'a': [1, 2, 1], 'b': [1, 2, 3]})
    df['count'] = \
    df.groupby('a')['a']\
    .transform(pd.Series.value_counts)
    df.sort_values('count', inplace=True, ascending=False)
    print('df sorted: \n{}'.format(df))

      

df sorted:
a  b  count
0  1  1      2
2  1  3      2
1  2  2      1

      

+3


source


I think there must be a better way to do this, but this should work:

Data preparation:

data = """
CompanyName  HighPriority     QualityIssue
Customer1         Yes             User
Customer1         Yes             User
Customer2         No              User
Customer3         No              Equipment
Customer1         No              Neither
Customer3         No              User
Customer3         Yes             User
Customer3         Yes             Equipment
Customer4         No              User
"""
df = pd.read_table(io.StringIO(data), sep=r"\s*")

      

And let's do the transformation:

# create a (sorted) data frame that lists the customers with their number of occurrences
count_df = pd.DataFrame(df.CompanyName.value_counts())

# join the count data frame back with the original data frame
new_index = count_df.merge(df[["CompanyName"]], left_index=True, right_on="CompanyName")

# output the original data frame in the order of the new index.
df.reindex(new_index.index)

      

Output:

    CompanyName HighPriority    QualityIssue
3   Customer3   No  Equipment
5   Customer3   No  User
6   Customer3   Yes User
7   Customer3   Yes Equipment
0   Customer1   Yes User
1   Customer1   Yes User
4   Customer1   No  Neither
8   Customer4   No  User
2   Customer2   No  User

      

It's probably unintuitive what's going on here, but at the moment I can't figure out how to do it. I tried to comment as much as possible.

The tricky part here is that the index count_df

is a (unique) occurrence of customers. So I am joining index count_df

( left_index=True

) with column CompanyName

df

( right_on="CompanyName"

).

The magic here is that it's count_df

already sorted by the number of occurrences, so we don't need an explicit sort. So all we have to do is change the order of the original dataframe lines to the lines of the concatenated dataframe and we will get the expected result.

0


source







All Articles