Count values in Python / Pandas in one column, but return values in multiple columns

Question

Count values in Python / Pandas in one column, but return values in multiple columns

In Python, I'm trying to execute value_counts on a Pandas column. I can get this to work, but I can't figure out how to get some of the other related columns.
Code:

import pandas as pd

myframe = pd.DataFrame({"Server":["Server_1","Server_1","Server_1","Server_1","Server_1","Server_2","Server_2","Server_2","Server_2","Server_3","Server_3","Server_3","Server_3","Server_3"], 
"CVE_ID":["CVE-2017-1111","CVE-2017-1112","CVE-2017-1113","CVE-2017-1114","CVE-2017-1115","CVE-2017-1111","CVE-2017-1112","CVE-2017-1113","CVE-2017-1114","CVE-2017-1113","CVE-2017-1114","CVE-2017-1115","CVE-2017-1116","CVE-2017-1117"],
"VulnName":["Java Update 1","Java Update 2","Java Update 3","Adobe 1","Chrome 1","Java Update 1","Java Update 2","Java Update 3","Adobe 1","Java Update 3","Adobe 1","Chrome 1","Chrome 2","Chrome 3"],
"ServerOwner":["Alice","Alice","Alice","Alice","Alice","Bob","Bob","Bob","Bob","Carol","Carol","Carol","Carol","Carol"]})

print "The dataframe: \n", myframe
print "Top 10 offending CVEs, Vulnerability and Count: \n"
print myframe['CVE_ID'].value_counts()

The last line displays 2 columns: one from CVE and one from several times. But I want to print something like this where it maintains the relationship between the CVE and the name of the vulnerability (see middle column):

Top 10 offending CVEs, Vulnerability and Count:
CVE-2017-1113   Java Update 1     3
CVE-2017-1114   Java Update 2     3
...etc...

How should I do it? Everything I do keeps throwing errors.

+3

python pandas count value

user3688402 11 Apr 17 at 19:16

source to share

2 answers

Use join

to addvalue_counts

myframe.join(myframe['CVE_ID'].value_counts().rename('Count'), on='CVE_ID')

           CVE_ID    Server ServerOwner       VulnName  Count
0   CVE-2017-1111  Server_1       Alice  Java Update 1      2
1   CVE-2017-1112  Server_1       Alice  Java Update 2      2
2   CVE-2017-1113  Server_1       Alice  Java Update 3      3
3   CVE-2017-1114  Server_1       Alice        Adobe 1      3
4   CVE-2017-1115  Server_1       Alice       Chrome 1      2
5   CVE-2017-1111  Server_2         Bob  Java Update 1      2
6   CVE-2017-1112  Server_2         Bob  Java Update 2      2
7   CVE-2017-1113  Server_2         Bob  Java Update 3      3
8   CVE-2017-1114  Server_2         Bob        Adobe 1      3
9   CVE-2017-1113  Server_3       Carol  Java Update 3      3
10  CVE-2017-1114  Server_3       Carol        Adobe 1      3
11  CVE-2017-1115  Server_3       Carol       Chrome 1      2
12  CVE-2017-1116  Server_3       Carol       Chrome 2      1
13  CVE-2017-1117  Server_3       Carol       Chrome 3      1

If you want to restrict it to vertex n (2 shown in my example) use head

andhow='inner'

myframe.join(
    myframe['CVE_ID'].value_counts().head(2).rename('Count'),
    on='CVE_ID', how='inner')

           CVE_ID    Server ServerOwner       VulnName  Count
2   CVE-2017-1113  Server_1       Alice  Java Update 3      3
7   CVE-2017-1113  Server_2         Bob  Java Update 3      3
9   CVE-2017-1113  Server_3       Carol  Java Update 3      3
3   CVE-2017-1114  Server_1       Alice        Adobe 1      3
8   CVE-2017-1114  Server_2         Bob        Adobe 1      3
10  CVE-2017-1114  Server_3       Carol        Adobe 1      3

+1

piRSquared 11 Apr 17 at 23:15

source to share

Chuck · Accepted Answer · 2017-04-11T19:34:03+0000

Edit: Changed so that the output has access to the column name

(Note the addition as_index=False

and .reset_index

in [1] See sources 5 and 6

[1] First groupby

in a column CVE_ID

and use size

:

counts = myframe.groupby(['CVE_ID','VulnName','ServerOwner'], as_index=False).size().unstack(fill_value=0).reset_index()


ServerOwner         CVE_ID       VulnName  Alice  Bob  Carol
0            CVE-2017-1111  Java Update 1      1    1      0
1            CVE-2017-1112  Java Update 2      1    1      0
2            CVE-2017-1113  Java Update 3      1    1      1
3            CVE-2017-1114        Adobe 1      1    1      1
4            CVE-2017-1115       Chrome 1      1    0      1
5            CVE-2017-1116       Chrome 2      0    0      1
6            CVE-2017-1117       Chrome 3      0    0      1

[2] Then add over the columns of Alice, Bob, and Carol to get:

counts['Count'] = counts[['Alice','Bob','Carol']].sum(axis=1)

ServerOwner         CVE_ID       VulnName  Alice  Bob  Carol  Count
0            CVE-2017-1111  Java Update 1      1    1      0      2
1            CVE-2017-1112  Java Update 2      1    1      0      2
2            CVE-2017-1113  Java Update 3      1    1      1      3
3            CVE-2017-1114        Adobe 1      1    1      1      3
4            CVE-2017-1115       Chrome 1      1    0      1      2
5            CVE-2017-1116       Chrome 2      0    0      1      1
6            CVE-2017-1117       Chrome 3      0    0      1      1

[3] Then remove the name columns with df.drop

on names

:

counts.drop(['Carol','Bob','Alice'],inplace=True,axis=1)

ServerOwner         CVE_ID       VulnName  Count
0            CVE-2017-1111  Java Update 1      2
1            CVE-2017-1112  Java Update 2      2
2            CVE-2017-1113  Java Update 3      3
3            CVE-2017-1114        Adobe 1      3
4            CVE-2017-1115       Chrome 1      2
5            CVE-2017-1116       Chrome 2      1
6            CVE-2017-1117       Chrome 3      1

[4] Then you use sort_values

in the column sum

:

counts.sort_values(by='Count', ascending=False, inplace=True)

ServerOwner         CVE_ID       VulnName  Count
2            CVE-2017-1113  Java Update 3      3
3            CVE-2017-1114        Adobe 1      3
0            CVE-2017-1111  Java Update 1      2
1            CVE-2017-1112  Java Update 2      2
4            CVE-2017-1115       Chrome 1      2
5            CVE-2017-1116       Chrome 2      1
6            CVE-2017-1117       Chrome 3      1

Combined:

counts = myframe.groupby(['CVE_ID','VulnName','ServerOwner'], as_index=False).size().unstack(fill_value=0).reset_index()
counts['Count'] = counts[['Alice','Bob','Carol']].sum(axis=1)
counts.drop(['Carol','Bob','Alice'],inplace=True,axis=1)
counts.sort_values(by='Count', ascending=False, inplace=True)

print "The dataframe: \n", myframe
print "Top 10 offending CVEs, Vulnerability and Count: \n"
print counts

Top 10 offending CVEs, Vulnerability and Count: 

ServerOwner         CVE_ID       VulnName  Count
2            CVE-2017-1113  Java Update 3      3
3            CVE-2017-1114        Adobe 1      3
0            CVE-2017-1111  Java Update 1      2
1            CVE-2017-1112  Java Update 2      2
4            CVE-2017-1115       Chrome 1      2
5            CVE-2017-1116       Chrome 2      1
6            CVE-2017-1117       Chrome 3      1

Optionally, you can use reset_index()

to reset the index at this point.

Edit: In response to a comment about the index, serverOwner

you can reset the index, discard the old index, and rename the new index:

counts.reset_index(drop=True, inplace = True)
counts.index.names = ['index']

gives:

ServerOwner         CVE_ID       VulnName  Count
index                                           
0            CVE-2017-1113  Java Update 3      3
1            CVE-2017-1114        Adobe 1      3
2            CVE-2017-1111  Java Update 1      2
3            CVE-2017-1112  Java Update 2      2
4            CVE-2017-1115       Chrome 1      2
5            CVE-2017-1116       Chrome 2      1
6            CVE-2017-1117       Chrome 3      1

(The name serverOwner

remains as a remainder of the original command groupby

to detail which column was used.)

Sources for this answer:

[1] Group value for pandas dataframe- pandas

[2] Pandas: Sum DataFrame Rows for Column Data

[3] Remove column from pandas DataFrame

[4] python, sort a downstream frame with pandas

[5] Convert pandas GroupBy object to DataFrame

[6] How to GroupBy Dataframe in pandas and save columns

Count values ​​in Python / Pandas in one column, but return values ​​in multiple columns

More articles:

Count values in Python / Pandas in one column, but return values in multiple columns