Count values ββin Python / Pandas in one column, but return values ββin multiple columns
In Python, I'm trying to execute value_counts on a Pandas column. I can get this to work, but I can't figure out how to get some of the other related columns.
Code:
import pandas as pd
myframe = pd.DataFrame({"Server":["Server_1","Server_1","Server_1","Server_1","Server_1","Server_2","Server_2","Server_2","Server_2","Server_3","Server_3","Server_3","Server_3","Server_3"],
"CVE_ID":["CVE-2017-1111","CVE-2017-1112","CVE-2017-1113","CVE-2017-1114","CVE-2017-1115","CVE-2017-1111","CVE-2017-1112","CVE-2017-1113","CVE-2017-1114","CVE-2017-1113","CVE-2017-1114","CVE-2017-1115","CVE-2017-1116","CVE-2017-1117"],
"VulnName":["Java Update 1","Java Update 2","Java Update 3","Adobe 1","Chrome 1","Java Update 1","Java Update 2","Java Update 3","Adobe 1","Java Update 3","Adobe 1","Chrome 1","Chrome 2","Chrome 3"],
"ServerOwner":["Alice","Alice","Alice","Alice","Alice","Bob","Bob","Bob","Bob","Carol","Carol","Carol","Carol","Carol"]})
print "The dataframe: \n", myframe
print "Top 10 offending CVEs, Vulnerability and Count: \n"
print myframe['CVE_ID'].value_counts()
The last line displays 2 columns: one from CVE and one from several times. But I want to print something like this where it maintains the relationship between the CVE and the name of the vulnerability (see middle column):
Top 10 offending CVEs, Vulnerability and Count:
CVE-2017-1113 Java Update 1 3
CVE-2017-1114 Java Update 2 3
...etc...
How should I do it? Everything I do keeps throwing errors.
source to share
Edit: Changed so that the output has access to the column name
(Note the addition as_index=False
and .reset_index
in [1] See sources 5 and 6
[1] First groupby
in a column CVE_ID
and use size
:
counts = myframe.groupby(['CVE_ID','VulnName','ServerOwner'], as_index=False).size().unstack(fill_value=0).reset_index()
ServerOwner CVE_ID VulnName Alice Bob Carol
0 CVE-2017-1111 Java Update 1 1 1 0
1 CVE-2017-1112 Java Update 2 1 1 0
2 CVE-2017-1113 Java Update 3 1 1 1
3 CVE-2017-1114 Adobe 1 1 1 1
4 CVE-2017-1115 Chrome 1 1 0 1
5 CVE-2017-1116 Chrome 2 0 0 1
6 CVE-2017-1117 Chrome 3 0 0 1
[2] Then add over the columns of Alice, Bob, and Carol to get:
counts['Count'] = counts[['Alice','Bob','Carol']].sum(axis=1)
ServerOwner CVE_ID VulnName Alice Bob Carol Count
0 CVE-2017-1111 Java Update 1 1 1 0 2
1 CVE-2017-1112 Java Update 2 1 1 0 2
2 CVE-2017-1113 Java Update 3 1 1 1 3
3 CVE-2017-1114 Adobe 1 1 1 1 3
4 CVE-2017-1115 Chrome 1 1 0 1 2
5 CVE-2017-1116 Chrome 2 0 0 1 1
6 CVE-2017-1117 Chrome 3 0 0 1 1
[3] Then remove the name columns with df.drop
on names
:
counts.drop(['Carol','Bob','Alice'],inplace=True,axis=1)
ServerOwner CVE_ID VulnName Count
0 CVE-2017-1111 Java Update 1 2
1 CVE-2017-1112 Java Update 2 2
2 CVE-2017-1113 Java Update 3 3
3 CVE-2017-1114 Adobe 1 3
4 CVE-2017-1115 Chrome 1 2
5 CVE-2017-1116 Chrome 2 1
6 CVE-2017-1117 Chrome 3 1
[4] Then you use sort_values
in the column sum
:
counts.sort_values(by='Count', ascending=False, inplace=True)
ServerOwner CVE_ID VulnName Count
2 CVE-2017-1113 Java Update 3 3
3 CVE-2017-1114 Adobe 1 3
0 CVE-2017-1111 Java Update 1 2
1 CVE-2017-1112 Java Update 2 2
4 CVE-2017-1115 Chrome 1 2
5 CVE-2017-1116 Chrome 2 1
6 CVE-2017-1117 Chrome 3 1
Combined:
counts = myframe.groupby(['CVE_ID','VulnName','ServerOwner'], as_index=False).size().unstack(fill_value=0).reset_index()
counts['Count'] = counts[['Alice','Bob','Carol']].sum(axis=1)
counts.drop(['Carol','Bob','Alice'],inplace=True,axis=1)
counts.sort_values(by='Count', ascending=False, inplace=True)
print "The dataframe: \n", myframe
print "Top 10 offending CVEs, Vulnerability and Count: \n"
print counts
Top 10 offending CVEs, Vulnerability and Count:
ServerOwner CVE_ID VulnName Count
2 CVE-2017-1113 Java Update 3 3
3 CVE-2017-1114 Adobe 1 3
0 CVE-2017-1111 Java Update 1 2
1 CVE-2017-1112 Java Update 2 2
4 CVE-2017-1115 Chrome 1 2
5 CVE-2017-1116 Chrome 2 1
6 CVE-2017-1117 Chrome 3 1
Optionally, you can use reset_index()
to reset the index at this point.
Edit: In response to a comment about the index, serverOwner
you can reset the index, discard the old index, and rename the new index:
counts.reset_index(drop=True, inplace = True)
counts.index.names = ['index']
gives:
ServerOwner CVE_ID VulnName Count
index
0 CVE-2017-1113 Java Update 3 3
1 CVE-2017-1114 Adobe 1 3
2 CVE-2017-1111 Java Update 1 2
3 CVE-2017-1112 Java Update 2 2
4 CVE-2017-1115 Chrome 1 2
5 CVE-2017-1116 Chrome 2 1
6 CVE-2017-1117 Chrome 3 1
(The name serverOwner
remains as a remainder of the original command groupby
to detail which column was used.)
Sources for this answer:
[1] Group value for pandas dataframe- pandas
[2] Pandas: Sum DataFrame Rows for Column Data
[3] Remove column from pandas DataFrame
[4] python, sort a downstream frame with pandas
source to share
Use join
to addvalue_counts
myframe.join(myframe['CVE_ID'].value_counts().rename('Count'), on='CVE_ID')
CVE_ID Server ServerOwner VulnName Count
0 CVE-2017-1111 Server_1 Alice Java Update 1 2
1 CVE-2017-1112 Server_1 Alice Java Update 2 2
2 CVE-2017-1113 Server_1 Alice Java Update 3 3
3 CVE-2017-1114 Server_1 Alice Adobe 1 3
4 CVE-2017-1115 Server_1 Alice Chrome 1 2
5 CVE-2017-1111 Server_2 Bob Java Update 1 2
6 CVE-2017-1112 Server_2 Bob Java Update 2 2
7 CVE-2017-1113 Server_2 Bob Java Update 3 3
8 CVE-2017-1114 Server_2 Bob Adobe 1 3
9 CVE-2017-1113 Server_3 Carol Java Update 3 3
10 CVE-2017-1114 Server_3 Carol Adobe 1 3
11 CVE-2017-1115 Server_3 Carol Chrome 1 2
12 CVE-2017-1116 Server_3 Carol Chrome 2 1
13 CVE-2017-1117 Server_3 Carol Chrome 3 1
If you want to restrict it to vertex n (2 shown in my example) use head
andhow='inner'
myframe.join(
myframe['CVE_ID'].value_counts().head(2).rename('Count'),
on='CVE_ID', how='inner')
CVE_ID Server ServerOwner VulnName Count
2 CVE-2017-1113 Server_1 Alice Java Update 3 3
7 CVE-2017-1113 Server_2 Bob Java Update 3 3
9 CVE-2017-1113 Server_3 Carol Java Update 3 3
3 CVE-2017-1114 Server_1 Alice Adobe 1 3
8 CVE-2017-1114 Server_2 Bob Adobe 1 3
10 CVE-2017-1114 Server_3 Carol Adobe 1 3
source to share