Number of null / NaN values in data frame by columns
I am trying to count the number of unique values for each row in the columns of a dataframe.
More context in my previous post and my answer
Here is the current dataframe:
[in] df
[out]
PID CID PPID PPPID PPPPID PPPPPID
0 2015-01-02 456 2014-01-02 2014-01-02 2014-01-02 2014-01-02
1 2015-02-02 500 2014-02-02 2013-02-02 2012-02-02 2012-02-10
2 2010-12-04 300 2010-12-04 2010-12-04 2010-12-04 2010-12-04
All columns, except for CID (contract_ID), are datetimes. I would like to add another column to the dataframe that counts the number of unique datetimes in each row (in order to figure out how many contracts there are in the "chain").
I have tried different implementations of the methods .count()
and .sum()
, but cannot get them to work on a line basis (the output is all lines with the same value).
Example:
df_merged['COUNT'] = df_merged2.count(axis=1)
Fills the entire "COUNT" column with "6" when I would like to be different for each row.
Removing the argument axis=1
makes the entire column "NaN"
source to share
You need apply(your_func, axis=1)
to work on a line basis.
df
Out[19]:
PID CID PPID PPPID PPPPID PPPPPID
0 2015-01-02 456 2014-01-02 2014-01-02 2014-01-02 2014-01-02
1 2015-02-02 500 2014-02-02 2013-02-02 2012-02-02 2012-02-10
2 2010-12-04 300 2010-12-04 2010-12-04 2010-12-04 2010-12-04
df['counts'] = df.drop('CID', axis=1).apply(lambda row: len(pd.unique(row)), axis=1)
Out[20]:
PID CID PPID PPPID PPPPID PPPPPID counts
0 2015-01-02 456 2014-01-02 2014-01-02 2014-01-02 2014-01-02 2
1 2015-02-02 500 2014-02-02 2013-02-02 2012-02-02 2012-02-10 5
2 2010-12-04 300 2010-12-04 2010-12-04 2010-12-04 2010-12-04 1
[3 rows x 7 columns]
source to share
Another way would be to call unique
on the transpose of your df:
In [26]:
df['counts'] = df.drop('CID', axis=1).T.apply(lambda x: len(pd.Series.unique(x)))
df
Out[26]:
PID CID PPID PPPID PPPPID PPPPPID counts
0 2015-01-02 456 2014-01-02 2014-01-02 2014-01-02 2014-01-02 2
1 2015-02-02 500 2014-02-02 2013-02-02 2012-02-02 2012-02-10 5
2 2010-12-04 300 2010-12-04 2010-12-04 2010-12-04 2010-12-04 1
source to share
You can directly use nunique
in DataFrame
. This is from pd.__version__ == u'0.20.0'
forward.
In [169]: df['counts'] = df.drop('CID', axis=1).nunique(axis=1)
In [170]: df
Out[170]:
PID CID PPID PPPID PPPPID PPPPPID counts
0 2015-01-02 456 2014-01-02 2014-01-02 2014-01-02 2014-01-02 2
1 2015-02-02 500 2014-02-02 2013-02-02 2012-02-02 2012-02-10 5
2 2010-12-04 300 2010-12-04 2010-12-04 2010-12-04 2010-12-04 1
source to share