How to fill in missing values based on a column in pandas?
I have this data block in pandas:
df = pandas.DataFrame({
"n": ["a", "b", "c", "a", "b", "x"],
"t": [0, 0, 0, 1, 1, 1],
"v": [10,20,30,40,50,60]
})
how can it be filled with missing values so that every column value t
has the same entries in the column n
? that is, each value t
must contain entries for a, b, c, x
, written as NaN
if they are missing:
n t v
a 0 10
b 0 20
c 0 30
x NaN NaN
a 1 40
b 1 50
c NaN NaN
x 1 60
source to share
From what I understand, you want each value in to be "n"
evenly distributed among the subgroups grouped by "t"
. I also hope that these "n"
cannot be duplicated in these subgroups.
Given that these assumptions are correct, pd.pivot_table
it seems like a good option for this use case. Here, the values under "n"
will be the column names, "t"
will be the grouped index, and the content will be DF
populated with the values under "v"
. Push the stack DF
, save the entries, NaN
and fill it with the appropriate cells in "t"
with .loc
accessor.
df1 = pd.pivot_table(df, "v", "t", "n", "first").stack(dropna=False).reset_index(name="v")
df1.loc[df1['v'].isnull(), "t"] = np.nan
source to share
plan
- get unique column values
'n'
. we will use this forreindex
- we will apply
f
to our groups in each column group't'
, re-indexingidx
, ensure that all itemsidx
are represented for each group of unique't'
- we set the index so that we can
reindex
in bits
idx = df.n.unique()
f = lambda x: x.reindex(idx)
df.set_index('n').groupby('t', group_keys=False).apply(f).reset_index()
n t v
0 a 0.0 10.0
1 b 0.0 20.0
2 c 0.0 30.0
3 x NaN NaN
4 a 1.0 40.0
5 b 1.0 50.0
6 c NaN NaN
7 x 1.0 60.0
source to share
You can use if df
not in NaN
before - create MultiIndex
, and then reindex
, NaN
in t
set column v
:
cols = ["n", "t"]
df1 = df.set_index(cols)
mux = pd.MultiIndex.from_product(df1.index.levels, names=cols)
df1 = df1.reindex(mux).sort_index(level=[1,0]).reset_index()
df1['t'] = df1['t'].mask(df1['v'].isnull())
print (df1)
n t v
0 a 0.0 10.0
1 b 0.0 20.0
2 c 0.0 30.0
3 x NaN NaN
4 a 1.0 40.0
5 b 1.0 50.0
6 c NaN NaN
7 x 1.0 60.0
Another solution is to add NaN unstack
, stack
:
cols = ["n", "t"]
df1 = df.set_index(cols)['v'].unstack().stack(dropna=False)
df1 = df1.sort_index(level=[1,0]).reset_index(name='v')
df1['t'] = df1['t'].mask(df1['v'].isnull())
print (df1)
n t v
0 a 0.0 10.0
1 b 0.0 20.0
2 c 0.0 30.0
3 x NaN NaN
4 a 1.0 40.0
5 b 1.0 50.0
6 c NaN NaN
7 x 1.0 60.0
But if some values NaN
require groupby
with loc
a unique
column value n
:
df = pd.DataFrame({"n": ["a", "b", "c", "a", "b", "x"],
"t": [0, 0, 0, 1, 1, 1],
"v": [10,20,30,40,50,np.nan]})
print (df)
n t v
0 a 0 10.0
1 b 0 20.0
2 c 0 30.0
3 a 1 40.0
4 b 1 50.0
5 x 1 NaN
df1 = df.set_index('n')
.groupby('t', group_keys=False)
.apply(lambda x: x.loc[df.n.unique()])
.reset_index()
print (df1)
n t v
0 a 0.0 10.0
1 b 0.0 20.0
2 c 0.0 30.0
3 x NaN NaN
4 a 1.0 40.0
5 b 1.0 50.0
6 c NaN NaN
7 x 1.0 NaN
df1 = df.groupby('t', group_keys=False)
.apply(lambda x: x.set_index('n').loc[df.n.unique()])
.reset_index()
print (df1)
n t v
0 a 0.0 10.0
1 b 0.0 20.0
2 c 0.0 30.0
3 x NaN NaN
4 a 1.0 40.0
5 b 1.0 50.0
6 c NaN NaN
7 x 1.0 NaN
source to share