How to fill in missing values based on a column in pandas?

Question

How to fill in missing values based on a column in pandas?

I have this data block in pandas:

df = pandas.DataFrame({
        "n": ["a", "b", "c", "a", "b", "x"],
        "t": [0, 0, 0, 1, 1, 1],
        "v": [10,20,30,40,50,60]
    })

how can it be filled with missing values so that every column value t

has the same entries in the column n

? that is, each value t

must contain entries for a, b, c, x

, written as NaN

if they are missing:

   n  t   v
   a  0  10
   b  0  20
   c  0  30
   x  NaN NaN
   a  1  40
   b  1  50
   c  NaN NaN
   x  1  60

+3

python numpy pandas

jll 24 Mar 17 at 2:59 am

source to share

4 answers

plan

get unique column values 'n'

. we will use this forreindex
we will apply f

to our groups in each column group 't'

, re-indexing idx

, ensure that all items idx

are represented for each group of unique't'
we set the index so that we can reindex

in bits

idx = df.n.unique()
f = lambda x: x.reindex(idx)
df.set_index('n').groupby('t', group_keys=False).apply(f).reset_index()

   n    t     v
0  a  0.0  10.0
1  b  0.0  20.0
2  c  0.0  30.0
3  x  NaN   NaN
4  a  1.0  40.0
5  b  1.0  50.0
6  c  NaN   NaN
7  x  1.0  60.0

+2

piRSquared 24 Mar 17 at 5:38 am

source to share

You can use if df

not in NaN

before - create MultiIndex

, and then reindex

, NaN

in t

set column v

:

cols = ["n", "t"]
df1 = df.set_index(cols)
mux = pd.MultiIndex.from_product(df1.index.levels, names=cols)
df1 = df1.reindex(mux).sort_index(level=[1,0]).reset_index()
df1['t'] = df1['t'].mask(df1['v'].isnull())
print (df1)
   n    t     v
0  a  0.0  10.0
1  b  0.0  20.0
2  c  0.0  30.0
3  x  NaN   NaN
4  a  1.0  40.0
5  b  1.0  50.0
6  c  NaN   NaN
7  x  1.0  60.0

Another solution is to add NaN unstack

, stack

:

cols = ["n", "t"]
df1 = df.set_index(cols)['v'].unstack().stack(dropna=False)
df1 = df1.sort_index(level=[1,0]).reset_index(name='v')
df1['t'] = df1['t'].mask(df1['v'].isnull())
print (df1)
    n    t     v
0  a  0.0  10.0
1  b  0.0  20.0
2  c  0.0  30.0
3  x  NaN   NaN
4  a  1.0  40.0
5  b  1.0  50.0
6  c  NaN   NaN
7  x  1.0  60.0

But if some values NaN

require groupby

with loc

a unique

column value n

:

df = pd.DataFrame({"n": ["a", "b", "c", "a", "b", "x"], 
                       "t": [0, 0, 0, 1, 1, 1], 
                       "v": [10,20,30,40,50,np.nan]})
print (df)
   n  t     v
0  a  0  10.0
1  b  0  20.0
2  c  0  30.0
3  a  1  40.0
4  b  1  50.0
5  x  1   NaN

df1 = df.set_index('n')
        .groupby('t', group_keys=False)
        .apply(lambda x: x.loc[df.n.unique()])
        .reset_index()

print (df1)
   n    t     v
0  a  0.0  10.0
1  b  0.0  20.0
2  c  0.0  30.0
3  x  NaN   NaN
4  a  1.0  40.0
5  b  1.0  50.0
6  c  NaN   NaN
7  x  1.0   NaN

df1 = df.groupby('t', group_keys=False)
        .apply(lambda x: x.set_index('n').loc[df.n.unique()])
        .reset_index()
print (df1)
   n    t     v
0  a  0.0  10.0
1  b  0.0  20.0
2  c  0.0  30.0
3  x  NaN   NaN
4  a  1.0  40.0
5  b  1.0  50.0
6  c  NaN   NaN
7  x  1.0   NaN

+1

jezrael 24 Mar 17 at 5:36 am

source to share

It looks like you are wrong. Usually NaNs are read automatically or you supply them. You can manually put NaN on np.nan

if yours is import numpy as np

up. Alternatively pandas stores numpy internally and you can get Nan onpandas.np.nan

0

Charlie 24 Mar 17 at 3:07

source to share

Nickil maveli · Accepted Answer · 2017-03-24T07:14:11+0000

From what I understand, you want each value in to be "n"

evenly distributed among the subgroups grouped by "t"

. I also hope that these "n"

cannot be duplicated in these subgroups.

Given that these assumptions are correct, pd.pivot_table

it seems like a good option for this use case. Here, the values under "n"

will be the column names, "t"

will be the grouped index, and the content will be DF

populated with the values under "v"

. Push the stack DF

, save the entries, NaN

and fill it with the appropriate cells in "t"

with .loc

accessor.

df1 = pd.pivot_table(df, "v", "t", "n", "first").stack(dropna=False).reset_index(name="v")
df1.loc[df1['v'].isnull(), "t"] = np.nan

How to fill in missing values ​​based on a column in pandas?

More articles:

How to fill in missing values based on a column in pandas?