Parsing time in nested pandas dataframe

Question

Parsing time in nested pandas dataframe

I am struggling to parse a date in Pandas. Here's my short example:

df.iloc[:10,10:]
Out[45]: 
                                 response_date         revision scheduleClosedAt  scheduleEventIndex scheduleId scheduleOpenedAt
0  {u'$date': u'2012-01-10T11:00:00.000+0000'}  {u'Measure': 1}              NaN                 NaN        NaN              NaN
1  {u'$date': u'2012-01-19T13:00:00.000+0000'}  {u'Measure': 1}              NaN                 NaN        NaN              NaN
2  {u'$date': u'2011-06-15T09:00:00.000+0100'}  {u'Measure': 1}              NaN                 NaN        NaN              NaN
3  {u'$date': u'2011-06-22T00:00:00.000+0100'}  {u'Measure': 1}              NaN                 NaN        NaN              NaN
4  {u'$date': u'2011-06-30T09:00:00.000+0100'}  {u'Measure': 1}              NaN                 NaN        NaN              NaN
5  {u'$date': u'2011-07-05T00:00:00.000+0100'}  {u'Measure': 1}              NaN                 NaN        NaN              NaN
6  {u'$date': u'2011-07-14T10:00:00.000+0100'}  {u'Measure': 1}              NaN                 NaN        NaN              NaN
7  {u'$date': u'2011-07-20T09:00:00.000+0100'}  {u'Measure': 1}              NaN                 NaN        NaN              NaN
8  {u'$date': u'2011-07-26T00:00:00.000+0100'}  {u'Measure': 1}              NaN                 NaN        NaN              NaN
9  {u'$date': u'2011-08-02T00:00:00.000+0100'}  {u'Measure': 1}              NaN                 NaN        NaN              NaN

I need to get rid of the nested column "response_date" and convert it to normal timedate keeping the column name "response_date" /

I tried:

>> df_respons = df.response_date.apply(pd.Series)
>> df_new_response = pd.to_datetime(df_respons)

but got the error:

ValueError: to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing

Any neat way to handle nested datetime in nice columns?

EDIT

How do I ignore missing values?

43025    {u'$date': u'2015-11-18T10:35:00.000+0000'}
43026    {u'$date': u'2015-11-18T14:23:00.000+0000'}
43027    {u'$date': u'2015-11-18T14:23:00.000+0000'}
43028    {u'$date': u'2015-11-18T15:20:00.000+0000'}
43029    {u'$date': u'2015-11-18T15:20:00.000+0000'}
43030                                            NaN
43031                                            NaN
43032    {u'$date': u'2015-11-19T08:00:00.000+0000'}
43033    {u'$date': u'2015-11-19T08:00:00.000+0000'}
43034    {u'$date': u'2015-11-24T08:00:00.000+0000'}

which give a new column "0":

        0                 response_date
43027 NaN  2015-11-18T14:23:00.000+0000
43028 NaN  2015-11-18T15:20:00.000+0000
43029 NaN  2015-11-18T15:20:00.000+0000
43030 NaN                           NaN
43031 NaN                           NaN
43032 NaN  2015-11-19T08:00:00.000+0000
43033 NaN  2015-11-19T08:00:00.000+0000
43034 NaN  2015-11-24T08:00:00.000+0000

+3

python pandas dataframe

Arnold klein May 25 '17 at 18:02

source to share

3 answers

It sounds like you want something like df.apply(lambda row: pd.to_datetime(row['response_date']['$date']), axis=1)

;

In [41]: df
Out[41]:
                               response_date
0  {'$date': '2011-06-15T09:00:00.000+0100'}

In [42]: df['response_date'] = df.apply(lambda row: pd.to_datetime(row['response_date']['$date']), axis=1)

In [43]: df
Out[43]:
        response_date
0 2011-06-15 08:00:00

+1

fuglede May 25 '17 at 18:08

source to share

Try the following:

In [70]: pd.to_datetime(
             df.response_date.map(lambda x: 
                                  x['$date'] if isinstance(x, dict) and '$date' in x
                                             else x),
             errors='coerce')
Out[70]:
0   2012-01-10 11:00:00
1   2012-01-19 13:00:00
2   2011-06-15 08:00:00
3   2011-06-21 23:00:00
4   2011-06-30 08:00:00
5                   NaT
6                   NaT
7   2011-07-20 08:00:00
8   2011-07-25 23:00:00
9   2011-08-01 23:00:00
Name: response_date, dtype: datetime64[ns]

+1

MaxU May 25 '17 at 18:36

source to share

jezrael · Accepted Answer · 2017-05-25T18:48:36+0000

You can use combine_first

or fillna

to replace NaN

with empty dict

, and then you can use the DataFrame

c constructor values

to convert to numpy array

, and then tolist

:

d = {'$date':'response_date'}
s = pd.Series([{}], index=df.index)
df = pd.DataFrame(df['0'].combine_first(s).values.tolist()).rename(columns=d)
#alternatively
#df = pd.DataFrame(df['0'].fillna(s).values.tolist()).rename(columns=d)
df['response_date'] = pd.to_datetime(df['response_date'])
print (df)
        response_date
0 2015-11-18 10:35:00
1 2015-11-18 14:23:00
2 2015-11-18 14:23:00
3 2015-11-18 15:20:00
4 2015-11-18 15:20:00
5                 NaT
6                 NaT
7 2015-11-19 08:00:00
8 2015-11-19 08:00:00
9 2015-11-24 08:00:00

Another solution with map

:

df['response_date'] = \
pd.to_datetime(df['response_date'].map(lambda x: x['$date'] if type(x) == dict else x))
print (df)
            response_date
43025 2015-11-18 10:35:00
43026 2015-11-18 14:23:00
43027 2015-11-18 14:23:00
43028 2015-11-18 15:20:00
43029 2015-11-18 15:20:00
43030                 NaT
43031                 NaT
43032 2015-11-19 08:00:00
43033 2015-11-19 08:00:00
43034 2015-11-24 08:00:00

Parsing time in nested pandas dataframe

More articles: