Pandas: Synchronizing dates in different columns with read_csv

Question

Pandas: Synchronizing dates in different columns with read_csv

I have an ascii file where dates are formatted like this:

Jan 20 2015 00:00:00.000
Jan 20 2015 00:10:00.000
Jan 20 2015 00:20:00.000
Jan 20 2015 00:30:00.000
Jan 20 2015 00:40:00.000

When uploading a file to pandas, each column above gets its own column in the pandas dataframe. I've tried the following options:

from pandas import read_csv
from datetime import datetime

df = read_csv('file.txt', header=None, delim_whitespace=True,
              parse_dates={'datetime': [0, 1, 2, 3]},
              date_parser=lambda x: datetime.strptime(x, '%b %d %Y %H %M %S'))

I am getting a couple of errors:

TypeError: <lambda>() takes 1 positional argument but 4 were given
ValueError: time data 'Jun 29 2017 00:35:00.000' does not match format '%b %d %Y %H %M %S'

I am confused because:

I am passing a dict for parse_dates

to parse different columns as one date.
I use: %b

- abbreviated month name, %d

- day of month, %Y

year with century, %H

24 hour, %M

- minute and %S

- second

Does anyone see what I am doing wrong?

Edit:

I have tried date_parser=lambda x: datetime.strptime(x, '%b %d %Y %H:%M:%S')

which returnsValueError: unconverted data remains: .000

Edit 2:

I tried what @MaxU suggested in his update, but it was problematic because my original data is formatted like this:

Jan   1  2017  00:00:00.000   123 456 789 111 222 333

I'm only interested in the first 7 columns, so I import a file with the following:

df = read_csv(fn, header=None, delim_whitespace=True, usecols=[0, 1, 2, 3, 4, 5, 6])

Then to create a column with time and time information from the first 4 columns, I try:

df['datetime'] = to_datetime(df.ix[:, :3], format='%b %d %Y %H:%M:%S.%f')

However, this doesn't work because it to_datetime

expects "integer, float, string, datetime, list, tuple, 1-d array, Series" as the first argument and df.ix[:, :3]

returns a dataframe with the following format:

         0   1     2             3
0      Jan   1  2017  00:00:00.000

How can I feed in each row of the first four columns a value to_datetime

to get one column datetimes

?

Edit 3:

I think I solved the second problem. I just use the following command and do everything when I read my file (I was just missing %f

to parse the last seconds):

df = read_csv(fileName, header=None, delim_whitespace=True,
              parse_dates={'datetime': [0, 1, 2, 3]},
              date_parser=lambda x: datetime.strptime(x, '%b %d %Y %H:%M:%S.%f'),
              usecols=[0, 1, 2, 3, 4, 5, 6])

The whole reason I wanted to parse it manually, instead of letting pandas handle it like @MaxU, was suggesting to check if manual instruction would be faster - and it is! From my tests, the snippet above is about 5-6 times faster than allowing pandas to output the parsing for you.

+3

python pandas datetime parsing dataframe

Arda arslan 13 jul. 17 at 20:43

source to share

2 answers

Take this easier approach:

df = pandas.read_csv('file.txt')
df.columns = ['date']

df

should be a single column dataframe. After that try casting this column to datetime

df['date'] = pd.to_datetime(df['date'])

+3

Diego aguado 13 jul. 17 at 20:47

source to share

MaxU · Accepted Answer · 2017-07-13T20:49:09+0000

Pandas (tested with version 0.20.1) is smart enough to do this for you:

In [4]: pd.read_csv(fn, sep='\s+', parse_dates={'datetime': [0, 1, 2, 3]})
Out[4]:
             datetime
0 2015-01-20 00:10:00
1 2015-01-20 00:20:00
2 2015-01-20 00:30:00
3 2015-01-20 00:40:00

UPDATE: if all records are in the same format, you can try doing it like this:

df = pd.read_csv(fn, sep='~', names=['datetime'])
df['datetime'] = pd.to_datetime(df['datetime'], format='%b %d %Y %H:%M:%S.%f')

Pandas: Synchronizing dates in different columns with read_csv

More articles: