Break time from object date in pandas
I am having problems with some dates from zipped xlsx files. These files are loaded into sqlite database and then exported as .csv. Each file is about 40,000 lines a day. The problem I'm running into is that pd.to_datetime
it doesn't seem to work on these objects (dates from Excel format cause the problem, I guess - pure .csv files work fine with this command). This is actually ok - I don't need them to be in datetime format.
I am trying to create a ShortDate column that is %m/%d/%Y
. How to do it on a datetime object (format - mm / dd / yyyy hh: mm: ss from Excel). Next, I'll create a new column named RosterID that concatenates the EmployeeID field and ShortDate field along with a unique ID.
I am very new to pandas and currently I only use it to process .csv files (renaming and selecting specific columns, creating unique ids for use in filters in Tableau, etc.).
rep = pd.read_csv(r'C:\Users\Desktop\test.csv.gz', dtype = 'str', compression = 'gzip', usecols = ['etc','etc2'])
print('Read successfully.')
rep['Total']=1
rep['UniqueID']= rep['EmployeeID'] + rep['InteractionID']
rep['ShortDate'] = ??? #what do I do here to get what I am looking for?
rep['RosterID']= rep['EmployeeID'] + rep['ShortDate'] # this is my goal
print('Modified successfully.')
Here is some of the raw data from the .csv. Column names would be
InteractionID, Created Date, EmployeeID, Repeat Date
07927,04/01/2014 14:05:10,912a,04/01/2014 14:50:03
02158,04/01/2014 13:44:05,172r,04/04/2014 17:47:29
44279,04/01/2014 17:28:36,217y,04/07/2014 22:06:19
source to share
Create a new column, then just apply simple functions datetime
using lambda
and apply
.
In [14]: df['Short Date']= pd.to_datetime(df['Created Date'])
In [15]: df
Out[15]:
InteractionID Created Date EmployeeID Repeat Date \
0 7927 4/1/2014 14:05 912a 4/1/2014 14:50
1 2158 4/1/2014 13:44 172r 4/4/2014 17:47
2 44279 4/1/2014 17:28 217y 4/7/2014 22:06
Short Date
0 2014-04-01 14:05:00
1 2014-04-01 13:44:00
2 2014-04-01 17:28:00
In [16]: df['Short Date'] = df['Short Date'].apply(lambda x:x.date().strftime('%m%d%y'))
In [17]: df
Out[17]:
InteractionID Created Date EmployeeID Repeat Date Short Date
0 7927 4/1/2014 14:05 912a 4/1/2014 14:50 040114
1 2158 4/1/2014 13:44 172r 4/4/2014 17:47 040114
2 44279 4/1/2014 17:28 217y 4/7/2014 22:06 040114
Then just join the two columns. Convert the column Short Date
to strings to avoid errors when concatenating strings and integers.
In [32]: df['Roster ID'] = df['EmployeeID'] + df['Short Date'].map(str)
In [33]: df
Out[33]:
InteractionID Created Date EmployeeID Repeat Date Short Date \
0 7927 4/1/2014 14:05 912a 4/1/2014 14:50 040114
1 2158 4/1/2014 13:44 172r 4/4/2014 17:47 040114
2 44279 4/1/2014 17:28 217y 4/7/2014 22:06 040114
Roster ID
0 912a040114
1 172r040114
2 217y040114
source to share
You can apply a post-processing step that first converts the string to a date-time and then applies a lambda to keep only the date part:
In [29]:
df['Created Date'] = pd.to_datetime(df['Created Date']).apply(lambda x: x.date())
df['Repeat Date'] = pd.to_datetime(df['Repeat Date']).apply(lambda x: x.date())
df
Out[29]:
InteractionID Created Date EmployeeID Repeat Date
0 7927 2014-04-01 912a 2014-04-01
1 2158 2014-04-01 172r 2014-04-04
2 44279 2014-04-01 217y 2014-04-07
EDIT
After revisiting, you can only access the date component with dt.date
if your pandas version is greater than 0.15.0
:
In [18]:
df['just_date'] = df['Repeat Date'].dt.date
df
Out[18]:
InteractionID Created Date EmployeeID Repeat Date \
0 7927 2014-04-01 14:05:10 912a 2014-04-01 14:50:03
1 2158 2014-04-01 13:44:05 172r 2014-04-04 17:47:29
2 44279 2014-04-01 17:28:36 217y 2014-04-07 22:06:19
just_date
0 2014-04-01
1 2014-04-04
2 2014-04-07
Also, you can now do dt.strftime
instead of using apply
to achieve the desired result:
In [28]:
df['short_date'] = df['Repeat Date'].dt.strftime('%m%d%Y')
df
Out[28]:
InteractionID Created Date EmployeeID Repeat Date \
0 7927 2014-04-01 14:05:10 912a 2014-04-01 14:50:03
1 2158 2014-04-01 13:44:05 172r 2014-04-04 17:47:29
2 44279 2014-04-01 17:28:36 217y 2014-04-07 22:06:19
just_date short_date
0 2014-04-01 04012014
1 2014-04-04 04042014
2 2014-04-07 04072014
So generating the Roster ID is now a trivial exercise of adding two new columns:
In [30]:
df['Roster ID'] = df['EmployeeID'] + df['short_date']
df
Out[30]:
InteractionID Created Date EmployeeID Repeat Date \
0 7927 2014-04-01 14:05:10 912a 2014-04-01 14:50:03
1 2158 2014-04-01 13:44:05 172r 2014-04-04 17:47:29
2 44279 2014-04-01 17:28:36 217y 2014-04-07 22:06:19
just_date short_date Roster ID
0 2014-04-01 04012014 912a04012014
1 2014-04-04 04042014 172r04042014
2 2014-04-07 04072014 217y04072014
source to share
You can also do this using only standard libraries (in whatever format you want: "% m /% d /% Y", "% m-% d-% Y", or other orders / formats):
In [118]:
import time
df['Created Date'] = df['Created Date'].apply(lambda x: time.strftime('%m/%d/%Y', time.strptime(x, '%m/%d/%Y %H:%M:%S')))
In [120]:
print df
InteractionID Created Date EmployeeID Repeat Date
0 7927 04/01/2014 912a 04/01/2014 14:50:03
1 2158 04/01/2014 172r 04/04/2014 17:47:29
2 44279 04/01/2014 217y 04/07/2014 22:06:19
source to share