Speeding up timestamped operations
The following conversion (ms -> datetime -> time time time) takes a long time (4 minutes), possibly because I am working with a large dataframe:
for column in ['A', 'B', 'C', 'D', 'E']:
# Data comes in unix time (ms) so I need to convert it to datetime
df[column] = pd.to_datetime(df[column], unit='ms')
# Get times in EST
df[column] = df[column].apply(lambda x: x.tz_localize('UTC').tz_convert('US/Eastern'))
Is there a way to speed it up? Am I already using Pandas data structures and methods in the most efficient way?
source to share
They are available as DatetimeIndex methods which will be much faster:
df[column] = pd.DatetimeIndex(df[column]).tz_localize('UTC').tz_convert('US/Eastern')
Note: in 0.15.0, you will access them as dt accessor :
df[column] = df[column].dt.tz_localize('UTC').tz_convert('US/Eastern')
source to share
I would try in Bash using the date command. The date turns out to be faster than even gawk for routine conversions. Python can fight this.
To speed it up, even faster export column in one temp file, column B in another, ect. (You can even do this in python). Then run 5 columns in parallel.
for column in ['A']:
print>>thefileA, column
for column in ['B']:
print>>thefileB, column
Then a Bash script:
#!/usr/bin/env bash
readarray a < thefileA
for i in $( a ); do
date -r item: $i
done
You will need a Bash script wizard to run the first part in python python pythonscript.py
. Then you will need to call in each of the Bash scripts in the background from the master ./FILEA.sh &
. This will run each column individually and automatically assign nodes. For my Bash loop after readarray I am not 100%, this is the correct syntax. If you are using Linux use date -d @ item
.
source to share