Speeding up timestamped operations

The following conversion (ms -> datetime -> time time time) takes a long time (4 minutes), possibly because I am working with a large dataframe:

for column in ['A', 'B', 'C', 'D', 'E']:
    # Data comes in unix time (ms) so I need to convert it to datetime
    df[column] = pd.to_datetime(df[column], unit='ms')

    # Get times in EST
    df[column] = df[column].apply(lambda x: x.tz_localize('UTC').tz_convert('US/Eastern'))

      

Is there a way to speed it up? Am I already using Pandas data structures and methods in the most efficient way?

+3


source to share


2 answers


They are available as DatetimeIndex methods which will be much faster:

df[column] = pd.DatetimeIndex(df[column]).tz_localize('UTC').tz_convert('US/Eastern')

      



Note: in 0.15.0, you will access them as dt accessor :

df[column] = df[column].dt.tz_localize('UTC').tz_convert('US/Eastern')

      

+6


source


I would try in Bash using the date command. The date turns out to be faster than even gawk for routine conversions. Python can fight this.

To speed it up, even faster export column in one temp file, column B in another, ect. (You can even do this in python). Then run 5 columns in parallel.

for column in ['A']:
  print>>thefileA, column
for column in ['B']:
  print>>thefileB, column

      



Then a Bash script:

#!/usr/bin/env bash
readarray a < thefileA
for i in $( a ); do
    date -r item: $i
done

      

You will need a Bash script wizard to run the first part in python python pythonscript.py

. Then you will need to call in each of the Bash scripts in the background from the master ./FILEA.sh &

. This will run each column individually and automatically assign nodes. For my Bash loop after readarray I am not 100%, this is the correct syntax. If you are using Linux use date -d @ item

.

0


source







All Articles