Converting from numpy.datetime64 to pandas.tslib.Timestamp error?

I have a python module that loads data directly into a dict of numpy.ndarray for use in pandas.Dataframe. However, I noticed a problem with the "NA" values. My file format represents NA values ​​as -9223372036854775808 (boost :: integer_traits :: const_min). My non-NA values ​​are loaded as expected (with correct values) into pandas.Dataframe. I believe what is happening is that my module is loaded into the numpy.datetime64 ndarray, which is then converted to a pandas.tslib.Timestamp list. This conversion doesn't seem to preserve the const_min integer. Try the following:

>>> pandas.tslib.Timestamp(-9223372036854775808)
NaT
>>> pandas.tslib.Timestamp(numpy.datetime64(-9223372036854775808))
<Timestamp: 1969-12-31 15:58:10.448384>

      

Is this a Pandas bug? I think I can disable my module without using numpy.ndarray in this case, and using something Pandas doesn't start (maybe pre-assign the list to the tslib.Timestamp itself.)

Here's another example of unexpected events:

>>> npa = numpy.ndarray(1, dtype=numpy.datetime64)
>>> npa[0] = -9223372036854775808
>>> pandas.Series(npa)
0   NaT
>>> pandas.Series(npa)[0]
<Timestamp: 1969-12-31 15:58:10.448384>

      

Following Jeff below, I have more information on what is going wrong.

>>> npa = numpy.ndarray(2, dtype=numpy.int64)
>>> npa[0] = -9223372036854775808
>>> npa[1] = 1326834000090451
>>> npa
array([-9223372036854775808,     1326834000090451])
>>> s_npa = pandas.Series(npa, dtype='M8[us]')
>>> s_npa
0                          NaT
1   2012-01-17 21:00:00.090451

      

Hooray! The series kept NA and my timestamp. However, if I try to create a DataFrame from this series, NaT disappears.

>>> pandas.DataFrame({'ts':s_npa})
                      ts
0 1969-12-31 15:58:10.448384
1 2012-01-17 21:00:00.090451

      

Ho-hum. On a whim, I tried to interpret my integers as nano seconds of a bygone era. To my surprise, the DataFrame worked correctly:

s2_npa = pandas.Series(npa, dtype='M8[ns]')
>>> s2_npa
0                             NaT
1   1970-01-16 08:33:54.000090451
>>> pandas.DataFrame({"ts":s2_npa})
                             ts
0                           NaT
1 1970-01-16 08:33:54.000090451

      

Of course my timestamp is wrong. My point is that pandas.DataFrame is behaving inconsistently here. Why does it preserve NaT when using dtype = 'M8 [ns]' but not when using "M8 [us]"?

I am currently using this conversion workaround which slows things down but works:

>>> s = pandas.Series([1000*ts if ts != -9223372036854775808 else ts for ts in npa], dtype='M8[ns]')
>>> pandas.DataFrame({'ts':s})
                          ts
0                        NaT
1 2012-01-17 21:00:00.090451

      

(A few hours later ...)

Okay, I have progress. I dug deeper into the code to see that the repr function on the Series eventually calls "_format_datetime64", which checks for "isnull" and prints out "NaT". This explains the difference between the two.

>>> pandas.Series(npa)
0   NaT
>>> pandas.Series(npa)[0]
<Timestamp: 1969-12-31 15:58:10.448384>

      

The first appears to honor the NA, but he only does so in print. I suppose there may be other Pandas functions that call "isnull" and act on the basis of the answer, which might seem partially fulfilled for NA timestamps in this case. However, I know the series is wrong due to the type of the null element. This timestamp , but should be NaTType . My next step is to dive into the constructor for Series to figure out when / how Pandas uses the NaT value during construction. Presumably it is missing the case where I specify dtype = 'M8 [us]' ... (more to come).

Following Andy's suggestion in the comments, I tried to use Pandas timestamp to resolve this issue. It didn't work. Here's an example of these results:

>>> npa = numpy.ndarray(1, dtype='i8')
>>> npa[0] = -9223372036854775808
>>> npa
array([-9223372036854775808])
>>> pandas.tslib.Timestamp(npa.view('M8[ns]')[0]).value
-9223372036854775808
>>> pandas.tslib.Timestamp(npa.view('M8[us]')[0]).value
-28909551616000

      

+3


source to share


1 answer


Answer: No

Technically speaking, it is. I posted a bug on github and got the answer here: https://github.com/pydata/pandas/issues/2800#issuecomment-13161074

"Units other than nanoseconds are currently not supported for indexing, etc. This must be strictly followed."



All tests I ran with "ns" and not "us" work fine. I am looking forward to a future release.

For anyone interested, I modified my Python module to C ++ to iterate over the int64_t arrays loaded from disk and multiply everything by 1000 except for the NA values ​​(boost :: integer_traits :: const_min). I was worried about performance, but the difference in load times is tiny for me. (Doing this in Python is very, very slow.)

+2


source







All Articles