Handling HUGE numbers in numpy or pandas

I participate in competitions where data is provided to me, anonymous. Some columns have HUGE values. The largest was 40 digits! I have used pd.read_csv

, but these columns were converted to objects in the result.

My original plan was to scale the data, but since they are treated like objects, I cannot do arithmetic on them.

Does anyone have a suggestion on how to handle huge numbers in Pandas or Numpy?

Please note that I tried converting the value to uint64

with no luck. I get the error "too big to convert"

+2


source to share


3 answers


You can use Pandas converters to call int

or some other custom converter function in string on import:

import pandas as pd 
from StringIO import StringIO

txt='''\
line,Big_Num,text
1,1234567890123456789012345678901234567890,"That sure is a big number"
2,9999999999999999999999999999999999999999,"That is an even BIGGER number"
3,1,"Tiny"
4,-9999999999999999999999999999999999999999,"Really negative"
'''

df=pd.read_csv(StringIO(txt), converters={'Big_Num':int})

print df

      

Printing



   line                                    Big_Num                           text
0     1   1234567890123456789012345678901234567890      That sure is a big number
1     2   9999999999999999999999999999999999999999  That is an even BIGGER number
2     3                                          1                           Tiny
3     4  -9999999999999999999999999999999999999999                Really negative

      

Now the arithmetic of the test:

n=df["Big_Num"][1]
print n,n+1 

      

Printing



9999999999999999999999999999999999999999 10000000000000000000000000000000000000000

      

If you have values ​​in a column that can trigger int

, you can do this:

txt='''\
line,Big_Num,text
1,1234567890123456789012345678901234567890,"That sure is a big number"
2,9999999999999999999999999999999999999999,"That is an even BIGGER number"
3,0.000000000000000001,"Tiny"
4,"a string","Use 0 for strings"
'''

def conv(s):
    try:
        return int(s)
    except ValueError:
        try:
            return float(s)
        except ValueError:
            return 0        

df=pd.read_csv(StringIO(txt), converters={'Big_Num':conv})
print df

      

Printing



   line                                   Big_Num                           text
0     1  1234567890123456789012345678901234567890      That sure is a big number
1     2  9999999999999999999999999999999999999999  That is an even BIGGER number
2     3                                     1e-18                           Tiny
3     4                                         0              Use 0 for strings

      

Then each value in the column will be either Python int or float and will support arithmetic.

+2


source


If you have a mixed type column - some integers, some strings - stored in a dtype = object column, you can still convert to int and do the arithmetic. Starting with a mixed column:

>>> df = pd.DataFrame({"A": [11**44, "11"*22]})
>>> df
                                                A
0  6626407607736641103900260617069258125403649041
1    11111111111111111111111111111111111111111111

[2 rows x 1 columns]
>>> df.dtypes, list(map(type, df.A))
(A    object
dtype: object, [<type 'long'>, <type 'str'>])

      

We can convert to ints:



>>> df["A"] = df["A"].apply(int)
>>> df.dtypes, list(map(type, df.A))
(A    object
dtype: object, [<type 'long'>, <type 'long'>])
>>> df
                                                A
0  6626407607736641103900260617069258125403649041
1    11111111111111111111111111111111111111111111

[2 rows x 1 columns]

      

And then do the arithmetic:

>>> df // 11
                                               A
0  602400691612421918536387328824478011400331731
1    1010101010101010101010101010101010101010101

[2 rows x 1 columns]

      

+5


source


Edit: they can't be (accurately) represented as floats, they just don't rise when trying ... it might be best to use a dtype and longs object as in DSM's answer.

But you can do it imprecisely (using @DSM example):

In [11]: df = pd.DataFrame({"A": [11**44, "11"*22]}).astype(float)

In [12]: df
Out[12]: 
              A
0  6.626408e+45
1  1.111111e+43

[2 rows x 1 columns]

In [13]: df.dtypes
Out[13]: 
A    float64
dtype: object

      

But that might not be what you want ...

In [21]: df.iloc[0, 0]
Out[21]: 6.6264076077366411e+45

In [22]: long(df.iloc[0, 0])
Out[22]: 6626407607736641089115845702792172379125579776L

In [23]: 11 ** 44
Out[23]: 6626407607736641103900260617069258125403649041L

      

As DSM points out, convert to long (and use a dtype object) so you don't lose precision:

In [31]: df = pd.DataFrame({"A": [11**44, "11"*22]}).apply(long, 1)

In [32]: df
Out[32]: 
0    6626407607736641103900260617069258125403649041
1      11111111111111111111111111111111111111111111
dtype: object

      

+2


source







All Articles