Handling HUGE numbers in numpy or pandas
I participate in competitions where data is provided to me, anonymous. Some columns have HUGE values. The largest was 40 digits! I have used pd.read_csv
, but these columns were converted to objects in the result.
My original plan was to scale the data, but since they are treated like objects, I cannot do arithmetic on them.
Does anyone have a suggestion on how to handle huge numbers in Pandas or Numpy?
Please note that I tried converting the value to uint64
with no luck. I get the error "too big to convert"
source to share
You can use Pandas converters to call int
or some other custom converter function in string on import:
import pandas as pd
from StringIO import StringIO
txt='''\
line,Big_Num,text
1,1234567890123456789012345678901234567890,"That sure is a big number"
2,9999999999999999999999999999999999999999,"That is an even BIGGER number"
3,1,"Tiny"
4,-9999999999999999999999999999999999999999,"Really negative"
'''
df=pd.read_csv(StringIO(txt), converters={'Big_Num':int})
print df
Printing
line Big_Num text
0 1 1234567890123456789012345678901234567890 That sure is a big number
1 2 9999999999999999999999999999999999999999 That is an even BIGGER number
2 3 1 Tiny
3 4 -9999999999999999999999999999999999999999 Really negative
Now the arithmetic of the test:
n=df["Big_Num"][1]
print n,n+1
Printing
9999999999999999999999999999999999999999 10000000000000000000000000000000000000000
If you have values in a column that can trigger int
, you can do this:
txt='''\
line,Big_Num,text
1,1234567890123456789012345678901234567890,"That sure is a big number"
2,9999999999999999999999999999999999999999,"That is an even BIGGER number"
3,0.000000000000000001,"Tiny"
4,"a string","Use 0 for strings"
'''
def conv(s):
try:
return int(s)
except ValueError:
try:
return float(s)
except ValueError:
return 0
df=pd.read_csv(StringIO(txt), converters={'Big_Num':conv})
print df
Printing
line Big_Num text
0 1 1234567890123456789012345678901234567890 That sure is a big number
1 2 9999999999999999999999999999999999999999 That is an even BIGGER number
2 3 1e-18 Tiny
3 4 0 Use 0 for strings
Then each value in the column will be either Python int or float and will support arithmetic.
source to share
If you have a mixed type column - some integers, some strings - stored in a dtype = object column, you can still convert to int and do the arithmetic. Starting with a mixed column:
>>> df = pd.DataFrame({"A": [11**44, "11"*22]})
>>> df
A
0 6626407607736641103900260617069258125403649041
1 11111111111111111111111111111111111111111111
[2 rows x 1 columns]
>>> df.dtypes, list(map(type, df.A))
(A object
dtype: object, [<type 'long'>, <type 'str'>])
We can convert to ints:
>>> df["A"] = df["A"].apply(int)
>>> df.dtypes, list(map(type, df.A))
(A object
dtype: object, [<type 'long'>, <type 'long'>])
>>> df
A
0 6626407607736641103900260617069258125403649041
1 11111111111111111111111111111111111111111111
[2 rows x 1 columns]
And then do the arithmetic:
>>> df // 11
A
0 602400691612421918536387328824478011400331731
1 1010101010101010101010101010101010101010101
[2 rows x 1 columns]
source to share
Edit: they can't be (accurately) represented as floats, they just don't rise when trying ... it might be best to use a dtype and longs object as in DSM's answer.
But you can do it imprecisely (using @DSM example):
In [11]: df = pd.DataFrame({"A": [11**44, "11"*22]}).astype(float)
In [12]: df
Out[12]:
A
0 6.626408e+45
1 1.111111e+43
[2 rows x 1 columns]
In [13]: df.dtypes
Out[13]:
A float64
dtype: object
But that might not be what you want ...
In [21]: df.iloc[0, 0]
Out[21]: 6.6264076077366411e+45
In [22]: long(df.iloc[0, 0])
Out[22]: 6626407607736641089115845702792172379125579776L
In [23]: 11 ** 44
Out[23]: 6626407607736641103900260617069258125403649041L
As DSM points out, convert to long (and use a dtype object) so you don't lose precision:
In [31]: df = pd.DataFrame({"A": [11**44, "11"*22]}).apply(long, 1)
In [32]: df
Out[32]:
0 6626407607736641103900260617069258125403649041
1 11111111111111111111111111111111111111111111
dtype: object
source to share