Pandas (Python) reading and working on Java BigInteger / large numbers
I have a data file ( csv ) with Nilsimsa
hash values. Some of them would have up to 80 characters. I want to read them in Python for data analysis tasks. Is there a way to import data into python without losing information?
EDIT: I've tried the implementation suggested in the comments, but it doesn't work for me. Sample data in csv file :77241756221441762028881402092817125017724447303212139981668021711613168152184106
source to share
As @JohnE explained in his answer, we don't lose any information when reading large numbers with Pandas. They are stored as dtype=object
, to calculate them numerically, we need to convert this data to a numeric type.
For the series:
We have to apply map(func)
to the row in the dataframe:
df['columnName'].map(int)
Entire data frame:
If for some reason our entire data block consists of columns c dtype=object
, we look atapplymap(func)
from the Pandas documentation:
DataFrame.applymap (func): apply a function to a DataFrame that is designed to work differently, like making a map (func, series) for each series in the DataFrame
to convert all columns to dataframe:
df.applymap(int)
source to share
Start with a simple text file to read, with just one variable and one line.
%more foo.txt
x
77241756221441762028881402092817125017724447303212139981668021711613168152184106
In [268]: df=pd.read_csv('foo.txt')
Pandas will read it as a string because it is too large to store as a kernel type, like int64 or float64. But the information is there, you have not lost anything.
In [269]: df.x
Out[269]:
0 7724175622144176202888140209281712501772444730...
Name: x, dtype: object
In [270]: type(df.x[0])
Out[270]: str
And you can use plain python to treat it like a number. Remember the caveats from the links in the comments, it won't be as fast as stuff in numpy and pandas where you stored the entire column as int64. This uses a more flexible but slower object mode to handle things.
You can change the column to be stored as longs (long integers) like this. (But note that the dtype is still an object, because everything but the basic numpy types (int32, int64, float64, etc.) are stored as objects.)
In [271]: df.x = df.x.map(int)
And then you can more or less consider it as a number.
In [272]: df.x * 2
Out[272]:
0 1544835124428835240577628041856342500354488946...
Name: x, dtype: object
You will need to do some formatting to see the entire number. Or go the numpy route which will show the whole number by default.
In [273]: df.x.values * 2
Out[273]: array([ 154483512442883524057762804185634250035448894606424279963336043423226336304368212L], dtype=object)
source to share