Pandas (Python) reading and working on Java BigInteger / large numbers

I have a data file ( csv ) with Nilsimsa

hash values. Some of them would have up to 80 characters. I want to read them in Python for data analysis tasks. Is there a way to import data into python without losing information?

EDIT: I've tried the implementation suggested in the comments, but it doesn't work for me. Sample data in csv file :77241756221441762028881402092817125017724447303212139981668021711613168152184106

+3


source to share


2 answers


As @JohnE explained in his answer, we don't lose any information when reading large numbers with Pandas. They are stored as dtype=object

, to calculate them numerically, we need to convert this data to a numeric type.

For the series:

We have to apply map(func)

to the row in the dataframe:

df['columnName'].map(int)

      

Entire data frame:



If for some reason our entire data block consists of columns c dtype=object

, we look atapplymap(func)

from the Pandas documentation:

DataFrame.applymap (func): apply a function to a DataFrame that is designed to work differently, like making a map (func, series) for each series in the DataFrame

to convert all columns to dataframe:

 df.applymap(int)

      

+1


source


Start with a simple text file to read, with just one variable and one line.

%more foo.txt
x
77241756221441762028881402092817125017724447303212139981668021711613168152184106

In [268]: df=pd.read_csv('foo.txt')

      

Pandas will read it as a string because it is too large to store as a kernel type, like int64 or float64. But the information is there, you have not lost anything.

In [269]: df.x
Out[269]: 
0    7724175622144176202888140209281712501772444730...
Name: x, dtype: object

In [270]: type(df.x[0])
Out[270]: str

      

And you can use plain python to treat it like a number. Remember the caveats from the links in the comments, it won't be as fast as stuff in numpy and pandas where you stored the entire column as int64. This uses a more flexible but slower object mode to handle things.

You can change the column to be stored as longs (long integers) like this. (But note that the dtype is still an object, because everything but the basic numpy types (int32, int64, float64, etc.) are stored as objects.)



In [271]: df.x = df.x.map(int)

      

And then you can more or less consider it as a number.

In [272]: df.x * 2
Out[272]: 
0    1544835124428835240577628041856342500354488946...
Name: x, dtype: object

      

You will need to do some formatting to see the entire number. Or go the numpy route which will show the whole number by default.

In [273]: df.x.values * 2
Out[273]: array([ 154483512442883524057762804185634250035448894606424279963336043423226336304368212L], dtype=object)

      

+1


source







All Articles