Best way to compare Pandas dataframe with csv file

I have a series of tests where the Pandas data output needs to be compared against a static source file. My preferred choice for the base file format is the csv format for readability and easy maintenance in Git. But if I have to load csv file into dataframe and use

A.equals(B) 

      

where A is the output data block and B is the data frame loaded from the CSV file, errors will inevitably occur, since the csv file does not record data types and what is not. So my rather contrived solution is to write data block A to a CSV file and load it back in the same way as B and then ask if they are equal.

Does anyone have a better solution that they have been using for some time without any problem?

+3


source to share


3 answers


I came across a solution that works for my case using Pandas testing utilities.

from pandas.util.testing import assert_frame_equal

      



Then call it from a try block other than check_dtype which is set to False.

try:
    assert_frame_equal(A, B, check_dtype=False)
    print("The dataframes are the same.")
except: 
    print("Please verify data integrity.")

      

0


source


If you are worried about the datatypes of the csv file, you can load it as a dataframe with specific datatypes as shown below:

import pandas as pd
B = pd.DataFrame('path_to_csv.csv', dtypes={"col1": "int", "col2": "float64", "col3": "object"} )

      

This ensures that each csv column is read as a specific datatype

After that, you can easily compare the data with

A.equals(B)

      




EDIT:

If you need to compare many pairs, another way to do this is to compare the hash values ​​of the data, instead of comparing each row and column of separate data frames.

hashA = hash(A.values.tobytes())
hashB = hash(B.values.tobytes())

      

Now compare these two hash values ​​which are integers to check if the original data frames were the same or not.

Be careful: I'm not sure if the data of the original dataframe type will matter or not. Be sure to check it out.

0


source


(A != B).any(1)

returns a series of booleans that tell you which strings are equal and which are not ...

Booleans are internally represented by 1 and 0, so you can do sum () to check how many rows were not equal.

sum((A != B).any(1))

      

If you get output 0, that means all strings are equal.

0


source







All Articles