How can we format data like this pickle

I am trying to use the PyUpset package and it has some test data in the pickel which can be found here

I can run the following code to view the content and data format.

from pickle import load
with open('./test_data_dict.pckl', 'rb') as f:
   data_dict = load(f)
data_dict

      

which showed that the data is in the following format, this is just an example of what it looks like,

   [495 rows X 4 columns],
    'adventure':          title rating_avg \
        0                20,000 Leagues Under the Sea (1954)    3.702609    
        1                 7th Voyage of Sinbad, The (1958)      3.616279

             rating_std views
        0     0.869685    575  
        1     0.931531    258  

     [281 rows x 4 columns],
    'romance':          title rating_avg \
        0                'Til There Was You (1997)    2.402609    
        1                 1-900 (1994)                2.411279

             rating_std views
        0     0.669685    575  
        1     0.981310    245  

      

I am trying to format my csv data this way and the closest I could get was using pandas for something like this

csv file in the following format

Type_A, Type_B, Type_C
x1,x2,x3
y1,y2,y3

      

used by pandas to import into dataframe and merge them after adding index

import pandas as pd
df=pd.read_csv(csv_file)
d1=df.Type_A.tolist()
d2=df.Type_B.tolist()
d3=df.Type_C.tolist()

      

then enumerate () is used to add the index

d1_df=list(enumerate(d1, 1))
d2_df=list(enumerate(d2, 1))
d3_df=list(enumerate(d3, 1))
d1_df  # this gives me [(1, 'x1'), (2, 'y1')]

      

Now I have added lables Id and Value to dataframe

labels = ['Id','Value']
d1_df = pd.DataFrame.from_records(d1_df, columns=labels)
d2_df = pd.DataFrame.from_records(d2_df, columns=labels)
d3_df = pd.DataFrame.from_records(d3_df, columns=labels)


d1_df  # this gives me Id Value
       #            0   1    x1
       #            1   2    y1

      

then merged all 3 into one dataframe and redefined Type_A, Type_B and Type_C

child_df = [d1_df, d2_df, d3_df]
labels2 = ['Type_A','Type_B','Type_C']

parent_df = pd.concat(child_df, keys=['Type_A', 'Type_B', 'Type_C'])

parent_df # out below


#          Id Value
#Type_A 0   1    x1
#       1   2    y1
#Type_B 0   1    x2
#       1   2    y2
#Type_C 0   1    x3
#       1   2    y3

      

This is where I am amazed, I think I am using the wrong approach and it should be easier to get the data in the format used by PyUpset.

+3


source to share


2 answers


I think you need to rearrange the table so that it is in "long" format. After that, you can use the groupby method in pandas to create the correct dictionary for pyupset.



import pandas as pd
try:
    # for Python 2.x
    from StringIO import StringIO
except ImportError:
    # for Python 3.x
    from io import StringIO

test_string = StringIO("""Type_A,Type_B,Type_C
x1,x2,x3
y1,y2,y3""")

df = pd.read_csv(test_string)
df = pd.melt(df, var_name='type')
# df now looks like this:
#
#    type      value
# 0  Type_A    x1
# 1  Type_A    y1
# 2  Type_B    x2
# 3  Type_B    y2
# 4  Type_C    x3
# 5  Type_C    y3

pyupset_data = {key: df.loc[value] for key, value in df.groupby("type").groups.items()}

      

+2


source


I think it is actually just a simple python dict with values ​​as integer data. The key is the title you want on the bottom line.



0


source







All Articles