How can we format data like this pickle
I am trying to use the PyUpset package and it has some test data in the pickel which can be found here
I can run the following code to view the content and data format.
from pickle import load
with open('./test_data_dict.pckl', 'rb') as f:
data_dict = load(f)
data_dict
which showed that the data is in the following format, this is just an example of what it looks like,
[495 rows X 4 columns],
'adventure': title rating_avg \
0 20,000 Leagues Under the Sea (1954) 3.702609
1 7th Voyage of Sinbad, The (1958) 3.616279
rating_std views
0 0.869685 575
1 0.931531 258
[281 rows x 4 columns],
'romance': title rating_avg \
0 'Til There Was You (1997) 2.402609
1 1-900 (1994) 2.411279
rating_std views
0 0.669685 575
1 0.981310 245
I am trying to format my csv data this way and the closest I could get was using pandas for something like this
csv file in the following format
Type_A, Type_B, Type_C x1,x2,x3 y1,y2,y3
used by pandas to import into dataframe and merge them after adding index
import pandas as pd df=pd.read_csv(csv_file) d1=df.Type_A.tolist() d2=df.Type_B.tolist() d3=df.Type_C.tolist()
then enumerate () is used to add the index
d1_df=list(enumerate(d1, 1))
d2_df=list(enumerate(d2, 1))
d3_df=list(enumerate(d3, 1))
d1_df # this gives me [(1, 'x1'), (2, 'y1')]
Now I have added lables Id and Value to dataframe
labels = ['Id','Value']
d1_df = pd.DataFrame.from_records(d1_df, columns=labels)
d2_df = pd.DataFrame.from_records(d2_df, columns=labels)
d3_df = pd.DataFrame.from_records(d3_df, columns=labels)
d1_df # this gives me Id Value
# 0 1 x1
# 1 2 y1
then merged all 3 into one dataframe and redefined Type_A, Type_B and Type_C
child_df = [d1_df, d2_df, d3_df]
labels2 = ['Type_A','Type_B','Type_C']
parent_df = pd.concat(child_df, keys=['Type_A', 'Type_B', 'Type_C'])
parent_df # out below
# Id Value
#Type_A 0 1 x1
# 1 2 y1
#Type_B 0 1 x2
# 1 2 y2
#Type_C 0 1 x3
# 1 2 y3
This is where I am amazed, I think I am using the wrong approach and it should be easier to get the data in the format used by PyUpset.
source to share
I think you need to rearrange the table so that it is in "long" format. After that, you can use the groupby method in pandas to create the correct dictionary for pyupset.
import pandas as pd
try:
# for Python 2.x
from StringIO import StringIO
except ImportError:
# for Python 3.x
from io import StringIO
test_string = StringIO("""Type_A,Type_B,Type_C
x1,x2,x3
y1,y2,y3""")
df = pd.read_csv(test_string)
df = pd.melt(df, var_name='type')
# df now looks like this:
#
# type value
# 0 Type_A x1
# 1 Type_A y1
# 2 Type_B x2
# 3 Type_B y2
# 4 Type_C x3
# 5 Type_C y3
pyupset_data = {key: df.loc[value] for key, value in df.groupby("type").groups.items()}
source to share