Convert Pandas DataFrame data format to LIBFM txt format
I want to convert a Pandas dataframe in python to a sparse text txt file in LIBFM format .
Here the format should look like this:
4 0:1.5 3:-7.9
2 1:1e-5 3:2
-1 6:1
This file contains three cases. The first column states the purpose of each of the three cases: i.e. 4 for the first case, 2 for the second and -1 for the third. After the target, each line contains nonzero elements x, where an entry like 0: 1.5 reads x0 = 1.5 and 3: -7.9 means x3 = -7.9 and so on. This means the left side of INDEX: VALUE states that the index is inside x, whereas the right side indicates the value of x.
In general, the data from the example describes the following constructive matrix X and target vector y:
1.5 0.0 0.0 −7.9 0.0 0.0 0.0
X: 0.0 10−5 0.0 2.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 1.0
4
Y: 2
−1
This is also explained in the file manually in Chapter 2.
Now here's my problem: I have a Pandas framework that looks like this:
overall reviewerID asin brand Positive Negative \
0 5.0 A2XVJBSRI3SWDI 0000031887 Boutique Cutie 3.0 -1
1 4.0 A2G0LNLN79Q6HR 0000031887 Boutique Cutie 5.0 -2
2 2.0 A2R3K1KX09QBYP 0000031887 Boutique Cutie 3.0 -2
3 1.0 A19PBP93OF896 0000031887 Boutique Cutie 2.0 -3
4 4.0 A1P0IHU93EF9ZK 0000031887 Boutique Cutie 2.0 -2
LDA_0 LDA_1 ... LDA_98 LDA_99
0 0.000833 0.000833 ... 0.000833 0.000833
1 0.000769 0.000769 ... 0.000769 0.000769
2 0.000417 0.000417 ... 0.000417 0.000417
3 0.000137 0.014101 ... 0.013836 0.000137
4 0.000625 0.000625 ... 0.063125 0.000625
Where "common" is the target column and all the other 105 columns are functions.
Columns "ReviewerId", "Asin" and "Brand" should be replaced with dummy variables. Therefore, each unique "ReviewerID", "Asin" and brand gets its own column. This means that if "ReviewerID" has 100 unique values, you get 100 columns, where the value is 1 if that row represents a particular reviewer, and zero.
All other columns do not need reformatting. Thus, the index for these columns can simply be the column number.
So, the first 3 lines in the previous Pandas dataframe need to be converted to the following output:
5 0:1 5:1 6:1 7:3 8:-1 9:0.000833 10:0.000833 ... 107:0.000833 108:0.00833
4 1:1 5:1 6:1 7:5 8:-2 9:0.000769 10:0.000769 ... 107:0.000769 108:0.00769
2 2:1 5:1 6:1 7:3 8:-2 9:0.000417 10:0.000417 ... 107:0.000417 108:0.000417
The LIBFM package] has a program that can convert User-Item-Rating to LIBFM output format. However, this program cannot be combined with this set of columns.
Is there an easy way to do this? I only have 1 million lines.
source to share
The LibFM executable expects input in the libSVM format you explained here. If the file converter in the LibFM package does not work for your data, try scikit to examine the sklearn.datasets.dump_svmlight_file method.
Link: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.dump_svmlight_file.html
source to share