Convert Pandas DataFrame data format to LIBFM txt format

I want to convert a Pandas dataframe in python to a sparse text txt file in LIBFM format .

Here the format should look like this:

4   0:1.5   3:-7.9
2   1:1e-5  3:2
-1  6:1

      

This file contains three cases. The first column states the purpose of each of the three cases: i.e. 4 for the first case, 2 for the second and -1 for the third. After the target, each line contains nonzero elements x, where an entry like 0: 1.5 reads x0 = 1.5 and 3: -7.9 means x3 = -7.9 and so on. This means the left side of INDEX: VALUE states that the index is inside x, whereas the right side indicates the value of x.

In general, the data from the example describes the following constructive matrix X and target vector y:

   1.5  0.0   0.0  −7.9  0.0  0.0  0.0
X: 0.0  10−5  0.0  2.0   0.0  0.0  0.0
   0.0  0.0   0.0  0.0   0.0  0.0  1.0

   4
Y: 2
  −1

      

This is also explained in the file manually in Chapter 2.

Now here's my problem: I have a Pandas framework that looks like this:

  overall reviewerID        asin       brand         Positive Negative  \
0  5.0   A2XVJBSRI3SWDI  0000031887  Boutique Cutie     3.0       -1
1  4.0   A2G0LNLN79Q6HR  0000031887  Boutique Cutie     5.0       -2
2  2.0   A2R3K1KX09QBYP  0000031887  Boutique Cutie     3.0       -2
3  1.0   A19PBP93OF896   0000031887  Boutique Cutie     2.0       -3
4  4.0   A1P0IHU93EF9ZK  0000031887  Boutique Cutie     2.0       -2

  LDA_0     LDA_1      ...    LDA_98      LDA_99
0  0.000833  0.000833  ...    0.000833    0.000833
1  0.000769  0.000769  ...    0.000769    0.000769
2  0.000417  0.000417  ...    0.000417    0.000417
3  0.000137  0.014101  ...    0.013836    0.000137
4  0.000625  0.000625  ...    0.063125    0.000625

      

Where "common" is the target column and all the other 105 columns are functions.

Columns "ReviewerId", "Asin" and "Brand" should be replaced with dummy variables. Therefore, each unique "ReviewerID", "Asin" and brand gets its own column. This means that if "ReviewerID" has 100 unique values, you get 100 columns, where the value is 1 if that row represents a particular reviewer, and zero.

All other columns do not need reformatting. Thus, the index for these columns can simply be the column number.

So, the first 3 lines in the previous Pandas dataframe need to be converted to the following output:

5 0:1 5:1 6:1 7:3 8:-1 9:0.000833 10:0.000833 ... 107:0.000833 108:0.00833
4 1:1 5:1 6:1 7:5 8:-2 9:0.000769 10:0.000769 ... 107:0.000769 108:0.00769
2 2:1 5:1 6:1 7:3 8:-2 9:0.000417 10:0.000417 ... 107:0.000417 108:0.000417

      

The LIBFM package] has a program that can convert User-Item-Rating to LIBFM output format. However, this program cannot be combined with this set of columns.

Is there an easy way to do this? I only have 1 million lines.

+3


source to share


1 answer


The LibFM executable expects input in the libSVM format you explained here. If the file converter in the LibFM package does not work for your data, try scikit to examine the sklearn.datasets.dump_svmlight_file method.



Link: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.dump_svmlight_file.html

0


source







All Articles