Converting categorical data csv to libsvm

I am using spark MLlib

machine learning to create models. I need to provide format files libsvm

as input if there are categorical variables in the data.

I tried to convert the file csv

to libsvm

using 1. Convert.c

as shown in site libsvm

2. Csvtolibsvm.py

tophraug

github

But both of these scenarios don't seem to convert categorical data. I also installed weka

and tried to save in libsvm

. But we could not find this parameter in weka explorer

.

Please suggest any other way to convert csv

with categorical data to format libsvm

or let me know if I'm missing something here.

Thanks for the help in advance.

+3


source to share


2 answers


I think you want to train SVM. It needs the input rdd [LabeledPoint].

https://spark.apache.org/docs/1.4.1/api/scala/#org.apache.spark.mllib.classification.SVMWithSGD

I suggest you consider your categorical columns similar to the second answer here:



How do I convert a categorical variable in Spark to a set of columns encoded as {0,1}?

where the LogisticRegression case is very similar to SVM.

0


source


you can try hash tricks to convert categorical functions to number and then convert dataframe to rdd if the order matches the function to each row. The following fake example is solved with pyspark.

for example, the data frame to convert is df:

>> df.show(5)

+------+----------------+-------+-------+
|gender|            city|country|     os|
+------+----------------+-------+-------+
|     M|         chennai|     IN|ANDROID|
|     F|       hyderabad|     IN|ANDROID|
|     M|leighton buzzard|     GB|ANDROID|
|     M|          kanpur|     IN|ANDROID|
|     F|       lafayette|     US|    IOS|
+------+----------------+-------+-------+

      

I want to use the functions: yob, city, country to predict gender.



import hashlib
from pyspark.sql import Row
from pyspark.ml.linalg import SparseVector

spark = SparkSession \
    .builder \
    .appName("Spark-app")\
     .config("spark.some.config.option", "some-value")\
    .getOrCreate() # create the spark session

NR_BINS = 100000 # the total number of categories, it should be a big number if you have many different categories in each feature and a lot of categorical features. in the meantime do consider the memory.

def hashnum(input):
    return int(hashlib.md5(input).hexdigest(), 16)%NR_BINS + 1

def libsvm_converter(row):
    target = "gender"
    features = ['city', 'country', 'os']
    if row[target] == "M":
        lab = 1
    elif row[target] == "F":
        lab = 0
    else:
        return
    sparse_vector = []
    for f in features:
        v = '{}-{}'.format(f, row[f].encode('utf-8'))
        hashv = hashnum(v) # the index in libsvm
        sparse_vector.append((hashv, 1)) # the value is always 1 because of categorical feature
    sparse_vector = list(set(sparse_vector)) # in case there are clashes (BR_BINS not big enough)
    return Row(label = lab, features=SparseVector(NR_BINS, sparse_vector))


libsvm = df.rdd.map(libsvm_converter_2)
data = spark.createDataFrame(libsvm)

      

If you check the data, it looks like this:

>> data.show()
+--------------------+-----+
|            features|label|
+--------------------+-----+
|(100000,[12626,68...|    1|
|(100000,[59866,68...|    0|
|(100000,[66386,68...|    1|
|(100000,[53746,68...|    1|
|(100000,[6966,373...|    0|
+--------------------+-----+

      

0


source







All Articles