Converting categorical data csv to libsvm
I am using spark MLlib
machine learning to create models. I need to provide format files libsvm
as input if there are categorical variables in the data.
I tried to convert the file csv
to libsvm
using 1. Convert.c
as shown in site libsvm
2. Csvtolibsvm.py
tophraug
github
But both of these scenarios don't seem to convert categorical data. I also installed weka
and tried to save in libsvm
. But we could not find this parameter in weka explorer
.
Please suggest any other way to convert csv
with categorical data to format libsvm
or let me know if I'm missing something here.
Thanks for the help in advance.
source to share
I think you want to train SVM. It needs the input rdd [LabeledPoint].
https://spark.apache.org/docs/1.4.1/api/scala/#org.apache.spark.mllib.classification.SVMWithSGD
I suggest you consider your categorical columns similar to the second answer here:
How do I convert a categorical variable in Spark to a set of columns encoded as {0,1}?
where the LogisticRegression case is very similar to SVM.
source to share
you can try hash tricks to convert categorical functions to number and then convert dataframe to rdd if the order matches the function to each row. The following fake example is solved with pyspark.
for example, the data frame to convert is df:
>> df.show(5)
+------+----------------+-------+-------+
|gender| city|country| os|
+------+----------------+-------+-------+
| M| chennai| IN|ANDROID|
| F| hyderabad| IN|ANDROID|
| M|leighton buzzard| GB|ANDROID|
| M| kanpur| IN|ANDROID|
| F| lafayette| US| IOS|
+------+----------------+-------+-------+
I want to use the functions: yob, city, country to predict gender.
import hashlib
from pyspark.sql import Row
from pyspark.ml.linalg import SparseVector
spark = SparkSession \
.builder \
.appName("Spark-app")\
.config("spark.some.config.option", "some-value")\
.getOrCreate() # create the spark session
NR_BINS = 100000 # the total number of categories, it should be a big number if you have many different categories in each feature and a lot of categorical features. in the meantime do consider the memory.
def hashnum(input):
return int(hashlib.md5(input).hexdigest(), 16)%NR_BINS + 1
def libsvm_converter(row):
target = "gender"
features = ['city', 'country', 'os']
if row[target] == "M":
lab = 1
elif row[target] == "F":
lab = 0
else:
return
sparse_vector = []
for f in features:
v = '{}-{}'.format(f, row[f].encode('utf-8'))
hashv = hashnum(v) # the index in libsvm
sparse_vector.append((hashv, 1)) # the value is always 1 because of categorical feature
sparse_vector = list(set(sparse_vector)) # in case there are clashes (BR_BINS not big enough)
return Row(label = lab, features=SparseVector(NR_BINS, sparse_vector))
libsvm = df.rdd.map(libsvm_converter_2)
data = spark.createDataFrame(libsvm)
If you check the data, it looks like this:
>> data.show()
+--------------------+-----+
| features|label|
+--------------------+-----+
|(100000,[12626,68...| 1|
|(100000,[59866,68...| 0|
|(100000,[66386,68...| 1|
|(100000,[53746,68...| 1|
|(100000,[6966,373...| 0|
+--------------------+-----+
source to share