Column functions must be of type org.apache.spark.ml.linalg.VectorUDT

Question

Column functions must be of type org.apache.spark.ml.linalg.VectorUDT

I want to run this code in pyspark (spark 2.1.1):

from pyspark.ml.feature import PCA

bankPCA = PCA(k=3, inputCol="features", outputCol="pcaFeatures") 
pcaModel = bankPCA.fit(bankDf)    
pcaResult = pcaModel.transform(bankDF).select("label", "pcaFeatures")    
pcaResult.show(truncate= false)

But I am getting this error:

Failed to fulfill condition
: column functions should be of type org.apache.spark.ml.linalg.Vect orUDT@3bfc3ba7

, but really org.apache.spark.mllib.linalg.VectorUDT@f71b0bce

.

+3

import apache-spark pyspark

S.Lotfi 01 june 17 at 9:40 am

source to share

1 answer

titiro89 · Answer 1 · 2017-06-02T20:28:36+0000

An example you can find here :

from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors

data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
    (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
    (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = spark.createDataFrame(data, ["features"])

pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)

... other code ...

As you can see above, df is a dataframe that contains Vectors.sparse () and Vectors.dense (), which are imported from pyspark.ml.linalg .

Your bankDf may contain Vectors imported from pyspark.mllib.linalg .

So you have to set the Vectors in your data files to be imported

from pyspark.ml.linalg import Vectors

instead:

from pyspark.mllib.linalg import Vectors

Maybe you are interested in this question about the stacking surface .

Column functions must be of type org.apache.spark.ml.linalg.VectorUDT

More articles: