Column functions must be of type org.apache.spark.ml.linalg.VectorUDT

I want to run this code in pyspark (spark 2.1.1):

from pyspark.ml.feature import PCA

bankPCA = PCA(k=3, inputCol="features", outputCol="pcaFeatures") 
pcaModel = bankPCA.fit(bankDf)    
pcaResult = pcaModel.transform(bankDF).select("label", "pcaFeatures")    
pcaResult.show(truncate= false)

      

But I am getting this error:

Failed to fulfill condition

: column functions should be of type org.apache.spark.ml.linalg.Vect orUDT@3bfc3ba7

, but really org.apache.spark.mllib.linalg.VectorUDT@f71b0bce

.

+3


source to share


1 answer


An example you can find here :

from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors

data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
    (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
    (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = spark.createDataFrame(data, ["features"])

pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)

... other code ...

      

As you can see above, df is a dataframe that contains Vectors.sparse () and Vectors.dense (), which are imported from pyspark.ml.linalg .

Your bankDf may contain Vectors imported from pyspark.mllib.linalg .

So you have to set the Vectors in your data files to be imported



from pyspark.ml.linalg import Vectors 

      

instead:

from pyspark.mllib.linalg import Vectors

      

Maybe you are interested in this question about the stacking surface .

+2


source







All Articles