Column functions must be of type org.apache.spark.ml.linalg.VectorUDT
I want to run this code in pyspark (spark 2.1.1):
from pyspark.ml.feature import PCA
bankPCA = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
pcaModel = bankPCA.fit(bankDf)
pcaResult = pcaModel.transform(bankDF).select("label", "pcaFeatures")
pcaResult.show(truncate= false)
But I am getting this error:
Failed to fulfill condition: column functions should be of type
org.apache.spark.ml.linalg.Vect orUDT@3bfc3ba7
, but reallyorg.apache.spark.mllib.linalg.VectorUDT@f71b0bce
.
+3
source to share
1 answer
An example you can find here :
from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors
data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
(Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
(Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = spark.createDataFrame(data, ["features"])
pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)
... other code ...
As you can see above, df is a dataframe that contains Vectors.sparse () and Vectors.dense (), which are imported from pyspark.ml.linalg .
Your bankDf may contain Vectors imported from pyspark.mllib.linalg .
So you have to set the Vectors in your data files to be imported
from pyspark.ml.linalg import Vectors
instead:
from pyspark.mllib.linalg import Vectors
Maybe you are interested in this question about the stacking surface .
+2
source to share