PySpark: creating a new RDD from an existing LabeledPointsRDD but changing the label
Is there a quick way to create a new RDD from an existing RDD that contains LabeledPoints, but only change the labels for each line?
As an example, let's say I have an RDD called myRDD and that myRDD has LabeledPoints like this:
RDD = sc.parallelize([ LabeledPoint(1, [1.0, 2.0, 3.0]), LabeledPoint(2, [3.0, 4.0, 5.0]), LabeledPoint(4, [6.0, 7.0, 8.0])])
This represents taking (5) RDD.
I just want to create a new RDD from this, but I want to subtract 10 from each label.
When I try this, it fails:
myRDD = RDD.map(lambda x: x[0].label - 10, x[1].features)
Please help me by also pointing out what is wrong with my reasoning in the attempt above.
what is wrong with your reasoning in the attempt?
First, let's take a look at the entire map:
map(lambda x: x[0].label - 10, x[1].features)
It is now interpreted as map
having a function lambda x: x[0].label - 10
and an optional argument x[1].features
. Let's start by returning a tuple:
map(lambda x: (x[0].label - 10, x[1].features)))
The function passed to the map gets one point at a time, so indexing doesn't make sense, you should just use label
and features
:
map(lambda x: (x.label - 10, x.features))
Finally, you need to create a new point:
map(lambda x: LabeledPoint(x.label - 10, x.features))