PySpark: creating a new RDD from an existing LabeledPointsRDD but changing the label
Is there a quick way to create a new RDD from an existing RDD that contains LabeledPoints, but only change the labels for each line?
As an example, let's say I have an RDD called myRDD and that myRDD has LabeledPoints like this:
RDD = sc.parallelize([ LabeledPoint(1, [1.0, 2.0, 3.0]), LabeledPoint(2, [3.0, 4.0, 5.0]), LabeledPoint(4, [6.0, 7.0, 8.0])])
This represents taking (5) RDD.
I just want to create a new RDD from this, but I want to subtract 10 from each label.
When I try this, it fails:
myRDD = RDD.map(lambda x: x[0].label - 10, x[1].features)
Please help me by also pointing out what is wrong with my reasoning in the attempt above.
source to share
what is wrong with your reasoning in the attempt?
First, let's take a look at the entire map:
map(lambda x: x[0].label - 10, x[1].features)
It is now interpreted as map
having a function lambda x: x[0].label - 10
and an optional argument x[1].features
. Let's start by returning a tuple:
map(lambda x: (x[0].label - 10, x[1].features)))
The function passed to the map gets one point at a time, so indexing doesn't make sense, you should just use label
and features
:
map(lambda x: (x.label - 10, x.features))
Finally, you need to create a new point:
map(lambda x: LabeledPoint(x.label - 10, x.features))
source to share