PySpark: creating a new RDD from an existing LabeledPointsRDD but changing the label

Question

PySpark: creating a new RDD from an existing LabeledPointsRDD but changing the label

Is there a quick way to create a new RDD from an existing RDD that contains LabeledPoints, but only change the labels for each line?

As an example, let's say I have an RDD called myRDD and that myRDD has LabeledPoints like this:

RDD = sc.parallelize([
    LabeledPoint(1, [1.0, 2.0, 3.0]),
    LabeledPoint(2, [3.0, 4.0, 5.0]),
    LabeledPoint(4, [6.0, 7.0, 8.0])])

This represents taking (5) RDD.

I just want to create a new RDD from this, but I want to subtract 10 from each label.

When I try this, it fails:

myRDD = RDD.map(lambda x: x[0].label - 10, x[1].features)

Please help me by also pointing out what is wrong with my reasoning in the attempt above.

+3

python apache-spark pyspark apache-spark-mllib

Monty Jul 19 '15 at 9:37

source to share

1 answer

zero323 · Accepted Answer · 2015-07-19T11:05:06+0000

what is wrong with your reasoning in the attempt?

First, let's take a look at the entire map:

 map(lambda x: x[0].label - 10, x[1].features)

It is now interpreted as map

having a function lambda x: x[0].label - 10

and an optional argument x[1].features

. Let's start by returning a tuple:

map(lambda x: (x[0].label - 10, x[1].features)))

The function passed to the map gets one point at a time, so indexing doesn't make sense, you should just use label

and features

:

 map(lambda x: (x.label - 10, x.features))

Finally, you need to create a new point:

map(lambda x: LabeledPoint(x.label - 10, x.features))

PySpark: creating a new RDD from an existing LabeledPointsRDD but changing the label

More articles: