Split rdd with spark

Question

Split rdd with spark

I have rdd that looks like

[u'1,0,0,0,0,0,0,0,1,2013,52,0,4,1,0',
 u'1,0,0,0,1,1,0,1,1,2012,49,1,1,0,1',
 u'1,0,0,0,1,1,0,0,1,2012,49,1,1,0,1',
 u'0,1,0,0,0,0,1,1,1,2014,45,0,0,1,0']

Is there a way to get three separate rdds, like making a filter based on the value of the year column?

[u'1,0,0,0,0,0,0,0,1,2013,52,0,4,1,0']

and

[ u'1,0,0,0,1,1,0,1,1,2012,49,1,1,0,1',
     u'1,0,0,0,1,1,0,0,1,2012,49,1,1,0,1']

and

  [u'0,1,0,0,0,0,1,1,1,2014,45,0,0,1,0']

+3

python apache-spark

canada canada May 12 '15 at 19:55

source to share

2 answers

There's a better solution than this. I learned a lot of things working on this and wasted so much time could not resist posting it.

In [60]: a
Out[60]: 
[u'1,0,0,0,0,0,0,0,1,2013,52,0,4,1,0',
 u'1,0,0,0,1,1,0,1,1,2012,49,1,1,0,1',
 u'1,0,0,0,1,1,0,0,1,2012,49,1,1,0,1',
 u'0,1,0,0,0,0,1,1,1,2014,45,0,0,1,0']

I am very ashamed to work with strings, so I changed them to ints.

In [61]: b=[map(int,elem.split(',')) for elem in a]

In [62]: b
Out[62]: 
[[1, 0, 0, 0, 0, 0, 0, 0, 1, 2013, 52, 0, 4, 1, 0],
 [1, 0, 0, 0, 1, 1, 0, 1, 1, 2012, 49, 1, 1, 0, 1],
 [1, 0, 0, 0, 1, 1, 0, 0, 1, 2012, 49, 1, 1, 0, 1],
 [0, 1, 0, 0, 0, 0, 1, 1, 1, 2014, 45, 0, 0, 1, 0]]

Sorted b depending on the year.

In [63]: b_s=sorted(b,key=itemgetter(-6))

In [64]: b_s
Out[64]: 
[[1, 0, 0, 0, 1, 1, 0, 1, 1, 2012, 49, 1, 1, 0, 1],
 [1, 0, 0, 0, 1, 1, 0, 0, 1, 2012, 49, 1, 1, 0, 1],
 [1, 0, 0, 0, 0, 0, 0, 0, 1, 2013, 52, 0, 4, 1, 0],
 [0, 1, 0, 0, 0, 0, 1, 1, 1, 2014, 45, 0, 0, 1, 0]]

Using groupby from operator module to group based on year.

In [65]: [list(g) for k,g in groupby(b_s,key=itemgetter(-6))]
Out[65]: 
[[[1, 0, 0, 0, 1, 1, 0, 1, 1, 2012, 49, 1, 1, 0, 1],
  [1, 0, 0, 0, 1, 1, 0, 0, 1, 2012, 49, 1, 1, 0, 1]],
 [[1, 0, 0, 0, 0, 0, 0, 0, 1, 2013, 52, 0, 4, 1, 0]],
 [[0, 1, 0, 0, 0, 0, 1, 1, 1, 2014, 45, 0, 0, 1, 0]]]

+1

Ajay May 12 '15 at 22:14

source to share

jaynp · Accepted Answer · 2015-05-12T20:08:42+0000

Here's one way to use it groupBy

, and if your original RDD has a variable name rdd

:

rdd = rdd.groupBy(lambda x: x.split(",")[9])
new_rdds = [sc.parallelize(x[1]) for x in rdd.collect()]

for x in new_rdds:
    print x.collect()

Split rdd with spark

More articles: