Split rdd with spark

I have rdd that looks like

[u'1,0,0,0,0,0,0,0,1,2013,52,0,4,1,0',
 u'1,0,0,0,1,1,0,1,1,2012,49,1,1,0,1',
 u'1,0,0,0,1,1,0,0,1,2012,49,1,1,0,1',
 u'0,1,0,0,0,0,1,1,1,2014,45,0,0,1,0']

      

Is there a way to get three separate rdds, like making a filter based on the value of the year column?

[u'1,0,0,0,0,0,0,0,1,2013,52,0,4,1,0']

      

and

[ u'1,0,0,0,1,1,0,1,1,2012,49,1,1,0,1',
     u'1,0,0,0,1,1,0,0,1,2012,49,1,1,0,1']

      

and

  [u'0,1,0,0,0,0,1,1,1,2014,45,0,0,1,0']

      

+3


source to share


2 answers


Here's one way to use it groupBy

, and if your original RDD has a variable name rdd

:



rdd = rdd.groupBy(lambda x: x.split(",")[9])
new_rdds = [sc.parallelize(x[1]) for x in rdd.collect()]

for x in new_rdds:
    print x.collect()

      

+3


source


There's a better solution than this. I learned a lot of things working on this and wasted so much time could not resist posting it.

In [60]: a
Out[60]: 
[u'1,0,0,0,0,0,0,0,1,2013,52,0,4,1,0',
 u'1,0,0,0,1,1,0,1,1,2012,49,1,1,0,1',
 u'1,0,0,0,1,1,0,0,1,2012,49,1,1,0,1',
 u'0,1,0,0,0,0,1,1,1,2014,45,0,0,1,0']

      

I am very ashamed to work with strings, so I changed them to ints.

In [61]: b=[map(int,elem.split(',')) for elem in a]

In [62]: b
Out[62]: 
[[1, 0, 0, 0, 0, 0, 0, 0, 1, 2013, 52, 0, 4, 1, 0],
 [1, 0, 0, 0, 1, 1, 0, 1, 1, 2012, 49, 1, 1, 0, 1],
 [1, 0, 0, 0, 1, 1, 0, 0, 1, 2012, 49, 1, 1, 0, 1],
 [0, 1, 0, 0, 0, 0, 1, 1, 1, 2014, 45, 0, 0, 1, 0]]

      



Sorted b depending on the year.

In [63]: b_s=sorted(b,key=itemgetter(-6))

In [64]: b_s
Out[64]: 
[[1, 0, 0, 0, 1, 1, 0, 1, 1, 2012, 49, 1, 1, 0, 1],
 [1, 0, 0, 0, 1, 1, 0, 0, 1, 2012, 49, 1, 1, 0, 1],
 [1, 0, 0, 0, 0, 0, 0, 0, 1, 2013, 52, 0, 4, 1, 0],
 [0, 1, 0, 0, 0, 0, 1, 1, 1, 2014, 45, 0, 0, 1, 0]]

      

Using groupby from operator module to group based on year.

In [65]: [list(g) for k,g in groupby(b_s,key=itemgetter(-6))]
Out[65]: 
[[[1, 0, 0, 0, 1, 1, 0, 1, 1, 2012, 49, 1, 1, 0, 1],
  [1, 0, 0, 0, 1, 1, 0, 0, 1, 2012, 49, 1, 1, 0, 1]],
 [[1, 0, 0, 0, 0, 0, 0, 0, 1, 2013, 52, 0, 4, 1, 0]],
 [[0, 1, 0, 0, 0, 0, 1, 1, 1, 2014, 45, 0, 0, 1, 0]]]

      

+1


source







All Articles