Split rdd with spark
I have rdd that looks like
[u'1,0,0,0,0,0,0,0,1,2013,52,0,4,1,0',
u'1,0,0,0,1,1,0,1,1,2012,49,1,1,0,1',
u'1,0,0,0,1,1,0,0,1,2012,49,1,1,0,1',
u'0,1,0,0,0,0,1,1,1,2014,45,0,0,1,0']
Is there a way to get three separate rdds, like making a filter based on the value of the year column?
[u'1,0,0,0,0,0,0,0,1,2013,52,0,4,1,0']
and
[ u'1,0,0,0,1,1,0,1,1,2012,49,1,1,0,1',
u'1,0,0,0,1,1,0,0,1,2012,49,1,1,0,1']
and
[u'0,1,0,0,0,0,1,1,1,2014,45,0,0,1,0']
+3
source to share
2 answers
There's a better solution than this. I learned a lot of things working on this and wasted so much time could not resist posting it.
In [60]: a
Out[60]:
[u'1,0,0,0,0,0,0,0,1,2013,52,0,4,1,0',
u'1,0,0,0,1,1,0,1,1,2012,49,1,1,0,1',
u'1,0,0,0,1,1,0,0,1,2012,49,1,1,0,1',
u'0,1,0,0,0,0,1,1,1,2014,45,0,0,1,0']
I am very ashamed to work with strings, so I changed them to ints.
In [61]: b=[map(int,elem.split(',')) for elem in a]
In [62]: b
Out[62]:
[[1, 0, 0, 0, 0, 0, 0, 0, 1, 2013, 52, 0, 4, 1, 0],
[1, 0, 0, 0, 1, 1, 0, 1, 1, 2012, 49, 1, 1, 0, 1],
[1, 0, 0, 0, 1, 1, 0, 0, 1, 2012, 49, 1, 1, 0, 1],
[0, 1, 0, 0, 0, 0, 1, 1, 1, 2014, 45, 0, 0, 1, 0]]
Sorted b depending on the year.
In [63]: b_s=sorted(b,key=itemgetter(-6))
In [64]: b_s
Out[64]:
[[1, 0, 0, 0, 1, 1, 0, 1, 1, 2012, 49, 1, 1, 0, 1],
[1, 0, 0, 0, 1, 1, 0, 0, 1, 2012, 49, 1, 1, 0, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 1, 2013, 52, 0, 4, 1, 0],
[0, 1, 0, 0, 0, 0, 1, 1, 1, 2014, 45, 0, 0, 1, 0]]
Using groupby from operator module to group based on year.
In [65]: [list(g) for k,g in groupby(b_s,key=itemgetter(-6))]
Out[65]:
[[[1, 0, 0, 0, 1, 1, 0, 1, 1, 2012, 49, 1, 1, 0, 1],
[1, 0, 0, 0, 1, 1, 0, 0, 1, 2012, 49, 1, 1, 0, 1]],
[[1, 0, 0, 0, 0, 0, 0, 0, 1, 2013, 52, 0, 4, 1, 0]],
[[0, 1, 0, 0, 0, 0, 1, 1, 1, 2014, 45, 0, 0, 1, 0]]]
+1
source to share