How do I execute groupBy in PySpark?

Question

How do I execute groupBy in PySpark?

auto = sc.textFile("temp/auto_data.csv")
auto = auto.map(lambda x: x.split(","))
header = auto.first()
autoData = auto.filter(lambda a: a!=header)

now i have data in autoData p>

[[u'', u'ETZ', u'AS1', u'CUT000021', u'THE TU-WHEEL SPARES', u'DIBRUGARH', u'201505', u'LCK   ', u'2WH   ', u'KIT', u'KT-2069CZ', u'18', u'8484'], [u'', u'ETZ', u'AS1', u'CUT000021', u'THE TU-WHEEL SPARES', u'DIBRUGARH', u'201505', u'LCK   ', u'2WH   ', u'KIT', u'KT-2069SZ', u'9', u'5211']]

now I want to execute groupBy()

on 2nd and 12th (last) values. How to do it?

+3

python apache-spark pyspark

Ashutosh sonaliya 04 Aug 15 at 11:02

source to share

1 answer

zero323 · Answer 1 · 2015-08-04T11:48:36+0000

groupBy

takes as an argument a function that generates keys, so you can do something like this:

autoData.groupBy(lambda row: (row[2], row[12]))

Edit

Concerning the task described in the comments . groupBy

only collects data in groups, but does not aggregate them.

from operator import add

def int_or_zero(s):
    try:
        return int(s)
    except ValueError:
        return 0

autoData.map(lambda row: (row[2], int_or_zero(row[12]))).reduceByKey(add)

A highly efficient using versiongroupBy

might look like this:

(autoData.map(lambda row: (row[2], int_or_zero(row[12])))
     .groupByKey()
     .mapValues(sum))

How do I execute groupBy in PySpark?

More articles: