How to apply the same function to all columns of a dataset in parallel using Spark (Java)

Question

How to apply the same function to all columns of a dataset in parallel using Spark (Java)

I have a dataset with some categorical features. I am trying to apply the same feature to all these categorical features in Spark framework. My first guess was that I can parallelize the work of each function with the work of other functions. However, I couldn't figure out if this is possible or not (confused after reading this , this ).

For example: suppose my dataset looks like this:

feature1, feature2, feature3

blue, apple, snake

orange, orange, monkey

blue, orange, horse

I want to count the number of occurrences of each category for each function separately. For example, for feature1 (blue = 2, orange = 1)

+3

apache-spark apache-spark-sql spark-dataframe apache-spark-mllib

Nooshin salek faramarzi May 26 '17 at 3:51

source to share

1 answer

Jacek Laskowski · Answer 1 · 2017-05-26T04:51:04+0000

TL; DR . SQL sparks are not split into columns, but rows, so Spark processes a group of rows for each task (not columns), unless you split the original dataset using the select

-like operator .

If you want to:

count the number of occurrences of each category for each function, separately

just use groupBy

and count

(possibly with join

) or use windows (with window aggregation functions).

How to apply the same function to all columns of a dataset in parallel using Spark (Java)

More articles: