How to apply the same function to all columns of a dataset in parallel using Spark (Java)

I have a dataset with some categorical features. I am trying to apply the same feature to all these categorical features in Spark framework. My first guess was that I can parallelize the work of each function with the work of other functions. However, I couldn't figure out if this is possible or not (confused after reading this , this ).

For example: suppose my dataset looks like this:

feature1, feature2, feature3

blue, apple, snake

orange, orange, monkey

blue, orange, horse

I want to count the number of occurrences of each category for each function separately. For example, for feature1 (blue = 2, orange = 1)

+3


source to share


1 answer


TL; DR . SQL sparks are not split into columns, but rows, so Spark processes a group of rows for each task (not columns), unless you split the original dataset using the select

-like operator .

If you want to:



count the number of occurrences of each category for each function, separately

just use groupBy

and count

(possibly with join

) or use windows (with window aggregation functions).

+1


source







All Articles