How to apply the same function to all columns of a dataset in parallel using Spark (Java)
I have a dataset with some categorical features. I am trying to apply the same feature to all these categorical features in Spark framework. My first guess was that I can parallelize the work of each function with the work of other functions. However, I couldn't figure out if this is possible or not (confused after reading this , this ).
For example: suppose my dataset looks like this:
feature1, feature2, feature3
blue, apple, snake
orange, orange, monkey
blue, orange, horse
I want to count the number of occurrences of each category for each function separately. For example, for feature1 (blue = 2, orange = 1)
source to share
TL; DR . SQL sparks are not split into columns, but rows, so Spark processes a group of rows for each task (not columns), unless you split the original dataset using the select
-like operator .
If you want to:
count the number of occurrences of each category for each function, separately
just use groupBy
and count
(possibly with join
) or use windows (with window aggregation functions).
source to share