How to calculate average of dataframe column and find top 10%

I am very new to Scala and Spark and I am working on some homemade exercises using baseball stats. I use a case class, create an RDD and assign a schema to the data, and then turn it into a DataFrame, so I can use SparkSQL to select groups of players through their statistics that match certain criteria.

Once I have a subset of players I'm interested in looking further, I would like to find the average of a column; for example, medium batting or RBIs. From there, I would like to break all players into percentile groups based on their average performance compared to all players; top 10%, bottom 10%, 40-50%

I was able to use the DataFrame.describe () function to return the summary of the desired column (mean, stddev, count, min and max) all as strings. Is there a better way to get just the mean and stddev as a pair, and what is the best way to split the players into 10 percent groups?

So far, my thoughts have been to find values ​​that capture the percentile ranges and record a function that groups players through comparators, but it looks like it borders on reimagining the wheel.

+3


source to share


1 answer


I was able to get the percentiles using windows functions and apply ntile () and cumeDist () to it. Ntile () can create grouping based on input number. If you want everything to be grouped by 10%, just enter ntile (10), if 5%, then ntile (20). For more fine tuning restum, cumeDist () applied to the window will display a new cumulative distribution column that can be filtered from there via select (), where (), or an SQL query.



0


source







All Articles