How do I make a stratified selection using Spark DataFrames?
I am in Spark 1.3.0 and my data is in DataFrames. I need operations like sampleByKey (), sampleByKeyExact (). I've seen JIRA "Add Sample Stratified Fetch to DataFrame" ( https://issues.apache.org/jira/browse/SPARK-7157 ). Which is for Spark 1.5, until that happens, the simplest way is to do the equivalent of sampleByKey () and sampleByKeyExact () in DataFrames. Thanks and regards MK
source to share
Spark 1.1 added a stratified sampling procedure, SampleByKey
and SampleByKeyExact
in the Spark Core, therefore, since they are available without dependencies MLLib.
These two functions PairRDDFunctions
relate to the key value RDD[(K,T)]
. Also DataFrames has no keys. You should use a basic RDD - something like below:
val df = ... // your dataframe
val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key
val sample = df.rdd.keyBy(x=>x(0)).sampleByKey(false, fractions)
Note that sample
this is now an RDD, not a DataFrame, but you can easily convert it back to a DataFrame since you already have a schema defined for df
.
source to share