How do I make a stratified selection using Spark DataFrames?

I am in Spark 1.3.0 and my data is in DataFrames. I need operations like sampleByKey (), sampleByKeyExact (). I've seen JIRA "Add Sample Stratified Fetch to DataFrame" ( https://issues.apache.org/jira/browse/SPARK-7157 ). Which is for Spark 1.5, until that happens, the simplest way is to do the equivalent of sampleByKey () and sampleByKeyExact () in DataFrames. Thanks and regards MK

+3


source to share


1 answer


Spark 1.1 added a stratified sampling procedure, SampleByKey

and SampleByKeyExact

in the Spark Core, therefore, since they are available without dependencies MLLib.

These two functions PairRDDFunctions

relate to the key value RDD[(K,T)]

. Also DataFrames has no keys. You should use a basic RDD - something like below:



val df = ... // your dataframe
val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key

val sample = df.rdd.keyBy(x=>x(0)).sampleByKey(false, fractions)

      

Note that sample

this is now an RDD, not a DataFrame, but you can easily convert it back to a DataFrame since you already have a schema defined for df

.

+3


source







All Articles