Filter rows with NaN values โโfor a specific column
I have a dataset, and in some rows, an attribute value NaN
. This data is loaded into the dataframe and I would only like to use strings, which are made up of strings where all the attributes have values. I tried to do it via sql:
val df_data = sqlContext.sql("SELECT * FROM raw_data WHERE attribute1 != NaN")
I tried several options, but I can't seem to get it to work.
Another option would be to convert it to an RDD and then filter it, since filtering that frame to check if an attribute isNaN
is not working.
source to share
Here is some sample code that shows you my way of doing it -
import sqlContext.implicits._
val df = sc.parallelize(Seq((1, 0.5), (2, Double.NaN))).toDF("id", "value")
val df2 = df.explode[Double, Boolean]("value", "isNaN")(d => Seq(d.isNaN))
df will have -
df.show
id value
1 0.5
2 NaN
doing the filter on df2 will give you what you want -
df2.filter($"isNaN" !== true).show
id value isNaN
1 0.5 false
source to share
I know you accepted a different answer, but you can do it without explode
(which should work better than doubling the size of your DataFrame).
Before Spark 1.6, you can use udf
like this:
def isNaNudf = udf[Boolean,Double](d => d.isNaN)
df.filter(isNaNudf($"value"))
As of Spark 1.6 , you can now use the built -in SQL function isnan()
like this:
df.filter(isnan($"value"))
source to share