Filter rows with NaN values โ€‹โ€‹for a specific column

I have a dataset, and in some rows, an attribute value NaN

. This data is loaded into the dataframe and I would only like to use strings, which are made up of strings where all the attributes have values. I tried to do it via sql:

val df_data = sqlContext.sql("SELECT * FROM raw_data WHERE attribute1 != NaN")

      

I tried several options, but I can't seem to get it to work.

Another option would be to convert it to an RDD and then filter it, since filtering that frame to check if an attribute isNaN

is not working.

+3


source to share


3 answers


Here is some sample code that shows you my way of doing it -

import sqlContext.implicits._
val df = sc.parallelize(Seq((1, 0.5), (2, Double.NaN))).toDF("id", "value")
val df2 = df.explode[Double, Boolean]("value", "isNaN")(d => Seq(d.isNaN))

      

df will have -



df.show

id value
1  0.5  
2  NaN

      

doing the filter on df2 will give you what you want -

df2.filter($"isNaN" !== true).show

id value isNaN
1  0.5   false 

      

+2


source


I know you accepted a different answer, but you can do it without explode

(which should work better than doubling the size of your DataFrame).

Before Spark 1.6, you can use udf

like this:

def isNaNudf = udf[Boolean,Double](d => d.isNaN)
df.filter(isNaNudf($"value"))

      



As of Spark 1.6 , you can now use the built -in SQL function isnan()

like this:

df.filter(isnan($"value"))

      

+9


source


It works:

where isNaN(tau_doc) = false

      

eg.

val df_data = sqlContext.sql("SELECT * FROM raw_data where isNaN(attribute1) = false")

      

+1


source







All Articles