Filter rows with NaN values for a specific column

Question

Filter rows with NaN values for a specific column

I have a dataset, and in some rows, an attribute value NaN

. This data is loaded into the dataframe and I would only like to use strings, which are made up of strings where all the attributes have values. I tried to do it via sql:

val df_data = sqlContext.sql("SELECT * FROM raw_data WHERE attribute1 != NaN")

I tried several options, but I can't seem to get it to work.

Another option would be to convert it to an RDD and then filter it, since filtering that frame to check if an attribute isNaN

is not working.

+3

scala apache-spark apache-spark-sql

Ojtwist May 27 '15 @ 7:23 am

source to share

3 answers

I know you accepted a different answer, but you can do it without explode

(which should work better than doubling the size of your DataFrame).

Before Spark 1.6, you can use udf

like this:

def isNaNudf = udf[Boolean,Double](d => d.isNaN)
df.filter(isNaNudf($"value"))

As of Spark 1.6 , you can now use the built -in SQL function isnan()

like this:

df.filter(isnan($"value"))

+9

David Griffin May 27 '15 at 15:02

source to share

It works:

where isNaN(tau_doc) = false

eg.

val df_data = sqlContext.sql("SELECT * FROM raw_data where isNaN(attribute1) = false")

+1

hyokyun.park 06 June 17 at 15:49

source to share

Wesley miao · Accepted Answer · 2015-05-27T09:12:51+0000

Here is some sample code that shows you my way of doing it -

import sqlContext.implicits._
val df = sc.parallelize(Seq((1, 0.5), (2, Double.NaN))).toDF("id", "value")
val df2 = df.explode[Double, Boolean]("value", "isNaN")(d => Seq(d.isNaN))

df will have -

df.show

id value
1  0.5  
2  NaN

doing the filter on df2 will give you what you want -

df2.filter($"isNaN" !== true).show

id value isNaN
1  0.5   false

Filter rows with NaN values ​​for a specific column

More articles:

Filter rows with NaN values for a specific column