How to filter rows with NaN values ​​in Hive?

I am executing the sum function of one hive table in Hue and getting the return value NaN.

Here is my code:

select sum(v1) from hivedb.tb1;

I don't know why this gives me a NaN result. I have checked if any of my v1 values ​​are null:

select * from hivedb.tb1 where v1 is null;

and as it turns out, none of the entries have a null value. The table has 100 million rows, so I cannot do a manual check for every record.

  • Does anyone know why I am getting NaN result?
  • And if it is because I have some abnormal value in some lines, how can I find them?

Any help is appreciated. Thank you in advance!

UPDATE 1 I manually checked the first 1000 rows and luckily noticed some anomalous NaN values ​​in tb1. This is due to some rounding error from the previous steps. So my question 1 is probably answered. Please feel free to comment on this if you think there might be other reasons.

I still don't know how to use an efficient way of defining strings with NaN values. Therefore, I am still waiting for answers to my question # 2. Please feel free to share. I appreciate your help.

UPDATE 2 Issue resolved with the accepted answer below in the discussion section. There are several ways to deal with it.

  • Use condition selection v1 + 1> v1. It will select rows with values ​​other than NaN.
  • Use condition selection for cast (v1 as String) = 'NaN'. It will select rows with NaN values.
+3


source to share


2 answers


Hive relies on Java (plus SQL-specific semantics for Null and Friends) and Java is the IEEE standard for number semantics. This means ... NaN is tricky.

The quote that posted ...

(Float.NaN == Float.NaN)

always returns false. In fact, if you look at the JDK implementation Float.isNaN()

, a number is not a number if it is not equal to itself (which makes sense, because the number must be equal to itself). The same is true for Double.NaN

So it doesn't make sense to show you how to use the (undocumented) Hive function called reflect2

, which allows you to call raw Java methods on Hive columns, i.e.

where v1 is not null and not reflect2(v1, "isNaN")

      

... because - in theory - you can simply specify:

where v1 is not null and v1=v1

      

Disclaimer . I've seen cases where the Hive optimizer does aggressive "optimizations" and produces incorrect results. In other words, if a simple sentence v1=v1

doesn't filter out NaNs as expected, then look at reflect2

...

Edit . Indeed, the optimizer ignores the suggestion v1=v1

in some versions of Hive (see comments), so a more sophisticated formula is needed:

  • v1 +1.0 > v1

    should work ... except when rounding errors make either abs(v1)

    <<1 or abs(v1)

    β†’ 1
  • other "number" tricks will crash in a similar manner in the case of edges, especially when v1 =0.0



In the end, the most reliable approach is trying to try (since all possible NaN values ​​are displayed as "NaN" even if they are not strictly "equal" in the arithmetic sense). cast(v1 as String) <>'NaN'


Side note about reflect2

- you can see that it is not really mentioned in the official Hive doc while it reflect

is mentioned (and even has a specific wiki entry ). But it was already implemented in Hive V0.11 cf. Hive-4025

Edit . Java reflection is now disabled by default for ODBC / JDBC / Hue connections (see comments) and cannot be re-enabled when using security plugins such as Ranger or Sentry. Therefore its use is limited to the (deprecated) hive

CLI.

+3


source


You can handle NaN like



SELECT SUM(CAST(IF(v1 ='NaN', 0, v1)) as Double) FROM hivedb.tb1 

      

+1


source







All Articles