Calculate the coefficient of exponential mean in HIVE

I am trying to calculate the exponential average in a hive. For EMA, this is EMA = (K * (C - P)) + P, in which K is a smoothing factor, let it be 0.5. C is the current value, p is the previous value. If the table looks like the table shown below:

ID    Value        Date
1      10         2010-05-03
2      15         2010-05-06
3      17         2010-05-13

      

And the EMA should be:

ID     EMA                             Date
1      10                           2010-05-03
2   0.5*(15 - 10) + 10 = 12.5       2010-05-06
3   0.5*(17 - 12.5) + 12.4 = 14.75  2010-05-13

      

Instead of embedding UDFs in Java, I think that if I could get the same results using the built-in Hive SQL function. I think the LAG function should apply here, but I am really bad at database ... Am I in the right direction too? Is there a Hive SQL way to do this?

Thank you so much!

+3


source to share


1 answer


It's a little tricky because the coefficients for the first two numbers are always the same as you described the problem. I would tend to do this:

select v.*,
       sum(power(2, n)*val) over (order by id) / sum(power(2, n) over (order by id)
from (select v.*, row_number() over (order by id) - 1 as n
      from vals
     ) v

      

However, this gives results like 10, 13.33 and 15.42. Regarding what you want, this weight value is below the first. This can be easily fixed by adding it to:



select v.*,
       (max(case when n = 0 then val else 0 end) over (order by id) +
        sum(power(2, n)*val) over (order by id)
       ) / (1 + sum(power(2, n)) over (order by id)
from (select v.*, row_number() over (order by id) - 1 as n
      from vals v
     ) v

      

Here is a SQL script using Oracle that demonstrates the code. I'm not 100% sure if the numeric functions have the same names in Hive, but they should be something similar. Also, if your sequences are large, you run the risk of numeric overflow with this particular code.

+1


source







All Articles