A hive of foreign keys?
I am new to Hive. I've tried searching various sites, but no one was able to give me a clear idea of ββthe following: A> Foreign Keys: General concept Hive never mentions foreign keys. Then, how do we enforce referential constraints? (I am aware of the JOIN ON syntax, which means that the two tables have a primary key: a foreign key relationship?) Is there a higher goal of not supporting foreign keys? B> Equality floats comparison: Seems to be a problem with this. For example, to check if A = 3.5 => "A> 3.49 and A <3.51". Is it correct?
Are there any links / materials that can help with the implementation of HQL?
Appreciate any help
Thanks -Shiree
Hive is implemented as Schema-on-Read, so there is no inherent referential integrity that Hive does on datasets. Instead, integrity must be enforced by the originating system and, more importantly, by any queries that run on Hive.
Hive does not currently support FK / PK restrictions.
But this may happen in the future. This gives Hive CBO more information to make better power estimates, better rewrite queries:
https://issues.apache.org/jira/browse/HIVE-13019
https://issues.apache.org/jira/browse/HIVE-6905
In response to Mo K's answer, constraints don't necessarily mean overhead. Oracle, for example, has the "RELY NOVALIDATE" constraint - so the CBO (or Hive CBO in this case) relies on this constraint to optimize queries, but does not validate the constraint.
Edit 02/18/2016: I created https://issues.apache.org/jira/browse/HIVE-13076 , please vote up if you are interested in this feature.
Edit 07/25/2016: https://issues.apache.org/jira/browse/HIVE-13076 resolved from 06/2016, should be landing in Hive 2.1. I don't see any updates in the official documentation yet.
Generally, the best practice in a data warehouse is to avoid forced referential integrity to avoid overhead. Therefore, if the need arises, you can enforce it in your requests.
Support for primary / foreign key constraints is available in Hive 2.1.0. See the 2.1.0 release notes .