In Pyspark HiveContext, what is the SQL OFFSET equivalent?
Or a more specific question is how can I handle large amounts of data that doesn't fit into memory right away? With OFFSET, I tried to do hiveContext.sql ("select ... limit 10 offset 10") and increment the offset to get all the data, but the offset doesn't seem to be valid in the hiveContext. What alternative is usually used to achieve this goal?
In some context, the pyspark code starts with
from pyspark.sql import HiveContext
hiveContext = HiveContext(sc)
hiveContext.sql("select ... limit 10 offset 10").show()
0
source to share
1 answer
The code will look like
from pyspark.sql import HiveContext
hiveContext = HiveContext(sc)
hiveContext.sql(" with result as
( SELECT colunm1 ,column2,column3, ROW_NUMBER() OVER (ORDER BY columnname) AS RowNum FROM tablename )
select colunm1 ,column2,column3 from result where RowNum >= OFFSEtvalue and RowNum < (OFFSEtvalue +limtvalue ").show()
Note. Update below variables as per your requirement tcolunm1, tablename, OFFSEtvalue, limtvalue
0
source to share