In Pyspark HiveContext, what is the SQL OFFSET equivalent?

Or a more specific question is how can I handle large amounts of data that doesn't fit into memory right away? With OFFSET, I tried to do hiveContext.sql ("select ... limit 10 offset 10") and increment the offset to get all the data, but the offset doesn't seem to be valid in the hiveContext. What alternative is usually used to achieve this goal?

In some context, the pyspark code starts with

from pyspark.sql import HiveContext
hiveContext = HiveContext(sc)
hiveContext.sql("select ... limit 10 offset 10").show()

      

0


source to share


1 answer


The code will look like

  from pyspark.sql import HiveContext
hiveContext = HiveContext(sc)
hiveContext.sql("    with result as
 (   SELECT colunm1 ,column2,column3, ROW_NUMBER() OVER (ORDER BY columnname) AS RowNum FROM tablename )
select colunm1 ,column2,column3 from result where RowNum >= OFFSEtvalue and  RowNum < (OFFSEtvalue +limtvalue ").show()

      



Note. Update below variables as per your requirement tcolunm1, tablename, OFFSEtvalue, limtvalue

0


source







All Articles