In Pyspark HiveContext, what is the SQL OFFSET equivalent?

Question

In Pyspark HiveContext, what is the SQL OFFSET equivalent?

Or a more specific question is how can I handle large amounts of data that doesn't fit into memory right away? With OFFSET, I tried to do hiveContext.sql ("select ... limit 10 offset 10") and increment the offset to get all the data, but the offset doesn't seem to be valid in the hiveContext. What alternative is usually used to achieve this goal?

In some context, the pyspark code starts with

from pyspark.sql import HiveContext
hiveContext = HiveContext(sc)
hiveContext.sql("select ... limit 10 offset 10").show()

0

hive apache-spark pyspark

irregular 02 Mar 17 at 16:30

source to share

1 answer

sandeep rawat · Accepted Answer · 2017-03-03T03:24:42+0000

The code will look like

  from pyspark.sql import HiveContext
hiveContext = HiveContext(sc)
hiveContext.sql("    with result as
 (   SELECT colunm1 ,column2,column3, ROW_NUMBER() OVER (ORDER BY columnname) AS RowNum FROM tablename )
select colunm1 ,column2,column3 from result where RowNum >= OFFSEtvalue and  RowNum < (OFFSEtvalue +limtvalue ").show()

Note. Update below variables as per your requirement tcolunm1, tablename, OFFSEtvalue, limtvalue

In Pyspark HiveContext, what is the SQL OFFSET equivalent?

More articles: