How can I delete rows in a table created from the Spark framework?

Basically, I would like to do a simple delete using SQL statements, but when I execute the sql script, it throws the following error:

pyspark.sql.utils.ParseException: u "\ nmissing 'FROM' at 'a' (line 2, pos 23) \ n \ n == SQL == \ n \ n DELETE a. * FROM adsquare a \ P - --------------------- ^^^ \ n "

This is the script I am using:

sq = SparkSession.builder.config('spark.rpc.message.maxSize','1536').config("spark.sql.shuffle.partitions",str(shuffle_value)).getOrCreate()
adsquare = sq.read.csv(f, schema=adsquareSchemaDevice , sep=";", header=True)
adsquare_grid = adsqaureJoined.select("userid", "latitude", "longitude").repartition(1000).cache()
adsquare_grid.createOrReplaceTempView("adsquare")   

sql = """
    DELETE a.* FROM adsquare a
    INNER JOIN codepoint c ON a.grid_id = c.grid_explode
    WHERE dis2 > 1 """

sq.sql(sql)

      

Note : The code point table is created at run time.

Is there any other way to delete lines with the above conditions?

+3


source to share


3 answers


You cannot delete rows from Data Frame. But you can create a new Data Frame that excludes unwanted entries.

sql = """
    Select a.* FROM adsquare a
    INNER JOIN codepoint c ON a.grid_id = c.grid_explode
    WHERE dis2 <= 1 """

sq.sql(sql)

      



This way you can create a new data frame. Here I used the opposite conditiondis2 <= 1

+4


source


File frames in Apache Spark are immutable. SO, you cannot change it to remove rows from the dataframe, you can filter out the row you don't want and store in another frame.



+4


source


You cannot delete rows from Data Frame because Hadoop follows WORM (write times many times) , instead you can filter out deleted records in SQL statement, will give you a new data frame.

0


source







All Articles