Spark dataFrame.colaesce (1) or dataFrame.reapartition (1) doesn't seem to work for me

Question

Spark dataFrame.colaesce (1) or dataFrame.reapartition (1) doesn't seem to work for me

Hi I have an insert in Hive in a query that creates new Hive sections. I have two Hive sections named server and date. Now I am doing an insert in queries with the following code and trying to save it

DataFrame dframe = hiveContext.sql("insert into summary1 partition(server='a1',date='2015-05-22') select from sourcetbl bla bla"); 
//above query creates orc file at /user/db/a1/20-05-22 
//I want only one part-00000 file at the end of above query so I tried the following and none worked 
drame.coalesce(1).write().format("orc").mode(SaveMode.OverWrite).saveAsTable("summary1"); OR

drame.repartition(1).write().format("orc").mode(SaveMode.OverWrite).saveAsTable("summary1"); OR

drame.coalesce(1).write().format("orc").save("/user/db/a1/20-05-22",SaveMode.OverWrite); OR

drame.repartition(1).write().format("orc").save("/user/db/a1/20-05-22",SaveMode.OverWrite); OR

Regardless of whether I use coalesce or reparition above query, about 200 small files of about 20MB are generated at / user / db / a location 1 / 20-05-22. I only want to get one Part0000 file when using Hive. I thought that if I call coalesce(1)

then it will create the final one part file, but that doesn't seem to happen. Am I wrong? Please guide. Thanks in advance.

+3

apache-spark apache-spark-sql

u449355 10 jul. 15 at 17:14

source to share

2 answers

df.coalesce(1)

worked for me in sparks 2.1.1 so anyone who sees this page shouldn't worry as I did.

df.coalesce(1).write.format("parquet").save("a.parquet")

0

ruseel 02 nov. 17 at 5:41

source to share

ApolloFortyNine · Accepted Answer · 2015-07-10T20:01:17+0000

Repartition controls how many chunks of a file are split when a Spark job runs, but the actual integrity of the file is controlled by the Hadoop cluster.

Or as I understand it. Also you can answer the same question: http://mail-archives.us.apache.org/mod_mbox/spark-user/201501.mbox/% 3CCA+2Pv=hF5SGC-SWTwTMh6zK2JeoHF1OHPb=WG94vp2GW-vL5SQ@mail.gmail . com% 3E

It doesn't matter why, why are you installing one file? getmerge will compile it together for you if it's only for your own system.

Spark dataFrame.colaesce (1) or dataFrame.reapartition (1) doesn't seem to work for me

More articles: