Spark dataFrame.colaesce (1) or dataFrame.reapartition (1) doesn't seem to work for me

Hi I have an insert in Hive in a query that creates new Hive sections. I have two Hive sections named server and date. Now I am doing an insert in queries with the following code and trying to save it

DataFrame dframe = hiveContext.sql("insert into summary1 partition(server='a1',date='2015-05-22') select from sourcetbl bla bla"); 
//above query creates orc file at /user/db/a1/20-05-22 
//I want only one part-00000 file at the end of above query so I tried the following and none worked 
drame.coalesce(1).write().format("orc").mode(SaveMode.OverWrite).saveAsTable("summary1"); OR

drame.repartition(1).write().format("orc").mode(SaveMode.OverWrite).saveAsTable("summary1"); OR

drame.coalesce(1).write().format("orc").save("/user/db/a1/20-05-22",SaveMode.OverWrite); OR

drame.repartition(1).write().format("orc").save("/user/db/a1/20-05-22",SaveMode.OverWrite); OR

      

Regardless of whether I use coalesce or reparition above query, about 200 small files of about 20MB are generated at / user / db / a location 1 / 20-05-22. I only want to get one Part0000 file when using Hive. I thought that if I call coalesce(1)

then it will create the final one part file, but that doesn't seem to happen. Am I wrong? Please guide. Thanks in advance.

+3
apache-spark apache-spark-sql


source to share


2 answers


Repartition controls how many chunks of a file are split when a Spark job runs, but the actual integrity of the file is controlled by the Hadoop cluster.

Or as I understand it. Also you can answer the same question: http://mail-archives.us.apache.org/mod_mbox/spark-user/201501.mbox/% 3CCA+2Pv=hF5SGC-SWTwTMh6zK2JeoHF1OHPb=WG94vp2GW-vL5SQ@mail.gmail . com% 3E



It doesn't matter why, why are you installing one file? getmerge will compile it together for you if it's only for your own system.

0


source to share


df.coalesce(1)

worked for me in sparks 2.1.1 so anyone who sees this page shouldn't worry as I did.



df.coalesce(1).write.format("parquet").save("a.parquet") 

      

0


source to share







All Articles
Loading...
X
Show
Funny
Dev
Pics