Spark dataFrame.colaesce (1) or dataFrame.reapartition (1) doesn't seem to work for me
Hi I have an insert in Hive in a query that creates new Hive sections. I have two Hive sections named server and date. Now I am doing an insert in queries with the following code and trying to save it
DataFrame dframe = hiveContext.sql("insert into summary1 partition(server='a1',date='2015-05-22') select from sourcetbl bla bla");
//above query creates orc file at /user/db/a1/20-05-22
//I want only one part-00000 file at the end of above query so I tried the following and none worked
drame.coalesce(1).write().format("orc").mode(SaveMode.OverWrite).saveAsTable("summary1"); OR
drame.repartition(1).write().format("orc").mode(SaveMode.OverWrite).saveAsTable("summary1"); OR
drame.coalesce(1).write().format("orc").save("/user/db/a1/20-05-22",SaveMode.OverWrite); OR
drame.repartition(1).write().format("orc").save("/user/db/a1/20-05-22",SaveMode.OverWrite); OR
Regardless of whether I use coalesce or reparition above query, about 200 small files of about 20MB are generated at / user / db / a location 1 / 20-05-22. I only want to get one Part0000 file when using Hive. I thought that if I call coalesce(1)
then it will create the final one part file, but that doesn't seem to happen. Am I wrong? Please guide. Thanks in advance.
Repartition controls how many chunks of a file are split when a Spark job runs, but the actual integrity of the file is controlled by the Hadoop cluster.
Or as I understand it. Also you can answer the same question: http://mail-archives.us.apache.org/mod_mbox/spark-user/201501.mbox/% 3CCA+2Pv=hF5SGC-SWTwTMh6zK2JeoHF1OHPb=WG94vp2GW-vL5SQ@mail.gmail . com% 3E
It doesn't matter why, why are you installing one file? getmerge will compile it together for you if it's only for your own system.
df.coalesce(1)
worked for me in sparks 2.1.1 so anyone who sees this page shouldn't worry as I did.
df.coalesce(1).write.format("parquet").save("a.parquet")