Spark dataFrame.colaesce (1) or dataFrame.reapartition (1) doesn't seem to work for me
Hi I have an insert in Hive in a query that creates new Hive sections. I have two Hive sections named server and date. Now I am doing an insert in queries with the following code and trying to save it
DataFrame dframe = hiveContext.sql("insert into summary1 partition(server='a1',date='2015-05-22') select from sourcetbl bla bla");
//above query creates orc file at /user/db/a1/20-05-22
//I want only one part-00000 file at the end of above query so I tried the following and none worked
drame.coalesce(1).write().format("orc").mode(SaveMode.OverWrite).saveAsTable("summary1"); OR
drame.repartition(1).write().format("orc").mode(SaveMode.OverWrite).saveAsTable("summary1"); OR
drame.coalesce(1).write().format("orc").save("/user/db/a1/20-05-22",SaveMode.OverWrite); OR
drame.repartition(1).write().format("orc").save("/user/db/a1/20-05-22",SaveMode.OverWrite); OR
Regardless of whether I use coalesce or reparition above query, about 200 small files of about 20MB are generated at / user / db / a location 1 / 20-05-22. I only want to get one Part0000 file when using Hive. I thought that if I call coalesce(1)
then it will create the final one part file, but that doesn't seem to happen. Am I wrong? Please guide. Thanks in advance.
source to share
Repartition controls how many chunks of a file are split when a Spark job runs, but the actual integrity of the file is controlled by the Hadoop cluster.
Or as I understand it. Also you can answer the same question: http://mail-archives.us.apache.org/mod_mbox/spark-user/201501.mbox/% 3CCA+2Pv=hF5SGC-SWTwTMh6zK2JeoHF1OHPb=WG94vp2GW-vL5SQ@mail.gmail . com% 3E
It doesn't matter why, why are you installing one file? getmerge will compile it together for you if it's only for your own system.
source to share