You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Mehdi Ben Haj Abbes <me...@gmail.com> on 2016/11/30 15:12:44 UTC

Parallel dynamic partitioning producing duplicated data

Hi Folks,



I have a spark job reading a csv file into a dataframe. I register that
dataframe as a tempTable then I’m writing that dataframe/tempTable to hive
external table (using parquet format for storage)

I’m using this kind of command :

hiveContext.sql(*"INSERT INTO TABLE t PARTITION(statPart='string_value',
dynPart) SELECT * FROM tempTable"*);



Through this integration, for each csv line I will get a parquet
line/record. So if I count the csv files lines total number it must equals
the count of the parquet dataset produced.



I launch in parallel 20 of these jobs (to take advantage of idle
resources). Sometimes I get parquet count randomly slightly bigger than csv
count (mainly the difference concern one dynamic partition and one csv file
that has been integrated) but if I launch these job sequentially one after
the other I never get the problem of the different count.



Does anyone have  any idea about the cause of this problem (different
count). For me it is obvious that the parallel execution is causing the
issue and strongly believe that it happens when moving data from
hive.exec.stagingdir.prefix dir  to the hive final table location on hdfs



Thanks in advance.