You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jerry Lam <ch...@gmail.com> on 2015/10/22 18:00:59 UTC

Spark SQL: Issues with using DirectParquetOutputCommitter with APPEND mode and OVERWRITE mode

Hi Spark users and developers,

I read the ticket [SPARK-8578] (Should ignore user defined output committer
when appending data) which ignore DirectParquetOutputCommitter if append
mode is selected. The logic was that it is unsafe to use because it is not
possible to revert a failed job in append mode using
DirectParquetOutputCommitter. I think wouldn't it better to allow users to
use it at their own risk? Say, if you use DirectParquetOutputCommitter with
append mode, the job fails immediately when a task fails. The user can then
choose to reprocess the job entirely which is not a big deal since failure
is rare in most cases. Another approach is to provide at least once-task
semantic for append mode using DirectParquetOutputCommitter. This will end
up having duplicate records but for some applications, this is fine.

The second issue is that  the assumption that Overwrite mode works with
DirectParquetOutputCommitter for all cases is wrong at least from the
perspective of using it with s3. S3 provides eventual consistency for
overwrite PUTS and DELETES. So if you attempt to delete a directory and
then create the same directory again in a split of a second. The chance you
hit org.apache.hadoop.fs.FileAlreadyExistsException is very high because
deletes don't immediately and creating the same file before it is deleted
will result with the above exception. Might I propose to change the code
such that it will actually OVERWRITE the file instead of a delete following
by a create?

Best Regards,

Jerry