You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by JoshRosen <gi...@git.apache.org> on 2015/02/03 08:16:39 UTC
[GitHub] spark pull request: [SPARK-4879] [WIP] Use driver to coordinate Ha...

GitHub user JoshRosen reopened a pull request:

    https://github.com/apache/spark/pull/4066

    [SPARK-4879] [WIP] Use driver to coordinate Hadoop output committing

    (This is a WIP commit so that Jenkins tests my code; I still need to add tests and think through a few corner-cases.)
    
    I believe that Spark's SparkHadoopWriter is misusing Hadoop's OutputCommitter: OutputCommitter.commitTask seems to assume that coordination has been performed via the AM / Driver; our current lack of coordination can lead to subtle bugs where task output is missing because redundant copies of tasks are allowed to attempt to commit their output after a job has completed (due to some odd Hadoop behaviors, this can lead to a completed job's output being deleted).
    
    The fix here is to add some centralized coordination in the driver for deciding which copy of a task is allowed to commit its task output to HDFS.  The architecture here is a little hacky, since it involves a new RPC from SparkOutputCommitter directly to the DAGScheduler.  The reason that we send the message to DAGScheduler, as opposed to some other actor, is to ensure proper ordering / interleaving with other events.
    
    See https://issues.apache.org/jira/browse/SPARK-4879 for full context.  I'll write a real commit message / description later (the problem is a little subtle and it will take some work to come up with a nice, concise summary).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark SPARK-4879-sparkhadoopwriter-fix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4066.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4066
    
----
commit dbfed0f81001ac8866f32e1d9edd20a449a8b7e9
Author: Josh Rosen <jo...@databricks.com>
Date:   2015-01-16T01:52:21Z

    WIP commit towards fixing SPARK-4879

commit c25c9972d9878b91ddcbc9c9a32d5453f781191a
Author: Josh Rosen <jo...@databricks.com>
Date:   2015-01-16T01:52:52Z

    Fix scalastyle issue

commit beba16e8bcba493b8de26b065794014b64d23f82
Author: Josh Rosen <jo...@databricks.com>
Date:   2015-01-16T07:30:01Z

    Fix NPE for non-result tasks

commit 8c64d12d2e4f5b7b377cce0f49c941870958cdef
Author: Josh Rosen <jo...@databricks.com>
Date:   2015-01-16T07:33:25Z

    Fix NPE for tasks that complete after stage

commit 63a7707cad01f4dcc2c74c4a6bffded9c887f9d4
Author: Josh Rosen <jo...@databricks.com>
Date:   2015-01-16T21:13:11Z

    Fix DAGScheduler actor path; use more SparkConf retry settings.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org