You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by rdblue <gi...@git.apache.org> on 2018/06/13 19:54:13 UTC

[GitHub] spark pull request #21558: [SPARK-24552][SQL] Use task ID instead of attempt...

GitHub user rdblue opened a pull request:

    https://github.com/apache/spark/pull/21558

    [SPARK-24552][SQL] Use task ID instead of attempt number for v2 writes.

    ## What changes were proposed in this pull request?
    
    This passes the unique task attempt id instead of attempt number to v2 data sources because attempt number is reused when stages are retried. When attempt numbers are reused, sources that track data by partition id and attempt number may incorrectly clean up data because **the same attempt number can be both committed and aborted**.
    
    ## How was this patch tested?
    
    Existing tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rdblue/spark SPARK-24552-v2-source-work-around

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21558.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21558
    
----
commit e9e776a097f5dca1dccdd6e50b3790e6a91873d8
Author: Ryan Blue <bl...@...>
Date:   2018-06-13T19:50:00Z

    SPARK-24552: Use task ID instead of attempt number for v2 writes.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    Actually I didn't mean speculation but something like this:
    
    ```
    sc.parallelize(1 to 10).foreach { i => if (TaskContext.get().attemptNumber() == 0) throw new Exception("Fail") else println(i) }
    ```
    
    Anyway I ran that and the behavior is the same (different attempt = different task ID) so it's all good.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    **[Test build #92050 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92050/testReport)** for PR 21558 at commit [`6c60d14`](https://github.com/apache/spark/commit/6c60d1462c34f01610ada50c989832775b6fd117).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    @cloud-fan, this is a work-around for SPARK-24552. I'm not sure the right way to fix this besides fixing the scheduler so that it doesn't use task attempt numbers twice, but I think this works.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/295/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    FYI, I'm preparing my own version of this PR with the remaining feedback addressed. Ryan was on paternity leave and I don't know whether he's done yet, so he may not be that responsive.
    
    This will conflict with the output commit coordinator change in any case, so one of them needs to wait (and that one is further along).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    **[Test build #91794 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91794/testReport)** for PR 21558 at commit [`e9e776a`](https://github.com/apache/spark/commit/e9e776a097f5dca1dccdd6e50b3790e6a91873d8).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    **[Test build #92043 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92043/testReport)** for PR 21558 at commit [`6c60d14`](https://github.com/apache/spark/commit/6c60d1462c34f01610ada50c989832775b6fd117).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    Ah ok, was looking at my own version as well.  There are other things we should update for v2 as well, other functions with the variable names, description in DataWriterFactory.java, etc.
     


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    Retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    **[Test build #92043 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92043/testReport)** for PR 21558 at commit [`6c60d14`](https://github.com/apache/spark/commit/6c60d1462c34f01610ada50c989832775b6fd117).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    so I looked through the code and it certain appears to be a bug in the existing code (not just v2 datasource api).  If you have one stage running that gets a fetch failure, if it leaves any tasks running with attempt 0, it could conflict with the restarted stage since those tasks would all start with attempt 0 as well.  When I say it could it means it would be a race if they go to commit at about the same time.   Its probably more of an issue if one commits, then starts the job commit and the other task starts to then commit its, you could end up with incomplete/corrupt file.  We should see the warning "Authorizing duplicate request to commit" in the logs though if this occurs.
    
    @rdblue does this match what you are seeing?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21558: [SPARK-24552][SQL] Use task ID instead of attempt...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21558#discussion_r196592175
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala ---
    @@ -110,7 +108,7 @@ object DataWritingSparkTask extends Logging {
           useCommitCoordinator: Boolean): WriterCommitMessage = {
         val stageId = context.stageId()
         val partId = context.partitionId()
    -    val attemptId = context.attemptNumber()
    +    val attemptId = context.taskAttemptId().toInt
    --- End diff --
    
    I was going to suggest removing the cast to int, but well, that's in `DataWriterFactory` and would be an API breakage... hopefully won't cause issues aside from weird output names when the value overflows the int.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    IMO your change is the right fix, not just a workaround.  I don't think its a scheduler bug (though its definitely unclear).  I'll move that discussion to the jira.
    
    An alternative would be using `context.StageAttemptNumber()` and `context.attemptNumber()` together.
    
    
    > the stage ID can change for a different execution of the same stage, IIRC, and that would reset the attempt id.
    
    hmm, the only place I could imagine that happening is with a shared shuffle dependency between jobs, which gets renumbered and then skipped, but then perhaps re-executed on a fetch-failure.  That isn't relevant here, though, since it would only be shuffle map stages, not result stages.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    sorry trying to catch up on this thread.  Are we saying this is a bug in the existing output committer as well when we have a fetch failure?   


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    @vanzin, thanks for working on this. I was out most of this week at a conference and I'm still on just half time, which is why I was delayed. Sorry to leave you all waiting. I'll make comments on your PR.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    Actually, scratch that, there is really a problem. The output committer only checks the stage ID, not the stage attempt ID, so it will still allow tasks from the failed attempt to commit...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    **[Test build #91794 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91794/testReport)** for PR 21558 at commit [`e9e776a`](https://github.com/apache/spark/commit/e9e776a097f5dca1dccdd6e50b3790e6a91873d8).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3998/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    @rdblue  would you want to update for v1 and hadoop committers?  it should be very simialr to this change but in: createTaskAttemptContext should take the actual task attempt id (type Long) instead of the attempt number as an argument.
    
    I would need to look at the v1 writers to make sure there is nothing else, but perhaps you are more familiar with them.  
    
    If not we can split that into a separate jira


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    yeah that is the code that I was still looking at to verify if it can actually happen. on a fetch failure the dag scheduler removes the stage from the ouput committer, but when the new stage attempt starts it just puts the stage back with the stage id (not attempt id).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/4182/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    **[Test build #92050 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92050/testReport)** for PR 21558 at commit [`6c60d14`](https://github.com/apache/spark/commit/6c60d1462c34f01610ada50c989832775b6fd117).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    I'm fine with this; my PR should fix the underlying issue but this still seems like a good idea to me.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    > IMO your change is the right fix, not just a workaround
    
    @squito, part of the problem is that the output commit coordinator -- that ensures only one attempt of a task commits -- relies on commit number to allow or deny commits. If this is the correct fix, should we also pass the TID as the attempt id to the commit coordinator?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21558: [SPARK-24552][SQL] Use task ID instead of attempt...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21558#discussion_r196277478
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala ---
    @@ -110,7 +108,7 @@ object DataWritingSparkTask extends Logging {
           useCommitCoordinator: Boolean): WriterCommitMessage = {
         val stageId = context.stageId()
         val partId = context.partitionId()
    -    val attemptId = context.attemptNumber()
    +    val attemptId = context.taskAttemptId().toInt
    --- End diff --
    
    nit: can you rename this variable `tid`?  these names are pretty confusing, but I think that at least "tid" is used consistently and exclusively for this, while "attempt" means a lot of different things.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    @vanzin, the ID that this uses is the TID, which I thought was always unique. It appears to be a one-up counter. Also, I noted on your PR that both are needed because even if we only commit one of the attempts, the writers may use this ID to track and clean up data.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    See link above for the updated PR.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/288/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    > the ID that this uses is the TID, which I thought was always unique. It appears to be a one-up counter
    
    If that's the case, and a different attempt for the same partition will get a different task ID, that's fine. I was under the impression it would have the same task ID.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    I'm not sure, but I'm starting to think the part of the fix for SPARK-18113 that added the "Authorizing duplicate request..." stuff should be removed.
    
    The rest of that change replaces `askWithRetry` in the RPC calls with `ask`, so now there will only be one request per task. So there's no need to handle duplicate requests that way, because the only time you'd have a duplicate is when this situation happens (same task attempt number running concurrently on two stage attempts).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    I guess https://issues.apache.org/jira/browse/SPARK-24492 is potentially cause by the output committer issue ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21558: [SPARK-24552][SQL] Use task ID instead of attempt...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/21558


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    > So the problem here is, when we retry a stage, Spark doesn't kill the tasks of the old stage and just launch tasks for the new stage
    
    I think that's something that should be fixed, but it wouldn't entirely fix the problem unless we were very careful about ordering in the driver: the stage would have to fail, then stop allowing commits, then wait for all of the tasks that were allowed to commit to finish, and account for the coordination messages being in flight. Not an easy problem.
    
    I'd like to see a fix that makes the attempt number unique within a job and partition, i.e., no two tasks should have the same (job id, partition id, attempt number) triple as Wenchen said.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    Correct TaskId is unique.  I agree with @squito  can we rename to tid. 
    
    Hmm, good point @rdblue we need to make sure the output directories/files and such would be unique.  That problem may actually exist in the current committers as well, will comment on the other pr  


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91794/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/108/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    I started with some ideas in #21577 ; I haven't tested that PR aside from the modified unit test, but I think it's in the right direction. I plan to work more on it Monday, but leaving it there in case people are around over the weekend and want to comment.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    But note that the task ID is also not necessarily unique (since you can have multiple attempts of the same task). So perhaps you should consider Imran's suggestion or mixing stage and task attempt numbers.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    for a write job, the writing only happens at the last stage, so stage id doesn't matter, and data source v2 API assumes `(job id, partition id, task attemp id)` can uniquely define a write task, even counting the failure cases.
    
    So the problem here is, when we retry a stage, Spark doesn't kill the tasks of the old stage and just launch tasks for the new stage. We may have running write tasks that have the same `(job id, partition id, task attemp id)`.
    
    One solution is, we can just use `(job id, task id)` to uniquely define a write task. But we may lose information like the index of this task and how many times it has been retried.
    
    Or we can use `(job id, partition id, stage attemp id, task attemp id)`, which is a little verbose, but has all the information.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21558: [SPARK-24552][SQL] Use task ID instead of attempt...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21558#discussion_r197222759
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala ---
    @@ -110,7 +108,7 @@ object DataWritingSparkTask extends Logging {
           useCommitCoordinator: Boolean): WriterCommitMessage = {
         val stageId = context.stageId()
         val partId = context.partitionId()
    -    val attemptId = context.attemptNumber()
    +    val attemptId = context.taskAttemptId().toInt
    --- End diff --
    
    HadoopWriteConfigUtil has the same issue, its a public interface and uses in for attempt number. 
    it seems somewhat unlikely but more likely to be able to go over an int for task ids in spark then in say MapReduce.  we do have partitionId as an Int so if partitions go to Int and you have task failures then taskids could go over Int.  Looking at our options
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    > If you have one stage running that gets a fetch failure, if it leaves any tasks running
    
    I took a look at the output coordinator code and, depending on how the scheduler behaves, it might be ok.
    
    The coordinator will deny commits for finished stages; so it depends on the order of things. If the failed attempt is marked as "failed" before the next attempt starts, then it's ok, even if tasks for the failed attempt are still running. Looking at the code handling `FetchFailed` failures in `DAGScheduler`, that seems to be the case.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    If you are working on this, I'll merge the other one and wait for you and continue to investigate in parallel


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    @tgravescs, that's exactly what we're seeing. I think it might just be misleading to have a stage-local attempt ID although it is more friendly for users and matches what MR does.
    
    @jiangxb1987, we see SPARK-24492 occasionally (it has gotten better with later fixes to the coordinator) and haven't tracked down the cause yet. If this were the underlying cause that would be great, but I'm not sure how it could be the cause. If the same attempt number is reused, then two tasks in different stage attempts may both be authorized to commit. That wouldn't cause the retries because it wouldn't cause extra commit denials.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92043/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    (Or, alternatively, the output committer could track task IDs instead of attempt numbers. That should result in the same behavior but haven't really looked at that.)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    ping @rdblue ^ .  If I don't hear tomorrow, will file separate jira.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    Yes, I just checked and speculative attempts do get a different TID. Just turn on speculation, run a large stage, and sort tasks in a stage by TID. There aren't duplicates.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92050/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    It's a little weird to use the attempt id here anyway; the stage ID can change for a different execution of the same stage, IIRC, and that would reset the attempt id.
    
    @squito probably remembers a lot more about this off the top of his head.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/4189/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    As @squito suggested, we can either use taskAttemptId or combine stageAttemptId and taskAttemptNumber together, both shall be able to represent a unique task attempt.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21558
  
    I think the right thing to do for this commit is to use the task ID instead of the stage-local attempt number. I've updated the PR with the change so I think this is ready to commit. @vanzin, are you okay committing this?
    
    cc @cloud-fan


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org