You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2017/10/31 15:54:24 UTC

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/19623

    [SPARK-22078][SQL] clarify exception behaviors for all data source v2 interfaces

    ## What changes were proposed in this pull request?
    
    clarify exception behaviors for all data source v2 interfaces.
    
    ## How was this patch tested?
    
    document change only

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark data-source-exception

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19623.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19623
    
----
commit 68fe4512dc7baaaf810db4202397fc9d138d2ed7
Author: Wenchen Fan <we...@databricks.com>
Date:   2017-10-31T15:53:24Z

    clarify exception behaviors for all data source v2 interfaces

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83300/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148239331
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ReadTask.java ---
    @@ -37,13 +37,19 @@
        * The preferred locations where this read task can run faster, but Spark does not guarantee that
        * this task will always run on these locations. The implementations should make sure that it can
        * be run on any location. The location is a string representing the host name of an executor.
    +   *
    +   * If an exception was thrown, the action would fail and we guarantee that no Spark job was
    +   * submitted.
        */
       default String[] preferredLocations() {
         return new String[0];
    --- End diff --
    
    for `independent-of-location placement`, users can just return `new String[0]`.
    
    I don't know where Spark treats "localhost" as "anywhere", but I'd expect it to be an unrecognized host and then it's "anywhere".


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148533741
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java ---
    @@ -50,28 +53,34 @@
     
       /**
        * Creates a writer factory which will be serialized and sent to executors.
    +   *
    +   * If this method fails (by throwing an exception), the action would fail and no Spark job was
    +   * submitted.
        */
       DataWriterFactory<Row> createWriterFactory();
     
       /**
        * Commits this writing job with a list of commit messages. The commit messages are collected from
    -   * successful data writers and are produced by {@link DataWriter#commit()}. If this method
    -   * fails(throw exception), this writing job is considered to be failed, and
    -   * {@link #abort(WriterCommitMessage[])} will be called. The written data should only be visible
    -   * to data source readers if this method succeeds.
    +   * successful data writers and are produced by {@link DataWriter#commit()}.
    +   *
    +   * If this method fails (by throwing an exception), this writing job is considered to to have been
    +   * failed, and {@link #abort(WriterCommitMessage[])} would be called. The state of the destination
    +   * is undefined and @{@link #abort(WriterCommitMessage[])} may not be able to deal with it.
        *
        * Note that, one partition may have multiple committed data writers because of speculative tasks.
        * Spark will pick the first successful one and get its commit message. Implementations should be
    -   * aware of this and handle it correctly, e.g., have a mechanism to make sure only one data writer
    -   * can commit successfully, or have a way to clean up the data of already-committed writers.
    +   * aware of this and handle it correctly, e.g., have a coordinator to make sure only one data
    +   * writer can commit, or have a way to clean up the data of already-committed writers.
        */
       void commit(WriterCommitMessage[] messages);
     
       /**
        * Aborts this writing job because some data writers are failed to write the records and aborted,
    --- End diff --
    
    I'll update the comment, we do retry failed write tasks before calling this method.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by steveloughran <gi...@git.apache.org>.

Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/19623#discussion_r148643598

--- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java ---
@@ -50,28 +53,34 @@

/**
* Creates a writer factory which will be serialized and sent to executors.
+ *
+ * If this method fails (by throwing an exception), the action would fail and no Spark job was
+ * submitted.
*/
DataWriterFactory<Row> createWriterFactory();

/**
* Commits this writing job with a list of commit messages. The commit messages are collected from
- * successful data writers and are produced by {@link DataWriter#commit()}. If this method
- * fails(throw exception), this writing job is considered to be failed, and
- * {@link #abort(WriterCommitMessage[])} will be called. The written data should only be visible
- * to data source readers if this method succeeds.
+ * successful data writers and are produced by {@link DataWriter#commit()}.
+ *
+ * If this method fails (by throwing an exception), this writing job is considered to to have been
+ * failed, and {@link #abort(WriterCommitMessage[])} would be called. The state of the destination
+ * is undefined and @{@link #abort(WriterCommitMessage[])} may not be able to deal with it.
*
* Note that, one partition may have multiple committed data writers because of speculative tasks.
* Spark will pick the first successful one and get its commit message. Implementations should be
--- End diff --

> Your algorithm doesn't work

to be clear: I know you cant reliably guarantee exactly 1 commit of a task in all conditions, but as long as speculative tasks don't manifest their work in the dest directory until commitTask is called, then you can have a stateful driver call taskCommit exactly once for a task attempt. It just can't be confident that anything received that request.
The driver needs to as long as it reacts to a failure to return success by failing the job and saying "something went wrong" & letting layers above decide how to react. The output should have the results of 0-1 commits of that task, plus whatever other tasks had already committed.

Of course, the layers above have to somehow be connected to the driver, otherwise they won't be able to distinguish "job failed" from "job completed", hence the fact that the `_SUCCESS` manifest is considered the marker of a successful job. (which of course means that you need a consistent store so that HEAD of _SUCCESS doesn't return an old copy, so a simple probe isn't actually sufficient for S3)

Oh, and there are a couple of failure modes I forgot to mention: executors doing their work after you think they've stopped. e.g

* task attempt 1 commitTask called -> timeout, task attempt 2 commitTask called -< success, then task1 actually doing its commit.
* task 1 invoked commitTask -> timeout, job reacts to failure by aborting job, something retrying etc, *then task 1 actually doing the commit*. That is: even after the entire job has aborted and a new job executed, it is possible for a partitioned task for the first attempt to commit its work.

That's an interesting one to defend against.

I do know that the MapReduce AM for retryable jobs tries to minimise this risk by only attempting to commit work if the last time it interacted with the YARN was within
`yarn.app.mapreduce.am.job.committer.commit-window` millis. So a partitioned AM is guaranteed not to attempt to commit its work after that window. Task commit is similar, using the timeout for the communications as the window

```scala
requestTaskCommit()
val canCommit = awaitTaskCommit(timeout = jobcommitWindow)
if (canCommit ) doCommit() else doAbort()
```
Instead they just rely on the assumption that once the worker/manager protocol says "Commit" then you can go ahead, that is: the interval between retrieving a success from the manager and executing taskCommit() is negligible.

Of course, timeout logic assumes time goes forward at the same rate everywhere, which in a world of VMs isn't something you can consistently see: either the JVM or the entire VM can block between awaitTaskCommt() returning true and doCommit() being called.

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148240421
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataReader.java ---
    @@ -34,11 +34,17 @@
     
       /**
        * Proceed to next record, returns false if there is no more records.
    +   *
    +   * If an exception was thrown, the corresponding Spark task would fail and get retried until
    +   * hitting the maximum retry times.
        */
       boolean next();
     
       /**
        * Return the current record. This method should return same value until `next` is called.
    +   *
    --- End diff --
    
    > This method should return same value until `next` is called.
    
    are you talking about this requirement? I think this is a strong requirement for `next/get` pattern, like `hasNext` should return same value until `next` is called, for the `hasNext/next` pattern.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148505783
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java ---
    @@ -50,28 +53,34 @@
     
       /**
        * Creates a writer factory which will be serialized and sent to executors.
    +   *
    +   * If this method fails (by throwing an exception), the action would fail and no Spark job was
    +   * submitted.
        */
       DataWriterFactory<Row> createWriterFactory();
     
       /**
        * Commits this writing job with a list of commit messages. The commit messages are collected from
    -   * successful data writers and are produced by {@link DataWriter#commit()}. If this method
    -   * fails(throw exception), this writing job is considered to be failed, and
    -   * {@link #abort(WriterCommitMessage[])} will be called. The written data should only be visible
    -   * to data source readers if this method succeeds.
    +   * successful data writers and are produced by {@link DataWriter#commit()}.
    +   *
    +   * If this method fails (by throwing an exception), this writing job is considered to to have been
    +   * failed, and {@link #abort(WriterCommitMessage[])} would be called. The state of the destination
    +   * is undefined and @{@link #abort(WriterCommitMessage[])} may not be able to deal with it.
        *
        * Note that, one partition may have multiple committed data writers because of speculative tasks.
        * Spark will pick the first successful one and get its commit message. Implementations should be
    --- End diff --
    
    are you proposing something like 2PC? I wanna keep this write commit API simple that there is only one round trip between driver and executors: "writer factory sent to executor" -> "executor write data and commit" -> "commit message sent back to driver" -> "driver does job-level commit". This round trip can easily be implemented by Spark RDD.
    
    If implementations wanna something stronger, they can still implement it with their own coordinator, which can probably be more efficient than using Spark driver as coordinator.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148133085
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/ReadSupport.java ---
    @@ -30,6 +30,8 @@
       /**
        * Creates a {@link DataSourceV2Reader} to scan the data from this data source.
        *
    +   * If this method fails(throw exception), the query fails before submitting any Spark jobs.
    --- End diff --
    
    What is the query you mentioned here? How about?
    > If an exception was thrown, we guarantee that no Spark job was submitted.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83341/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    **[Test build #83270 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83270/testReport)** for PR 19623 at commit [`68fe451`](https://github.com/apache/spark/commit/68fe4512dc7baaaf810db4202397fc9d138d2ed7).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148553358
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java ---
    @@ -50,28 +53,34 @@
     
       /**
        * Creates a writer factory which will be serialized and sent to executors.
    +   *
    +   * If this method fails (by throwing an exception), the action would fail and no Spark job was
    +   * submitted.
        */
       DataWriterFactory<Row> createWriterFactory();
     
       /**
        * Commits this writing job with a list of commit messages. The commit messages are collected from
    -   * successful data writers and are produced by {@link DataWriter#commit()}. If this method
    -   * fails(throw exception), this writing job is considered to be failed, and
    -   * {@link #abort(WriterCommitMessage[])} will be called. The written data should only be visible
    -   * to data source readers if this method succeeds.
    +   * successful data writers and are produced by {@link DataWriter#commit()}.
    +   *
    +   * If this method fails (by throwing an exception), this writing job is considered to to have been
    +   * failed, and {@link #abort(WriterCommitMessage[])} would be called. The state of the destination
    +   * is undefined and @{@link #abort(WriterCommitMessage[])} may not be able to deal with it.
        *
        * Note that, one partition may have multiple committed data writers because of speculative tasks.
        * Spark will pick the first successful one and get its commit message. Implementations should be
    --- End diff --
    
    The only way to guarantee no more than one task can commit is if the underlying storage system guarantees that. There is no way to design something generic. It is simply not possible in a distributed system, when network partitioning or message lost.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83346/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by steveloughran <gi...@git.apache.org>.

Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148325280
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ReadTask.java ---
    @@ -37,13 +37,19 @@
        * The preferred locations where this read task can run faster, but Spark does not guarantee that
        * this task will always run on these locations. The implementations should make sure that it can
        * be run on any location. The location is a string representing the host name of an executor.
    +   *
    +   * If an exception was thrown, the action would fail and we guarantee that no Spark job was
    +   * submitted.
        */
       default String[] preferredLocations() {
         return new String[0];
    --- End diff --
    
    I'm sure there was some specific filter for it, though a quick grep only shows that happening in {{ReliableCheckpointRDD}}. The reason the filter is needed is that getHostByName(localhost) does return a host, but scheduling gets confused, without the filtering work can get get held back until the driver concludes that localhost isn't free & so assign it elsewhere in the cluster (Hive can do this unintentionally, which is why I did trace through spark's use of getBlockLocations once)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83266/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148243095
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java ---
    @@ -75,8 +82,10 @@
       /**
        * Aborts this writing job because some data writers are failed to write the records and aborted,
        * or the Spark job fails with some unknown reasons, or {@link #commit(WriterCommitMessage[])}
    -   * fails. If this method fails(throw exception), the underlying data source may have garbage that
    -   * need to be cleaned manually, but these garbage should not be visible to data source readers.
    +   * fails.
    +   *
    +   * If an exception was thrown, the underlying data source may have garbage that need to be
    +   * cleaned manually, but these garbage should not be visible to data source readers.
    --- End diff --
    
    I'm doubting if we should call `abort` if `commit` fails. It looks like if job commit fails, the state of the destination is undefined and it's hard for `abort` to deal with it.
    
    I'd like to make the action fail if job commit fails, and tell users that the state of the destination is undefined, as it's an important information to users.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83292/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148176680
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java ---
    @@ -40,6 +40,9 @@
      *   3. Special scans. E.g, columnar scan, unsafe row scan, etc. These scan interfaces are named
      *      like `SupportsScanXXX`.
      *
    + * If exception happens when applying any of these query optimizations, the action would fail and
    --- End diff --
    
    If an exception happens


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83289/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    cc @rxin @rdblue @steveloughran @gatorsmile @wzhfy 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by steveloughran <gi...@git.apache.org>.

Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148596100
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java ---
    @@ -50,28 +53,34 @@
     
       /**
        * Creates a writer factory which will be serialized and sent to executors.
    +   *
    +   * If this method fails (by throwing an exception), the action would fail and no Spark job was
    +   * submitted.
        */
       DataWriterFactory<Row> createWriterFactory();
     
       /**
        * Commits this writing job with a list of commit messages. The commit messages are collected from
    -   * successful data writers and are produced by {@link DataWriter#commit()}. If this method
    -   * fails(throw exception), this writing job is considered to be failed, and
    -   * {@link #abort(WriterCommitMessage[])} will be called. The written data should only be visible
    -   * to data source readers if this method succeeds.
    +   * successful data writers and are produced by {@link DataWriter#commit()}.
    +   *
    +   * If this method fails (by throwing an exception), this writing job is considered to to have been
    +   * failed, and {@link #abort(WriterCommitMessage[])} would be called. The state of the destination
    +   * is undefined and @{@link #abort(WriterCommitMessage[])} may not be able to deal with it.
        *
        * Note that, one partition may have multiple committed data writers because of speculative tasks.
        * Spark will pick the first successful one and get its commit message. Implementations should be
    --- End diff --
    
    @cloud-fan 
    
    > @steveloughran you can still implement this with the current interface. Like you said, we can use HDFS(or other coordinators) to make sure only one speculative task can commit, but I think there may be other ways to do it, we don't need to restrict us from allowing only one speculative task to commit.
    
    I guess the main thing is decide who has to make the decision & make clear. what the expectations are and what the driver does for them


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83270/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    **[Test build #83492 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83492/testReport)** for PR 19623 at commit [`1e7dca3`](https://github.com/apache/spark/commit/1e7dca3a57582dee4431878f21a95e5c8a28e56b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83288/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    **[Test build #83289 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83289/testReport)** for PR 19623 at commit [`2f2596a`](https://github.com/apache/spark/commit/2f2596a7d879aea1d825cd7cd544b7bdf7b58a4d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148849054
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ReadTask.java ---
    @@ -36,14 +36,24 @@
       /**
        * The preferred locations where this read task can run faster, but Spark does not guarantee that
        * this task will always run on these locations. The implementations should make sure that it can
    -   * be run on any location. The location is a string representing the host name of an executor.
    +   * be run on any location. The location is a string representing the host name.
    +   *
    +   * Note that if a host name cannot be recognized by Spark, it will be ignored as it was not in
    +   * the returned locations. By default this method returns empty string, which means this task
    --- End diff --
    
    This isn't the empty string, it is a 0-length array.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    **[Test build #83346 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83346/testReport)** for PR 19623 at commit [`45a4ddb`](https://github.com/apache/spark/commit/45a4ddbd763e26fc99fc460b40453052e9620c49).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    **[Test build #83346 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83346/testReport)** for PR 19623 at commit [`45a4ddb`](https://github.com/apache/spark/commit/45a4ddbd763e26fc99fc460b40453052e9620c49).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    **[Test build #83300 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83300/testReport)** for PR 19623 at commit [`db45129`](https://github.com/apache/spark/commit/db45129d621f788136e88bfd645f658bee2afd2c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by steveloughran <gi...@git.apache.org>.

Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/19623#discussion_r148888237

--- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java ---
@@ -50,28 +53,34 @@

thinking about this some more (and still reserving the right to be provably wrong), I could imagine a job commit protocol which allowed 1+ Task attempt to commit, making the decision about which completed tasks to accept into the final results at job commit.

That is,

1. every task attempt could (atomically) promote its output to the job commit dir
1. the job commit would enumerate all promoted task attempts, and choose which ones
to accept.
1. The ones it didn't want would be aborted by the job committer.

Example: task attempts rename their output to a dir `dest/_tempt/jobId/completed/$taskId_$task_attemptId`; job committer would enum all directories in the completed dir, and for all dirs where the task ID was the same: pick one to commit, abort the others.

With that variant, you don't need to coordinate task commit across workers, even with speculation enabled. You'd just need to be confident that the promotion of task commit information was atomic, the view of the output consistent, and that job commit is not initiated until at least one attempt per task has succeeded. Job commit is potentially slower though.
Because you can turn off use of the output co-ordinator when writing to hive/hadoop tables, then a commit protocol like this could work today. Interestingly, it's not that far off the FileOutputCommitter v1 protocol, you'd just renane the task attempt output dir to the promoted dir & let the job commit filter out duplicates from its directory listing. Be a bit more expensive in terms of storage use between task commit and job commit though.

Maybe a good policy here is "a job must not commit 1+ task attempt", but give committers the option to bypass the output coordinator. if the job committer can handle the conflict resolution & so make that guarantee in its commit phase.

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148848790
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataReader.java ---
    @@ -34,11 +35,17 @@
     
       /**
        * Proceed to next record, returns false if there is no more records.
    +   *
    +   * If this method fails (by throwing an exception), the corresponding Spark task would fail and
    +   * get retried until hitting the maximum retry times.
        */
    -  boolean next();
    +  boolean next() throws IOException;
    --- End diff --
    
    Should clarify when it is okay to throw IOException with `@throws`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148552166
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java ---
    @@ -50,28 +53,34 @@
     
       /**
        * Creates a writer factory which will be serialized and sent to executors.
    +   *
    +   * If this method fails (by throwing an exception), the action would fail and no Spark job was
    +   * submitted.
        */
       DataWriterFactory<Row> createWriterFactory();
     
       /**
        * Commits this writing job with a list of commit messages. The commit messages are collected from
    -   * successful data writers and are produced by {@link DataWriter#commit()}. If this method
    -   * fails(throw exception), this writing job is considered to be failed, and
    -   * {@link #abort(WriterCommitMessage[])} will be called. The written data should only be visible
    -   * to data source readers if this method succeeds.
    +   * successful data writers and are produced by {@link DataWriter#commit()}.
    +   *
    +   * If this method fails (by throwing an exception), this writing job is considered to to have been
    +   * failed, and {@link #abort(WriterCommitMessage[])} would be called. The state of the destination
    +   * is undefined and @{@link #abort(WriterCommitMessage[])} may not be able to deal with it.
        *
        * Note that, one partition may have multiple committed data writers because of speculative tasks.
        * Spark will pick the first successful one and get its commit message. Implementations should be
    --- End diff --
    
    @steveloughran you can still implement this with the current interface. Like you said, we can use HDFS(or other coordinators) to make sure only one speculative task can commit, but I think there may be other ways to do it, we don't need to restrict us from allowing only one speculative task to commit.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    **[Test build #83492 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83492/testReport)** for PR 19623 at commit [`1e7dca3`](https://github.com/apache/spark/commit/1e7dca3a57582dee4431878f21a95e5c8a28e56b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by steveloughran <gi...@git.apache.org>.

Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148325887
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataWriter.java ---
    @@ -84,9 +86,9 @@
        * This method will only be called if there is one record failed to write, or {@link #commit()}
        * failed.
        *
    -   * If this method fails(throw exception), the underlying data source may have garbage that need
    -   * to be cleaned by {@link DataSourceV2Writer#abort(WriterCommitMessage[])} or manually, but
    -   * these garbage should not be visible to data source readers.
    +   * If an exception was thrown, the underlying data source may have garbage that need to be
    +   * cleaned by {@link DataSourceV2Writer#abort(WriterCommitMessage[])} or manually, but these
    +   * garbage should not be visible to data source readers.
        */
       void abort();
    --- End diff --
    
    May as well declare throws IOE here too, because things invoked are likely to declare that, and whatever calls this is going to need to be catching and downgrading all exceptions anyway


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    **[Test build #83288 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83288/testReport)** for PR 19623 at commit [`1d12c8e`](https://github.com/apache/spark/commit/1d12c8e075b153917c7dcf6de4b7ff62c8eecacd).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by steveloughran <gi...@git.apache.org>.

Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148227459
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataReader.java ---
    @@ -34,11 +34,17 @@
     
       /**
        * Proceed to next record, returns false if there is no more records.
    +   *
    +   * If an exception was thrown, the corresponding Spark task would fail and get retried until
    +   * hitting the maximum retry times.
        */
       boolean next();
     
       /**
        * Return the current record. This method should return same value until `next` is called.
    +   *
    --- End diff --
    
    -assuming that the source data is not changed in any way. I'd avoid making any comments about what happens then. Maybe make that a broader requirement upfront: Spark assumes that the data does not change in size/structure/value during the query. If it does, any operation may raise an exception or return invalid/inconsistent data.
    
    That makes for a nice disclaimer: if it's a database with the right ACID level, updates to a source may not be visible. If it's a CSV file, nobody knows what will happen.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148875607
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java ---
    @@ -50,28 +53,34 @@
     
       /**
        * Creates a writer factory which will be serialized and sent to executors.
    +   *
    +   * If this method fails (by throwing an exception), the action would fail and no Spark job was
    +   * submitted.
        */
       DataWriterFactory<Row> createWriterFactory();
     
       /**
        * Commits this writing job with a list of commit messages. The commit messages are collected from
    -   * successful data writers and are produced by {@link DataWriter#commit()}. If this method
    -   * fails(throw exception), this writing job is considered to be failed, and
    -   * {@link #abort(WriterCommitMessage[])} will be called. The written data should only be visible
    -   * to data source readers if this method succeeds.
    +   * successful data writers and are produced by {@link DataWriter#commit()}.
    +   *
    +   * If this method fails (by throwing an exception), this writing job is considered to to have been
    +   * failed, and {@link #abort(WriterCommitMessage[])} would be called. The state of the destination
    +   * is undefined and @{@link #abort(WriterCommitMessage[])} may not be able to deal with it.
        *
        * Note that, one partition may have multiple committed data writers because of speculative tasks.
        * Spark will pick the first successful one and get its commit message. Implementations should be
    --- End diff --
    
    I haven't read Steve's points here entirely, but I agree that Spark should be primarily responsible for task commit coordination. Most implementations would be fine using the current [output commit coordinator](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala), which does a good job balancing the trade-offs that you've been discussing. It ensures that only one task is authorized to commit and has well-defined failure cases (when a network partition prevents the authorized committer from responding before its commit authorization times out).
    
    I think that Spark should use the current commit coordinator unless an implementation opts out of using it (and I'm not sure that opting out is a use case we care to support at this point). It's fine if Spark documents how its coordinator works and there are some drawbacks, but expecting implementations to handle their own commit coordination (which requires RPC for Spark) is, I think, unreasonable. Let's use the one we have by default, however imperfect.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by steveloughran <gi...@git.apache.org>.

Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148325560
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataWriter.java ---
    @@ -72,7 +74,7 @@
        * should still "hide" the written data and ask the {@link DataSourceV2Writer} at driver side to
        * do the final commitment via {@link WriterCommitMessage}.
        *
    -   * If this method fails(throw exception), {@link #abort()} will be called and this data writer is
    +   * If an exception was thrown, {@link #abort()} would be called and this data writer was
        * considered to be failed.
        */
       WriterCommitMessage commit();
    --- End diff --
    
    should throw IOE as well, if possible


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148918542
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java ---
    @@ -50,28 +53,34 @@
     
       /**
        * Creates a writer factory which will be serialized and sent to executors.
    +   *
    +   * If this method fails (by throwing an exception), the action would fail and no Spark job was
    +   * submitted.
        */
       DataWriterFactory<Row> createWriterFactory();
     
       /**
        * Commits this writing job with a list of commit messages. The commit messages are collected from
    -   * successful data writers and are produced by {@link DataWriter#commit()}. If this method
    -   * fails(throw exception), this writing job is considered to be failed, and
    -   * {@link #abort(WriterCommitMessage[])} will be called. The written data should only be visible
    -   * to data source readers if this method succeeds.
    +   * successful data writers and are produced by {@link DataWriter#commit()}.
    +   *
    +   * If this method fails (by throwing an exception), this writing job is considered to to have been
    +   * failed, and {@link #abort(WriterCommitMessage[])} would be called. The state of the destination
    +   * is undefined and @{@link #abort(WriterCommitMessage[])} may not be able to deal with it.
        *
        * Note that, one partition may have multiple committed data writers because of speculative tasks.
        * Spark will pick the first successful one and get its commit message. Implementations should be
    --- End diff --
    
    it's really out of the scope of this PR...
    
    This PR is adding documents to describe exception behavior, please send a new PR to update the document for the commit stuff and we can move our discussion there. Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by steveloughran <gi...@git.apache.org>.

Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/19623#discussion_r148595625

--- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java ---
@@ -50,28 +53,34 @@

> The only way to guarantee no more than one task can commit is if the underlying storage system guarantees that. There is no way to design something generic. It is simply not possible in a distributed system, when network partitioning or message lost.

I agree that you cannot guarantee that >1 task has executed.

I believe that you can stop >1 task committing by having the scheduler decide the exact task to commit & fail the job if it doesn't return within a bounded time saying "done". At which point the final state of the destination is unknown & you are into whatever driver-side algorithms chooses to implement there, possibly based on information provided by the committer about its ability to recover/retry. That's where it gets complicated, and where I'd need to do a lot more delving into the spark source with a debugger and some fault injecting committer underneath to really stress things.

Network partitioning

1. executor split from driver during execution & does not rejoin: can't request commit, or requests commit & doesn't get a response. Timeout -> failure on executor. Task abort? Driver must eventually consider task failed too. Driver presumably has to consider executor timeout & retry. if speculation enabled & first committer requests first, either let it through or tell it to abort. If !speculation & first committer requests. must abort.
1. executor split from driver after asking for commit & before getting response: same.
1. executor request to commit -> driver, rejection sent & not received. timeout -> failure. task should abort, driver: should it just assume aborted?
1. executor request to commit -> driver, acceptance sent & not received; task times out & doesn't commit, aborts. But what does driver do?
1. executor request to commit -> driver, acceptance sent, received, processed & the response lost. Executor knows the difference: it's finished. Driver cannot differentiate this from the previous one.
1. executor request to commit -> driver, acceptance sent, executor fails before execution. No task abort executed on executor. No direct way for driver to tell. How to interpret? Retry vis job fail?
1. executor request to commit -> driver, acceptance sent, driver fails during commit. Dest state unknown. Unless protocol says it can handle recovery here, -> abort.
1. driver itself fails; job restarted. Maybe: recovery attempted. Hadoop MR v1 FileOutputCommitAlgorithm can do this as it can infer which tasks committed by state of destFS, relying on all state being in consistent FS, task commit atomic. Does make for a slower job commit though...main benefit is that for jobs big & long enough that the probability of AM failure is tangible, it doesn't require a replay of all committed tasks.

For the cases where driver is tells executor to commit & never returns , driver has a problem. It can't directly differentiate: (not committed, committed, nonatomic commit-failed-partway)

Looking into the Hadoop code, `OutputCommiter.recoverTask()` comes out to play here, which is run from the AM on a rebuilt task context. If it can recover: Fine. If not, task is considered dead and needs cleanup (abort?).

To recover:
* dest exists ==> old state == committed => no-op + success
* dest !exists, repeat commit through rename(). If that rename [on the AM] fails: retry some more times then abort task & fail job. Lack of meaningful error text in rename hurts here. Because this recovery is done on the AM then there's no partitioning going to happen between scheduler & commitTask() call, just with remote FS/store/DB

Spark does not use the recovery methods of OutputCommitter, AFAIK.

The other way to handle failed/timed-out task commit is: give up, abort the entire job, say "dest is in unknown state now". My tests imply that a task commit failure in spark fails the job; I don't know about what happens if the task commit outcome isn't received by the driver & will have to think of a test there. Maybe I'll add another failure mode to the S3A fault injector: "block", & see what happens.

Returning to the point about speculation; its the same as rerunning a task which failed to report in: they both generate output, exactly one must be committed, and the generated output should not (must not?) be manifest at the dest paths until task commit. That was the issue with the DirectOutputCommitter, wasn't it: you couldn't repeat a task whose executor failed as had already modified the output directory.

For the spec,I think you should mandate that output MUST NOT be visible until taskCommit; taskAbort MUST be repeatable & best effort; same for jobAbort. I don't know about saying "Can recreate state of a task attempt and be able to call taskAbort() again. I know Hadoop output committer requires it, but I already know of one committer which will leak file:/tmp data if rerun on a different host, because it can't get at the remote FS.

Now, should output be visible on taskCommit? It is on Hadoop v2 commit protocol, which moves in all output to outputPath on task commit, downgrading jobCommit to `touchz("$dest/_SUCCESS")`. It's faster, it just means that if job fails before commit, you need to clean up the job by deleting dest/*.

The S3A committer delays all manifestation of MPUs until job commit because we don't want anything to be observable, and cost of commit is a small POST XML doc per file, which can be done in parallel. taskCommits are not repeatable, and at least currently, any partition/timeout of taskCommit() is going to mean the entire job needs rerunning. If spark does have the ability to rerun tasks which didn't return from taskcommit, we could do something here, relying on the fact tha we PUT the .pendingset of a tasks' commit metadata is atomically: cleanup would just mean doing a HEAD of it. Exists -> task was committed. !exists, retry (& rely on job cleanup() to cancel all pending MPUs under the dest path being called in jobCommit()).

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by steveloughran <gi...@git.apache.org>.

Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/19623#discussion_r148226526

--- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java ---
@@ -75,8 +82,10 @@
/**
* Aborts this writing job because some data writers are failed to write the records and aborted,
* or the Spark job fails with some unknown reasons, or {@link #commit(WriterCommitMessage[])}
- * fails. If this method fails(throw exception), the underlying data source may have garbage that
- * need to be cleaned manually, but these garbage should not be visible to data source readers.
+ * fails.
+ *
+ * If an exception was thrown, the underlying data source may have garbage that need to be
+ * cleaned manually, but these garbage should not be visible to data source readers.
--- End diff --

bit optimistic on the "should". Maybe: "state of the destination is undefined".

The issue here is what if there's a network failure and abort() fails for that. it hasn't made any changes to the destination. Thus its state is whatever the previous state of the dest was. If no task was committed, nothing should be directly observable. 1+ task committed and there *may* be output. If job commit failed and abort() was called then, nobody is going to have any idea of the final state.

Maybe: "Implementations are required to perform best-effort cleanup irrespective of the current state of the destination".

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    **[Test build #83279 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83279/testReport)** for PR 19623 at commit [`bffd91e`](https://github.com/apache/spark/commit/bffd91e519b3856e61c007c3aa8eba7485b38661).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    **[Test build #83289 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83289/testReport)** for PR 19623 at commit [`2f2596a`](https://github.com/apache/spark/commit/2f2596a7d879aea1d825cd7cd544b7bdf7b58a4d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by steveloughran <gi...@git.apache.org>.

Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148225333
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ReadTask.java ---
    @@ -37,13 +37,19 @@
        * The preferred locations where this read task can run faster, but Spark does not guarantee that
        * this task will always run on these locations. The implementations should make sure that it can
        * be run on any location. The location is a string representing the host name of an executor.
    +   *
    +   * If an exception was thrown, the action would fail and we guarantee that no Spark job was
    +   * submitted.
        */
       default String[] preferredLocations() {
         return new String[0];
    --- End diff --
    
    Somewhere in the spark code there's that translation of localhost -> anywhere, so that things like object stores who return "localhost" end up with independent-of-location placement. Is this required to have happened by the time `preferredLocations` is called, or can an impl return "localhost" and expect spark to deal with it. Purest would be to say "stores which use 'localhost' as a hint to mean 'unplaced' are required to have filtered it out by this point'


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148848619
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/ReadSupport.java ---
    @@ -30,6 +30,9 @@
       /**
        * Creates a {@link DataSourceV2Reader} to scan the data from this data source.
        *
    +   * If this method fails (by throwing an exception), the action would fail and no Spark job was
    --- End diff --
    
    Are there recommended exceptions to throw? It would be great if this specified what exception classes implementations should use for problems with `@throws` javadoc entries. For example, `@throws AnalysisException If ...`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148144342
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/ReadSupport.java ---
    @@ -30,6 +30,8 @@
       /**
        * Creates a {@link DataSourceV2Reader} to scan the data from this data source.
        *
    +   * If this method fails(throw exception), the query fails before submitting any Spark jobs.
    --- End diff --
    
    We also need to mention that the "action" will fail...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148918617
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/ReadSupport.java ---
    @@ -30,6 +30,9 @@
       /**
        * Creates a {@link DataSourceV2Reader} to scan the data from this data source.
        *
    +   * If this method fails (by throwing an exception), the action would fail and no Spark job was
    --- End diff --
    
    no exception is recommended here, you can throw any exception and Spark will fail your job. This document just describes what will happen if an exception is thrown, but no exception is expected actually.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by steveloughran <gi...@git.apache.org>.

Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148503478
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java ---
    @@ -50,28 +53,34 @@
     
       /**
        * Creates a writer factory which will be serialized and sent to executors.
    +   *
    +   * If this method fails (by throwing an exception), the action would fail and no Spark job was
    +   * submitted.
        */
       DataWriterFactory<Row> createWriterFactory();
     
       /**
        * Commits this writing job with a list of commit messages. The commit messages are collected from
    -   * successful data writers and are produced by {@link DataWriter#commit()}. If this method
    -   * fails(throw exception), this writing job is considered to be failed, and
    -   * {@link #abort(WriterCommitMessage[])} will be called. The written data should only be visible
    -   * to data source readers if this method succeeds.
    +   * successful data writers and are produced by {@link DataWriter#commit()}.
    +   *
    +   * If this method fails (by throwing an exception), this writing job is considered to to have been
    +   * failed, and {@link #abort(WriterCommitMessage[])} would be called. The state of the destination
    +   * is undefined and @{@link #abort(WriterCommitMessage[])} may not be able to deal with it.
        *
        * Note that, one partition may have multiple committed data writers because of speculative tasks.
        * Spark will pick the first successful one and get its commit message. Implementations should be
    --- End diff --
    
    Having >1 committed writers is a failure of protocol. Speculation & failure handling should allow >1 ready-to-commit writers, but only actually commit one. That's where the stuff about writers reporting ready-to-commit & driver tracking state of active tasks comes in. 
    
    What if a commit fails/commit operation times out or raises some other error? Hadoop FileOutputFormat v1 commit can actually recover from a failed task commit, v2 can't [Assumption: rename() is atomic and will fail if dest path exists). Hence the `OutputCommitter.isRecoverySupported()` probe to tell driver whether or not it can recover from a task commit which is perceived as failing.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by steveloughran <gi...@git.apache.org>.

Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/19623#discussion_r148542687

--- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java ---
@@ -50,28 +53,34 @@

OK, now I'm confused to the extent that I've been using the IDE to trace method calls, and ideally would like to see any TLA+/PlusCan specs of what's going on. Failing that, some .py pseudocode.

Otherwise, there's setting breakpoints and stepping through the debugger, which what I ended up doing to work out WTF MapReduce did. [Committer Architecture](https://github.com/steveloughran/hadoop/blob/s3guard/HADOOP-13786-committer/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committer_architecture.md) is essentially the documentation of that debugger-derived analysis.

### sequence of
1. 1+ task created, scheduled,
1. setupTask(taskContext) called in workers for task before execution
1. task does its work. worker knows task is now ready to be committed
1. 1+ task declares ready to commit to driver
1. driver selects which of these tasks to move to commit
1. driver sends ok-to-commit(task) message request to executor with task.
1. Task committed via some `commitTask()` operation
1. speculative tasks told to abort or left in pending state to see what happens
1. task commit succeeds: job continues
1. task commit fail: best effort attempt abort failed task (Hadoop MR may do this in AM by recreating taskContext of failed task & calling `taskAbort(tCtx)`) on committer created for job. means that if cleanup wants to delete local file:// data, it may not get deleted)
1. if engine supports recovery from failed commits and commit algorithm does: select another of pending speculative tasks or retry entire task, move to step (2) unless #of attempts exceeded.
1. else: abort entire job

A requirement of speculation is "speculative work which is only observable after commit". Its why the direct committers had to run with speculation off, and even there failure handling could only be strictly handled by failing entire job.

This is not a full 2PC protocol where the entire job is committed iff all workers are ready to commit. Instead you are allowing >1 task to run in parallel & choose which to commit.

### if >1 speculated tasks can commit work without coordination, then unless all tasks generate output with same filenames/paths *and the commit of an individual task is atomic*, you are in danger

1. Task A and Task B both commit different files: result, dest contains > 1 copy of each file. Task abort() isn't going to help as it's abort of pre-commit state, not rollback of output.
1. Task A and Task B commit different files, but in parallel and with nonatomic commit of files, then you will end up with the output of both.

To avoid that you're going to need some for of coordination between tasks s.t. only one actually commits. This can be done implicitly via a filesystem with some atomic operations (rename(src, dest), create(file, overwrite=false)...) or some other mechanism.

Except, what about [SPARK-4879](https://issues.apache.org/jira/browse/SPARK-4879) and `OutputCommitCoordinator`. That is asking for permission to commit, and it is, AFAIK, turned on by default from the property

Looking at `org.apache.spark.sql.execution.datasources.FileFormatWriter.write()`

```scala
/**
* Basic work flow of this command is:
* 1. Driver side setup, including output committer initialization and data source specific
* preparation work for the write job to be issued.
* 2. Issues a write job consists of one or more executor side tasks, each of which writes all
* rows within an RDD partition.
* 3. If no exception is thrown in a task, commits that task, otherwise aborts that task; If any
* exception is thrown during task commitment, also aborts that task.
* 4. If all tasks are committed, commit the job, otherwise aborts the job; If any exception is
* thrown during job commitment, also aborts the job.
* 5. If the job is successfully committed, perform post-commit operations such as
* processing statistics.
* @return The set of all partition paths that were updated during this write job.
*/
```

It's `FileFormatWriter.executeTask`,
1. does the etupTask/exec/commitTask in sequence,
1. calls `FileCommitProtocol.commitTask`
1. `HadoopMapReduceCommitProtocol.commitTask` calls `SparkHadoopMapRedUtil.commitTask`
1. which, if there's data (`OutputCommitter.needsTaskCommit`) and co-ordination enabled `"spark.hadoop.outputCommitCoordination.enabled" == true`, calls `OutputCommitCoordinator.canCommit(stage, part, attempt)`, which then talks to the driver to get permission to commit. The driver-side `OutputCommitCoordinator` instance gets to manage the state machine of task attempts, which is done in `handleAskPermissionToCommit`

So, unless I've misunderstood the flow of things. Spark guarantees at most one speculative task attempt for a (stage, part) succeeds, it's done through coordination with the driver inside `FileFormatWriter.executeTask`.

If have understood, the key thing is: co-ordination is done in Spark's OutputCommitCoordinator, with that coordinator being called from `FileFormatWriter`, hence the routines to insert into HadoopFS or Hive.

For this API here, where is that coordination going to take place?

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by steveloughran <gi...@git.apache.org>.

Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148507067
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java ---
    @@ -50,28 +53,34 @@
     
       /**
        * Creates a writer factory which will be serialized and sent to executors.
    +   *
    +   * If this method fails (by throwing an exception), the action would fail and no Spark job was
    +   * submitted.
        */
       DataWriterFactory<Row> createWriterFactory();
     
       /**
        * Commits this writing job with a list of commit messages. The commit messages are collected from
    -   * successful data writers and are produced by {@link DataWriter#commit()}. If this method
    -   * fails(throw exception), this writing job is considered to be failed, and
    -   * {@link #abort(WriterCommitMessage[])} will be called. The written data should only be visible
    -   * to data source readers if this method succeeds.
    +   * successful data writers and are produced by {@link DataWriter#commit()}.
    +   *
    +   * If this method fails (by throwing an exception), this writing job is considered to to have been
    +   * failed, and {@link #abort(WriterCommitMessage[])} would be called. The state of the destination
    +   * is undefined and @{@link #abort(WriterCommitMessage[])} may not be able to deal with it.
        *
        * Note that, one partition may have multiple committed data writers because of speculative tasks.
        * Spark will pick the first successful one and get its commit message. Implementations should be
    -   * aware of this and handle it correctly, e.g., have a mechanism to make sure only one data writer
    -   * can commit successfully, or have a way to clean up the data of already-committed writers.
    +   * aware of this and handle it correctly, e.g., have a coordinator to make sure only one data
    +   * writer can commit, or have a way to clean up the data of already-committed writers.
        */
       void commit(WriterCommitMessage[] messages);
     
       /**
        * Aborts this writing job because some data writers are failed to write the records and aborted,
    --- End diff --
    
    A single task failure shoudn't abort entire job. 
    Job abortion is more likely to be triggered by
    * Failure count of task exceeds configured limit. 
    * Non-recoverable failure of the commit() operation of one or more tasks. I don't see spark invoking `OutputCommitter.isRecoverySupported()` as its focusing more on "faster execution and recovery through retry".
    * pre-emption of job (if engine supports preemption)
    
    Looking in the code, it's called after `sparkContext.runJob()` throws an exception for any reason, & on fail of `FileFormatWriter.write()`, again, any reason. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83428/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148599528
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java ---
    @@ -50,28 +53,34 @@
     
       /**
        * Creates a writer factory which will be serialized and sent to executors.
    +   *
    +   * If this method fails (by throwing an exception), the action would fail and no Spark job was
    +   * submitted.
        */
       DataWriterFactory<Row> createWriterFactory();
     
       /**
        * Commits this writing job with a list of commit messages. The commit messages are collected from
    -   * successful data writers and are produced by {@link DataWriter#commit()}. If this method
    -   * fails(throw exception), this writing job is considered to be failed, and
    -   * {@link #abort(WriterCommitMessage[])} will be called. The written data should only be visible
    -   * to data source readers if this method succeeds.
    +   * successful data writers and are produced by {@link DataWriter#commit()}.
    +   *
    +   * If this method fails (by throwing an exception), this writing job is considered to to have been
    +   * failed, and {@link #abort(WriterCommitMessage[])} would be called. The state of the destination
    +   * is undefined and @{@link #abort(WriterCommitMessage[])} may not be able to deal with it.
        *
        * Note that, one partition may have multiple committed data writers because of speculative tasks.
        * Spark will pick the first successful one and get its commit message. Implementations should be
    --- End diff --
    
    The expectation is well defined, see `DataWriter#commit`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19623


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    Thanks, merging to master!
    
    We can continue the discussion about write transaction(which doesn't belong to this PR) in a new ticket.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83279/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148545942
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java ---
    @@ -50,28 +53,34 @@
     
       /**
        * Creates a writer factory which will be serialized and sent to executors.
    +   *
    +   * If this method fails (by throwing an exception), the action would fail and no Spark job was
    +   * submitted.
        */
       DataWriterFactory<Row> createWriterFactory();
     
       /**
        * Commits this writing job with a list of commit messages. The commit messages are collected from
    -   * successful data writers and are produced by {@link DataWriter#commit()}. If this method
    -   * fails(throw exception), this writing job is considered to be failed, and
    -   * {@link #abort(WriterCommitMessage[])} will be called. The written data should only be visible
    -   * to data source readers if this method succeeds.
    +   * successful data writers and are produced by {@link DataWriter#commit()}.
    +   *
    +   * If this method fails (by throwing an exception), this writing job is considered to to have been
    +   * failed, and {@link #abort(WriterCommitMessage[])} would be called. The state of the destination
    +   * is undefined and @{@link #abort(WriterCommitMessage[])} may not be able to deal with it.
        *
        * Note that, one partition may have multiple committed data writers because of speculative tasks.
        * Spark will pick the first successful one and get its commit message. Implementations should be
    --- End diff --
    
    You are not taking into account network partitioning. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148217006
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java ---
    @@ -30,6 +30,9 @@
      * It can mix in various writing optimization interfaces to speed up the data saving. The actual
      * writing logic is delegated to {@link DataWriter}.
      *
    + * If exception happens when applying any of these writing optimizations, the action would fail and
    --- End diff --
    
    we only have one so I don't know how to category them... We can do this after having more optimizations


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    **[Test build #83279 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83279/testReport)** for PR 19623 at commit [`bffd91e`](https://github.com/apache/spark/commit/bffd91e519b3856e61c007c3aa8eba7485b38661).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    **[Test build #83270 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83270/testReport)** for PR 19623 at commit [`68fe451`](https://github.com/apache/spark/commit/68fe4512dc7baaaf810db4202397fc9d138d2ed7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148176992
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java ---
    @@ -30,6 +30,9 @@
      * It can mix in various writing optimization interfaces to speed up the data saving. The actual
      * writing logic is delegated to {@link DataWriter}.
      *
    + * If exception happens when applying any of these writing optimizations, the action would fail and
    --- End diff --
    
    an exception.
    By the way, not related to this PR, shall we also list allowed writing optimizations like [in DataSourceV2Reader](https://github.com/apache/spark/pull/19623/files#diff-f7f970e9684796e2120223d1153302e4R35) ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by steveloughran <gi...@git.apache.org>.

Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/19623#discussion_r148263405

call it, because commit may fail for other reasons, e.g. the committer permits the job to commit if the output is all to empty partitions, but fails if there is data in any of the dest partitions. the abort() call triggers the cleanup.

1. exceptions aborts after job commit failure. must of course be caught/swallowed. Hadoop MR doesn't do that properly, which is something someone should fix. Probably me.
1. ideally the committers should be doing their own cleanup if the job fails, so they can be confident the cleanup always takes place. Again: catch & swallow exceptions.
1. Failure in job commit is very much an "outcome is unknown" state.

Couple of references
* [s3a committer lifecycle tests](https://github.com/steveloughran/hadoop/blob/s3guard/HADOOP-13786-committer/hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/commit/AbstractITCommitProtocol.java), including OOO workflows (abort before commit, 2x commit task, 2x commit job, commit job -> commit task).
* [SPARK-2984](https://issues.apache.org/jira/browse/SPARK-2984), where underlying problem of recurrent issue is people trying to write to same dest with >1 job, and finding all jobs after first one failing as the `__temporary` dir has been deleted in cleanup(). Arguably, the job commit/cleanup is being overaggressive about cleanup, but given how partition merging isn't atomic with the classic fileoutput committers, its really a manifestation of people trying to do something they shouldn't.

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148605519
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java ---
    @@ -50,28 +53,34 @@
     
       /**
        * Creates a writer factory which will be serialized and sent to executors.
    +   *
    +   * If this method fails (by throwing an exception), the action would fail and no Spark job was
    +   * submitted.
        */
       DataWriterFactory<Row> createWriterFactory();
     
       /**
        * Commits this writing job with a list of commit messages. The commit messages are collected from
    -   * successful data writers and are produced by {@link DataWriter#commit()}. If this method
    -   * fails(throw exception), this writing job is considered to be failed, and
    -   * {@link #abort(WriterCommitMessage[])} will be called. The written data should only be visible
    -   * to data source readers if this method succeeds.
    +   * successful data writers and are produced by {@link DataWriter#commit()}.
    +   *
    +   * If this method fails (by throwing an exception), this writing job is considered to to have been
    +   * failed, and {@link #abort(WriterCommitMessage[])} would be called. The state of the destination
    +   * is undefined and @{@link #abort(WriterCommitMessage[])} may not be able to deal with it.
        *
        * Note that, one partition may have multiple committed data writers because of speculative tasks.
        * Spark will pick the first successful one and get its commit message. Implementations should be
    --- End diff --
    
    @steveloughran it is proven to be not possible. Your algorithm doesn't work.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83492/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by steveloughran <gi...@git.apache.org>.

Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148507385
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java ---
    @@ -50,28 +53,34 @@
     
       /**
        * Creates a writer factory which will be serialized and sent to executors.
    +   *
    +   * If this method fails (by throwing an exception), the action would fail and no Spark job was
    +   * submitted.
        */
       DataWriterFactory<Row> createWriterFactory();
     
       /**
        * Commits this writing job with a list of commit messages. The commit messages are collected from
    -   * successful data writers and are produced by {@link DataWriter#commit()}. If this method
    -   * fails(throw exception), this writing job is considered to be failed, and
    -   * {@link #abort(WriterCommitMessage[])} will be called. The written data should only be visible
    -   * to data source readers if this method succeeds.
    +   * successful data writers and are produced by {@link DataWriter#commit()}.
    +   *
    +   * If this method fails (by throwing an exception), this writing job is considered to to have been
    +   * failed, and {@link #abort(WriterCommitMessage[])} would be called. The state of the destination
    +   * is undefined and @{@link #abort(WriterCommitMessage[])} may not be able to deal with it.
        *
        * Note that, one partition may have multiple committed data writers because of speculative tasks.
        * Spark will pick the first successful one and get its commit message. Implementations should be
    -   * aware of this and handle it correctly, e.g., have a mechanism to make sure only one data writer
    -   * can commit successfully, or have a way to clean up the data of already-committed writers.
    +   * aware of this and handle it correctly, e.g., have a coordinator to make sure only one data
    +   * writer can commit, or have a way to clean up the data of already-committed writers.
        */
       void commit(WriterCommitMessage[] messages);
     
       /**
        * Aborts this writing job because some data writers are failed to write the records and aborted,
        * or the Spark job fails with some unknown reasons, or {@link #commit(WriterCommitMessage[])}
    -   * fails. If this method fails(throw exception), the underlying data source may have garbage that
    -   * need to be cleaned manually, but these garbage should not be visible to data source readers.
    +   * fails.
    +   *
    +   * If this method fails (by throwing an exception), the underlying data source may have garbage
    +   * that need to be cleaned manually.
    --- End diff --
    
    "may require manual cleanup". It could be more than just "garbage", which implies filesystem temp data...it could be tables in a database or similar


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    **[Test build #83288 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83288/testReport)** for PR 19623 at commit [`1d12c8e`](https://github.com/apache/spark/commit/1d12c8e075b153917c7dcf6de4b7ff62c8eecacd).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    **[Test build #83300 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83300/testReport)** for PR 19623 at commit [`db45129`](https://github.com/apache/spark/commit/db45129d621f788136e88bfd645f658bee2afd2c).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19623
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148527451
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java ---
    @@ -50,28 +53,34 @@
     
       /**
        * Creates a writer factory which will be serialized and sent to executors.
    +   *
    +   * If this method fails (by throwing an exception), the action would fail and no Spark job was
    +   * submitted.
        */
       DataWriterFactory<Row> createWriterFactory();
     
       /**
        * Commits this writing job with a list of commit messages. The commit messages are collected from
    -   * successful data writers and are produced by {@link DataWriter#commit()}. If this method
    -   * fails(throw exception), this writing job is considered to be failed, and
    -   * {@link #abort(WriterCommitMessage[])} will be called. The written data should only be visible
    -   * to data source readers if this method succeeds.
    +   * successful data writers and are produced by {@link DataWriter#commit()}.
    +   *
    +   * If this method fails (by throwing an exception), this writing job is considered to to have been
    +   * failed, and {@link #abort(WriterCommitMessage[])} would be called. The state of the destination
    +   * is undefined and @{@link #abort(WriterCommitMessage[])} may not be able to deal with it.
        *
        * Note that, one partition may have multiple committed data writers because of speculative tasks.
        * Spark will pick the first successful one and get its commit message. Implementations should be
    --- End diff --
    
    @steveloughran it is not really possible to enforce exclusivity this way in a general commit system. That would just be a best effort thing to avoid duplicate work.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148145901
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/ReadSupport.java ---
    @@ -30,6 +30,8 @@
       /**
        * Creates a {@link DataSourceV2Reader} to scan the data from this data source.
        *
    +   * If this method fails(throw exception), the query fails before submitting any Spark jobs.
    --- End diff --
    
    > If an exception was thrown, the action would fail and we guarantee that no Spark job was submitted.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org