You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by HyukjinKwon <gi...@git.apache.org> on 2018/04/13 03:39:39 UTC

[GitHub] spark pull request #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes coll...

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/21060

    [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in PySpark as action for a query executor listener

    ## What changes were proposed in this pull request?
    
    This PR proposes to add `collect` to  a query executor as an action.
    
    Seems `collect` / `collect` with Arrow are not recognised via `QueryExecutionListener` as an action. For example, if we have a custom listener as below:
    
    ```scala
    package org.apache.spark.sql
    
    import org.apache.spark.internal.Logging
    import org.apache.spark.sql.execution.QueryExecution
    import org.apache.spark.sql.util.QueryExecutionListener
    
    class TestQueryExecutionListener extends QueryExecutionListener with Logging {
      override def onSuccess(funcName: String, qe: QueryExecution, durationNs: Long): Unit = {
        logError("Look at me! I'm 'onSuccess'")
      }
    
      override def onFailure(funcName: String, qe: QueryExecution, exception: Exception): Unit = { }
    }
    ```
    and set `spark.sql.queryExecutionListeners` to `org.apache.spark.sql.TestQueryExecutionListener`
    
    Other operations in PySpark or Scala side seems fine:
    
    ```python
    >>> sql("SELECT * FROM range(1)").show()
    ```
    ```
    18/04/09 17:02:04 ERROR TestQueryExecutionListener: Look at me! I'm 'onSuccess'
    +---+
    | id|
    +---+
    |  0|
    +---+
    ```
    
    ```scala
    scala> sql("SELECT * FROM range(1)").collect()
    ```
    ```
    18/04/09 16:58:41 ERROR TestQueryExecutionListener: Look at me! I'm 'onSuccess'
    res1: Array[org.apache.spark.sql.Row] = Array([0])
    ```
    
    but ..
    
    **Before**
    
    ```python
    >>> sql("SELECT * FROM range(1)").collect()
    ```
    ```
    [Row(id=0)]
    ```
    
    ```python
    >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
    >>> sql("SELECT * FROM range(1)").toPandas()
    ```
    ```
       id
    0   0
    ```
    
    **After**
    
    ```python
    >>> sql("SELECT * FROM range(1)").collect()
    ```
    ```
    18/04/09 16:57:58 ERROR TestQueryExecutionListener: Look at me! I'm 'onSuccess'
    [Row(id=0)]
    ```
    
    ```python
    >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
    >>> sql("SELECT * FROM range(1)").toPandas()
    ```
    ```
    18/04/09 17:53:26 ERROR TestQueryExecutionListener: Look at me! I'm 'onSuccess'
       id
    0   0
    ```
    
    ## How was this patch tested?
    
    I have manually tested as described above and unit test was added.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark PR_TOOL_PICK_PR_21007_BRANCH-2.3

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21060.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21060
    
----
commit 4656724d27c208d794f99691cfbf93b4bb118d93
Author: hyukjinkwon <gu...@...>
Date:   2018-04-13T03:28:13Z

    [SPARK-23942][PYTHON][SQL] Makes collect in PySpark as action for a query executor listener
    
    This PR proposes to add `collect` to  a query executor as an action.
    
    Seems `collect` / `collect` with Arrow are not recognised via `QueryExecutionListener` as an action. For example, if we have a custom listener as below:
    
    ```scala
    package org.apache.spark.sql
    
    import org.apache.spark.internal.Logging
    import org.apache.spark.sql.execution.QueryExecution
    import org.apache.spark.sql.util.QueryExecutionListener
    
    class TestQueryExecutionListener extends QueryExecutionListener with Logging {
      override def onSuccess(funcName: String, qe: QueryExecution, durationNs: Long): Unit = {
        logError("Look at me! I'm 'onSuccess'")
      }
    
      override def onFailure(funcName: String, qe: QueryExecution, exception: Exception): Unit = { }
    }
    ```
    and set `spark.sql.queryExecutionListeners` to `org.apache.spark.sql.TestQueryExecutionListener`
    
    Other operations in PySpark or Scala side seems fine:
    
    ```python
    >>> sql("SELECT * FROM range(1)").show()
    ```
    ```
    18/04/09 17:02:04 ERROR TestQueryExecutionListener: Look at me! I'm 'onSuccess'
    +---+
    | id|
    +---+
    |  0|
    +---+
    ```
    
    ```scala
    scala> sql("SELECT * FROM range(1)").collect()
    ```
    ```
    18/04/09 16:58:41 ERROR TestQueryExecutionListener: Look at me! I'm 'onSuccess'
    res1: Array[org.apache.spark.sql.Row] = Array([0])
    ```
    
    but ..
    
    **Before**
    
    ```python
    >>> sql("SELECT * FROM range(1)").collect()
    ```
    ```
    [Row(id=0)]
    ```
    
    ```python
    >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
    >>> sql("SELECT * FROM range(1)").toPandas()
    ```
    ```
       id
    0   0
    ```
    
    **After**
    
    ```python
    >>> sql("SELECT * FROM range(1)").collect()
    ```
    ```
    18/04/09 16:57:58 ERROR TestQueryExecutionListener: Look at me! I'm 'onSuccess'
    [Row(id=0)]
    ```
    
    ```python
    >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
    >>> sql("SELECT * FROM range(1)").toPandas()
    ```
    ```
    18/04/09 17:53:26 ERROR TestQueryExecutionListener: Look at me! I'm 'onSuccess'
       id
    0   0
    ```
    
    I have manually tested as described above and unit test was added.
    
    Author: hyukjinkwon <gu...@apache.org>
    
    Closes #21007 from HyukjinKwon/SPARK-23942.
    
    (cherry picked from commit ab7b961a4fe96ca02b8352d16b0fa80c972b67fc)
    Signed-off-by: hyukjinkwon <gu...@apache.org>

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    I am a bit puzzled because `QueryExecutionListener` should call the callback for actions and `collect` triggers it in Scala and R but it doesn't in PySpark specifically. It sounds a bug and this fix is relatively straightforward. The previous behaviour was it was not being called which didn't make sense.
    
    I agree that it's discouraged to make a behaviour change to the maintenance release, sure. However, I was thinking it makes sense to backport if the fix is not complicated and looks a bug quite clearly. I think we shouldn't say it's improvement in this case.
    
    Were actual apps or test cases broken somewhere?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    Merged to branch-2.3.
    
    Thanks for reviewing this @BryanCutler.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    cc @rdblue and @steveloughran too who I guess should be interested in setting up a backporting policy.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    Users apps should not be blamed in this case. If they want this change, they should upgrade to the newer release. Basically, we should not introduce any external behavior change in the maintenance release if possible. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2294/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    Since this is not a bug fix, I plan to revert this PR. WDYT? @HyukjinKwon @BryanCutler 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89312/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21060

I am okay if there's a specific reason. I think this is the point - if there's a specific reason, that should be mentioned and explained ahead. Actually, I (and @srowen did as well IIUC) asked this many times, see above.

I would have investigated or would have just said that I am okay with reverting. I don't usually get in the way if there's a specific reason. It would be great if we can have more open talks next time.

> for the 2.3.x backport, add a config that so it is possible to turn this off in production, if somebody actually has their job failed because of this? It's a small delta from what this PR already does, and that should alleviate the concerns @gatorsmile has.

I am personally fine with reverting or adding a configuration if that's what you guys feel strongly; however, I should say it sounds unusual to have a config to control this behaviour in branch-2.3 alone and it sounds less worth. The case you mention sounds really unlikely and I wonder if that makes sense tho. It's also experimental as you all said.

Also, I should note that I have been confused about the backporting policy and the bunch of configurations to control each behaviour. If that's just concerns to be addressed, that's fine but sounds what people must follow so far. If this is true, I feel sure this should be documented. I feel sure we shouldn't have such overhead next time. I am pretty sure this isn't the first time.

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    Like what I said above, we need to be very careful when backporting the PR with the behavior changes, especially when this is **neither a critical issue nor a regression**. Even if this is a bug based on your understanding, we should still not backport such PRs. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    **[Test build #89312 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89312/testReport)** for PR 21060 at commit [`4656724`](https://github.com/apache/spark/commit/4656724d27c208d794f99691cfbf93b4bb118d93).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89327/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    Adding a flag just in 2.3 is, at least, an unusual thing to do. By this logic lots of backports should be flag protected but we don't. Why is this special?
    
     I still don't see much argument against this backport. I count about 3-4 committers in favor and 1 against. Let's leave it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    Fixing API inconsistency should not be treated as a bug fix. 
    
    Please give me a few days. I need to summarize the Spark 2.3 release and list all the PRs that were backported to the release candidate branches. Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/21060

I do not see a problem with the commit message here. Is that really the issue? it accurately describes _what_ changes. The _why_ has always been documented in discussion, and it is here already. Sometimes the _why_ is documented in comments too; I don't see a particular need for that here, but, if that's the issue, why isn't that what we're talking about?

You continue to portray this as a behavior change, and I think you mean "a change in what is considered correct behavior". However all the other comments suggest otherwise; the argument from consistency seems much stronger.

Your proposed criteria for backports sort of align with accepted practice, which is to follow semver.org semantics. I think semver is reasonably clear, in general and in this case. I see broad agreement for this backport, and people simply disagree with your interpretation. It is not a failure to understand criteria.

Believe me, people here have plenty experience with software, versioning, and the impact of changes. I'd put more faith in the judgment of your peers. Your anecdotes are of a type that's familiar to many people, but, I also fail to see how they're relevant here.

You are adopting a 'conservative' position and I think in this case it's out of line with normal practice. I think you should accept that people disagree and move on.

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    **[Test build #89312 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89312/testReport)** for PR 21060 at commit [`4656724`](https://github.com/apache/spark/commit/4656724d27c208d794f99691cfbf93b4bb118d93).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    I agree with what @srowen said:


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89352/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes coll...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21060#discussion_r181279443
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -185,22 +185,12 @@ def __init__(self, key, value):
             self.value = value
     
     
    -class ReusedSQLTestCase(ReusedPySparkTestCase):
    -    @classmethod
    -    def setUpClass(cls):
    -        ReusedPySparkTestCase.setUpClass()
    -        cls.spark = SparkSession(cls.sc)
    -
    -    @classmethod
    -    def tearDownClass(cls):
    -        ReusedPySparkTestCase.tearDownClass()
    -        cls.spark.stop()
    -
    -    def assertPandasEqual(self, expected, result):
    --- End diff --
    
    This method causes a conflict which I don't really understand why. I compared line by line, character by character and they look identical.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    > We need to be very careful when backporting the PR with the behavior changes, especially when this is neither a critical issue nor a regression. Thus, I do not think we should backport this PR.
    
    I am not saying we shouldn't be careful but affects actual user group and actual scenarios.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    > The behavior consistency among Python/Scala/R/JAVA does not mean a bug, right?
    
    This case specifically `collect` in PySpark doesn't work alone whereas all other actions like `foreach`, `show` and other cases in other languages works in all other APIs. Also, that's what a query execution listener describes. Do you believe you would make this exception for PySpark specifically in any case?
    
    I am seeing `foreach` and etc was fixed https://github.com/apache/spark/commit/154351e6dbd24c4254094477e3f7defcba979b1a and also see `collect` is included in the original commit - https://github.com/apache/spark/commit/15ff85b3163acbe8052d4489a00bcf1d2332fcf0
    
    > I am not against this specific PR. All the committers need to be really careful when they make a decision to backport a behavior change. If any committer does it, we should jump in and stop the backport. This is what we should do.
    
    Let's open a discussion in the mailing list and see if we can see the agreement. I think this was not the first time we talked about this and think it's better to open a proper discussion and make a decision. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    **[Test build #89352 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89352/testReport)** for PR 21060 at commit [`4656724`](https://github.com/apache/spark/commit/4656724d27c208d794f99691cfbf93b4bb118d93).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    If this can be treated as a bug to backport, we have many behavior change PRs that can be backported. We are building the system software. We have to be more principled.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    I do think we should clearly document the rule what we can backport. 
    
    I do not think we should make an exception for this PR. cc @rxin @marmbrus @yhuai @cloud-fan @ueshin 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    **[Test build #89363 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89363/testReport)** for PR 21060 at commit [`4656724`](https://github.com/apache/spark/commit/4656724d27c208d794f99691cfbf93b4bb118d93).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    **[Test build #89363 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89363/testReport)** for PR 21060 at commit [`4656724`](https://github.com/apache/spark/commit/4656724d27c208d794f99691cfbf93b4bb118d93).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21060

I might not explain it well. Sorry for the misunderstanding. Thank you @rxin for helping me clarify my points. It sounds like many of you think this backport is fine. I am not against this specific PR. We do not need to revert the PR but just improve the documentation. That should be fine, although I still personally prefer to adding the configuration.

As what I said in the original PR https://github.com/apache/spark/pull/21007 that was merged to master, let me point out two points here too.

- PR descriptions will be part of the commit log. We need to be very careful before merging the PR. In the past, I also missed a few when I did the merge. To be honest, I am not sure how the native English speakers think. The first paragraph scared me when I reading the PR commit log. @srowen WDYT?

```
This PR proposes to add collect to a query executor as an action.
```

- Document the behavior changes that are visible to the external users/developers. In Spark 2.3, we started to enforce it in every merged PR. I believe many of you got multiple similar comments in the previous PRs. This PR should also upgrade the migration guides. @HyukjinKwon Do you agree?

Before we finalize the backport policy, below is my inputs about the whitelist which we can backport:
- The critical/important bug fixes and security fixes.
- The regression fixes.
- The PRs that do not touch the production code, like test-only patches, documentation fixes, and the log message fixes.

Avoid backporting the PRs if it contains
- The new features
- The minor bug fixes/improvements that have external behavior changes
- The code refactoring
- The code changes with the high/mid risk

In the OSS community, I believe no committer will be fired just because we merged/introduced a bug, right? If the users application failed due to an upgrade, normally we blame our users or the bug are just accidentally introduced. However, this is not acceptable in my first team. Let me share what I experienced. Just various customer accidents in my related product teams.

- One director got demoted (almost fired) due to a bad release. She is a very nice lady. We really like her. That release had many cool features but the quality is not controlled well. Many customers are not willing to upgrade.
- There is a famous system upgrade failure a few years ago. The whole system became very slow after the upgrade. It took 10s hours to recover the system. After a few days, the GM went to the customer site and got blamed in the whole day. Multiple architects and VPs were forced to write apology letters. Customers planned to sue us. In the customer side, the CTO got fired later and the upgrade accident was also on the national TV news because it affects many people.
- A few directors were on call with me 10+ nights to resolve one Japanese customer data corruption issue. The client teams ran multiple systems at the same time to reproduce the issue. After a few weeks, it was finally resolved after reading the memory dump. The root cause is the code merge from one branch to another branch many years ago.

If all the above people believes Spark is the best product in Big Data, we need to be more conservative. Our decisions could affect many people. This is not the first time I argued with the other committers/contributors about the PR quality. In one previous PR, I left almost 100 comments just because the documents are not accurate.

If my above comments offend anyone, I apologize. Everyone has different understanding about the software development because we have different work experience. The whole community already did a wonderful job compared with the other open source projects. I still believe we can do a better job, right? Let us formalize the backport policy and enforce them in each release.

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    > This case specifically collect in PySpark doesn't work alone whereas all other actions like foreach, show and other cases in other languages works in all other APIs. Also, that's what a query execution listener describes. Do you believe you would make this exception for PySpark specifically in any case?
    
    To improve the usability, we should change it in the master branch. My point is we should not backport this PR to 2.3 release. 
    
    > Let's open a discussion in the mailing list and see if we can see the agreement. I think this was not the first time we talked about this and think it's better to open a proper discussion and make a decision.
    
    Sure, let me lead the discussion in the dev channel and welcome you to add the inputs there. Next, we should also discuss the rule which PRs can be backported to RC branches when we do the release. In Spark 2.3 release, we backported many PRs that should not be merged to the release candidate branches. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    It looks to me this is a bug fix that can merit backporting, as QueryExecutionListener is also marked as experimental,
    
    In this case, I think @gatorsmile is worried one might have written a listener that enumerates the possible function names, and that listener will fail now with a new action name. I feel this is quite unlikely, but I also appreciate @gatorsmile's concern for backward compatibility, and I've certainly been wrong before when our fixes break existing workloads.
    
    (On the spectrum of being extremely conservative to extremely liberal, I think I'm in general more on the middle, whereas @gatorsmile probably leans more to the conservative side. There isn't really anything wrong with this, and it's good to have balancing forces in a project.)
    
    How about this, @HyukjinKwon -- for the 2.3.x backport, add a config that so it is possible to turn this off in production, if somebody actually has their job failed because of this? It's a small delta from what this PR already does, and that should alleviate the concerns @gatorsmile has. I'd also change the function doc for onSuccess/onFailure to make it clear that we will add new function names in the future, and users shouldn't expect a fixed list of function names.
    
    
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2306/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    hm I would say it's a bug since the action is not detected which is supposed to call the callback. The test is a bit complicated but the fix is relatively straightforward.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    cc @BryanCutler 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    This is just the basic backport rule we follow for each PR. We should not make an exception for this PR. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    This will introduce the behavior change and it is not a regression. The changes we made in this PR could break the external app. We should not do it in the maintenance release. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    @steveloughran Agree. We always can make an exception if most need it. 
    
    - For any external behavior change (even in experimental APIs), we need to document it and also mention it in the release note. This can simplify the version upgrade and let our users trust us.  
    - Need to have more discussion regarding the backport policy (to the maintenance branches and release candidate branches). This should be discussed and finalized at least in the dev list. When the policy is completed, we should also send it to the user list and the committers are responsible for enforcing it. 
    
    I am trying to summarize what we did in the Spark 2.3 release.  It took almost 2 month to release it.  Will send the postmortem to the community with some proposal about the backport policy.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    This certainly looks like a bug fix. I don't know this area well, but I don't see an argument here that the current behavior is correct. Right?
    
    When we say we don't back-port behavior changes, we mean "changes in what is meant to be correct behavior". All bug fixes change behavior, but to restore correct behavior. So I don't see an argument against back-porting because it's a behavior change. 
    
    Of course, sometimes practical concerns override that. If we thought programs were relying on the 'wrong' behavior then we'd have to think twice about correcting it. I don't see that argument being made here, but, I'm not sure? There is evidence the 'wrong' behavior is impacting users though?
    
    @gatorsmile I must say I don't understand your position here, can you clarify? So far standard practice here says this is a reasonable backport. What's different here?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by steveloughran <gi...@git.apache.org>.

Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    This is one of those great problems in software engineering: no good answer. I think case-by-case is generally the best tactic, with a bias against feature backport, though my track record is a bit mixed. 
    
    Patches which fix security issues at the expense of compatibility are real problems here: they need to go in even knowing stuff will break —especially when you quietly push it out with an innocuous JIRA title until you actually do the releases. People start complaining that XML entity expansion has has stopped working, REST APIs failing if unauthed, when that is the exact outcome intended,
    
    Talk to @templedf for a good policy here



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2328/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    **[Test build #89352 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89352/testReport)** for PR 21060 at commit [`4656724`](https://github.com/apache/spark/commit/4656724d27c208d794f99691cfbf93b4bb118d93).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    > The callback works for collect in R and Scala but Python doesn't. I think we should at least match the behaviour. I wonder why it's hard to say a bug when collect is detected in some APIs but not in some APIs.
    
    The behavior consistency among Python/Scala/R/JAVA does not mean a bug, right? 
    
    > That's because the change was big and invasive. I wouldn't backport it too; however, this fix is relatively small.
    
    This is the reason why we did not backport that PR. We still can backport the minimal changes to the previous releases.
    
    > I think we usually use committer's judgement when we make an exception. I already have been seeing many backports that actually causes behaviour changes and I did this because it looks being backported in general. This is the reason why we should formally document it if this is actually the rule.
    
    I am not against this specific PR. All the committers need to be really careful when they make a decision to backport a behavior change. If any committer does it, we should jump in and stop the backport. This is what we should do.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    This is not an new feature addition .. this fixes an exiting functionality to work as expected and consistently .. 
    Sure, that'd be great. Will join in the discussion.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes coll...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon closed the pull request at:

    https://github.com/apache/spark/pull/21060


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89363/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    **[Test build #89327 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89327/testReport)** for PR 21060 at commit [`4656724`](https://github.com/apache/spark/commit/4656724d27c208d794f99691cfbf93b4bb118d93).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    Yup, that should reduce some overhead like this. I would like to listen what you guys think cc @srowenn, @vanzin, @felixcheung, @holdenk too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21060

I am not saying we shouldn't be careful. I am trying to be careful when I backport. So, your reasons are:

- any behaviour changes shouldn't be backported and it's the basic backport rule

I disagree unless it's clearly documented as a rule. Even if so, I would like to make this as an exception because it's less invasive, looks a bug, affects an actual user group and fixes the case to make it sense. That's what I have been used to so far.

- the query execution listener is not clearly defined

I am seeing `collect` is included in the original commit - https://github.com/apache/spark/commit/15ff85b3163acbe8052d4489a00bcf1d2332fcf0. I don't see a reason to specifically exclude PySpark's case since Scala and R also work. I don't think we would exclude this on purpose.

- It's not a critical issue nor a regression.

I don't think we should only make a backport for a critical issue or a regression. That's a strong reason to backport but there are still other cases that can be backported based on my understanding and observations. If it's a bug quite clearly and it affects an actual user group, I would guess it can be valuable for a backport. The fix is straightforward, less invasive and small.

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    `withCallback` was added in Spark 1.6 release https://issues.apache.org/jira/browse/SPARK-11068 Since then, my understanding is we never clearly define which should be part of `withCallback`. Thus, it is hard to say this is a bug fix. 
    
    We hit the similar issue in https://github.com/apache/spark/pull/18064. At that time, we did not backport the PR to the previous releases too. Thus, I do not think we should make an exception for this PR just because the customers of @HyukjinKwon hit this issue. If we make an exception, it becomes harder to decide which PRs are qualified for a backport. 
    
    We need to be very careful when backporting the PR with the behavior changes, especially when this is **neither a critical issue nor a regression**. Thus, I do not think we should backport this PR.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by steveloughran <gi...@git.apache.org>.

Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    * from the ASF process-police perspective, something like versioning/backport policy is something which should be done on the ASF dev list...consider asking in user@ to see what people's preferences are. Worthwhile mentioning in the project report too.
    * from a personal perspective: its good to have a policy, but really good to leave a little bit of wiggle-room, even if its something like 'a vote on the developer list can override any policy on a case-by-case basis". That is: you can do more than just fixes, but its something where the decision is opened up. This makes clear the cost *and* avoids the "why did you cherry-pick this without asking me" conversations.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    I guess the behaviour changes here is that a custom query execution listener now can recognise the action `collect` in PySpark which other APIs have detected. Mind explaining how it breaks external apps? If the callback should not be called specifically `collect` but not other actions like `show` in PySpark, I would say it should be to blame yours apps.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    This is not just about just inconsistency but a bug. The previous behaivour doesn't make sense.
    
    Sure, no need to rush.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    **[Test build #89327 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89327/testReport)** for PR 21060 at commit [`4656724`](https://github.com/apache/spark/commit/4656724d27c208d794f99691cfbf93b4bb118d93).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    How about we formally document that in the guide?
    
    I have been always putting more importance on practice and I personally think we are fine to make a backport if it's a bug and the fix is straightforward. IMHO, principal is a base but we should put more importance on practice. 
    
    Even if I take your words, I would then like to make this as an exception since this fixes actual usecases from our customers.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    I agree that It's better to avoid a behaviour change but this one is a clearly a bug and the fix is straightforward. I am puzzled why this specifically prompted you. I wouldn't revert if there's not specific worry about this patch.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    This was a bug fix from my perspective and looked to be low risk.  I don't think this changes any behavior for the user, except if you do a `collect` from pyspark and have a `QueryExecutionListener`, then it will now get the expected callback instead of nothing.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2323/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    I am fine to accept different opinions for this specific PR. Reverting this PR is not my goal here. This is a public community. It sounds like the commit message clearly delivers what this PR does to you: `This PR proposes to add collect to a query executor as an action.`, although I still have different opinions. We need to collect and accept different opinions always. 
    
    I am also glad you agree on the backport policy I proposed above. Hopefully, everyone is on the same page for avoiding unnecessary overhead. 
    
    > The minor bug fixes/improvements that have external behavior changes
    
    I personally thought this PR fits this category. No matter whether the behavior changes are correct or not, we should still not backport it if the issue is neither critical nor a regression. That is what I emphasized in the above argument multiple times. The API inconsistency is not rare between our APIs. We did not backport these PRs. Now, I am fine to backport it because it is an experimental API. Thus, we can say we do not guarantee the backport compatibility. If it were a public API, I would insist my original opinion. 
    
    I am also glad many community members have a lot of experience with mission critical software development. This can help improve documentation, code quality and test coverage. Development of application/mobile software is completely different from development of system software. We are in the right direction. We need to enforce it with stricter discipline. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21060
  
    > withCallback was added in Spark 1.6 release https://issues.apache.org/jira/browse/SPARK-11068 Since then, my understanding is we never clearly define which should be part of withCallback. Thus, it is hard to say this is a bug fix.
    
    The callback works for `collect` in R and Scala but Python doesn't. I think we should at least match the behaviour. I wonder why it's hard to say a bug when `collect` is detected in some APIs but not in some APIs.
    
    > We hit the similar issue in #18064. At that time, we did not backport the PR to the previous releases too.
    
    That's because the change was big and invasive. I wouldn't backport it too; however, this fix is relatively small.
    
    > Thus, I do not think we should make an exception for this PR just because the customers of @HyukjinKwon hit this issue
    
    It's not because my customers but I am saying it fixes an actual usecase and it affects actual users.
    
    > If we make an exception, it becomes harder to decide which PRs are qualified for a backport.
    
    I think we usually use committer's judgement when we make an exception. I already have been seeing many backports that actually causes behaviour changes and I did this because it looks being backported in general. This is the reason why we should formally document it if this is actually the rule.
    
    What I am less sure is, why this one specifically prompted you.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org