You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by BryanCutler <gi...@git.apache.org> on 2018/05/24 19:32:37 UTC

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

GitHub user BryanCutler opened a pull request:

    https://github.com/apache/spark/pull/21427

    [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assign result columns by name

    ## What changes were proposed in this pull request?
    
    Currently, a `pandas_udf` of type `PandasUDFType.GROUPED_MAP` will assign the resulting columns based on index of the return pandas.DataFrame.  If a new DataFrame is returned and constructed using a dict, then the order of the columns could be arbitrary and be different than the defined schema for the UDF.  If the schema types still match, then no error will be raised and the user will see column names and column data mixed up.
    
    ## How was this patch tested?
    
    Added a test that returns a new DataFrame with column order different than the schema.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/BryanCutler/spark arrow-grouped-map-mixesup-cols-SPARK-24324

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21427.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21427
    
----
commit 0641c5a0cd690fec905829b70006de3f8a4902fc
Author: Bryan Cutler <cu...@...>
Date:   2018-05-24T19:03:02Z

    added test for diff column order

commit 8484647113144958c8ebcf3611c222119047cc96
Author: Bryan Cutler <cu...@...>
Date:   2018-05-24T19:18:47Z

    needed to adjust expected values to compare results

commit d67a8a5987d6ba4bdd65f5d5decafca2d22291ad
Author: Bryan Cutler <cu...@...>
Date:   2018-05-24T19:21:36Z

    for grouped map results, get columns based on name instead of position

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r191596459
  
    --- Diff: python/pyspark/worker.py ---
    @@ -111,9 +114,16 @@ def wrapped(key_series, value_series):
                     "Number of columns of the returned pandas.DataFrame "
                     "doesn't match specified schema. "
                     "Expected: {} Actual: {}".format(len(return_type), len(result.columns)))
    -        arrow_return_types = (to_arrow_type(field.dataType) for field in return_type)
    -        return [(result[result.columns[i]], arrow_type)
    -                for i, arrow_type in enumerate(arrow_return_types)]
    +        try:
    +            # Assign result columns by schema name
    +            return [(result[field.name], to_arrow_type(field.dataType)) for field in return_type]
    +        except KeyError:
    +            if all(not isinstance(name, basestring) for name in result.columns):
    +                # Assign result columns by position if they are not named with strings
    +                return [(result[result.columns[i]], to_arrow_type(field.dataType))
    +                        for i, field in enumerate(return_type)]
    +            else:
    +                raise
    --- End diff --
    
    Ah, I saw you add document for this behavior. Looks good.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    I think it's pretty common that users will construct a pd.DataFrame from a dict and see this issue.  So I think we have to use the column name instead of index.  I can't remember if this was previously discussed or not.  cc @icexelloss @HyukjinKwon @ueshin 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r191016970
  
    --- Diff: python/pyspark/worker.py ---
    @@ -111,9 +114,16 @@ def wrapped(key_series, value_series):
                     "Number of columns of the returned pandas.DataFrame "
                     "doesn't match specified schema. "
                     "Expected: {} Actual: {}".format(len(return_type), len(result.columns)))
    -        arrow_return_types = (to_arrow_type(field.dataType) for field in return_type)
    -        return [(result[result.columns[i]], arrow_type)
    -                for i, arrow_type in enumerate(arrow_return_types)]
    +        try:
    +            # Assign result columns by schema name
    +            return [(result[field.name], to_arrow_type(field.dataType)) for field in return_type]
    +        except KeyError:
    --- End diff --
    
    One potential issue is if `to_arrow_type(field.dataType)` ever throws KeyError, this can lead to unintended behavior. If we want to use KeyError, maybe limit the try block?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r197509567
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala ---
    @@ -120,4 +121,19 @@ object ArrowUtils {
           StructField(field.getName, dt, field.isNullable)
         })
       }
    +
    +  /** Return Map with conf settings to be used in ArrowPythonRunner */
    +  def getPythonRunnerConfMap(conf: SQLConf): Map[String, String] = {
    +    val timeZoneConf = if (conf.pandasRespectSessionTimeZone) {
    +      Seq(SQLConf.SESSION_LOCAL_TIMEZONE.key -> conf.sessionLocalTimeZone)
    +    } else {
    +      Nil
    +    }
    +    val pandasColsByPosition = if (conf.pandasGroupedMapAssignColumnssByPosition) {
    --- End diff --
    
    I think it's better to just omit the config for the default case, that way it's easier to process in the worker.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92223/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/418/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r191040210
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4931,6 +4931,63 @@ def foo3(key, pdf):
             expected4 = udf3.func((), pdf)
             self.assertPandasEqual(expected4, result4)
     
    +    def test_column_order(self):
    +        import pandas as pd
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +        df = self.data
    +
    +        # Function returns a pdf with required column names, but order could be arbitrary using dict
    +        def change_col_order(pdf):
    +            # Constructing a DataFrame from a dict should result in the same order,
    +            # but use from_items to ensure the pdf column order is different than schema
    +            return pd.DataFrame.from_items([
    +                ('id', pdf.id),
    +                ('u', pdf.v * 2),
    +                ('v', pdf.v)])
    +
    +        ordered_udf = pandas_udf(
    +            change_col_order,
    +            'id long, v int, u int',
    +            PandasUDFType.GROUPED_MAP
    +        )
    +
    +        def positional_col_order(pdf):
    --- End diff --
    
    would it be more nature/common to `zip(range(3)` the columns, or or just name them one by one explicitly?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    **[Test build #91275 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91275/testReport)** for PR 21427 at commit [`e322e1a`](https://github.com/apache/spark/commit/e322e1a1caa6cf422ed8b33244656345e4c13bb3).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r196442152
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala ---
    @@ -120,4 +121,19 @@ object ArrowUtils {
           StructField(field.getName, dt, field.isNullable)
         })
       }
    +
    +  /** Return Map with conf settings to be used in ArrowPythonRunner */
    +  def getPythonRunnerConfMap(conf: SQLConf): Map[String, String] = {
    +    val timeZoneConf = if (conf.pandasRespectSessionTimeZone) {
    +      Seq(SQLConf.SESSION_LOCAL_TIMEZONE.key -> conf.sessionLocalTimeZone)
    +    } else {
    +      Nil
    +    }
    +    val pandasColsByPosition = if (conf.pandasGroupedMapAssignColumnssByPosition) {
    --- End diff --
    
    Can we do:
    ```
    val pandasColByPosition = Seq(SQLConf.PANDAS_GROUPED_MAP_ASSIGN_COLUMNS_BY_POSITION.key -> conf.pandasGroupedMapAssignColumnssByPosition)
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    **[Test build #92066 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92066/testReport)** for PR 21427 at commit [`59972d6`](https://github.com/apache/spark/commit/59972d6a9ab3d8a95f5b5eed5c30d73421dbe140).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    I do think the current default behavior might be confusing to users and hard to debug. I have also received similar complaints. 
    
    I think at the very least, we should make sure when column names of the schema and return value matches but orders are different, we should match by column name as it is extremely unlikely user want any other behavior in this case. This will mostly keep the current behavior unchanged, with the exception that "same column name, different order" which the new behavior is strictly better.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    I would prefer if we could do this without a config because while the current behavior can work if the user knows what they are doing, it can also fail very easily and not obviously.  So to me that seems like a bug and we should just fix it so the feature can not be used in a potentially dangerous way.
    
    If we need to make a config though, can it be such that it falls back to the current behavior (to use position) only in there is a `KeyError` and the switch is set to be backwards compatible?  Otherwise it would raise the `KeyError`..  If we did this, then (1) and (2) from https://github.com/apache/spark/pull/21427#issuecomment-392070950 could continue to work but the following would no longer work (this seems pretty silly though):
    ```
    @pandas_udf("a string, b float", GROUPED_MAP)
    def foo(pdf):
        return pd.DataFrame({'b': ['hi'], 'a': [1.0]})
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Thanks all for the discussion and reviews!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r196437348
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala ---
    @@ -58,18 +58,18 @@ class ArrowPythonRunner(
         new WriterThread(env, worker, inputIterator, partitionIndex, context) {
     
           protected override def writeCommand(dataOut: DataOutputStream): Unit = {
    -        PythonUDFRunner.writeUDFs(dataOut, funcs, argOffsets)
    -        if (respectTimeZone) {
    -          PythonRDD.writeUTF(timeZoneId, dataOut)
    -        } else {
    -          dataOut.writeInt(SpecialLengths.NULL)
    +        dataOut.writeInt(conf.size)
    --- End diff --
    
    maybe put this in a  `writeConf` method to be more explicit?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    SGTM if it works.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    BTW, what do you think about adding a blocker to set this feature as experimental @rxin? I think it's pretty new feature and it should be reasonable to call it experimental.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Yea agree with not backporting and agree with configuration. Thing is, the configuration is inaccessible in worker.py side. That's why I was hesitant. The safest way is just to target 3.0.0 but there are currently many complaints too on the other hand.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    **[Test build #91275 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91275/testReport)** for PR 21427 at commit [`e322e1a`](https://github.com/apache/spark/commit/e322e1a1caa6cf422ed8b33244656345e4c13bb3).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    btw, we should really mark this as experimental and allow a bit of behaviour changes really. I guess that's what we meat by Experimental:
    https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/common/tags/src/main/java/org/apache/spark/annotation/Experimental.java#L25



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    @icexelloss @HyukjinKwon It is always simple to deprecate the confs in the release of Spark 3.0. Let us make it configurable in the next release which is Spark 2.4. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    I'm sorry for the late review, but I think the current fix is still behavior change..


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    I opened https://issues.apache.org/jira/browse/SPARK-24392 to continue the discussion about changing this to experimental.  IMO it was a bit of an oversight to not do so initially, but agree that it is a little strange to change it after being released.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Okay, but I get it can be smooth to go ahead. I am okay.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    I think the config switch is for maintaining backward compatibility in case someone is hit with this. so I think it's a good idea.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    I agree with the choice here. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    because it needs to change the our pickle protocol to access to the configuration if I remember this correctly. cc @ueshin too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r197527013
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala ---
    @@ -120,4 +121,19 @@ object ArrowUtils {
           StructField(field.getName, dt, field.isNullable)
         })
       }
    +
    +  /** Return Map with conf settings to be used in ArrowPythonRunner */
    +  def getPythonRunnerConfMap(conf: SQLConf): Map[String, String] = {
    +    val timeZoneConf = if (conf.pandasRespectSessionTimeZone) {
    +      Seq(SQLConf.SESSION_LOCAL_TIMEZONE.key -> conf.sessionLocalTimeZone)
    +    } else {
    +      Nil
    +    }
    +    val pandasColsByPosition = if (conf.pandasGroupedMapAssignColumnssByPosition) {
    --- End diff --
    
    I am sorry can you explain why it's easier to process in the worker?
    
    Isn't it just removing the default value here:
    https://github.com/apache/spark/pull/21427/files#diff-d33eea00c68dfd120f4ceae6381f34cdR99
    
    Also one thing is not great about omitting the conf for default case is that you need to put the default value in two places..(both python and java)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    but as I said it's difficult to have a configuration there. Shall we just target 3.0.0 abd martk this as experimental as I suggeated from the first place? That should be the safest way.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r197508262
  
    --- Diff: python/pyspark/worker.py ---
    @@ -110,9 +116,20 @@ def wrapped(key_series, value_series):
                     "Number of columns of the returned pandas.DataFrame "
                     "doesn't match specified schema. "
                     "Expected: {} Actual: {}".format(len(return_type), len(result.columns)))
    -        arrow_return_types = (to_arrow_type(field.dataType) for field in return_type)
    -        return [(result[result.columns[i]], arrow_type)
    -                for i, arrow_type in enumerate(arrow_return_types)]
    +
    +        if not assign_cols_by_pos:
    +            try:
    +                # Assign result columns by schema name
    +                return [(result[field.name], to_arrow_type(field.dataType))
    +                        for field in return_type]
    +            except KeyError:
    --- End diff --
    
    This seems ok to me since it's basically the same, but I don't think we need to worry about `to_arrow_type` throwing a `KeyError`.  Is there any particular reason you suggested handling position like this?
    
    ```
    [(result.iloc[:,i], to_arrow_type(field.dataType)) for i, field in enumerate(return_type)]
    ```
    
    To me it seems better to look up by column labels, how it is currently
    
    ```
    [(result[result.columns[i]], to_arrow_type(field.dataType))
                    for i, field in enumerate(return_type)]
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Merged to master.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    **[Test build #92077 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92077/testReport)** for PR 21427 at commit [`59972d6`](https://github.com/apache/spark/commit/59972d6a9ab3d8a95f5b5eed5c30d73421dbe140).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r191037717
  
    --- Diff: python/pyspark/worker.py ---
    @@ -111,9 +114,16 @@ def wrapped(key_series, value_series):
                     "Number of columns of the returned pandas.DataFrame "
                     "doesn't match specified schema. "
                     "Expected: {} Actual: {}".format(len(return_type), len(result.columns)))
    -        arrow_return_types = (to_arrow_type(field.dataType) for field in return_type)
    -        return [(result[result.columns[i]], arrow_type)
    -                for i, arrow_type in enumerate(arrow_return_types)]
    +        try:
    +            # Assign result columns by schema name
    +            return [(result[field.name], to_arrow_type(field.dataType)) for field in return_type]
    +        except KeyError:
    +            if all(not isinstance(name, basestring) for name in result.columns):
    +                # Assign result columns by position if they are not named with strings
    +                return [(result[result.columns[i]], to_arrow_type(field.dataType))
    +                        for i, field in enumerate(return_type)]
    +            else:
    +                raise
    --- End diff --
    
    Why we limit to just result columns not named with strings?
    
    In the case we return a pd.DataFrame with matching field types, but not matching field names, we don't like to allow it?
    
    If returned pd.DataFrame doesn't match return_type's column names, shouldn't we follow current behavior?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    @HyukjinKwon I agree with you 99% people will mostly certainly not use the config. I think @gatorsmile 's concern is that in the rare case that some people are actually depending on the existing behavior, they at least have an easy way to fallback. I think this is the similar situation as:
    
    https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L1156
    
    Going forward we should probably remove all these configuration at some point (maybe when pandas_udf and toPandas are out of experimental?), @gatorsmile WDYT?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r191511343
  
    --- Diff: python/pyspark/worker.py ---
    @@ -111,9 +114,16 @@ def wrapped(key_series, value_series):
                     "Number of columns of the returned pandas.DataFrame "
                     "doesn't match specified schema. "
                     "Expected: {} Actual: {}".format(len(return_type), len(result.columns)))
    -        arrow_return_types = (to_arrow_type(field.dataType) for field in return_type)
    -        return [(result[result.columns[i]], arrow_type)
    -                for i, arrow_type in enumerate(arrow_return_types)]
    +        try:
    +            # Assign result columns by schema name
    +            return [(result[field.name], to_arrow_type(field.dataType)) for field in return_type]
    +        except KeyError:
    +            if all(not isinstance(name, basestring) for name in result.columns):
    +                # Assign result columns by position if they are not named with strings
    +                return [(result[result.columns[i]], to_arrow_type(field.dataType))
    +                        for i, field in enumerate(return_type)]
    +            else:
    +                raise
    --- End diff --
    
    @viirya I think that it's just that it is very common for users to create a DataFrame with a dict using names as keys and not know that this can change the order of columns.  So even if the field types all match (in the case of this JIRA they were all StringTypes), there could be a mix up between the data and column names.  This is really weird and hard to figure out what is going on from the user perspective.
    
    When defining the pandas_udf, the return type requires the field names, so if the returned DataFrame has columns indexed by strings, I think it's fair to assume that if they do not match it was a mistake.  If the user wants to use positional columns, they can index by integers - and I'll add this to the docs.
    
    That being said, I do suppose that this slightly changes the behavior if by chance the user had gone out of the way to make a pandas_udf by specifying columns with different names than the return type schema, but still with the same field type order.  That seems pretty unlikely to me though.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Please let users decide whether they are resolved by names or by position. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    I think so:
    
    ```
    >>> type(pd.DataFrame({'1': [1], '2': [2]}).columns)
    <class 'pandas.core.indexes.base.Index'>
    
    >>> type(pd.DataFrame([[1, 2.0, "hello"]]).columns)
    <class 'pandas.core.indexes.range.RangeIndex'>
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Yup, my impression was that there could be a corner case too but I wasn't sure how much the corner case makes sense, and haven't checked it closelt yet. I believe elaborating the case might be helpful to judge we should block this or now. The current approach looks fine in general to me though. I think it's fine if it's a bit of behaviour change as long as we mention it in the migration guide cc @cloud-fan too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    @gatorsmile @ueshin Thanks for joining the discussion!
    
    I wonder if you agree that at least for case (1) here https://github.com/apache/spark/pull/21427#issuecomment-392070950, we should match by name rather than index by default?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r191015609
  
    --- Diff: python/pyspark/worker.py ---
    @@ -111,9 +114,16 @@ def wrapped(key_series, value_series):
                     "Number of columns of the returned pandas.DataFrame "
                     "doesn't match specified schema. "
                     "Expected: {} Actual: {}".format(len(return_type), len(result.columns)))
    -        arrow_return_types = (to_arrow_type(field.dataType) for field in return_type)
    -        return [(result[result.columns[i]], arrow_type)
    -                for i, arrow_type in enumerate(arrow_return_types)]
    +        try:
    +            # Assign result columns by schema name
    +            return [(result[field.name], to_arrow_type(field.dataType)) for field in return_type]
    +        except KeyError:
    +            if all(not isinstance(name, basestring) for name in result.columns):
    --- End diff --
    
    Can we just do `isinstance(name, str)` here to deal with python2/3?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/314/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r191070228
  
    --- Diff: python/pyspark/worker.py ---
    @@ -111,9 +114,16 @@ def wrapped(key_series, value_series):
                     "Number of columns of the returned pandas.DataFrame "
                     "doesn't match specified schema. "
                     "Expected: {} Actual: {}".format(len(return_type), len(result.columns)))
    -        arrow_return_types = (to_arrow_type(field.dataType) for field in return_type)
    -        return [(result[result.columns[i]], arrow_type)
    -                for i, arrow_type in enumerate(arrow_return_types)]
    +        try:
    +            # Assign result columns by schema name
    +            return [(result[field.name], to_arrow_type(field.dataType)) for field in return_type]
    +        except KeyError:
    +            if all(not isinstance(name, basestring) for name in result.columns):
    --- End diff --
    
    I believe he's trying to deal with `unicode` case too just in python 2. `isinstance(name, basestring)` should be safer.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    **[Test build #92048 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92048/testReport)** for PR 21427 at commit [`59972d6`](https://github.com/apache/spark/commit/59972d6a9ab3d8a95f5b5eed5c30d73421dbe140).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r191502476
  
    --- Diff: python/pyspark/worker.py ---
    @@ -111,9 +114,16 @@ def wrapped(key_series, value_series):
                     "Number of columns of the returned pandas.DataFrame "
                     "doesn't match specified schema. "
                     "Expected: {} Actual: {}".format(len(return_type), len(result.columns)))
    -        arrow_return_types = (to_arrow_type(field.dataType) for field in return_type)
    -        return [(result[result.columns[i]], arrow_type)
    -                for i, arrow_type in enumerate(arrow_return_types)]
    +        try:
    +            # Assign result columns by schema name
    +            return [(result[field.name], to_arrow_type(field.dataType)) for field in return_type]
    +        except KeyError:
    +            if all(not isinstance(name, basestring) for name in result.columns):
    --- End diff --
    
    Yeah, we still need to check for the possibility that python 2 uses unicode.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    **[Test build #91124 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91124/testReport)** for PR 21427 at commit [`d67a8a5`](https://github.com/apache/spark/commit/d67a8a5987d6ba4bdd65f5d5decafca2d22291ad).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    **[Test build #91166 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91166/testReport)** for PR 21427 at commit [`5591881`](https://github.com/apache/spark/commit/559188151c133377bbce34cbd930bc49ac166bf2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    **[Test build #91167 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91167/testReport)** for PR 21427 at commit [`f50120f`](https://github.com/apache/spark/commit/f50120ffaeaa020d68144945de4b25840d5dcb1b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Also, I really think we should mark this feature as experimental.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    I'm okay if that's only the way to get through here. but I must say I wonder who's going to intentionally switch this off though. This now sounds more like a bug or a design issue to be fixed to and we righly marked this as experimental. Now sure if it's only me feeling in this way.
    
    Let's make sure that the configuration's description looks making sense if we go ahead with a cinfiguration.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/293/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    @BryanCutler could you also add in the document of grouped_map to explains the behavior?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r197525704
  
    --- Diff: python/pyspark/worker.py ---
    @@ -110,9 +116,20 @@ def wrapped(key_series, value_series):
                     "Number of columns of the returned pandas.DataFrame "
                     "doesn't match specified schema. "
                     "Expected: {} Actual: {}".format(len(return_type), len(result.columns)))
    -        arrow_return_types = (to_arrow_type(field.dataType) for field in return_type)
    -        return [(result[result.columns[i]], arrow_type)
    -                for i, arrow_type in enumerate(arrow_return_types)]
    +
    +        if not assign_cols_by_pos:
    +            try:
    +                # Assign result columns by schema name
    +                return [(result[field.name], to_arrow_type(field.dataType))
    +                        for field in return_type]
    +            except KeyError:
    --- End diff --
    
    I think `result.iloc[:,i]` and `result[result.columns[i]]` are the same, you don't have change it if you prefer `result.columns[i]`
    
    I agree `to_arrow_type` doesn't throw `KeyError`,  but in general I feel it's more robust not to assume the implementation detail of `to_arrow_type`. I think the code is more concise and readable with if/else too (comparing to except KeyError)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Thing is, we already do similar thing via createDataFrame with list and dictionary. I believe @icexelloss  borrowed this idea from there:
    
    ```
    >>> spark.createDataFrame([["a", 1], ["c", 2]], "a: string, b: int").show()
    +---+---+
    |  a|  b|
    +---+---+
    |  a|  1|
    |  c|  2|
    +---+---+
    
    >>> spark.createDataFrame([{"b":1, "a": "a"}, {"b": 2, "a": "c"}], "a: string, b: int").show()
    +---+---+
    |  a|  b|
    +---+---+
    |  a|  1|
    |  c|  2|
    +---+---+
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    If we can fix it without breaking existing behavior that would be awesome.
    
    On Fri, May 25, 2018 at 9:59 AM Bryan Cutler <no...@github.com>
    wrote:
    
    > I've been thinking about this and came to the same conclusion as
    > @icexelloss <https://github.com/icexelloss> here #21427 (comment)
    > <https://github.com/apache/spark/pull/21427#issuecomment-392070950> that
    > we could really support both names and position, and fix this without
    > changing behavior.
    >
    > When the user defines as grouped map udf, the StructType has field names
    > so if the returned DataFrame has column names they should match. If the
    > user returned a DataFrame with positional columns only, pandas will name
    > the columns with an integer index (not an integer string). We could change
    > the logic to do the following:
    >
    > Assign columns by name, catching a KeyError exception
    > If the column names are all integers, then fallback to assign by position
    > Else raise the KeyError (most likely the user has a typo in the column name)
    >
    > I think that will solve this issue and not change the behavior, but I
    > would need check that this will hold for different pandas versions. How
    > does that sound?
    >
    > —
    > You are receiving this because you were mentioned.
    > Reply to this email directly, view it on GitHub
    > <https://github.com/apache/spark/pull/21427#issuecomment-392119306>, or mute
    > the thread
    > <https://github.com/notifications/unsubscribe-auth/AATvPMCqb9uccM8coTBel1PxwCReedS4ks5t2DiCgaJpZM4UM2oZ>
    > .
    >



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    @rxin @gatorsmile thanks for joining the discussion!
    
    On the configuration side, we have already some mechanism to do so for the "timezone" config:
    https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala#L48
    I'd imagine we could extend the mechanism to support arbitrary configuration map. 
    
    On the behavior side, I think more about this and I feel a desirable behavior is support both matching by name and by index, i.e.
    (1) If the output dataframe has the same column names as the schema, we match by column name, this is desirable behavior where user do:
    ```
    return pd.DataFrame({'a': ..., 'b': ...})
    ```
    (2) If the output dataframe has column names "0, 1, ,2 ...", we match by indices, this is because when user doesn't specify column names when creating a pd.DataFrame, that's the default column names, e.g.
    ```
    >>> pd.DataFrame([[1, 2.0, "hello"], [4, 5.0, "xxx"]])
       0    1      2
    0  1  2.0  hello
    1  4  5.0    xxx
    ``` 
    (3) throw exception otherwise
    
    What do you think of having the new configuration support this behavior?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r191502180
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4931,6 +4931,63 @@ def foo3(key, pdf):
             expected4 = udf3.func((), pdf)
             self.assertPandasEqual(expected4, result4)
     
    +    def test_column_order(self):
    +        import pandas as pd
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +        df = self.data
    +
    +        # Function returns a pdf with required column names, but order could be arbitrary using dict
    +        def change_col_order(pdf):
    +            # Constructing a DataFrame from a dict should result in the same order,
    +            # but use from_items to ensure the pdf column order is different than schema
    +            return pd.DataFrame.from_items([
    +                ('id', pdf.id),
    +                ('u', pdf.v * 2),
    +                ('v', pdf.v)])
    +
    +        ordered_udf = pandas_udf(
    +            change_col_order,
    +            'id long, v int, u int',
    +            PandasUDFType.GROUPED_MAP
    +        )
    +
    +        def positional_col_order(pdf):
    --- End diff --
    
    yeah, I'll add a test for an integer index. I don't think we need to explicitly only support string or int.  Only if it is not string based, then position will be used. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r196437623
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/FlatMapGroupsInPandasExec.scala ---
    @@ -77,7 +78,7 @@ case class FlatMapGroupsInPandasExec(
         val reuseWorker = inputRDD.conf.getBoolean("spark.python.worker.reuse", defaultValue = true)
         val chainedFunc = Seq(ChainedPythonFunctions(Seq(pandasFunction)))
         val sessionLocalTimeZone = conf.sessionLocalTimeZone
    -    val pandasRespectSessionTimeZone = conf.pandasRespectSessionTimeZone
    +    val runnerConf = ArrowUtils.getPythonRunnerConfMap(conf)
    --- End diff --
    
    nit: pythonRunnerConf?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r197510171
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala ---
    @@ -58,18 +58,18 @@ class ArrowPythonRunner(
         new WriterThread(env, worker, inputIterator, partitionIndex, context) {
     
           protected override def writeCommand(dataOut: DataOutputStream): Unit = {
    -        PythonUDFRunner.writeUDFs(dataOut, funcs, argOffsets)
    -        if (respectTimeZone) {
    -          PythonRDD.writeUTF(timeZoneId, dataOut)
    -        } else {
    -          dataOut.writeInt(SpecialLengths.NULL)
    +        dataOut.writeInt(conf.size)
    --- End diff --
    
    I think it's fine, but I will add some comments


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    **[Test build #91166 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91166/testReport)** for PR 21427 at commit [`5591881`](https://github.com/apache/spark/commit/559188151c133377bbce34cbd930bc49ac166bf2).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91124/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92077/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    How about we just start to have configurations classified for each version if it sounds better to have configurations for each behaviour change? For example, we could have postfix like spark_23 in the configuration name. Once we properly set this, I think this should get rid of all such overhead like this.
    
    For now, I am less sure what we are making it configurable. You guys really think https://github.com/apache/spark/pull/21427#issuecomment-393351811 makes any sense? This isn't really purely positional vs vs named now. See https://github.com/apache/spark/pull/21427#issuecomment-392070950, this actually covers both cases.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Here are some examples that currently work, but would no longer work under the proposed fix. These are all cases where columns are named with strings, but the names do not match the schema (let me know if I've missed any cases):
    
    1) DataFrame constructed with a dict, where column order happens to match field type order in schema
    ```python
    @pandas_udf("a string, b float", GROUPED_MAP)
    def foo(pdf):
        return pd.DataFrame({'x': ['hi'], 'y': [1.0]})
    ```
    
    2) Data used positionally and columns specified as list of strings that don't match schema
    ```python
    @pandas_udf("a string, b float", GROUPED_MAP)
    def foo(pdf):
        return pd.DataFrame([('hi', 1.0)], columns=['x', 'y'])
    ```
    
    Both of these currently work, but I think (1) is very error prone because the dict could have reordered the cols (as reported in this JIRA), so we should not allow this under any circumstance. (2) is not problematic, but I'm not sure why anyone might do this.  Unfortunately, I don't think there is any way to distinguish between the two. If we decide this should be done with a config, I'm ok with that but if it's positional by default then a lot of people will hit this problem and not be able to tell why.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Thanks @ueshin and @gatorsmile for taking a look, I agree the proposed fix changes some behavior, but I think that behavior is either error-prone or doesn't make much sense. Let me put up some examples that this fix would break and we can discuss. Btw, as another point of reference the `Row` class allows construction using position or by name which is expected to match the schema.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    **[Test build #92048 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92048/testReport)** for PR 21427 at commit [`59972d6`](https://github.com/apache/spark/commit/59972d6a9ab3d8a95f5b5eed5c30d73421dbe140).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Also, if this fix looks ok should we backport?  I can see a lot of people hitting this and getting wrong data, so seems like a correctness bug to me.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/4208/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/308/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/21427


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r190793613
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4931,6 +4931,33 @@ def foo3(key, pdf):
             expected4 = udf3.func((), pdf)
             self.assertPandasEqual(expected4, result4)
     
    +    def test_column_order(self):
    +        import pandas as pd
    +        from pyspark.sql.functions import pandas_udf, col, PandasUDFType
    --- End diff --
    
    seems `col` is not used btw.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    I guess sending configurations is not that difficult.
    We can write configs (as `Map[String, String]` for further configurations in the future?) before `PythonUDFRunner.writeUDFs(dataOut, funcs, argOffsets)` in `ArrowPythonRunner.writeCommand()` (and `PythonUDFRunner.writeCommand()`?), and read them before read udfs at `worker.py`. The `timezone` can be included in the configs.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    I agree with you @HyukjinKwon but we have had lots of discussion here and are still split, so let's just add the config.  There is still the off chance that someone upgrading from 2.3.0 (without the experimental tag) has one of these corner cases.  I'll try to implement it how @ueshin suggested so that we have a general way to pass configs to the worker.  That way, it won't be a disruptive change when we decide to remove it in the future.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    > we can't just change the behavior. We think the old behavior doesn't make sense and users should change their code, but users may not think in this way.
    
    I think this basically mean we will have every configuration for each behaviour change whether it's a bug or not.
    
    If we failed to explain why users could think it makes sense in a way, how about elaborating it rather then thinking hypothetically there might be.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    **[Test build #92223 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92223/testReport)** for PR 21427 at commit [`2d2ced6`](https://github.com/apache/spark/commit/2d2ced626f3ed9abf8100e5a83b007b8c8cfad99).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r196437909
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/WindowInPandasExec.scala ---
    @@ -97,7 +98,7 @@ case class WindowInPandasExec(
         val bufferSize = inputRDD.conf.getInt("spark.buffer.size", 65536)
         val reuseWorker = inputRDD.conf.getBoolean("spark.python.worker.reuse", defaultValue = true)
         val sessionLocalTimeZone = conf.sessionLocalTimeZone
    -    val pandasRespectSessionTimeZone = conf.pandasRespectSessionTimeZone
    +    val runnerConf = ArrowUtils.getPythonRunnerConfMap(conf)
    --- End diff --
    
    nit: pythonRunnerConf?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3595/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Yea, that's what I meant. Having configuration should be desiable but I doubt if we should extend that way further. one time thing should be fine too. I roughly guess that's going to be a min fix.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    On the config part, I haven’t looked at the code but can’t we just reorder
    the columns on the JVM side? Why do we need to reorder them on the Python
    side?
    
    On Fri, May 25, 2018 at 12:31 AM Hyukjin Kwon <no...@github.com>
    wrote:
    
    > I believe it was just a mistake to correct - we forget this to mark it
    > experimental. It's pretty unstable and many JIRAs are being open.
    > @BryanCutler <https://github.com/BryanCutler> mind if I ask to go ahead
    > if you find some time? if you are busy will do it by myself.
    >
    > cc @vanzin <https://github.com/vanzin> FYI.
    >
    > —
    > You are receiving this because you were mentioned.
    > Reply to this email directly, view it on GitHub
    > <https://github.com/apache/spark/pull/21427#issuecomment-391967423>, or mute
    > the thread
    > <https://github.com/notifications/unsubscribe-auth/AATvPD8iBRMXvmS7vVSIidwnZxK1BaQ4ks5t17NlgaJpZM4UM2oZ>
    > .
    >



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r191006873
  
    --- Diff: python/pyspark/worker.py ---
    @@ -111,9 +114,16 @@ def wrapped(key_series, value_series):
                     "Number of columns of the returned pandas.DataFrame "
                     "doesn't match specified schema. "
                     "Expected: {} Actual: {}".format(len(return_type), len(result.columns)))
    -        arrow_return_types = (to_arrow_type(field.dataType) for field in return_type)
    -        return [(result[result.columns[i]], arrow_type)
    -                for i, arrow_type in enumerate(arrow_return_types)]
    +        try:
    +            # Assign result columns by schema name
    +            return [(result[field.name], to_arrow_type(field.dataType)) for field in return_type]
    +        except KeyError:
    --- End diff --
    
    I think it's possible for the column index to be many things, the user could even assign it themselves right with `pdf.columns = ...`?
    
    As far as I can tell, using a string as a key should always result in a KeyError if not there..  If a MultiIndex is involved, it's a little more complicated but I don't think that's allowed anyway


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r191015105
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4931,6 +4931,63 @@ def foo3(key, pdf):
             expected4 = udf3.func((), pdf)
             self.assertPandasEqual(expected4, result4)
     
    +    def test_column_order(self):
    +        import pandas as pd
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType
    +        df = self.data
    +
    +        # Function returns a pdf with required column names, but order could be arbitrary using dict
    +        def change_col_order(pdf):
    +            # Constructing a DataFrame from a dict should result in the same order,
    +            # but use from_items to ensure the pdf column order is different than schema
    +            return pd.DataFrame.from_items([
    +                ('id', pdf.id),
    +                ('u', pdf.v * 2),
    +                ('v', pdf.v)])
    +
    +        ordered_udf = pandas_udf(
    +            change_col_order,
    +            'id long, v int, u int',
    +            PandasUDFType.GROUPED_MAP
    +        )
    +
    +        def positional_col_order(pdf):
    --- End diff --
    
    Can we test these two cases too? (using integer index)
    
    This should work
    ```
    pd.DataFrame(OrderedDict(zip(range(3), [pdf.id, pdf.v * 3, pdf.v]))
    ```
    
    This should not be supported. (We should only support string index and int index)
    ```
    pd.DataFrame(OrderedDict(zip(np.arange(3.0), [pdf.id, pdf.v * 3, pdf.v]))
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    **[Test build #91124 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91124/testReport)** for PR 21427 at commit [`d67a8a5`](https://github.com/apache/spark/commit/d67a8a5987d6ba4bdd65f5d5decafca2d22291ad).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/4202/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r196438132
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -1161,6 +1161,16 @@ object SQLConf {
           .booleanConf
           .createWithDefault(true)
     
    +  val PANDAS_GROUPED_MAP_ASSIGN_COLUMNS_BY_POSITION =
    +    buildConf("spark.sql.execution.pandas.groupedMap.assignColumnsByPosition")
    +      .internal()
    +      .doc("When true, a grouped map Pandas UDF will assign columns from the returned " +
    +        "Pandas DataFrame based on position, regardless of column label type. When false, " +
    +        "columns will be looked up by name if labeled with a string and fallback to use" +
    --- End diff --
    
    Yup, I think so.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    How about making it configurable? Users can choose either resolve by names or resolve by positions. It is hard to say which one is right. If the names do not match when users want to resolve by names, we should issue an error.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Why is it difficult?
    
    On Fri, May 25, 2018 at 12:03 AM Hyukjin Kwon <no...@github.com>
    wrote:
    
    > but as I said it's difficult to have a configuration there. Shall we just
    > target 3.0.0 abd martk this as experimental as I suggeated from the first
    > place? That should be the safest way.
    >
    > —
    > You are receiving this because you were mentioned.
    > Reply to this email directly, view it on GitHub
    > <https://github.com/apache/spark/pull/21427#issuecomment-391961189>, or mute
    > the thread
    > <https://github.com/notifications/unsubscribe-auth/AATvPJ-ym2CEM9e_hHJxlvOwTlE-UADIks5t16yxgaJpZM4UM2oZ>
    > .
    >



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r191004141
  
    --- Diff: python/pyspark/worker.py ---
    @@ -111,9 +114,16 @@ def wrapped(key_series, value_series):
                     "Number of columns of the returned pandas.DataFrame "
                     "doesn't match specified schema. "
                     "Expected: {} Actual: {}".format(len(return_type), len(result.columns)))
    -        arrow_return_types = (to_arrow_type(field.dataType) for field in return_type)
    -        return [(result[result.columns[i]], arrow_type)
    -                for i, arrow_type in enumerate(arrow_return_types)]
    +        try:
    +            # Assign result columns by schema name
    +            return [(result[field.name], to_arrow_type(field.dataType)) for field in return_type]
    +        except KeyError:
    --- End diff --
    
    I wonder whether we should rely on KeyError or the type of the column index?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    It's at least not trivial as much as Scaia side's. I am okay but please make sure what case we will allow by this configuration.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r196435526
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala ---
    @@ -120,4 +121,19 @@ object ArrowUtils {
           StructField(field.getName, dt, field.isNullable)
         })
       }
    +
    +  /** Return Map with conf settings to be used in ArrowPythonRunner */
    +  def getPythonRunnerConfMap(conf: SQLConf): Map[String, String] = {
    --- End diff --
    
    Maybe move this function out of `ArrowUtils`? Doesn't seem to be Arrow specific.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r191503646
  
    --- Diff: python/pyspark/worker.py ---
    @@ -111,9 +114,16 @@ def wrapped(key_series, value_series):
                     "Number of columns of the returned pandas.DataFrame "
                     "doesn't match specified schema. "
                     "Expected: {} Actual: {}".format(len(return_type), len(result.columns)))
    -        arrow_return_types = (to_arrow_type(field.dataType) for field in return_type)
    -        return [(result[result.columns[i]], arrow_type)
    -                for i, arrow_type in enumerate(arrow_return_types)]
    +        try:
    +            # Assign result columns by schema name
    +            return [(result[field.name], to_arrow_type(field.dataType)) for field in return_type]
    +        except KeyError:
    +            if all(not isinstance(name, basestring) for name in result.columns):
    +                # Assign result columns by position if they are not named with strings
    +                return [(result[result.columns[i]], to_arrow_type(field.dataType))
    +                        for i, field in enumerate(return_type)]
    +            else:
    +                raise
    --- End diff --
    
    I think when user specify column names explicitly on the returned pd.DataFrame but it doesn't match the schema, then it's most likely to be a bug / typo, so throw exception makes sense to me.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r197524629
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala ---
    @@ -58,18 +58,18 @@ class ArrowPythonRunner(
         new WriterThread(env, worker, inputIterator, partitionIndex, context) {
     
           protected override def writeCommand(dataOut: DataOutputStream): Unit = {
    -        PythonUDFRunner.writeUDFs(dataOut, funcs, argOffsets)
    -        if (respectTimeZone) {
    -          PythonRDD.writeUTF(timeZoneId, dataOut)
    -        } else {
    -          dataOut.writeInt(SpecialLengths.NULL)
    +        dataOut.writeInt(conf.size)
    --- End diff --
    
    Ok, SGTM.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    > But we marked this as experimental ...
    
    That's also special for this case, we marked it as experimental in 2.3.1.
    
    Not a lot of behavior changes are similar to this one. To highlight:
    1. it's not marked as experimental in the first release.
    2. it missed 2.3.1, so the old behavior will be there for some time, until the next release(2.3.2 or 2.4.0)
    3. it turns runnable code into failure, and the old behavior is kind of self-consistent(by-position match). it's not like turning failures into runnable or fix a correctness bug.
    
    To summary:
    1. I agree the new behavior makes more sense, we should have done that in the first place.
    2. This is a special case like I mentioned above. We should be a little more conservative here.
    3. Adding a config is not hard. Maybe @ueshin can build the framework first for passing configs to the python worker?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    But we marked this as experimental. If we treat old API and new experimental API in the same way, I wonder why we have them. One thing I am less clear is, what kind of scenario we are worried of. I reread the discussion here and I still don't know which case we are worried of breaking.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    I updated to use a conf that reverts to column assignment by position, regardless of the the type of column labels in the Pandas DataFrame, and added a test for this.  I also put this in a generic `Map[String, String]` so it will be less trouble to add additional confs in the future, or remove deprecated ones.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Pandas UDF is already in 2 releases(2.3.0 and 2.3.1), we can't just change the behavior. We think the old behavior doesn't make sense and users should change their code, but users may not think in this way.
    
    Educating users takes times, I think providing a config can make it smooth.
    
    Note that, this case is special: we turn runnable code into failure, and the old behavior is explainable(by-position match). I don't think there are a lot of behavior changes like this, so having version-specific configs seems an overkill.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91275/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    I need to take a look too but sounds possible. WDYT @BryanCutler? BTW, the fix there should be the most appropriate place to fix since that's actually where the problem was started.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Seems fine and I am okay with it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/4187/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r196242012
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala ---
    @@ -58,18 +58,18 @@ class ArrowPythonRunner(
         new WriterThread(env, worker, inputIterator, partitionIndex, context) {
     
           protected override def writeCommand(dataOut: DataOutputStream): Unit = {
    -        PythonUDFRunner.writeUDFs(dataOut, funcs, argOffsets)
    -        if (respectTimeZone) {
    -          PythonRDD.writeUTF(timeZoneId, dataOut)
    -        } else {
    -          dataOut.writeInt(SpecialLengths.NULL)
    +        dataOut.writeInt(conf.size)
    +        for ((k, v) <- conf) {
    +          PythonRDD.writeUTF(k, dataOut)
    +          PythonRDD.writeUTF(v, dataOut)
             }
    +        PythonUDFRunner.writeUDFs(dataOut, funcs, argOffsets)
           }
     
           protected override def writeIteratorToStream(dataOut: DataOutputStream): Unit = {
    -        val arrowSchema = ArrowUtils.toArrowSchema(schema, timeZoneId)
             val allocator = ArrowUtils.rootAllocator.newChildAllocator(
               s"stdout writer for $pythonExec", 0, Long.MaxValue)
    +        val arrowSchema = ArrowUtils.toArrowSchema(schema, timeZoneId)
    --- End diff --
    
    change this back, accidental


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    > what will happen if the given schema names are numbers? could we recognise it?
    The schema names will be integers as strings, so they are treated same as any string.  This is only if the pandas column names are not strings, e.g. integers.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    cc @rxin 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92066/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91167/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    If this has been released you can't just change it like this; it will break users' programs immediately. At the very least introduce a flag so it can be set by the user to avoid breaking their code.
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    For configuration, I wasn't sure if we should send the whole configuration map into worker.py side, if we should fix the command writing way, and also was thinking of the current timezone way is kind of one time thing. Was just wondering if we really should do that. I am okay if that's the only way and I should add a configuration for 2.4. Just for clarification, we should probably remove this configuration in 3.0.0 too.
    
    For the current approach, I thought we better check if there are other cases possibly broken and see if that makes sense rather then just blocking this only because there are a bit of behaviour changes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Do not backport this to 2.3. This is a behavior change. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    I ran into @ueshin and @gatorsmile at the Summit. It seems the preferable way to move forward is to having a configuration to fall back to the existing behavior and change the default behavior to be what's proposed here.
    
    @BryanCutler @gatorsmile @ueshin what do you think?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21427

2.3.1 wouldn't have this behaviour change and we marked this as experimental. So, on the other hand, it probably will give more time to expose that this is discouraged in production and there might be a bit of behaviour changes. Actually, It isn't long time comparing to other APIs we have as well on the other hand ...

> it turns runnable code into failure, and the old behavior is kind of self-consistent(by-position match). it's not like turning failures into runnable or fix a correctness bug.

It still sounds like we treat this API as a old stable API. It doesn't replace the self-consistent way completely. This PR partially fixes its behaviour to make it more sense, causing some corner behaviour changes which are quite unlikely and making no sense (IMHO).

We should be relatively less conservative for new and experimental APIs to promote to make it more stable and coherent as soon as possible until we remove the experimental note ..

The only special reason I see is that it's not a correctness bug but it changes the existing behaviour (which I actually don't completely agree but I get what you mean at least). But then what can we do for experimental APIs specifically .. ?

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    I agree it should have started experimental. It is pretty weird to after
    the fact mark something experimental though.
    
    On Fri, May 25, 2018 at 12:23 AM Hyukjin Kwon <no...@github.com>
    wrote:
    
    > BTW, what do you think about adding a blocker to set this feature as
    > experimental @rxin <https://github.com/rxin>? I think it's pretty new
    > feature and it should be reasonable to call it experimental.
    >
    > —
    > You are receiving this because you were mentioned.
    > Reply to this email directly, view it on GitHub
    > <https://github.com/apache/spark/pull/21427#issuecomment-391965470>, or mute
    > the thread
    > <https://github.com/notifications/unsubscribe-auth/AATvPI2-nftoelNAPqgn19vurlYolkG8ks5t17FjgaJpZM4UM2oZ>
    > .
    >



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    **[Test build #92223 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92223/testReport)** for PR 21427 at commit [`2d2ced6`](https://github.com/apache/spark/commit/2d2ced626f3ed9abf8100e5a83b007b8c8cfad99).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    I made https://issues.apache.org/jira/browse/SPARK-24444 to improve the documentation of the current behavior, until we can resolve this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    yea, I generally agree with that and I feel in the same way. I think I had a talk about this @gatorsmile and @cloud-fan multiple times. Here is my thought: we should not make a configuration only because it's going to break something. We should see if that usage makes sense, it's a bug or not, workaround is easy, the API is experimental or evolving and it's trivial or not.
    
    My concern is having too many configurations to control each behaviour for every behaviour change regardless of thinking other options.
    
    I kind of am still okay with having a configuration in most cases but think about the current behaviour, proposed change and what change we need to have a configuration.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r196436658
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala ---
    @@ -63,7 +64,7 @@ case class ArrowEvalPythonExec(udfs: Seq[PythonUDF], output: Seq[Attribute], chi
     
       private val batchSize = conf.arrowMaxRecordsPerBatch
       private val sessionLocalTimeZone = conf.sessionLocalTimeZone
    -  private val pandasRespectSessionTimeZone = conf.pandasRespectSessionTimeZone
    +  private val runnerConf = ArrowUtils.getPythonRunnerConfMap(conf)
    --- End diff --
    
    nit: runnerConf  -> pythonRunnerConf?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    **[Test build #92077 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92077/testReport)** for PR 21427 at commit [`59972d6`](https://github.com/apache/spark/commit/59972d6a9ab3d8a95f5b5eed5c30d73421dbe140).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    I believe it was just a mistake to correct - we forget this to mark it experimental. It's pretty unstable and many JIRAs are being open. @BryanCutler mind if I ask to go ahead if you find some time? if you are busy will do it by myself.
    
    cc @vanzin FYI.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r197509839
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala ---
    @@ -63,7 +64,7 @@ case class ArrowEvalPythonExec(udfs: Seq[PythonUDF], output: Seq[Attribute], chi
     
       private val batchSize = conf.arrowMaxRecordsPerBatch
       private val sessionLocalTimeZone = conf.sessionLocalTimeZone
    -  private val pandasRespectSessionTimeZone = conf.pandasRespectSessionTimeZone
    +  private val runnerConf = ArrowUtils.getPythonRunnerConfMap(conf)
    --- End diff --
    
    ok


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Okay .. but let's make sure this case was special ...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    **[Test build #92066 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92066/testReport)** for PR 21427 at commit [`59972d6`](https://github.com/apache/spark/commit/59972d6a9ab3d8a95f5b5eed5c30d73421dbe140).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Just for clarification, I am okay @BryanCutler if you feel in this way too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    if adding the config is trivial, let's add it. We can pick the new behavior by default.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    I've been thinking about this and came to the same conclusion as @icexelloss here https://github.com/apache/spark/pull/21427#issuecomment-392070950 that we could really support both names and position, and fix this without changing behavior.
    
    When the user defines as grouped map udf, the StructType has field names so if the returned DataFrame has column names they should match.  If the user returned a DataFrame with positional columns only, pandas will name the columns with an integer index (not an integer string).  We could change the logic to do the following:
    ```
    Assign columns by name, catching a KeyError exception
    If the column names are all integers, then fallback to assign by position
    Else raise the KeyError (most likely the user has a typo in the column name)
    ```
    I think that will solve this issue and not change the behavior, but I would need check that this will hold for different pandas versions.  How does that sound?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3684/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    I first glance, I thought this issue was slightly different than https://issues.apache.org/jira/browse/SPARK-23929, but yeah it seems to be the same.  After reading through that discussion, I guess we need to be careful about any changes.  I'm not used to creating DataFrames by position, but it is possible to do so with a list of tuples like the example from the doctest:
    
    ```
           >>> @pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP)  # doctest: +SKIP
           ... def mean_udf(key, pdf):
           ...     # key is a tuple of one numpy.int64, which is the value
           ...     # of 'id' for the current group
           ...     return pd.DataFrame([key + (pdf.v.mean(),)])
      
    ```
    Then this would be a breaking change... so maybe it would be best to add better documentation for now like @HyukjinKwon mentioned in SPARK-23929, and target a change for Spark 3.0?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r196243235
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -1161,6 +1161,16 @@ object SQLConf {
           .booleanConf
           .createWithDefault(true)
     
    +  val PANDAS_GROUPED_MAP_ASSIGN_COLUMNS_BY_POSITION =
    +    buildConf("spark.sql.execution.pandas.groupedMap.assignColumnsByPosition")
    +      .internal()
    +      .doc("When true, a grouped map Pandas UDF will assign columns from the returned " +
    +        "Pandas DataFrame based on position, regardless of column label type. When false, " +
    +        "columns will be looked up by name if labeled with a string and fallback to use" +
    --- End diff --
    
    This can also be marked as deprecated right?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    what will happen if the given schema names are numbers? could we recognise it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91166/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3562/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Hmmmm .. I got that this one is more preferable and I think we haven't got a discussion for this so far if I remember this correctly.
    
    Do you feel strongly about this @icexelloss and @BryanCutler? If so, let's update migration guide for 2.4.0 ... and I hope we can document this feature as an experimental. I think I could be okay.
    
    Otherwise, I prefer to target this 3.0.0 and document this for now .. Another option is to add a configuration to control this behaviour but I remember it's tricky to inject the configuration there.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r196457433
  
    --- Diff: python/pyspark/worker.py ---
    @@ -110,9 +116,20 @@ def wrapped(key_series, value_series):
                     "Number of columns of the returned pandas.DataFrame "
                     "doesn't match specified schema. "
                     "Expected: {} Actual: {}".format(len(return_type), len(result.columns)))
    -        arrow_return_types = (to_arrow_type(field.dataType) for field in return_type)
    -        return [(result[result.columns[i]], arrow_type)
    -                for i, arrow_type in enumerate(arrow_return_types)]
    +
    +        if not assign_cols_by_pos:
    +            try:
    +                # Assign result columns by schema name
    +                return [(result[field.name], to_arrow_type(field.dataType))
    +                        for field in return_type]
    +            except KeyError:
    --- End diff --
    
    I think we want to be a little more careful here, for example, an `KeyError` in to_arrow_type could lead to unexpected behavior. 
    
    How about sth like this:
    ```
    if any(isinstance(name, basestring) for name in result.columns):
        return [(result[field.name], to_arrow_type(field.dataType)) for field in return_type]
    else:
        return [(result.iloc[:,i], to_arrow_type(field.dataType)) for i, field in enumerate(return_type)]
    ```
    
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92048/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    **[Test build #91167 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91167/testReport)** for PR 21427 at commit [`f50120f`](https://github.com/apache/spark/commit/f50120ffaeaa020d68144945de4b25840d5dcb1b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r191076477
  
    --- Diff: python/pyspark/worker.py ---
    @@ -111,9 +114,16 @@ def wrapped(key_series, value_series):
                     "Number of columns of the returned pandas.DataFrame "
                     "doesn't match specified schema. "
                     "Expected: {} Actual: {}".format(len(return_type), len(result.columns)))
    -        arrow_return_types = (to_arrow_type(field.dataType) for field in return_type)
    -        return [(result[result.columns[i]], arrow_type)
    -                for i, arrow_type in enumerate(arrow_return_types)]
    +        try:
    +            # Assign result columns by schema name
    +            return [(result[field.name], to_arrow_type(field.dataType)) for field in return_type]
    +        except KeyError:
    +            if all(not isinstance(name, basestring) for name in result.columns):
    --- End diff --
    
    Ah, I see. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3594/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21427
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21427#discussion_r196241595
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -1161,6 +1161,16 @@ object SQLConf {
           .booleanConf
           .createWithDefault(true)
     
    +  val PANDAS_GROUPED_MAP_ASSIGN_COLUMNS_BY_POSITION =
    +    buildConf("spark.sql.execution.pandas.groupedMap.assignColumnsByPosition")
    +      .internal()
    +      .doc("When true, a grouped map Pandas UDF will assign columns from the returned " +
    +        "Pandas DataFrame based on position, regardless of column label type. When false, " +
    +        "columns will be looked up by name if labeled with a string and fallback to use" +
    --- End diff --
    
    need a space at end


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org