You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by HyukjinKwon <gi...@git.apache.org> on 2018/02/16 07:51:40 UTC

[GitHub] spark pull request #20625: [SPARK-23446][PYTHON] Explicitly check supported ...

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/20625

    [SPARK-23446][PYTHON] Explicitly check supported types in toPandas

    ## What changes were proposed in this pull request?
    
    This PR explicitly specifies the types we supported in `toPandas`. This was a hole. For example, we haven't finished the binary type support in Python side yet but now it allows as below:
    
    ```python
    spark.conf.set("spark.sql.execution.arrow.enabled", "false")
    df = spark.createDataFrame([[bytearray("a")]])
    df.toPandas()
    spark.conf.set("spark.sql.execution.arrow.enabled", "true")
    df.toPandas()
    ```
    
    ```
         _1
    0  [97]
      _1
    0  a
    ```
    
    This should be disallowed. I think the same things also apply to nested timestamps too.
    
    I also added some nicer message about `spark.sql.execution.arrow.enabled` in the error message.
    
    ## How was this patch tested?
    
    Manually tested and tests added in `python/pyspark/sql/tests.py`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark pandas_convertion_supported_type

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20625.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20625
    
----
commit c79c6df7284b9717fe4e4c26090dcb51bf7712da
Author: hyukjinkwon <gu...@...>
Date:   2018-02-16T07:45:52Z

    Explicitly specify supported types in toPandas

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20625: [SPARK-23446][PYTHON] Explicitly check supported types i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20625
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/931/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20625: [SPARK-23446][PYTHON] Explicitly check supported types i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20625
  
    **[Test build #87502 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87502/testReport)** for PR 20625 at commit [`c79c6df`](https://github.com/apache/spark/commit/c79c6df7284b9717fe4e4c26090dcb51bf7712da).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20625: [SPARK-23446][PYTHON] Explicitly check supported types i...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20625
  
    Thanks for the fast fix! We need to merge it to SPARK-2.3.0 before RC4. Will merge it now. We can improve the fix later if anybody has better ideas. 
    
    Thanks! Merged to master/2.3
    
    Happy Lunar New Year!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20625: [SPARK-23446][PYTHON] Explicitly check supported ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20625#discussion_r168825608
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -2000,10 +2001,12 @@ def toPandas(self):
                         return _check_dataframe_localize_timestamps(pdf, timezone)
                     else:
                         return pd.DataFrame.from_records([], columns=self.columns)
    -            except ImportError as e:
    -                msg = "note: pyarrow must be installed and available on calling Python process " \
    -                      "if using spark.sql.execution.arrow.enabled=true"
    -                raise ImportError("%s\n%s" % (_exception_message(e), msg))
    +            except Exception as e:
    +                msg = (
    +                    "Note: toPandas attempted Arrow optimization because "
    +                    "'spark.sql.execution.arrow.enabled' is set to true. Please set it to false "
    +                    "to disable this.")
    +                raise RuntimeError("%s\n%s" % (_exception_message(e), msg))
    --- End diff --
    
    Yup, please open a PR if you have a better idea.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20625: [SPARK-23446][PYTHON] Explicitly check supported types i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20625
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20625: [SPARK-23446][PYTHON] Explicitly check supported ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20625#discussion_r168718112
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -2000,10 +2001,12 @@ def toPandas(self):
                         return _check_dataframe_localize_timestamps(pdf, timezone)
                     else:
                         return pd.DataFrame.from_records([], columns=self.columns)
    -            except ImportError as e:
    -                msg = "note: pyarrow must be installed and available on calling Python process " \
    -                      "if using spark.sql.execution.arrow.enabled=true"
    -                raise ImportError("%s\n%s" % (_exception_message(e), msg))
    +            except Exception as e:
    +                msg = (
    +                    "Note: toPandas attempted Arrow optimization because "
    +                    "'spark.sql.execution.arrow.enabled' is set to true. Please set it to false "
    +                    "to disable this.")
    --- End diff --
    
    Oh, that should be part of the original message. For example, I don't have PyArrow in `pypy` in my local. it shows the error like:
    
    ```
    RuntimeError: PyArrow >= 0.8.0 must be installed; however, it was not found.
    Note: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to disable this.
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20625: [SPARK-23446][PYTHON] Explicitly check supported types i...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/20625
  
    I think `RuntimeError` is fine for now and we can improve this later with logic to fallback too - best not to try and get too clever so close to the release :)  Thanks for catching this and the quick fix @HyukjinKwon !


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20625: [SPARK-23446][PYTHON] Explicitly check supported types i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20625
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87502/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20625: [SPARK-23446][PYTHON] Explicitly check supported types i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20625
  
    **[Test build #87504 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87504/testReport)** for PR 20625 at commit [`4e5708c`](https://github.com/apache/spark/commit/4e5708ca01f048f2408ded0b039ae724b806977c).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20625: [SPARK-23446][PYTHON] Explicitly check supported types i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20625
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20625: [SPARK-23446][PYTHON] Explicitly check supported types i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20625
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87504/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20625: [SPARK-23446][PYTHON] Explicitly check supported types i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20625
  
    **[Test build #87502 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87502/testReport)** for PR 20625 at commit [`c79c6df`](https://github.com/apache/spark/commit/c79c6df7284b9717fe4e4c26090dcb51bf7712da).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20625: [SPARK-23446][PYTHON] Explicitly check supported types i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20625
  
    Thank you @BryanCutler!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20625: [SPARK-23446][PYTHON] Explicitly check supported types i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20625
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/932/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20625: [SPARK-23446][PYTHON] Explicitly check supported types i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20625
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20625: [SPARK-23446][PYTHON] Explicitly check supported types i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20625
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20625: [SPARK-23446][PYTHON] Explicitly check supported types i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20625
  
    **[Test build #87504 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87504/testReport)** for PR 20625 at commit [`4e5708c`](https://github.com/apache/spark/commit/4e5708ca01f048f2408ded0b039ae724b806977c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20625: [SPARK-23446][PYTHON] Explicitly check supported ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20625#discussion_r168709505
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -2000,10 +2001,12 @@ def toPandas(self):
                         return _check_dataframe_localize_timestamps(pdf, timezone)
                     else:
                         return pd.DataFrame.from_records([], columns=self.columns)
    -            except ImportError as e:
    -                msg = "note: pyarrow must be installed and available on calling Python process " \
    -                      "if using spark.sql.execution.arrow.enabled=true"
    -                raise ImportError("%s\n%s" % (_exception_message(e), msg))
    +            except Exception as e:
    +                msg = (
    +                    "Note: toPandas attempted Arrow optimization because "
    +                    "'spark.sql.execution.arrow.enabled' is set to true. Please set it to false "
    +                    "to disable this.")
    --- End diff --
    
    hmm, this says why it's trying arrow and how to turn it off, but doesn't say why I have to turn it off? perhaps say something like pyarrow is not found (if it is the cause if we know)?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20625: [SPARK-23446][PYTHON] Explicitly check supported ...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20625#discussion_r168823994
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -2000,10 +2001,12 @@ def toPandas(self):
                         return _check_dataframe_localize_timestamps(pdf, timezone)
                     else:
                         return pd.DataFrame.from_records([], columns=self.columns)
    -            except ImportError as e:
    -                msg = "note: pyarrow must be installed and available on calling Python process " \
    -                      "if using spark.sql.execution.arrow.enabled=true"
    -                raise ImportError("%s\n%s" % (_exception_message(e), msg))
    +            except Exception as e:
    +                msg = (
    +                    "Note: toPandas attempted Arrow optimization because "
    +                    "'spark.sql.execution.arrow.enabled' is set to true. Please set it to false "
    +                    "to disable this.")
    +                raise RuntimeError("%s\n%s" % (_exception_message(e), msg))
    --- End diff --
    
    Should the same type of error be raised instead of `RuntimeError`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20625: [SPARK-23446][PYTHON] Explicitly check supported types i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20625
  
    This was my best for a small and safe fix as possible as I could. Thanks for mering it @gatorsmile sincirely. This was my last concern about PyArrow abd Pandas.
    
    I don't mind at all if anyone opens another PR with a better idea to be clear.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20625: [SPARK-23446][PYTHON] Explicitly check supported ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20625


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org