You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by BryanCutler <gi...@git.apache.org> on 2017/11/13 19:16:59 UTC

[GitHub] spark pull request #19738: [SPARK-20791][PYTHON][FOLLOWUP] Check for unicode...

GitHub user BryanCutler opened a pull request:

    https://github.com/apache/spark/pull/19738

    [SPARK-20791][PYTHON][FOLLOWUP] Check for unicode column names in createDataFrame with Arrow

    ## What changes were proposed in this pull request?
    
    If schema is passed as a list of unicode strings for column names, they should be re-encoded to 'utf-8' to be consistent.  This is similar to the #13097 but for creation of DataFrame using Arrow.
    
    ## How was this patch tested?
    
    Added new test of using unicode names for schema.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/BryanCutler/spark arrow-createDataFrame-followup-unicode-SPARK-20791

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19738.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19738
    
----
commit 1be220036cc405eaef5acb77802a15bceb81c314
Author: Bryan Cutler <cu...@gmail.com>
Date:   2017-11-13T19:11:19Z

    moved re-encoding for unicode schema names to cover createDataFrame with Arrow also

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19738: [SPARK-20791][PYTHON][FOLLOWUP] Check for unicode...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19738#discussion_r150641930
  
    --- Diff: python/pyspark/sql/session.py ---
    @@ -593,6 +593,10 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr
             if isinstance(schema, basestring):
                 schema = _parse_datatype_string(schema)
     
    +        # If schema is a list of unicode strings, must change encoding
    +        if isinstance(schema, (list, tuple)):
    --- End diff --
    
    This should be an `elif`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19738: [SPARK-20791][PYTHON][FOLLOWUP] Check for unicode column...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19738
  
    Yup, it looks so. Could we add another small test case for it as well?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19738: [SPARK-20791][PYTHON][FOLLOWUP] Check for unicode column...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/19738
  
    Good catch @HyukjinKwon , this was an issue.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19738: [SPARK-20791][PYTHON][FOLLOWUP] Check for unicode column...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19738
  
    **[Test build #83807 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83807/testReport)** for PR 19738 at commit [`a27aaab`](https://github.com/apache/spark/commit/a27aaab14e3146872913a604510989699875f55e).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19738: [SPARK-20791][PYTHON][FOLLOWUP] Check for unicode column...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19738
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83804/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19738: [SPARK-20791][PYTHON][FOLLOWUP] Check for unicode column...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19738
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19738: [SPARK-20791][PYTHON][FOLLOWUP] Check for unicode column...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19738
  
    **[Test build #83807 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83807/testReport)** for PR 19738 at commit [`a27aaab`](https://github.com/apache/spark/commit/a27aaab14e3146872913a604510989699875f55e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19738: [SPARK-20791][PYTHON][FOLLOWUP] Check for unicode...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19738


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19738: [SPARK-20791][PYTHON][FOLLOWUP] Check for unicode column...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/19738
  
    I saw some cases for ML input using unicode chars, but that was it.  I think for the purposes here, it is maybe not necessary.  I manually verified that without converting unicode Pandas column names the test fails, and so that might be good enough.  What do you think @HyukjinKwon ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19738: [SPARK-20791][PYTHON][FOLLOWUP] Check for unicode column...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/19738
  
    > Yup, it looks so. Could we add another small test case for it as well?
    
    There is already, it's just that `str()` will convert from unicode as long as it's only ASCII characters so it passes.  Do you think we should use non ASCII in tests?  I'll check to see if that's done elsewhere in Spark..


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19738: [SPARK-20791][PYTHON][FOLLOWUP] Check for unicode column...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19738
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83857/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19738: [SPARK-20791][PYTHON][FOLLOWUP] Check for unicode column...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/19738
  
    Thanks @HyukjinKwon !


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19738: [SPARK-20791][PYTHON][FOLLOWUP] Check for unicode column...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19738
  
    **[Test build #83857 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83857/testReport)** for PR 19738 at commit [`611cbf9`](https://github.com/apache/spark/commit/611cbf98856907e2d44fba73715e010333230ddd).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19738: [SPARK-20791][PYTHON][FOLLOWUP] Check for unicode column...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/19738
  
    Should this https://github.com/apache/spark/pull/19738/files#diff-3b5463566251d5b09fd328738a9e9bc5R608 be this instead?
    
    ```
    # If no schema supplied by user then get the names of columns only
    if schema is None:
        schema = [x.encode('utf-8') if not isinstance(x, str) else x for x in data.columns]
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19738: [SPARK-20791][PYTHON][FOLLOWUP] Check for unicode column...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19738
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83807/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19738: [SPARK-20791][PYTHON][FOLLOWUP] Check for unicode column...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19738
  
    Merged to master.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19738: [SPARK-20791][PYTHON][FOLLOWUP] Check for unicode column...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19738
  
    LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19738: [SPARK-20791][PYTHON][FOLLOWUP] Check for unicode column...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19738
  
    **[Test build #83804 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83804/testReport)** for PR 19738 at commit [`1be2200`](https://github.com/apache/spark/commit/1be220036cc405eaef5acb77802a15bceb81c314).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19738: [SPARK-20791][PYTHON][FOLLOWUP] Check for unicode column...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19738
  
    **[Test build #83804 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83804/testReport)** for PR 19738 at commit [`1be2200`](https://github.com/apache/spark/commit/1be220036cc405eaef5acb77802a15bceb81c314).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19738: [SPARK-20791][PYTHON][FOLLOWUP] Check for unicode column...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19738
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19738: [SPARK-20791][PYTHON][FOLLOWUP] Check for unicode column...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19738
  
    **[Test build #83857 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83857/testReport)** for PR 19738 at commit [`611cbf9`](https://github.com/apache/spark/commit/611cbf98856907e2d44fba73715e010333230ddd).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19738: [SPARK-20791][PYTHON][FOLLOWUP] Check for unicode column...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19738
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org