You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ahirreddy <gi...@git.apache.org> on 2014/08/13 03:18:04 UTC

[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix

GitHub user ahirreddy opened a pull request:

    https://github.com/apache/spark/pull/1914

    [SQL] Python JsonRDD UTF8 Encoding Fix

    Only encode unicode objects to UTF-8, and not strings


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ahirreddy/spark json-rdd-unicode-fix1

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1914.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1914
    
----
commit ca4e9bad177a98abc0e7e634b3c3c47fa443877f
Author: Ahir Reddy <ah...@gmail.com>
Date:   2014-08-13T01:13:42Z

    Encoding Fix

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1914#issuecomment-52135573
  
    QA results for PR 1914:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18501/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1914#issuecomment-52131090
  
    QA tests have started for PR 1914. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18501/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/1914#issuecomment-52117952
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/1914#issuecomment-52130741
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1914#issuecomment-52120767
  
    QA tests have started for PR 1914. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18479/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1914#discussion_r16198127
  
    --- Diff: python/pyspark/sql.py ---
    @@ -1267,7 +1267,9 @@ def func(iterator):
                 for x in iterator:
                     if not isinstance(x, basestring):
                         x = unicode(x)
    -                yield x.encode("utf-8")
    +                if isinstance(x, unicode):
    --- End diff --
    
    Yes, if x is str with encoding "GBK", it will fail, because x.encode("utf-8") means it will try to x.decode("ascii") first.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/1914#issuecomment-52009623
  
    lgtm


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/1914#issuecomment-52120342
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1914#issuecomment-51999172
  
    QA tests have started for PR 1914. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18402/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/1914#issuecomment-52217488
  
    I've merged this to master and 1.1.  Thanks!
    
    Have we created the followup JIRA issue for `saveAsTextFile`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/1914#issuecomment-52126239
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/1914#issuecomment-52525420
  
    I created a followup JIRA here: https://issues.apache.org/jira/browse/SPARK-3103


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1914#discussion_r16198278
  
    --- Diff: python/pyspark/sql.py ---
    @@ -1267,7 +1267,9 @@ def func(iterator):
                 for x in iterator:
                     if not isinstance(x, basestring):
                         x = unicode(x)
    -                yield x.encode("utf-8")
    +                if isinstance(x, unicode):
    --- End diff --
    
    Ah, okay.  Let's fix this in `saveAsTextFile`, too.  Should we open a JIRA for this, since it's also a bug in existing code that users might encounter?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/1914#issuecomment-52126304
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix

Posted by loveconan1988 <gi...@git.apache.org>.
Github user loveconan1988 commented on the pull request:

    https://github.com/apache/spark/pull/1914#issuecomment-51999134
  
    ------------------ 原始邮件 ------------------
      发件人: "Ahir Reddy";<no...@github.com>;
     发送时间: 2014年8月13日(星期三) 上午9:18
     收件人: "apache/spark"<sp...@noreply.github.com>; 
     
     主题: [spark] [SQL] Python JsonRDD UTF8 Encoding Fix (#1914)
    
     
    
     
    Only encode unicode objects to UTF-8, and not strings
     
     
    You can merge this Pull Request by running
      git pull https://github.com/ahirreddy/spark json-rdd-unicode-fix1 
    Or view, comment on, or merge it at:
     
      https://github.com/apache/spark/pull/1914
     
    Commit Summary
      
    Encoding Fix
     
    File Changes
      
    M python/pyspark/sql.py (4) 
     
    Patch Links:
      
    https://github.com/apache/spark/pull/1914.patch
     
    https://github.com/apache/spark/pull/1914.diff
     
    —
    Reply to this email directly or view it on GitHub.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1914#discussion_r16197946
  
    --- Diff: python/pyspark/sql.py ---
    @@ -1267,7 +1267,9 @@ def func(iterator):
                 for x in iterator:
                     if not isinstance(x, basestring):
                         x = unicode(x)
    -                yield x.encode("utf-8")
    +                if isinstance(x, unicode):
    --- End diff --
    
    Why do we need the `isinstance` here?  In `saveAsTextFile`, we just unconditionally encode strings as UTF-8:
    
    ```python
            def func(split, iterator):
                for x in iterator:
                    if not isinstance(x, basestring):
                        x = unicode(x)
                    yield x.encode("utf-8")
            keyed = self.mapPartitionsWithIndex(func)
            keyed._bypass_serializer = True
    ```
    
    Is there a bug in this `saveAsTextFile` code?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1914#issuecomment-52124969
  
    QA results for PR 1914:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18479/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org