You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ahirreddy <gi...@git.apache.org> on 2014/08/13 03:18:04 UTC
[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix
GitHub user ahirreddy opened a pull request:
https://github.com/apache/spark/pull/1914
[SQL] Python JsonRDD UTF8 Encoding Fix
Only encode unicode objects to UTF-8, and not strings
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ahirreddy/spark json-rdd-unicode-fix1
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1914.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1914
----
commit ca4e9bad177a98abc0e7e634b3c3c47fa443877f
Author: Ahir Reddy <ah...@gmail.com>
Date: 2014-08-13T01:13:42Z
Encoding Fix
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1914#issuecomment-52135573
QA results for PR 1914:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18501/consoleFull
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1914#issuecomment-52131090
QA tests have started for PR 1914. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18501/consoleFull
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix
Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/1914#issuecomment-52117952
test this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix
Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/1914#issuecomment-52130741
Jenkins, test this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1914#issuecomment-52120767
QA tests have started for PR 1914. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18479/consoleFull
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix
Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/1914#discussion_r16198127
--- Diff: python/pyspark/sql.py ---
@@ -1267,7 +1267,9 @@ def func(iterator):
for x in iterator:
if not isinstance(x, basestring):
x = unicode(x)
- yield x.encode("utf-8")
+ if isinstance(x, unicode):
--- End diff --
Yes, if x is str with encoding "GBK", it will fail, because x.encode("utf-8") means it will try to x.decode("ascii") first.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix
Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/1914#issuecomment-52009623
lgtm
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/1914#issuecomment-52120342
Jenkins, test this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1914#issuecomment-51999172
QA tests have started for PR 1914. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18402/consoleFull
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix
Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/1914#issuecomment-52217488
I've merged this to master and 1.1. Thanks!
Have we created the followup JIRA issue for `saveAsTextFile`?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix
Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/1914#issuecomment-52126239
test this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix
Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/1914#issuecomment-52525420
I created a followup JIRA here: https://issues.apache.org/jira/browse/SPARK-3103
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix
Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:
https://github.com/apache/spark/pull/1914#discussion_r16198278
--- Diff: python/pyspark/sql.py ---
@@ -1267,7 +1267,9 @@ def func(iterator):
for x in iterator:
if not isinstance(x, basestring):
x = unicode(x)
- yield x.encode("utf-8")
+ if isinstance(x, unicode):
--- End diff --
Ah, okay. Let's fix this in `saveAsTextFile`, too. Should we open a JIRA for this, since it's also a bug in existing code that users might encounter?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix
Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/1914#issuecomment-52126304
Jenkins, test this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix
Posted by loveconan1988 <gi...@git.apache.org>.
Github user loveconan1988 commented on the pull request:
https://github.com/apache/spark/pull/1914#issuecomment-51999134
------------------ 原始邮件 ------------------
发件人: "Ahir Reddy";<no...@github.com>;
发送时间: 2014年8月13日(星期三) 上午9:18
收件人: "apache/spark"<sp...@noreply.github.com>;
主题: [spark] [SQL] Python JsonRDD UTF8 Encoding Fix (#1914)
Only encode unicode objects to UTF-8, and not strings
You can merge this Pull Request by running
git pull https://github.com/ahirreddy/spark json-rdd-unicode-fix1
Or view, comment on, or merge it at:
https://github.com/apache/spark/pull/1914
Commit Summary
Encoding Fix
File Changes
M python/pyspark/sql.py (4)
Patch Links:
https://github.com/apache/spark/pull/1914.patch
https://github.com/apache/spark/pull/1914.diff
—
Reply to this email directly or view it on GitHub.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix
Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:
https://github.com/apache/spark/pull/1914#discussion_r16197946
--- Diff: python/pyspark/sql.py ---
@@ -1267,7 +1267,9 @@ def func(iterator):
for x in iterator:
if not isinstance(x, basestring):
x = unicode(x)
- yield x.encode("utf-8")
+ if isinstance(x, unicode):
--- End diff --
Why do we need the `isinstance` here? In `saveAsTextFile`, we just unconditionally encode strings as UTF-8:
```python
def func(split, iterator):
for x in iterator:
if not isinstance(x, basestring):
x = unicode(x)
yield x.encode("utf-8")
keyed = self.mapPartitionsWithIndex(func)
keyed._bypass_serializer = True
```
Is there a bug in this `saveAsTextFile` code?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SQL] Python JsonRDD UTF8 Encoding Fix
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1914#issuecomment-52124969
QA results for PR 1914:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18479/consoleFull
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org