You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by gu...@apache.org on 2018/01/22 01:43:18 UTC

spark git commit: [SPARK-20947][PYTHON] Fix encoding/decoding error in pipe action

Repository: spark
Updated Branches:
  refs/heads/master 12faae295 -> 602c6d82d


[SPARK-20947][PYTHON] Fix encoding/decoding error in pipe action

## What changes were proposed in this pull request?

Pipe action convert objects into strings using a way that was affected by the default encoding setting of Python environment.

This patch fixed the problem. The detailed description is added here:

https://issues.apache.org/jira/browse/SPARK-20947

## How was this patch tested?

Run the following statement in pyspark-shell, and it will NOT raise exception if this patch is applied:

```python
sc.parallelize([u'\u6d4b\u8bd5']).pipe('cat').collect()
```

Author: 王晓哲 <wx...@linkdoc.com>

Closes #18277 from chaoslawful/fix_pipe_encoding_error.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/602c6d82
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/602c6d82
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/602c6d82

Branch: refs/heads/master
Commit: 602c6d82d893a7f34b37d674642669048eb59b03
Parents: 12faae2
Author: 王晓哲 <wx...@linkdoc.com>
Authored: Mon Jan 22 10:43:12 2018 +0900
Committer: hyukjinkwon <gu...@gmail.com>
Committed: Mon Jan 22 10:43:12 2018 +0900

----------------------------------------------------------------------
 python/pyspark/rdd.py   | 2 +-
 python/pyspark/tests.py | 7 +++++++
 2 files changed, 8 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/602c6d82/python/pyspark/rdd.py
----------------------------------------------------------------------
diff --git a/python/pyspark/rdd.py b/python/pyspark/rdd.py
index 340bc3a..1b39155 100644
--- a/python/pyspark/rdd.py
+++ b/python/pyspark/rdd.py
@@ -766,7 +766,7 @@ class RDD(object):
 
             def pipe_objs(out):
                 for obj in iterator:
-                    s = str(obj).rstrip('\n') + '\n'
+                    s = unicode(obj).rstrip('\n') + '\n'
                     out.write(s.encode('utf-8'))
                 out.close()
             Thread(target=pipe_objs, args=[pipe.stdin]).start()

http://git-wip-us.apache.org/repos/asf/spark/blob/602c6d82/python/pyspark/tests.py
----------------------------------------------------------------------
diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py
index da99872..5115857 100644
--- a/python/pyspark/tests.py
+++ b/python/pyspark/tests.py
@@ -1239,6 +1239,13 @@ class RDDTests(ReusedPySparkTestCase):
         self.assertRaises(Py4JJavaError, rdd.pipe('grep 4', checkCode=True).collect)
         self.assertEqual([], rdd.pipe('grep 4').collect())
 
+    def test_pipe_unicode(self):
+        # Regression test for SPARK-20947
+        data = [u'\u6d4b\u8bd5', '1']
+        rdd = self.sc.parallelize(data)
+        result = rdd.pipe('cat').collect()
+        self.assertEqual(data, result)
+
 
 class ProfilerTests(PySparkTestCase):
 


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org