You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Julaiti Alafate <ar...@gmail.com> on 2014/02/12 09:24:29 UTC
utf8 encoding error in serializers on pyspark 0.9.0
Hi,
I am getting this error (copied from the stderr of the worker that reports exceptions) while processing text files encoded in UTF8:
14/02/11 22:26:15 ERROR executor.Executor: Uncaught exception in thread Thread[stdin writer for python,5,main]
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "(base)/spark-0.9.0/python/pyspark/worker.py", line 77, in main
serializer.dump_stream(func(split_index, iterator), outfile)
File "(base)/spark-0.9.0/python/pyspark/serializers.py", line 182, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "(base)/spark-0.9.0/python/pyspark/serializers.py", line 117, in dump_stream
for obj in iterator:
File "(base)/spark-0.9.0/python/pyspark/serializers.py", line 171, in _batched
for item in iterator:
File "(base)/spark-0.9.0/python/pyspark/serializers.py", line 276, in load_stream
yield self.loads(stream)
File "(base)/spark-0.9.0/python/pyspark/serializers.py", line 271, in loads
return stream.read(length).decode('utf8')
File “(base)/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc0 in position 0: invalid start byte
I am using PySpark. “SPARK_MEM” is set to 30g. The system is deployed in standalone mode over 17 computers. Scala is version 2.10.3. Python is version 2.7.3.
I tried this code with previous release (spark 0.8.1, with scala 2.9.3). And it executes successfully. But it fails with the latest release (0.9.0). Thus I am not sure if this is a bug introduced by the latest version.
Any help would be appreciated. Thanks!
Julaiti
Re: utf8 encoding error in serializers on pyspark 0.9.0
Posted by Josh Rosen <ro...@gmail.com>.
This is probably a side-effect of a bug introduced when I added custom
serialization support to PySpark (
https://spark-project.atlassian.net/browse/SPARK-1043). The fix for this
bug (https://github.com/apache/incubator-spark/pull/523) wasn't included in
Spark 0.9, but it will be in 0.9.1; it's just a single commit, so you can
cherry-pick it on top of 0.9 if you don't want to wait for the next bugfix
release.
On Wed, Feb 12, 2014 at 12:24 AM, Julaiti Alafate <ar...@gmail.com>wrote:
> Hi,
>
> I am getting this error (copied from the stderr of the worker that
> reports exceptions) while processing text files encoded in UTF8:
>
> 14/02/11 22:26:15 ERROR executor.Executor: Uncaught exception in thread
> Thread[stdin writer for python,5,main]
> org.apache.spark.api.python.PythonException: Traceback (most recent call
> last):
> File "(base)/spark-0.9.0/python/pyspark/worker.py", line 77, in main
> serializer.dump_stream(func(split_index, iterator), outfile)
> File "(base)/spark-0.9.0/python/pyspark/serializers.py", line 182, in
> dump_stream
> self.serializer.dump_stream(self._batched(iterator), stream)
> File "(base)/spark-0.9.0/python/pyspark/serializers.py", line 117, in
> dump_stream
> for obj in iterator:
> File "(base)/spark-0.9.0/python/pyspark/serializers.py", line 171, in
> _batched
> for item in iterator:
> File "(base)/spark-0.9.0/python/pyspark/serializers.py", line 276, in
> load_stream
> yield self.loads(stream)
> File "(base)/spark-0.9.0/python/pyspark/serializers.py", line 271, in
> loads
> return stream.read(length).decode('utf8')
> File "(base)/lib/python2.7/encodings/utf_8.py", line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0xc0 in position 0:
> invalid start byte
>
> I am using PySpark. "SPARK_MEM" is set to 30g. The system is deployed in
> standalone mode over 17 computers. Scala is version 2.10.3. Python is
> version 2.7.3.
>
> I tried this code with previous release (spark 0.8.1, with scala 2.9.3).
> And it executes successfully. But it fails with the latest release (0.9.0).
> Thus I am not sure if this is a bug introduced by the latest version.
>
> Any help would be appreciated. Thanks!
>
> Julaiti
>
>