You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Brad Miller (JIRA)" <ji...@apache.org> on 2014/09/29 17:50:33 UTC
[jira] [Created] (SPARK-3721) Broadcast Variables above 2GB break
in PySpark
Brad Miller created SPARK-3721:
----------------------------------
Summary: Broadcast Variables above 2GB break in PySpark
Key: SPARK-3721
URL: https://issues.apache.org/jira/browse/SPARK-3721
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 1.1.0
Reporter: Brad Miller
Attempting to reproduce the bug in isolation in iPython notebook I've observed the following 3 unique failure modes, all of which seem to be related to broadcast variable size. Note that I'm running python 2.7.3 on all machines and using the Spark 1.1.0 binaries.
**BLOCK 1** [no problem]
import cPickle
from pyspark import SparkContext
def check_pre_serialized(size):
msg = cPickle.dumps(range(2 ** size))
print 'serialized length:', len(msg)
bvar = sc.broadcast(msg)
print 'length recovered from broadcast variable:', len(bvar.value)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()
def check_unserialized(size):
msg = range(2 ** size)
bvar = sc.broadcast(msg)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()
SparkContext.setSystemProperty('spark.executor.memory', '15g')
SparkContext.setSystemProperty('spark.cores.max', '5')
sc = SparkContext('spark://crosby.research.intel-research.net:7077', 'broadcast_bug')
**BLOCK 2** [no problem]
check_pre_serialized(20)
> serialized length: 9374656
> length recovered from broadcast variable: 9374656
> correct value recovered: True
**BLOCK 3** [no problem]
check_unserialized(20)
> correct value recovered: True
**BLOCK 4** [no problem]
check_pre_serialized(27)
> serialized length: 1499501632
> length recovered from broadcast variable: 1499501632
> correct value recovered: True
**BLOCK 5** [no problem]
check_unserialized(27)
> correct value recovered: True
**BLOCK 6** [ERROR 1: unhandled error from cPickle.dumps inside sc.broadcast]
check_pre_serialized(28)
.....
> /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
> 354
> 355 def dumps(self, obj):
> --> 356 return cPickle.dumps(obj, 2)
> 357
> 358 loads = cPickle.loads
>
> SystemError: error return without exception set
**BLOCK 7** [no problem]
check_unserialized(28)
> correct value recovered: True
**BLOCK 8** [ERROR 2: no error occurs and *incorrect result* is returned]
check_pre_serialized(29)
> serialized length: 6331339840
> length recovered from broadcast variable: 2036372544
> correct value recovered: False
**BLOCK 9** [ERROR 3: unhandled error from zlib.compress inside sc.broadcast]
check_unserialized(29)
......
> /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
> 418
> 419 def dumps(self, obj):
> --> 420 return zlib.compress(self.serializer.dumps(obj), 1)
> 421
> 422 def loads(self, obj):
>
> OverflowError: size does not fit in an int
**BLOCK 10** [ERROR 1]
check_pre_serialized(30)
...same as above...
**BLOCK 11** [ERROR 3]
check_unserialized(30)
...same as above...
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org