You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:37:51 UTC
[jira] [Resolved] (SPARK-7682) Size of distributed grids still limited by cPickle

     [ https://issues.apache.org/jira/browse/SPARK-7682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-7682.
---------------------------------
    Resolution: Incomplete

> Size of distributed grids still limited by cPickle 
> ---------------------------------------------------
>
>                 Key: SPARK-7682
>                 URL: https://issues.apache.org/jira/browse/SPARK-7682
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.3.1
>         Environment: Redhat Enterprise Linux 6.5, Spark 1.3.1 standalone in cluster mode, 2 nodes with 64 GB spark slaves, Python 2.7.6
>            Reporter: Toby Potter
>            Priority: Minor
>              Labels: Serializer, bulk-closed
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I'm trying to explore the possibilities of writing a fault-tolerant distributed computing engine for multidimensional arrays. I'm finding that the Python cPickle serializer is limiting the size of Numpy arrays that I can distribute over the cluster.
> My example code is below
> #!/usr/bin/env python
> #Python app to use spark
> from pyspark import SparkContext, SparkConf
> import numpy
> appName="Spark Test App"
> # Create a spark context
> conf = SparkConf().setAppName(appName)
> # Set memory
> conf = SparkConf().set("spark.executor.memory", "32g")
> sc = SparkContext(conf=conf)
> # Make array
> grid=numpy.zeros((1024,1024,1024))
> # Now parallelise and persist the data
> rdd = sc.parallelize([("srcw", grid)])
> # Make the data persist in memory
> rdd_rdd.persist()
> When I run the code I get the following error
> Traceback (most recent call last):
>   File "test_app.py", line 20, in <module>
>     rdd = sc.parallelize([("srcw", grid)])
>   File "/spark/1.3.1/python/pyspark/context.py", line 341, in parallelize
>     serializer.dump_stream(c, tempFile)
>   File "/spark/1.3.1/python/pyspark/serializers.py", line 208, in dump_stream
>     self.serializer.dump_stream(self._batched(iterator), stream)
>   File "/spark/1.3.1/python/pyspark/serializers.py", line 127, in dump_stream
>     self._write_with_length(obj, stream)
>   File "/spark/1.3.1/python/pyspark/serializers.py", line 137, in _write_with_length
>     serialized = self.dumps(obj)
>   File "/spark/1.3.1/python/pyspark/serializers.py", line 403, in dumps
>     return cPickle.dumps(obj, 2)
> SystemError: error return without exception set



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org