You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jeremy Freeman (JIRA)" <ji...@apache.org> on 2014/10/18 00:08:34 UTC
[jira] [Updated] (SPARK-3995) pyspark's sample methods do not work
with NumPy 1.9
[ https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jeremy Freeman updated SPARK-3995:
----------------------------------
Description:
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. This is the version of NumPy included with the current Anaconda distribution (v2.1); this is a popular distribution, and is likely to affect many users.
Steps to reproduce are:
{code}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}
Returns:
{code}
PythonException: Traceback (most recent call last):
File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 127, in dump_stream
for obj in iterator:
File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 185, in _batched
for item in iterator:
File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 116, in func
if self.getUniformSample(split) <= self._fraction:
File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 58, in getUniformSample
self.initRandomGenerator(split)
File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397)
File "mtrand.pyx", line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}
In previous versions of NumPy a random seed larger than 32 would silently get truncated to 2 ** 32. This was fixed in a recent patch (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). But it means that PySpark’s code now causes an error, because in the RDDSamplerBase class from pyspark.rddsampler, we use:
{code}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}
And this often yields ints larger than 2 ** 32. Effectively, this reliably breaks any sampling operation in PySpark with this NumPy version.
I am putting a PR together now (the fix is very simple!).
> pyspark's sample methods do not work with NumPy 1.9
> ---------------------------------------------------
>
> Key: SPARK-3995
> URL: https://issues.apache.org/jira/browse/SPARK-3995
> Project: Spark
> Issue Type: Bug
> Components: PySpark, Spark Core
> Affects Versions: 1.1.0
> Reporter: Jeremy Freeman
> Priority: Critical
>
> There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. This is the version of NumPy included with the current Anaconda distribution (v2.1); this is a popular distribution, and is likely to affect many users.
> Steps to reproduce are:
> {code}
> foo = sc.parallelize(range(1000),5)
> foo.takeSample(False, 10)
> {code}
> Returns:
> {code}
> PythonException: Traceback (most recent call last):
> File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 79, in main
> serializer.dump_stream(func(split_index, iterator), outfile)
> File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 196, in dump_stream
> self.serializer.dump_stream(self._batched(iterator), stream)
> File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 127, in dump_stream
> for obj in iterator:
> File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 185, in _batched
> for item in iterator:
> File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 116, in func
> if self.getUniformSample(split) <= self._fraction:
> File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 58, in getUniformSample
> self.initRandomGenerator(split)
> File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 44, in initRandomGenerator
> self._random = numpy.random.RandomState(self._seed)
> File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397)
> File "mtrand.pyx", line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697)
> ValueError: Seed must be between 0 and 4294967295
> {code}
> In previous versions of NumPy a random seed larger than 32 would silently get truncated to 2 ** 32. This was fixed in a recent patch (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). But it means that PySpark’s code now causes an error, because in the RDDSamplerBase class from pyspark.rddsampler, we use:
> {code}
> self._seed = seed if seed is not None else random.randint(0, sys.maxint)
> {code}
> And this often yields ints larger than 2 ** 32. Effectively, this reliably breaks any sampling operation in PySpark with this NumPy version.
> I am putting a PR together now (the fix is very simple!).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org