You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Jeremy Freeman <fr...@gmail.com> on 2014/10/18 00:23:15 UTC
sampling broken in PySpark with recent NumPy

Hi all,

I found a significant bug in PySpark's sampling methods, due to a recent NumPy change (as of v1.9). I created a JIRA (https://issues.apache.org/jira/browse/SPARK-3995), but wanted to share here as well in case anyone hits it.

Steps to reproduce are:

> foo = sc.parallelize(range(1000),5)
> foo.takeSample(False, 10)

Which returns:

> PythonException: Traceback (most recent call last):
>   File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 79, in main
>     serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 196, in dump_stream
>     self.serializer.dump_stream(self._batched(iterator), stream)
>   File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 127, in dump_stream
>     for obj in iterator:
>   File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 185, in _batched
>     for item in iterator:
>   File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 116, in func
>     if self.getUniformSample(split) <= self._fraction:
>   File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 58, in getUniformSample
>     self.initRandomGenerator(split)
>   File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 44, in initRandomGenerator
>     self._random = numpy.random.RandomState(self._seed)
>   File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397)
>   File "mtrand.pyx", line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697)
> ValueError: Seed must be between 0 and 4294967295

The problem is that NumPy used to silently truncate random seeds larger than 2 ** 32, but now throws an error (due to this patch: https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). And this reliably breaks our sampling. I’ll put a PR in shortly with the fix.

— Jeremy

-------------------------
jeremyfreeman.net
@thefreemanlab