You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by dstuck <da...@gmail.com> on 2017/01/03 23:15:16 UTC

DataFrame Distinct Sample Bug?

I ran into an issue where I'm getting unstable results after sampling a
dataframe that has had the distinct function called on it. The following
code should print different answers each time.

from pyspark.sql import functions as F
d = sqlContext.createDataFrame(sc.parallelize([[x] for x in range(100000)]),
['t'])
sampled = d.distinct().sample(False, 0.01, 478)
print sampled.select(F.min('t').alias('t')).collect()
print sampled.select(F.min('t').alias('t')).collect()
print sampled.select(F.min('t').alias('t')).collect()

Removing distinct and caching after sampling fix the problem (as does using
a smaller dataframe). The spark bug reporting docs dissuaded me from
creating a JIRA issue without checking with this mailing list that this is
reproducible.

I'm not familiar enough with the spark code to fix this :\



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrame-Distinct-Sample-Bug-tp20439.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Re: DataFrame Distinct Sample Bug?

Posted by Reynold Xin <rx...@databricks.com>.
I get the same result every time on Spark 2.1:


Using Python version 2.7.12 (default, Jul  2 2016 17:43:17)
SparkSession available as 'spark'.
>>> from pyspark.sql import functions as F
>>>
>>> d = sqlContext.createDataFrame(sc.parallelize([[x] for x in
range(100000)]),
... ['t'])
>>> sampled = d.distinct().sample(False, 0.01, 478)
>>> print sampled.select(F.min('t').alias('t')).collect()
[Row(t=4)]

>>> print sampled.select(F.min('t').alias('t')).collect()
[Row(t=4)]
>>> print sampled.select(F.min('t').alias('t')).collect()
[Row(t=4)]


On Wed, Jan 4, 2017 at 8:15 AM, dstuck <da...@gmail.com> wrote:

> I ran into an issue where I'm getting unstable results after sampling a
> dataframe that has had the distinct function called on it. The following
> code should print different answers each time.
>
> from pyspark.sql import functions as F
> d = sqlContext.createDataFrame(sc.parallelize([[x] for x in
> range(100000)]),
> ['t'])
> sampled = d.distinct().sample(False, 0.01, 478)
> print sampled.select(F.min('t').alias('t')).collect()
> print sampled.select(F.min('t').alias('t')).collect()
> print sampled.select(F.min('t').alias('t')).collect()
>
> Removing distinct and caching after sampling fix the problem (as does using
> a smaller dataframe). The spark bug reporting docs dissuaded me from
> creating a JIRA issue without checking with this mailing list that this is
> reproducible.
>
> I'm not familiar enough with the spark code to fix this :\
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/DataFrame-Distinct-
> Sample-Bug-tp20439.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>