You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2016/07/23 10:21:20 UTC
[jira] [Assigned] (SPARK-16686) Dataset.sample with seed: result
seems to depend on downstream usage
[ https://issues.apache.org/jira/browse/SPARK-16686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-16686:
------------------------------------
Assignee: (was: Apache Spark)
> Dataset.sample with seed: result seems to depend on downstream usage
> --------------------------------------------------------------------
>
> Key: SPARK-16686
> URL: https://issues.apache.org/jira/browse/SPARK-16686
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.6.2, 2.0.0
> Environment: Spark 1.6.2 and Spark 2.0 - RC4
> Standalone
> Single-worker cluster
> Reporter: Joseph K. Bradley
> Attachments: DataFrame.sample bug - 2.0.html
>
>
> Summary to reproduce bug:
> * Create a DataFrame DF, and sample it with a fixed seed.
> * Collect that DataFrame -> result1
> * Call a particular UDF on that DataFrame -> result2
> You would expect results 1 and 2 to use the same rows from DF, but they appear not to.
> Note: result1 and result2 are both deterministic.
> See the attached notebook for details. Cells in the notebook were executed in order.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org