You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by rx...@apache.org on 2016/02/19 06:19:45 UTC
spark git commit: [SPARK-13380][SQL][DOCUMENT] Document Rand(seed) and Randn(seed) Return Indeterministic Results When Data Partitions are not fixed.

Repository: spark
Updated Branches:
  refs/heads/master 95e1ab223 -> c776fce99


[SPARK-13380][SQL][DOCUMENT] Document Rand(seed) and Randn(seed) Return Indeterministic Results When Data Partitions are not fixed.

`rand` and `randn` functions with a `seed` argument are commonly used. Based on the common sense, the results of `rand` and `randn` should be deterministic if the `seed` parameter value is provided. For example, in MS SQL Server, it also has a function `rand`. Regarding the parameter `seed`, the description is like: ```Seed is an integer expression (tinyint, smallint, or int) that gives the seed value. If seed is not specified, the SQL Server Database Engine assigns a seed value at random. For a specified seed value, the result returned is always the same.```

Update: the current implementation is unable to generate deterministic results when the partitions are not fixed. This PR documents this issue in the function descriptions.

jkbradley hit an issue and provided an example in the following JIRA: https://issues.apache.org/jira/browse/SPARK-13333

Author: gatorsmile <ga...@gmail.com>

Closes #11232 from gatorsmile/randSeed.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c776fce9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c776fce9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c776fce9

Branch: refs/heads/master
Commit: c776fce99b496a789ffcf2cfab78cf51eeea032b
Parents: 95e1ab2
Author: gatorsmile <ga...@gmail.com>
Authored: Thu Feb 18 21:19:36 2016 -0800
Committer: Reynold Xin <rx...@databricks.com>
Committed: Thu Feb 18 21:19:36 2016 -0800

----------------------------------------------------------------------
 .../spark/sql/catalyst/expressions/randomExpressions.scala       | 2 +-
 sql/core/src/main/scala/org/apache/spark/sql/functions.scala     | 4 ++++
 2 files changed, 5 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/c776fce9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala
----------------------------------------------------------------------
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala
index 2e70367..6be3cbc 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala
@@ -85,7 +85,7 @@ case class Randn(seed: Long) extends RDG {
 
   def this(seed: Expression) = this(seed match {
     case IntegerLiteral(s) => s
-    case _ => throw new AnalysisException("Input argument to rand must be an integer literal.")
+    case _ => throw new AnalysisException("Input argument to randn must be an integer literal.")
   })
 
   override def genCode(ctx: CodegenContext, ev: ExprCode): String = {

http://git-wip-us.apache.org/repos/asf/spark/blob/c776fce9/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
----------------------------------------------------------------------
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
index e4ab6b4..97c6992 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
@@ -1052,6 +1052,8 @@ object functions extends LegacyFunctions {
   /**
    * Generate a random column with i.i.d. samples from U[0.0, 1.0].
    *
+   * Note that this is indeterministic when data partitions are not fixed.
+   *
    * @group normal_funcs
    * @since 1.4.0
    */
@@ -1068,6 +1070,8 @@ object functions extends LegacyFunctions {
   /**
    * Generate a column with i.i.d. samples from the standard normal distribution.
    *
+   * Note that this is indeterministic when data partitions are not fixed.
+   *
    * @group normal_funcs
    * @since 1.4.0
    */


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org