You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by rx...@apache.org on 2016/11/02 18:36:22 UTC
spark git commit: [SPARK-17895] Improve doc for rangeBetween and
rowsBetween
Repository: spark
Updated Branches:
refs/heads/master 4af0ce2d9 -> 742e0fea5
[SPARK-17895] Improve doc for rangeBetween and rowsBetween
## What changes were proposed in this pull request?
Copied description for row and range based frame boundary from https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala#L56
Added examples to show different behavior of rangeBetween and rowsBetween when involving duplicate values.
Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.
Author: buzhihuojie <re...@gmail.com>
Closes #15727 from david-weiluo-ren/improveDocForRangeAndRowsBetween.
Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/742e0fea
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/742e0fea
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/742e0fea
Branch: refs/heads/master
Commit: 742e0fea5391857964e90d396641ecf95cac4248
Parents: 4af0ce2
Author: buzhihuojie <re...@gmail.com>
Authored: Wed Nov 2 11:36:20 2016 -0700
Committer: Reynold Xin <rx...@databricks.com>
Committed: Wed Nov 2 11:36:20 2016 -0700
----------------------------------------------------------------------
.../apache/spark/sql/expressions/Window.scala | 55 ++++++++++++++++++++
.../spark/sql/expressions/WindowSpec.scala | 55 ++++++++++++++++++++
2 files changed, 110 insertions(+)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/spark/blob/742e0fea/sql/core/src/main/scala/org/apache/spark/sql/expressions/Window.scala
----------------------------------------------------------------------
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/expressions/Window.scala b/sql/core/src/main/scala/org/apache/spark/sql/expressions/Window.scala
index 0b26d86..327bc37 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/expressions/Window.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/expressions/Window.scala
@@ -121,6 +121,32 @@ object Window {
* and [[Window.currentRow]] to specify special boundary values, rather than using integral
* values directly.
*
+ * A row based boundary is based on the position of the row within the partition.
+ * An offset indicates the number of rows above or below the current row, the frame for the
+ * current row starts or ends. For instance, given a row based sliding frame with a lower bound
+ * offset of -1 and a upper bound offset of +2. The frame for row with index 5 would range from
+ * index 4 to index 6.
+ *
+ * {{{
+ * import org.apache.spark.sql.expressions.Window
+ * val df = Seq((1, "a"), (1, "a"), (2, "a"), (1, "b"), (2, "b"), (3, "b"))
+ * .toDF("id", "category")
+ * df.withColumn("sum",
+ * sum('id) over Window.partitionBy('category).orderBy('id).rowsBetween(0,1))
+ * .show()
+ *
+ * +---+--------+---+
+ * | id|category|sum|
+ * +---+--------+---+
+ * | 1| b| 3|
+ * | 2| b| 5|
+ * | 3| b| 3|
+ * | 1| a| 2|
+ * | 1| a| 3|
+ * | 2| a| 2|
+ * +---+--------+---+
+ * }}}
+ *
* @param start boundary start, inclusive. The frame is unbounded if this is
* the minimum long value ([[Window.unboundedPreceding]]).
* @param end boundary end, inclusive. The frame is unbounded if this is the
@@ -144,6 +170,35 @@ object Window {
* and [[Window.currentRow]] to specify special boundary values, rather than using integral
* values directly.
*
+ * A range based boundary is based on the actual value of the ORDER BY
+ * expression(s). An offset is used to alter the value of the ORDER BY expression, for
+ * instance if the current order by expression has a value of 10 and the lower bound offset
+ * is -3, the resulting lower bound for the current row will be 10 - 3 = 7. This however puts a
+ * number of constraints on the ORDER BY expressions: there can be only one expression and this
+ * expression must have a numerical data type. An exception can be made when the offset is 0,
+ * because no value modification is needed, in this case multiple and non-numeric ORDER BY
+ * expression are allowed.
+ *
+ * {{{
+ * import org.apache.spark.sql.expressions.Window
+ * val df = Seq((1, "a"), (1, "a"), (2, "a"), (1, "b"), (2, "b"), (3, "b"))
+ * .toDF("id", "category")
+ * df.withColumn("sum",
+ * sum('id) over Window.partitionBy('category).orderBy('id).rangeBetween(0,1))
+ * .show()
+ *
+ * +---+--------+---+
+ * | id|category|sum|
+ * +---+--------+---+
+ * | 1| b| 3|
+ * | 2| b| 5|
+ * | 3| b| 3|
+ * | 1| a| 4|
+ * | 1| a| 4|
+ * | 2| a| 2|
+ * +---+--------+---+
+ * }}}
+ *
* @param start boundary start, inclusive. The frame is unbounded if this is
* the minimum long value ([[Window.unboundedPreceding]]).
* @param end boundary end, inclusive. The frame is unbounded if this is the
http://git-wip-us.apache.org/repos/asf/spark/blob/742e0fea/sql/core/src/main/scala/org/apache/spark/sql/expressions/WindowSpec.scala
----------------------------------------------------------------------
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/expressions/WindowSpec.scala b/sql/core/src/main/scala/org/apache/spark/sql/expressions/WindowSpec.scala
index 1e85b6e..4a8ce69 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/expressions/WindowSpec.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/expressions/WindowSpec.scala
@@ -89,6 +89,32 @@ class WindowSpec private[sql](
* and [[Window.currentRow]] to specify special boundary values, rather than using integral
* values directly.
*
+ * A row based boundary is based on the position of the row within the partition.
+ * An offset indicates the number of rows above or below the current row, the frame for the
+ * current row starts or ends. For instance, given a row based sliding frame with a lower bound
+ * offset of -1 and a upper bound offset of +2. The frame for row with index 5 would range from
+ * index 4 to index 6.
+ *
+ * {{{
+ * import org.apache.spark.sql.expressions.Window
+ * val df = Seq((1, "a"), (1, "a"), (2, "a"), (1, "b"), (2, "b"), (3, "b"))
+ * .toDF("id", "category")
+ * df.withColumn("sum",
+ * sum('id) over Window.partitionBy('category).orderBy('id).rowsBetween(0,1))
+ * .show()
+ *
+ * +---+--------+---+
+ * | id|category|sum|
+ * +---+--------+---+
+ * | 1| b| 3|
+ * | 2| b| 5|
+ * | 3| b| 3|
+ * | 1| a| 2|
+ * | 1| a| 3|
+ * | 2| a| 2|
+ * +---+--------+---+
+ * }}}
+ *
* @param start boundary start, inclusive. The frame is unbounded if this is
* the minimum long value ([[Window.unboundedPreceding]]).
* @param end boundary end, inclusive. The frame is unbounded if this is the
@@ -111,6 +137,35 @@ class WindowSpec private[sql](
* and [[Window.currentRow]] to specify special boundary values, rather than using integral
* values directly.
*
+ * A range based boundary is based on the actual value of the ORDER BY
+ * expression(s). An offset is used to alter the value of the ORDER BY expression, for
+ * instance if the current order by expression has a value of 10 and the lower bound offset
+ * is -3, the resulting lower bound for the current row will be 10 - 3 = 7. This however puts a
+ * number of constraints on the ORDER BY expressions: there can be only one expression and this
+ * expression must have a numerical data type. An exception can be made when the offset is 0,
+ * because no value modification is needed, in this case multiple and non-numeric ORDER BY
+ * expression are allowed.
+ *
+ * {{{
+ * import org.apache.spark.sql.expressions.Window
+ * val df = Seq((1, "a"), (1, "a"), (2, "a"), (1, "b"), (2, "b"), (3, "b"))
+ * .toDF("id", "category")
+ * df.withColumn("sum",
+ * sum('id) over Window.partitionBy('category).orderBy('id).rangeBetween(0,1))
+ * .show()
+ *
+ * +---+--------+---+
+ * | id|category|sum|
+ * +---+--------+---+
+ * | 1| b| 3|
+ * | 2| b| 5|
+ * | 3| b| 3|
+ * | 1| a| 4|
+ * | 1| a| 4|
+ * | 2| a| 2|
+ * +---+--------+---+
+ * }}}
+ *
* @param start boundary start, inclusive. The frame is unbounded if this is
* the minimum long value ([[Window.unboundedPreceding]]).
* @param end boundary end, inclusive. The frame is unbounded if this is the
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org