You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Maciej Szymkiewicz (JIRA)" <ji...@apache.org> on 2015/08/14 16:14:45 UTC
[jira] [Updated] (SPARK-9978) Window functions require partitionBy
to work as expected
[ https://issues.apache.org/jira/browse/SPARK-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Maciej Szymkiewicz updated SPARK-9978:
--------------------------------------
Description:
I am trying to reproduce following SQL query:
{code}
df.registerTempTable("df")
sqlContext.sql("SELECT x, row_number() OVER (ORDER BY x) as rn FROM df").show()
+----+--+
| x|rn|
+----+--+
|0.25| 1|
| 0.5| 2|
|0.75| 3|
+----+--+
{code}
using PySpark API. Unfortunately it doesn't work as expected:
{code}
from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber
df = sqlContext.createDataFrame([{"x": 0.25}, {"x": 0.5}, {"x": 0.75}])
df.select(df["x"], rowNumber().over(Window.orderBy("x")).alias("rn")).show()
+----+--+
| x|rn|
+----+--+
| 0.5| 1|
|0.25| 1|
|0.75| 1|
+----+--+
{code}
As a workaround It is possible to call partitionBy without additional arguments:
{code}
df.select(
df["x"],
rowNumber().over(Window.partitionBy().orderBy("x")).alias("rn")
).show()
+----+--+
| x|rn|
+----+--+
|0.25| 1|
| 0.5| 2|
|0.75| 3|
+----+--+
{code}
but as far as I can tell it is not documented and is rather counterintuitive considering SQL syntax. Moreover this problem doesn't affect Scala API:
{code}
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.rowNumber
case class Record(x: Double)
val df = sqlContext.createDataFrame(Record(0.25) :: Record(0.5) :: Record(0.75))
df.select($"x", rowNumber().over(Window.orderBy($"x")).alias("rn")).show
+----+--+
| x|rn|
+----+--+
|0.25| 1|
| 0.5| 2|
|0.75| 3|
+----+--+
{code}
was:
I am trying to reproduce following query:
{code}
df.registerTempTable("df")
sqlContext.sql("SELECT x, row_number() OVER (ORDER BY x) as rn FROM df").show()
+----+--+
| x|rn|
+----+--+
|0.25| 1|
| 0.5| 2|
|0.75| 3|
+----+--+
{code}
using PySpark API. Unfortunately it doesn't work as expected:
{code}
from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber
df = sqlContext.createDataFrame([{"x": 0.25}, {"x": 0.5}, {"x": 0.75}])
df.select(df["x"], rowNumber().over(Window.orderBy("x")).alias("rn")).show()
+----+--+
| x|rn|
+----+--+
| 0.5| 1|
|0.25| 1|
|0.75| 1|
+----+--+
{code}
As a workaround It is possible to call partitionBy without additional arguments:
{code}
df.select(
df["x"],
rowNumber().over(Window.partitionBy().orderBy("x")).alias("rn")
).show()
+----+--+
| x|rn|
+----+--+
|0.25| 1|
| 0.5| 2|
|0.75| 3|
+----+--+
{code}
but as far as I can tell it is not documented and is rather counterintuitive considering SQL syntax. Moreover this problem doesn't affect Scala API:
{code}
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.rowNumber
case class Record(x: Double)
val df = sqlContext.createDataFrame(Record(0.25) :: Record(0.5) :: Record(0.75))
df.select($"x", rowNumber().over(Window.orderBy($"x")).alias("rn")).show
+----+--+
| x|rn|
+----+--+
|0.25| 1|
| 0.5| 2|
|0.75| 3|
+----+--+
{code}
> Window functions require partitionBy to work as expected
> --------------------------------------------------------
>
> Key: SPARK-9978
> URL: https://issues.apache.org/jira/browse/SPARK-9978
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 1.4.1
> Reporter: Maciej Szymkiewicz
>
> I am trying to reproduce following SQL query:
> {code}
> df.registerTempTable("df")
> sqlContext.sql("SELECT x, row_number() OVER (ORDER BY x) as rn FROM df").show()
> +----+--+
> | x|rn|
> +----+--+
> |0.25| 1|
> | 0.5| 2|
> |0.75| 3|
> +----+--+
> {code}
> using PySpark API. Unfortunately it doesn't work as expected:
> {code}
> from pyspark.sql.window import Window
> from pyspark.sql.functions import rowNumber
> df = sqlContext.createDataFrame([{"x": 0.25}, {"x": 0.5}, {"x": 0.75}])
> df.select(df["x"], rowNumber().over(Window.orderBy("x")).alias("rn")).show()
> +----+--+
> | x|rn|
> +----+--+
> | 0.5| 1|
> |0.25| 1|
> |0.75| 1|
> +----+--+
> {code}
> As a workaround It is possible to call partitionBy without additional arguments:
> {code}
> df.select(
> df["x"],
> rowNumber().over(Window.partitionBy().orderBy("x")).alias("rn")
> ).show()
> +----+--+
> | x|rn|
> +----+--+
> |0.25| 1|
> | 0.5| 2|
> |0.75| 3|
> +----+--+
> {code}
> but as far as I can tell it is not documented and is rather counterintuitive considering SQL syntax. Moreover this problem doesn't affect Scala API:
> {code}
> import org.apache.spark.sql.expressions.Window
> import org.apache.spark.sql.functions.rowNumber
> case class Record(x: Double)
> val df = sqlContext.createDataFrame(Record(0.25) :: Record(0.5) :: Record(0.75))
> df.select($"x", rowNumber().over(Window.orderBy($"x")).alias("rn")).show
> +----+--+
> | x|rn|
> +----+--+
> |0.25| 1|
> | 0.5| 2|
> |0.75| 3|
> +----+--+
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org