You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Maciej Szymkiewicz (JIRA)" <ji...@apache.org> on 2015/08/14 16:14:45 UTC

[jira] [Updated] (SPARK-9978) Window functions require partitionBy to work as expected

     [ https://issues.apache.org/jira/browse/SPARK-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Maciej Szymkiewicz updated SPARK-9978:
--------------------------------------
    Description: 
I am trying to reproduce following SQL query:

{code}
df.registerTempTable("df")
sqlContext.sql("SELECT x, row_number() OVER (ORDER BY x) as rn FROM df").show()

+----+--+
|   x|rn|
+----+--+
|0.25| 1|
| 0.5| 2|
|0.75| 3|
+----+--+
{code}

using PySpark API. Unfortunately it doesn't work as expected:

{code}
from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber

df = sqlContext.createDataFrame([{"x": 0.25}, {"x": 0.5}, {"x": 0.75}])
df.select(df["x"], rowNumber().over(Window.orderBy("x")).alias("rn")).show()

+----+--+
|   x|rn|
+----+--+
| 0.5| 1|
|0.25| 1|
|0.75| 1|
+----+--+
{code}

As a workaround It is possible to call partitionBy without additional arguments:

{code}
df.select(
    df["x"],
    rowNumber().over(Window.partitionBy().orderBy("x")).alias("rn")
).show()

+----+--+
|   x|rn|
+----+--+
|0.25| 1|
| 0.5| 2|
|0.75| 3|
+----+--+
{code}

but as far as I can tell it is not documented and is rather counterintuitive considering SQL syntax. Moreover this problem doesn't affect Scala API:

{code}
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.rowNumber

case class Record(x: Double)
val df = sqlContext.createDataFrame(Record(0.25) :: Record(0.5) :: Record(0.75))
df.select($"x", rowNumber().over(Window.orderBy($"x")).alias("rn")).show

+----+--+
|   x|rn|
+----+--+
|0.25| 1|
| 0.5| 2|
|0.75| 3|
+----+--+
{code}


  was:
I am trying to reproduce following query:

{code}
df.registerTempTable("df")
sqlContext.sql("SELECT x, row_number() OVER (ORDER BY x) as rn FROM df").show()

+----+--+
|   x|rn|
+----+--+
|0.25| 1|
| 0.5| 2|
|0.75| 3|
+----+--+
{code}

using PySpark API. Unfortunately it doesn't work as expected:

{code}
from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber

df = sqlContext.createDataFrame([{"x": 0.25}, {"x": 0.5}, {"x": 0.75}])
df.select(df["x"], rowNumber().over(Window.orderBy("x")).alias("rn")).show()

+----+--+
|   x|rn|
+----+--+
| 0.5| 1|
|0.25| 1|
|0.75| 1|
+----+--+
{code}

As a workaround It is possible to call partitionBy without additional arguments:

{code}
df.select(
    df["x"],
    rowNumber().over(Window.partitionBy().orderBy("x")).alias("rn")
).show()

+----+--+
|   x|rn|
+----+--+
|0.25| 1|
| 0.5| 2|
|0.75| 3|
+----+--+
{code}

but as far as I can tell it is not documented and is rather counterintuitive considering SQL syntax. Moreover this problem doesn't affect Scala API:

{code}
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.rowNumber

case class Record(x: Double)
val df = sqlContext.createDataFrame(Record(0.25) :: Record(0.5) :: Record(0.75))
df.select($"x", rowNumber().over(Window.orderBy($"x")).alias("rn")).show

+----+--+
|   x|rn|
+----+--+
|0.25| 1|
| 0.5| 2|
|0.75| 3|
+----+--+
{code}



> Window functions require partitionBy to work as expected
> --------------------------------------------------------
>
>                 Key: SPARK-9978
>                 URL: https://issues.apache.org/jira/browse/SPARK-9978
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.4.1
>            Reporter: Maciej Szymkiewicz
>
> I am trying to reproduce following SQL query:
> {code}
> df.registerTempTable("df")
> sqlContext.sql("SELECT x, row_number() OVER (ORDER BY x) as rn FROM df").show()
> +----+--+
> |   x|rn|
> +----+--+
> |0.25| 1|
> | 0.5| 2|
> |0.75| 3|
> +----+--+
> {code}
> using PySpark API. Unfortunately it doesn't work as expected:
> {code}
> from pyspark.sql.window import Window
> from pyspark.sql.functions import rowNumber
> df = sqlContext.createDataFrame([{"x": 0.25}, {"x": 0.5}, {"x": 0.75}])
> df.select(df["x"], rowNumber().over(Window.orderBy("x")).alias("rn")).show()
> +----+--+
> |   x|rn|
> +----+--+
> | 0.5| 1|
> |0.25| 1|
> |0.75| 1|
> +----+--+
> {code}
> As a workaround It is possible to call partitionBy without additional arguments:
> {code}
> df.select(
>     df["x"],
>     rowNumber().over(Window.partitionBy().orderBy("x")).alias("rn")
> ).show()
> +----+--+
> |   x|rn|
> +----+--+
> |0.25| 1|
> | 0.5| 2|
> |0.75| 3|
> +----+--+
> {code}
> but as far as I can tell it is not documented and is rather counterintuitive considering SQL syntax. Moreover this problem doesn't affect Scala API:
> {code}
> import org.apache.spark.sql.expressions.Window
> import org.apache.spark.sql.functions.rowNumber
> case class Record(x: Double)
> val df = sqlContext.createDataFrame(Record(0.25) :: Record(0.5) :: Record(0.75))
> df.select($"x", rowNumber().over(Window.orderBy($"x")).alias("rn")).show
> +----+--+
> |   x|rn|
> +----+--+
> |0.25| 1|
> | 0.5| 2|
> |0.75| 3|
> +----+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org