You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Herman van Hovell (JIRA)" <ji...@apache.org> on 2015/08/14 21:49:45 UTC
[jira] [Commented] (SPARK-9978) Window functions require partitionBy to work as expected

    [ https://issues.apache.org/jira/browse/SPARK-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697642#comment-14697642 ] 

Herman van Hovell commented on SPARK-9978:
------------------------------------------

There is a small bug in the Python code: https://github.com/apache/spark/blob/master/python/pyspark/sql/window.py#L67

It is calling {{partitionBy}} instead of {{orderBy}}. I never work with Python, and I have never build Spark with Python; so I don't think it the best idea to start my journey into the world of Python this way.

cc [~rxin]/[~davies]: I know you are both super busy, but can you assign this to someone?



> Window functions require partitionBy to work as expected
> --------------------------------------------------------
>
>                 Key: SPARK-9978
>                 URL: https://issues.apache.org/jira/browse/SPARK-9978
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.4.1
>            Reporter: Maciej Szymkiewicz
>            Assignee: Davies Liu
>
> I am trying to reproduce following SQL query:
> {code}
> df.registerTempTable("df")
> sqlContext.sql("SELECT x, row_number() OVER (ORDER BY x) as rn FROM df").show()
> +----+--+
> |   x|rn|
> +----+--+
> |0.25| 1|
> | 0.5| 2|
> |0.75| 3|
> +----+--+
> {code}
> using PySpark API. Unfortunately it doesn't work as expected:
> {code}
> from pyspark.sql.window import Window
> from pyspark.sql.functions import rowNumber
> df = sqlContext.createDataFrame([{"x": 0.25}, {"x": 0.5}, {"x": 0.75}])
> df.select(df["x"], rowNumber().over(Window.orderBy("x")).alias("rn")).show()
> +----+--+
> |   x|rn|
> +----+--+
> | 0.5| 1|
> |0.25| 1|
> |0.75| 1|
> +----+--+
> {code}
> As a workaround It is possible to call partitionBy without additional arguments:
> {code}
> df.select(
>     df["x"],
>     rowNumber().over(Window.partitionBy().orderBy("x")).alias("rn")
> ).show()
> +----+--+
> |   x|rn|
> +----+--+
> |0.25| 1|
> | 0.5| 2|
> |0.75| 3|
> +----+--+
> {code}
> but as far as I can tell it is not documented and is rather counterintuitive considering SQL syntax. Moreover this problem doesn't affect Scala API:
> {code}
> import org.apache.spark.sql.expressions.Window
> import org.apache.spark.sql.functions.rowNumber
> case class Record(x: Double)
> val df = sqlContext.createDataFrame(Record(0.25) :: Record(0.5) :: Record(0.75))
> df.select($"x", rowNumber().over(Window.orderBy($"x")).alias("rn")).show
> +----+--+
> |   x|rn|
> +----+--+
> |0.25| 1|
> | 0.5| 2|
> |0.75| 3|
> +----+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org