You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Herman van Hovell (JIRA)" <ji...@apache.org> on 2015/08/14 21:49:45 UTC
[jira] [Commented] (SPARK-9978) Window functions require
partitionBy to work as expected
[ https://issues.apache.org/jira/browse/SPARK-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697642#comment-14697642 ]
Herman van Hovell commented on SPARK-9978:
------------------------------------------
There is a small bug in the Python code: https://github.com/apache/spark/blob/master/python/pyspark/sql/window.py#L67
It is calling {{partitionBy}} instead of {{orderBy}}. I never work with Python, and I have never build Spark with Python; so I don't think it the best idea to start my journey into the world of Python this way.
cc [~rxin]/[~davies]: I know you are both super busy, but can you assign this to someone?
> Window functions require partitionBy to work as expected
> --------------------------------------------------------
>
> Key: SPARK-9978
> URL: https://issues.apache.org/jira/browse/SPARK-9978
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 1.4.1
> Reporter: Maciej Szymkiewicz
> Assignee: Davies Liu
>
> I am trying to reproduce following SQL query:
> {code}
> df.registerTempTable("df")
> sqlContext.sql("SELECT x, row_number() OVER (ORDER BY x) as rn FROM df").show()
> +----+--+
> | x|rn|
> +----+--+
> |0.25| 1|
> | 0.5| 2|
> |0.75| 3|
> +----+--+
> {code}
> using PySpark API. Unfortunately it doesn't work as expected:
> {code}
> from pyspark.sql.window import Window
> from pyspark.sql.functions import rowNumber
> df = sqlContext.createDataFrame([{"x": 0.25}, {"x": 0.5}, {"x": 0.75}])
> df.select(df["x"], rowNumber().over(Window.orderBy("x")).alias("rn")).show()
> +----+--+
> | x|rn|
> +----+--+
> | 0.5| 1|
> |0.25| 1|
> |0.75| 1|
> +----+--+
> {code}
> As a workaround It is possible to call partitionBy without additional arguments:
> {code}
> df.select(
> df["x"],
> rowNumber().over(Window.partitionBy().orderBy("x")).alias("rn")
> ).show()
> +----+--+
> | x|rn|
> +----+--+
> |0.25| 1|
> | 0.5| 2|
> |0.75| 3|
> +----+--+
> {code}
> but as far as I can tell it is not documented and is rather counterintuitive considering SQL syntax. Moreover this problem doesn't affect Scala API:
> {code}
> import org.apache.spark.sql.expressions.Window
> import org.apache.spark.sql.functions.rowNumber
> case class Record(x: Double)
> val df = sqlContext.createDataFrame(Record(0.25) :: Record(0.5) :: Record(0.75))
> df.select($"x", rowNumber().over(Window.orderBy($"x")).alias("rn")).show
> +----+--+
> | x|rn|
> +----+--+
> |0.25| 1|
> | 0.5| 2|
> |0.75| 3|
> +----+--+
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org