You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:15:34 UTC

[jira] [Resolved] (SPARK-19310) PySpark Window over function changes behaviour regarding Order-By

     [ https://issues.apache.org/jira/browse/SPARK-19310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-19310.
----------------------------------
    Resolution: Incomplete

> PySpark Window over function changes behaviour regarding Order-By
> -----------------------------------------------------------------
>
>                 Key: SPARK-19310
>                 URL: https://issues.apache.org/jira/browse/SPARK-19310
>             Project: Spark
>          Issue Type: Bug
>          Components: Documentation, PySpark
>    Affects Versions: 1.6.2, 2.0.2
>            Reporter: Lucas Tittmann
>            Priority: Major
>              Labels: bulk-closed, documentation
>
> I do not know if I overlooked it in the release notes (I guess it is intentional) or if this is a bug. There are many Window function related changes and tickets, but I haven't found this behaviour change described somewhere (I searched for "text ~ "requires window to be ordered" AND created >= -40w").
> So, should I change my syntax or will this be patched to show pre 2.0 behaviour ?
> *Problem:*
> This code works in Spark 1.6.2:
> {noformat}
> test = sqlContext.createDataFrame([("a", 3), ("a", 1), ("b", 2)], ("col1", "col2"))
> testAgg = test.orderBy("col1", "col2").withColumn("col3", pysqlf.lead(pysqlf.col("col2")).over(Window.partitionBy(["col1"])))
> print(testAgg.collect())
> ### [Row(col1=u'a', col2=3, col3=1), Row(col1=u'a', col2=1, col3=None), Row(col1=u'b', col2=2, col3=None)]
> {noformat}
> But not in Spark 2.0.2
> {noformat}
> Traceback (most recent call last):
>   File "/tmp/zeppelin_pyspark-6661935005109893810.py", line 267, in <module>
>     raise Exception(traceback.format_exc())
> Exception: Traceback (most recent call last):
>   File "/tmp/zeppelin_pyspark-6661935005109893810.py", line 260, in <module>
>     exec(code)
>   File "<stdin>", line 2, in <module>
>   File "/usr/local/spark/python/pyspark/sql/dataframe.py", line 1370, in withColumn
>     return DataFrame(self._jdf.withColumn(colName, col._jc), self.sql_ctx)
>   File "/usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 1133, in __call__
>     answer, self.gateway_client, self.target_id, self.name)
>   File "/usr/local/spark/python/pyspark/sql/utils.py", line 69, in deco
>     raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> AnalysisException: u"Window function lead('col2, 1, null) requires window to be ordered, please add ORDER BY clause. For example SELECT lead('col2, 1, null)(value_expr) OVER (PARTITION BY window_partition ORDER BY window_ordering) from table;"
> {noformat}
> *Work around:*
> {noformat}
> test = sqlContext.createDataFrame([("a", 3), ("a", 1), ("b", 2)], ("col1", "col2"))
> testAgg = test.orderBy("col1", "col2").withColumn("col3", pysqlf.lead(pysqlf.col("col2")).over(Window.partitionBy(["col1"]).orderBy("col1", "col2")))
> print(testAgg.collect())
> ### [Row(col1=u'a', col2=3, col3=1), Row(col1=u'a', col2=1, col3=None), Row(col1=u'b', col2=2, col3=None)]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org