You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by mathewwicks <ma...@gmail.com> on 2017/04/04 02:21:34 UTC
Do we support excluding the CURRENT ROW in PARTITION BY windowing
functions?
Here is an example to illustrate my question.
In this toy example, we are collecting a list of the other products that
each user has bought, and appending it as a new column. (Also note, that we
are filtering on some arbitrary column 'good_bad'.)
I would like to know if we support NOT including the CURRENT ROW in the
OVER(PARTITION BY xxx) windowing function.
For example, transaction 1 would have `other_purchases = [prod2, prod3]`
rather than `other_purchases = [prod1, prod2, prod3]`.
*------------------- Code Below -------------------*
df = spark.createDataFrame([
(1, "user1", "prod1", "good"),
(2, "user1", "prod2", "good"),
(3, "user1", "prod3", "good"),
(4, "user2", "prod3", "bad"),
(5, "user2", "prod4", "good"),
(5, "user2", "prod5", "good")],
("trans_id", "user_id", "prod_id", "good_bad")
)
df.show()
df = df.selectExpr(
"trans_id",
"user_id",
"COLLECT_LIST(CASE WHEN good_bad == 'good' THEN prod_id END)
OVER(PARTITION BY user_id) AS other_purchases"
)
df.show()
*----------------------------------------------------*
Here is a stackoverflow link:
https://stackoverflow.com/questions/43180723/spark-sql-excluding-the-current-row-in-partition-by-windowing-functions
<https://stackoverflow.com/questions/43180723/spark-sql-excluding-the-current-row-in-partition-by-windowing-functions>
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Do-we-support-excluding-the-CURRENT-ROW-in-PARTITION-BY-windowing-functions-tp28565.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org