You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Reynold Xin (JIRA)" <ji...@apache.org> on 2016/10/08 21:10:21 UTC

[jira] [Commented] (SPARK-10496) Efficient DataFrame cumulative sum

    [ https://issues.apache.org/jira/browse/SPARK-10496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15558724#comment-15558724 ] 

Reynold Xin commented on SPARK-10496:
-------------------------------------

I think there are two separate issues here:

1. The API to run cumulative sum right now is fairly awkward. Either do it through a complicated join, or through window functions that still look fairly verbose. I've created a notebook that contains two short examples to do this in SQL and in DataFrames: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/2836020637783173/5382278320999420/latest.html

It would make sense to me to create a simpler API for this case, since it is very common. This API under the hood can just call the existing window function API.

2. The implementation, for cases when there is a single window partition, is slow, because it requires shuffling all the data. This can technically be run just a prefix scan. In this case, I'd have an optimizer rule or physical plan changes to improve this.



> Efficient DataFrame cumulative sum
> ----------------------------------
>
>                 Key: SPARK-10496
>                 URL: https://issues.apache.org/jira/browse/SPARK-10496
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Joseph K. Bradley
>            Priority: Minor
>
> Goal: Given a DataFrame with a numeric column X, create a new column Y which is the cumulative sum of X.
> This can be done with window functions, but it is not efficient for a large number of rows.  It could be done more efficiently using a prefix sum/scan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org