You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Seth Hendrickson (JIRA)" <ji...@apache.org> on 2015/09/21 19:06:05 UTC

[jira] [Comment Edited] (SPARK-10602) Univariate statistics as UDAFs: single-pass continuous stats

    [ https://issues.apache.org/jira/browse/SPARK-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900991#comment-14900991 ] 

Seth Hendrickson edited comment on SPARK-10602 at 9/21/15 5:05 PM:
-------------------------------------------------------------------

[~sabyasachi.nayak] My branch is here: [SPARK-10641|https://github.com/sethah/spark/tree/SPARK-10641], the implementation is mostly [here|https://github.com/sethah/spark/blob/SPARK-10641/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala]. {{Skewness}} and {{Kurtosis}} both use the {{StatisticalMoments}} abstract class.

Note that I wrote a few tests but they aren't passing currently (also scalastyle will fail) as this is a WIP. The update rules are based on descriptions [here|https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics]. I also made a pass at implementing a numerically stable update for skewness and kurtosis following the [Kahan update|https://en.wikipedia.org/wiki/Kahan_summation_algorithm] and they are wrapped under the {{StableMoments}} abstract class. The implementation follows the algorithms described [here|http://researcher.watson.ibm.com/researcher/files/us-ytian/stability.pdf]. I am not sure that they are working and I haven't been able to find a great way to test for numerical stability but I'll try to get that worked out soon.

Also note that the legacy way of implementing aggregates is still required, but I have simply issued placeholders for the {{AggregateFunction1}} implementations for now. 


was (Author: sethah):
My branch is here: [SPARK-10641|https://github.com/sethah/spark/tree/SPARK-10641], the implementation is mostly [here|https://github.com/sethah/spark/blob/SPARK-10641/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala]. {{Skewness}} and {{Kurtosis}} both use the {{StatisticalMoments}} abstract class.

Note that I wrote a few tests but they aren't passing currently (also scalastyle will fail) as this is a WIP. The update rules are based on descriptions [here|https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics]. I also made a pass at implementing a numerically stable update for skewness and kurtosis following the [Kahan update|https://en.wikipedia.org/wiki/Kahan_summation_algorithm] and they are wrapped under the {{StableMoments}} abstract class. The implementation follows the algorithms described [here|http://researcher.watson.ibm.com/researcher/files/us-ytian/stability.pdf]. I am not sure that they are working and I haven't been able to find a great way to test for numerical stability but I'll try to get that worked out soon.

Also note that the legacy way of implementing aggregates is still required, but I have simply issued placeholders for the {{AggregateFunction1}} implementations for now. 

> Univariate statistics as UDAFs: single-pass continuous stats
> ------------------------------------------------------------
>
>                 Key: SPARK-10602
>                 URL: https://issues.apache.org/jira/browse/SPARK-10602
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML, SQL
>            Reporter: Joseph K. Bradley
>            Assignee: Seth Hendrickson
>
> See parent JIRA for more details.  This subtask covers statistics for continuous values requiring a single pass over the data, such as min and max.
> This JIRA is an umbrella.  For individual stats, please create and link a new JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org