You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Brian Hulette (Jira)" <ji...@apache.org> on 2022/01/06 17:05:00 UTC
[jira] [Comment Edited] (BEAM-12181) Implement parallelized (approximate) mode

    [ https://issues.apache.org/jira/browse/BEAM-12181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17470050#comment-17470050 ] 

Brian Hulette edited comment on BEAM-12181 at 1/6/22, 5:04 PM:
---------------------------------------------------------------

I looked into this approach a little bit, it's discussed in the wikipedia articale: https://en.wikipedia.org/wiki/Mode_(statistics)#Mode_of_a_sample

{quote}In order to estimate the mode of the underlying distribution, the usual practice is to discretize the data by assigning frequency values to intervals of equal distance, as for making a histogram, effectively replacing the values by the midpoints of the intervals they are assigned to. The mode is then the value where the histogram reaches its peak. *For small or middle-sized samples the outcome of this procedure is sensitive to the choice of interval width if chosen too narrow or too wide*; typically one should have a sizable fraction of the data concentrated in a relatively small number of intervals (5 to 10), while the fraction of the data falling outside these intervals is also sizable. An alternate approach is kernel density estimation, which essentially blurs point samples to produce a continuous estimate of the probability density function which can provide an estimate of the mode.{quote}

(emphasis mine)

I think we could do this discretization in a distributed way, but how would we select the number of bins to use to minimize the error? It might be worth looking into the kernel density estimation approach.


was (Author: bhulette):
I looked into this approach a little bit, it's discussed in the wikipedia articale: https://en.wikipedia.org/wiki/Mode_(statistics)#Mode_of_a_sample

{quote}
In order to estimate the mode of the underlying distribution, the usual practice is to discretize the data by assigning frequency values to intervals of equal distance, as for making a histogram, effectively replacing the values by the midpoints of the intervals they are assigned to. The mode is then the value where the histogram reaches its peak. *For small or middle-sized samples the outcome of this procedure is sensitive to the choice of interval width if chosen too narrow or too wide*; typically one should have a sizable fraction of the data concentrated in a relatively small number of intervals (5 to 10), while the fraction of the data falling outside these intervals is also sizable. An alternate approach is kernel density estimation, which essentially blurs point samples to produce a continuous estimate of the probability density function which can provide an estimate of the mode.
{quote}

(emphasis mine)

I think we could do this discretization in a distributed way, but how would we select the number of bins to use to minimize the error? It might be worth looking into the kernel density estimation approach.

> Implement parallelized (approximate) mode
> -----------------------------------------
>
>                 Key: BEAM-12181
>                 URL: https://issues.apache.org/jira/browse/BEAM-12181
>             Project: Beam
>          Issue Type: Improvement
>          Components: dsl-dataframe, sdk-py-core
>            Reporter: Brian Hulette
>            Assignee: Svetak Vihaan Sundhar
>            Priority: P3
>              Labels: dataframe-api
>
> Currently we require Singleton partitioning to compute mode(). We should provide an option to compute approximate mode() which can be parallelized.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)