You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@datafu.apache.org by "Matthew Hayes (JIRA)" <ji...@apache.org> on 2014/01/15 00:43:19 UTC

[jira] [Updated] (DATAFU-2) UDFs for entropy and weighted sampling algorithms

     [ https://issues.apache.org/jira/browse/DATAFU-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matthew Hayes updated DATAFU-2:
-------------------------------

    Description: 
Jian Wang has suggested that we add UDFs for entropy and weighted random sampling and has implementations for each of these ready.

In Jian's words:

"In the real world, there are occasions we need to calculate the entropy of discrete random variables, for instance, to calculate the mutual information between variable X and Y using its entropy-based formula(mutual information calculation could be found at http://en.wikipedia.org/wiki/Mutual_information#Relation_to_other_quantities). Would suggest to implement a UDF to calculate the entropy of given input samples, following the definition at http://en.wikipedia.org/wiki/Entropy_%28information_theory%29

This is the reference paper I use to learn about the weighted sampleing algorithm: http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf

The present WeightedSample.java implements the Algorithm D.

We may try Algorithm A, A-res and A-expJ since they could be used in a data stream and distributed environment. These algorithms could be implemented based on ReservoirSample.java(inherit from this class?) since they also need a reservior to store the selected items."

  was:
Jian Wang has suggested that we add UDFs for entropy and weighted random sampling and has implementations for each of these ready.

In his words:

"In the real world, there are occasions we need to calculate the entropy of discrete random variables, for instance, to calculate the mutual information between variable X and Y using its entropy-based formula(mutual information calculation could be found at http://en.wikipedia.org/wiki/Mutual_information#Relation_to_other_quantities). Would suggest to implement a UDF to calculate the entropy of given input samples, following the definition at http://en.wikipedia.org/wiki/Entropy_%28information_theory%29

This is the reference paper I use to learn about the weighted sampleing algorithm: http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf

The present WeightedSample.java implements the Algorithm D.

We may try Algorithm A, A-res and A-expJ since they could be used in a data stream and distributed environment. These algorithms could be implemented based on ReservoirSample.java(inherit from this class?) since they also need a reservior to store the selected items."


> UDFs for entropy and weighted sampling algorithms
> -------------------------------------------------
>
>                 Key: DATAFU-2
>                 URL: https://issues.apache.org/jira/browse/DATAFU-2
>             Project: DataFu
>          Issue Type: Task
>            Reporter: Matthew Hayes
>
> Jian Wang has suggested that we add UDFs for entropy and weighted random sampling and has implementations for each of these ready.
> In Jian's words:
> "In the real world, there are occasions we need to calculate the entropy of discrete random variables, for instance, to calculate the mutual information between variable X and Y using its entropy-based formula(mutual information calculation could be found at http://en.wikipedia.org/wiki/Mutual_information#Relation_to_other_quantities). Would suggest to implement a UDF to calculate the entropy of given input samples, following the definition at http://en.wikipedia.org/wiki/Entropy_%28information_theory%29
> This is the reference paper I use to learn about the weighted sampleing algorithm: http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf
> The present WeightedSample.java implements the Algorithm D.
> We may try Algorithm A, A-res and A-expJ since they could be used in a data stream and distributed environment. These algorithms could be implemented based on ReservoirSample.java(inherit from this class?) since they also need a reservior to store the selected items."



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)