You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@datafu.apache.org by "jian wang (JIRA)" <ji...@apache.org> on 2014/02/16 14:22:19 UTC
[jira] [Commented] (DATAFU-26) Resolve entropy UDF naming conventions

    [ https://issues.apache.org/jira/browse/DATAFU-26?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13902708#comment-13902708 ] 

jian wang commented on DATAFU-26:
---------------------------------

OK, agree we change the naming and documentation accordingly.

Entropy is meant to calculate the data set's entropy given per item count, so maybe we could change to EmpiricalCountEntropy? And in future in case we add new count 
based entropies, we also follow the convention xxCountEntropy.

Btw, the original naming is similar to R's entropy package that inputs item count. So I use Streaming* to distinguish the count-based and raw-data-based entropies.

The StreamingEntropy could be re-named to Entropy, but this re-naming may not directly reflect its purpose that is to handle sorted raw data. Have not thought about a better name yet.

The EmpiricalEntropy is used to handle sorted raw data, especially in nested-foreach after a group-by. And the Entropy is used to handle data count, in a distributed fashion that leverages combiner. In the description of each class, there is a short description of the typical usage of this class and a sample of how to use it to calculate entropy. Maybe we should give more detailed explanations so that user knows the typical usage scenario of each kind of entropy.



> Resolve entropy UDF naming conventions
> --------------------------------------
>
>                 Key: DATAFU-26
>                 URL: https://issues.apache.org/jira/browse/DATAFU-26
>             Project: DataFu
>          Issue Type: Task
>            Reporter: Matthew Hayes
>            Assignee: jian wang
>             Fix For: 1.3.0
>
>
> There are a couple issues with the naming of entropy UDFs that we should work out before the next release.
> StreamingEntropy supports multiple estimation methods.  Entropy however only support empirical.  The supported constructors are also different as a result.  Although Entropy's documentation states it computes the empirical entropy, I think the name itself may lead to confusion.  
> StreamingEntropy takes data the data in sorted order.  Using this sorted data it computes count, which are then used to compute entropy.  Entropy on the other hand takes counts directly and computes entropy.  These counts need to be computed before calling it.  Our convention in DataFu has been that "Streaming" implies that the data does not need to be sorted.  So StreamingEntropy is in conflict with this.
> My proposal is:
> 1) Rename Entropy to EmpiricalEntropy
> 2) Rename StreamingEntropy to Entropy
> 3) Clearly document why you would use EmpiricalEntropy over Entropy.  It will be more efficient in some scenarios and we should explain this.
> One open question I have is whether we should distinguish in the name somehow that EmpiricalEntropy accepts counts, not the actual items themselves.  EmpiricalCountBasedEntropy seems verbose.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)