You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Ted Dunning (JIRA)" <ji...@apache.org> on 2016/08/04 17:10:20 UTC

[jira] [Commented] (MAHOUT-1853) Improvements to CCO (Correlated Cross-Occurrence)

    [ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408148#comment-15408148 ] 

Ted Dunning commented on MAHOUT-1853:
-------------------------------------

First, I think that the root LLR function would be more appropriate so that you don't have indicators that occur less often than expected.

Regarding the threshold, significance is monotonic in LLR score so thresholding on either is equivalent. The only question is picking the value. Picking based on a significance level has no strong motivation because there is a vast number of repeated and correlated comparisons in play.

As such, I would simply use something like t-digest (available in Mahout as part of the OnlineSummarizer if that has survived, otherwise available as a simple dependency) to aggregate the scores you get in these cases and pick, say, the top 1-10%. The knob should be turned based on how sparse you want the indicators to be on average. If you have the distribution of all the scores available, then picking the cutoff is trivial.

Note that this isn't really n^2. Instead, it is k n = O(n) where k is the number of categories. This is different from the case of text or general viewing behaviors because the vocabulary there is unbounded and grows with n. This means that the computation of the indicators is only O(k n) for the counting and O(k^2) for the cooccurrence counting. If k_max is the interaction cut in some other behavior that has unbounded size, then the cost of the counting is O(k k_max n) for counting and scoring. Both are scalable due to the limitation imposed by the finiteness of k and the artificial limit of the interaction cut.




> Improvements to CCO (Correlated Cross-Occurrence)
> -------------------------------------------------
>
>                 Key: MAHOUT-1853
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1853
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.12.0
>            Reporter: Andrew Palumbo
>            Assignee: Pat Ferrel
>             Fix For: 0.13.0
>
>
> Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold calculation for LLR downsampling, and possible multiple fixed thresholds for A’A, A’B etc. This is to account for the vast difference in dimensionality between indicator types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)