You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Pat Ferrel (JIRA)" <ji...@apache.org> on 2016/08/04 16:16:20 UTC

[jira] [Comment Edited] (MAHOUT-1853) Improvements to CCO (Correlated Cross-Occurrence)

    [ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391126#comment-15391126 ] 

Pat Ferrel edited comment on MAHOUT-1853 at 8/4/16 4:15 PM:
------------------------------------------------------------

To reword this issue...

The CCO analysis code currently only employs a single # of values per row of the P’X matrices. This has proven an insufficient threshold for many of the possible cross-occurrence types. The problem is that for a user * item input matrix, which becomes an item * item output a fixed # per row is fine but the implementation is a bit meaningless when there are only 20 columns of the X matrix. For instance if X = C category preferences, there may be only 20 possible categories and with a threshold of 100 and the fact that users often have enough usage to trigger preference events on all categories (though resulting in a small LLR value), the P’C matrix is almost completely full. This reduces any value in P’C.

There are several ways to address:
1) have a # of indicators per row threshold for every P'X matrix, not one for all (the current impl)
2) use a fixed LLR threshold value per matrix
3) use a confidence of correlation value (a % maybe) that is calculated from the data by looking at the distribution in P’C or other. This is potentially O(n^2) where n = number of items in the matrix. This may be practical to calculate for some types of data since n may be very small.

1 and 2 are easy in the extreme, #3 can actually be calculated after the fact and used in #2 even if it is not included in Mahout.

I've started work on #1 and #2

[~ssc][~tdunning] I'm especially looking for comments on #3 above, calculating a % confidence of correlation. The function we use for LLR scoring is https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/cf/SimilarityAnalysis.scala#L210


was (Author: pferrel):
To reword this issue...

The CCO analysis code currently only employs a single # of values per row of the P’? matrices. This has proven an insufficient threshold for many of the possible cross-occurrence types. The problem is that for a user * item input matrix, which becomes an item * item output a fixed # per row is fine but the implementation is a bit meaningless when there are only 20 columns of the ? matrix. For instance if ? = C category preferences, there may be only 20 possible categories and with a threshold of 100 and the fact that users often have enough usage to trigger preference events on all categories (though resulting in a small LLR value), the P’C matrix is almost completely full. This reduces any value in P’C.

There are several ways to address:
1) have a # of indicators per row threshold for every matrix, not one for all (the current impl)
2) use a fixed LLR threshold value per matrix
3) use a confidence of correlation value (a % maybe) that is calculated from the data by looking at the distribution in P’C or other. This is potentially O(n^2) where n = number of items in the matrix. This may be practical to calculate for some types of data since n may be very small.

1 and 2 are easy in the extreme, #3 can actually be calculated after the fact and used in #2 even if it is not included in Mahout.

starting work on #1 and #2

> Improvements to CCO (Correlated Cross-Occurrence)
> -------------------------------------------------
>
>                 Key: MAHOUT-1853
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1853
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.12.0
>            Reporter: Andrew Palumbo
>            Assignee: Pat Ferrel
>             Fix For: 0.13.0
>
>
> Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold calculation for LLR downsampling, and possible multiple fixed thresholds for A’A, A’B etc. This is to account for the vast difference in dimensionality between indicator types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)