You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Nikaash Puri <ni...@gmail.com> on 2015/12/15 05:04:08 UTC

root LLR support in org.apache.mahout.math.cf.SimilarityAnalysis

Hi,

Just wondering whether there is support to use root Log Likelihood Ratio
using some sort of flag in the cooccurrencesIDSs function
in org.apache.mahout.math.cf.SimilarityAnalysis. Else, I can create and
issue and work on it to add said support.

Thank you,
Nikaash Puri

Re: root LLR support in org.apache.mahout.math.cf.SimilarityAnalysis

Posted by Pat Ferrel <pa...@occamsmachete.com>.
No, if you want to work on that feel free, it should be pretty easy to add that option. However be aware that LLR is used in the  downsampling step so you don’t get all elements of llr(A’A) for reasons that keep the calculation at O(n) downsampling is based on number of non-zero elements in a row of both A and A’A keeping the highest LLR scoring elements. These are params that you can control in the current implementation.

For some types of analysis where you would like A’A downsampled based on a purely probabilistic metric like confidence in non-correlation it might be nice to have a threshold based downsampler where the threshold is some fraction of all elements or some confidence value rather than a fixed value of LLR, which is trivial to add but not very useful. This requires that we find a way to calculate the distribution parameters of LLR in A’A so a confidence threshold can be derived. I haven’t put a lot of thought into this but iirc LLR is Chi-square with 2 degrees of freedom (going from old brain cells here) and root LLR is normally distributed.  If there is some clever way to find the threshold without calculating all of rllr(A’A), which would be O(n^2), then the confidence threshold downsampling could be kept O(n) and this would be a very useful contribution.


On Dec 14, 2015, at 8:04 PM, Nikaash Puri <ni...@gmail.com> wrote:

Hi,

Just wondering whether there is support to use root Log Likelihood Ratio
using some sort of flag in the cooccurrencesIDSs function
in org.apache.mahout.math.cf.SimilarityAnalysis. Else, I can create and
issue and work on it to add said support.

Thank you,
Nikaash Puri