You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2017/03/08 16:18:48 UTC

LLR thresholds

The CCO algorithm now supports a couple ways to limit indicators by “quality". The new way is by the value of LLR. We built a t-digest mechanism to look at the overall density produced with different thresholds. The higher the threshold, the lower the number of indicators and the lower the density of the resulting indicator matrix but also the higher the MAP score (of the full recommender). So MAP seems to increase monotonically until it breaks down.

This didn’t match my understanding of LLR, which is actually a test for non-correlation. I was expecting high scores to mean highly likelihood of non-correlation. So the actual formulation of the code must be reversing that so the higher the score the higher the likelihood that non-correlation is *false* (this is a treated as evidence of correlation)

The next observation is that with high thresholds we get higher MAP scores from the recommender (expected) but this increases monotonically until it breaks down because there are so few indicators left. This leads us to the conclusion that MAP is not a good way to set the threshold. We tried to looking are precision (MAP) vs recall (number of people who get recs) and this gave ambiguous results with the data we had.

Given my questions about how LLR is actually formulated in Mahout I’m unsure how to convert it into something like a confidence score or some other way to judge the threshold that would lead to good way to choose a threshold. Any ideas or illumination about how it’s being calculated or how to judge the threshold?



Long description of motivation:

LLR thresholds are needed when comparing conversion events to things that have very small dimensionality so maxIndicatorsPerIItem does not work well. For example a location by state where there are 50, maxIndicatorsPerItem defaults to 50 so you may end up with 50 very week indicators. If there are strong indicators in the data, thresholds should be the way to find them. This might lead to a few per item if the data supports it and this should then be useful. The question above is how to choose a threshold.

Re: LLR thresholds

Posted by Pat Ferrel <pa...@occamsmachete.com>.
Is this a case where LLR is not the best method for testing correlation? LLR is really nice when there is large dimensionality to a very sparse input matrix but as it becomes small (1M x 2) and dense, where we have 1m users and 2 genders one of which is known for everyone, might there be better correlation tests that would not be onerous given small dimensionality and more dense data?


On Mar 8, 2017, at 5:22 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

Couldn’t agree more and I was arguing this below.

To illustrate the issue I’m thinking about let’s use an extreme ecom case where we have conversions and user gender as indicators. The question is: does user gender correlate with purchasing a certain item? In some data there would be a strong correlation for certain items. If we set maxIndicatorsPerItem to 1 (one or the other gender) we will always get a highest scoring gender since we know everyone’s gender and one or the other always cross-occurs with conversion. The scores themselves may be quite low for many items or the difference between M and F may be minimal. This is an extreme but in some form it comes up often with secondary indicators since they high and low dimensionality and density. In this example we have 50% density and dimensionality of users x 2. Other indicators will have drastically different characteristics.

Therefor I conclude that maxIndicatorsPerItem is not sufficient for dealing with certain data and yet I know the data is worthwhile. Just looking for ideas for how to mine it.


On Mar 8, 2017, at 2:59 PM, Ted Dunning <te...@gmail.com> wrote:

MAP is dangerous, as are all off-line comparisons.

The problem is that it tends to over-emphasize precision over recall and it
tends to emphasize replicating what has been seen before.

Increasing the threshold increases precision and decreases recall. But MAP
mostly only cares about the top hit. In practice, you want lots of good
hits in the results page.



On Wed, Mar 8, 2017 at 8:18 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> The CCO algorithm now supports a couple ways to limit indicators by
> “quality". The new way is by the value of LLR. We built a t-digest
> mechanism to look at the overall density produced with different
> thresholds. The higher the threshold, the lower the number of indicators
> and the lower the density of the resulting indicator matrix but also the
> higher the MAP score (of the full recommender). So MAP seems to increase
> monotonically until it breaks down.
> 
> This didn’t match my understanding of LLR, which is actually a test for
> non-correlation. I was expecting high scores to mean highly likelihood of
> non-correlation. So the actual formulation of the code must be reversing
> that so the higher the score the higher the likelihood that non-correlation
> is **false** (this is a treated as evidence of correlation)
> 
> The next observation is that with high thresholds we get higher MAP scores
> from the recommender (expected) but this increases monotonically until it
> breaks down because there are so few indicators left. This leads us to the
> conclusion that MAP is not a good way to set the threshold. We tried to
> looking are precision (MAP) vs recall (number of people who get recs) and
> this gave ambiguous results with the data we had.
> 
> Given my questions about how LLR is actually formulated in Mahout I’m
> unsure how to convert it into something like a confidence score or some
> other way to judge the threshold that would lead to good way to choose a
> threshold. Any ideas or illumination about how it’s being calculated or how
> to judge the threshold?
> 
> 
> 
> Long description of motivation:
> 
> LLR thresholds are needed when comparing conversion events to things that
> have very small dimensionality so maxIndicatorsPerIItem does not work well.
> For example a location by state where there are 50, maxIndicatorsPerItem
> defaults to 50 so you may end up with 50 very week indicators. If there are
> strong indicators in the data, thresholds should be the way to find them.
> This might lead to a few per item if the data supports it and this should
> then be useful. The question above is how to choose a threshold.
> 



Re: LLR thresholds

Posted by Pat Ferrel <pa...@occamsmachete.com>.
Couldn’t agree more and I was arguing this below.

To illustrate the issue I’m thinking about let’s use an extreme ecom case where we have conversions and user gender as indicators. The question is: does user gender correlate with purchasing a certain item? In some data there would be a strong correlation for certain items. If we set maxIndicatorsPerItem to 1 (one or the other gender) we will always get a highest scoring gender since we know everyone’s gender and one or the other always cross-occurs with conversion. The scores themselves may be quite low for many items or the difference between M and F may be minimal. This is an extreme but in some form it comes up often with secondary indicators since they high and low dimensionality and density. In this example we have 50% density and dimensionality of users x 2. Other indicators will have drastically different characteristics.

Therefor I conclude that maxIndicatorsPerItem is not sufficient for dealing with certain data and yet I know the data is worthwhile. Just looking for ideas for how to mine it.


On Mar 8, 2017, at 2:59 PM, Ted Dunning <te...@gmail.com> wrote:

MAP is dangerous, as are all off-line comparisons.

The problem is that it tends to over-emphasize precision over recall and it
tends to emphasize replicating what has been seen before.

Increasing the threshold increases precision and decreases recall. But MAP
mostly only cares about the top hit. In practice, you want lots of good
hits in the results page.



On Wed, Mar 8, 2017 at 8:18 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> The CCO algorithm now supports a couple ways to limit indicators by
> “quality". The new way is by the value of LLR. We built a t-digest
> mechanism to look at the overall density produced with different
> thresholds. The higher the threshold, the lower the number of indicators
> and the lower the density of the resulting indicator matrix but also the
> higher the MAP score (of the full recommender). So MAP seems to increase
> monotonically until it breaks down.
> 
> This didn’t match my understanding of LLR, which is actually a test for
> non-correlation. I was expecting high scores to mean highly likelihood of
> non-correlation. So the actual formulation of the code must be reversing
> that so the higher the score the higher the likelihood that non-correlation
> is **false** (this is a treated as evidence of correlation)
> 
> The next observation is that with high thresholds we get higher MAP scores
> from the recommender (expected) but this increases monotonically until it
> breaks down because there are so few indicators left. This leads us to the
> conclusion that MAP is not a good way to set the threshold. We tried to
> looking are precision (MAP) vs recall (number of people who get recs) and
> this gave ambiguous results with the data we had.
> 
> Given my questions about how LLR is actually formulated in Mahout I’m
> unsure how to convert it into something like a confidence score or some
> other way to judge the threshold that would lead to good way to choose a
> threshold. Any ideas or illumination about how it’s being calculated or how
> to judge the threshold?
> 
> 
> 
> Long description of motivation:
> 
> LLR thresholds are needed when comparing conversion events to things that
> have very small dimensionality so maxIndicatorsPerIItem does not work well.
> For example a location by state where there are 50, maxIndicatorsPerItem
> defaults to 50 so you may end up with 50 very week indicators. If there are
> strong indicators in the data, thresholds should be the way to find them.
> This might lead to a few per item if the data supports it and this should
> then be useful. The question above is how to choose a threshold.
> 


Re: LLR thresholds

Posted by Ted Dunning <te...@gmail.com>.
MAP is dangerous, as are all off-line comparisons.

The problem is that it tends to over-emphasize precision over recall and it
tends to emphasize replicating what has been seen before.

Increasing the threshold increases precision and decreases recall. But MAP
mostly only cares about the top hit. In practice, you want lots of good
hits in the results page.



On Wed, Mar 8, 2017 at 8:18 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> The CCO algorithm now supports a couple ways to limit indicators by
> “quality". The new way is by the value of LLR. We built a t-digest
> mechanism to look at the overall density produced with different
> thresholds. The higher the threshold, the lower the number of indicators
> and the lower the density of the resulting indicator matrix but also the
> higher the MAP score (of the full recommender). So MAP seems to increase
> monotonically until it breaks down.
>
> This didn’t match my understanding of LLR, which is actually a test for
> non-correlation. I was expecting high scores to mean highly likelihood of
> non-correlation. So the actual formulation of the code must be reversing
> that so the higher the score the higher the likelihood that non-correlation
> is **false** (this is a treated as evidence of correlation)
>
> The next observation is that with high thresholds we get higher MAP scores
> from the recommender (expected) but this increases monotonically until it
> breaks down because there are so few indicators left. This leads us to the
> conclusion that MAP is not a good way to set the threshold. We tried to
> looking are precision (MAP) vs recall (number of people who get recs) and
> this gave ambiguous results with the data we had.
>
> Given my questions about how LLR is actually formulated in Mahout I’m
> unsure how to convert it into something like a confidence score or some
> other way to judge the threshold that would lead to good way to choose a
> threshold. Any ideas or illumination about how it’s being calculated or how
> to judge the threshold?
>
>
>
> Long description of motivation:
>
> LLR thresholds are needed when comparing conversion events to things that
> have very small dimensionality so maxIndicatorsPerIItem does not work well.
> For example a location by state where there are 50, maxIndicatorsPerItem
> defaults to 50 so you may end up with 50 very week indicators. If there are
> strong indicators in the data, thresholds should be the way to find them.
> This might lead to a few per item if the data supports it and this should
> then be useful. The question above is how to choose a threshold.
>