You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Lance Norskog <go...@gmail.com> on 2011/05/23 02:31:25 UTC

Re: Hoeffding Bound or Additive Chernoff Bound as a SImilarity algorithm?

Yup. The Hoeffding D statistic requires two passes of the preferences
vector pair: one to pull the mean of each input vector, the next to
calculate against it.

I'm using these correlations for somewhat different uses (evaluating
sampler outputs) and coded a commons-math Correlator. It seems to have
its own personality but still gives useful results. By which I mean,
each input pair gives a similar output from Spearman, Pearsons and
Hoeffding D. For example, input pair X gives (1.2,1.1,1.3) and input
pair Y gives (3.5,3.3,3.9).

So, it is not interesting for user/item recommendation, but it does
have its own utility in other problems.

On Sun, Apr 24, 2011 at 10:39 AM, Ted Dunning <te...@gmail.com> wrote:
> Given that ratings have only a small effect on recommendations (i.e. you can
> reduce to the binary case with much loss), and given that rank based
> statistics don't like tied scores, I wouldn't expect much gain from
> Hoeffding's D.
>
> On Sun, Apr 24, 2011 at 1:20 AM, Sean Owen <sr...@gmail.com> wrote:
>
>> Yes, this Hoeffding *bound*, and one flavor of the Chernoff bound, are not
>> really similarity metrics. They can construct loose bounds on what the true
>> mean of a random variable must be given some observations.
>>
>> (These bounds have a use, but in another part of the recommender
>> computation. And in those parts, you know the distribution of the random
>> variable so can use a better bound.)
>>
>> The Hoeffding D statistic (which I had to re-lookup for sure) is what you
>> want. It is spiritually like the Spearman correlation in that it is based
>> on
>> relative ordering of ratings rather than rating values. I have never played
>> with it but it is an interesting possibility. I imagine it will be slow to
>> compute, but, could be a useful addition. You would probably want to
>> refactor SpearmanCorrelationSimilarity a little since it already does the
>> conversion to rank and then subclass to implement this formula if
>> interested.
>>
>> Negative correlation just means two series are in fact related, but one
>> goes
>> up when the other goes down and vice versa, rather than trending together,
>> which would be positive correlation.
>>
>> On Sun, Apr 24, 2011 at 4:58 AM, Lance Norskog <go...@gmail.com> wrote:
>>
>> > This paper compares Pearson, Spearman and Hoeffding's D measure as
>> > similarity measures for DNA matching. It claims Hoeffding is the best.
>> > http://www.ncbi.nlm.nih.gov/pubmed/19634197
>> >
>> > Chasing down Hoeffding as a similarity measure, the closest I've come
>> > is the Hoeffding Bound or Additive Chernoff Bound. Page 2, right-hand
>> > column has a description of the algorithm:
>> > http://www.cs.washington.edu/homes/pedrod/papers/kdd00.pdf
>> >
>> > Is this the right base math? Given this formula for acceptable errors,
>> > what would be the algorithm for a similarity measure?
>> >
>> > Also, what does a negative correlation value mean? Should I just look
>> > at the absolute value?
>> >
>> > --
>> > Lance Norskog
>> > goksron@gmail.com
>> >
>>
>



-- 
Lance Norskog
goksron@gmail.com