You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Ian Upright <ia...@upright.net> on 2011/07/13 23:09:38 UTC

similarity metrics?

Hello,

I'm looking for more similarity metrics, such as Hellinger distance.
Wouldn't they be implemented as a subclass of DistributedVectorSimilarity?
Does anyone have more implementations?

Thanks, Ian

Re: similarity metrics?

Posted by Ted Dunning <te...@gmail.com>.

If you need this distance, please go for it!

The procedure for publishing the results (or the first attempts) is to file
a JIRA (see issues.apache.org/jira/browse/MAHOUT ) and attach patches to the
JIRA for review or comment.

On Wed, Jul 13, 2011 at 2:55 PM, Ian Upright <ia...@upright.net> wrote:

> I found this:
>
> http://www.utdallas.edu/~herve/Abdi-Distance2007-pretty.pdf
>
> Which seems to explain it pretty simply.  Seems like these measures should
> be fairly easy to implement.  I could take a stab at it and publish the
> results.
>
> Ian
>
> >What's in the project now is all I know about. Yes if you want to use it
> >with the Hadoop-based similarity calculator, that's what you would extend.
> >
> >How do you apply this metric to vectors?
> >
> >On Wed, Jul 13, 2011 at 10:09 PM, Ian Upright <ian-public@upright.net
> >wrote:
> >
> >> Hello,
> >>
> >> I'm looking for more similarity metrics, such as Hellinger distance.
> >> Wouldn't they be implemented as a subclass of
> DistributedVectorSimilarity?
> >> Does anyone have more implementations?
> >>
> >> Thanks, Ian
> >>
>

Re: similarity metrics?

Posted by Ted Dunning <te...@gmail.com>.

If you need this distance, please go for it!

The procedure for publishing the results (or the first attempts) is to file
a JIRA (see issues.apache.org/jira/browse/MAHOUT ) and attach patches to the
JIRA for review or comment.

On Wed, Jul 13, 2011 at 2:55 PM, Ian Upright <ia...@upright.net> wrote:

> I found this:
>
> http://www.utdallas.edu/~herve/Abdi-Distance2007-pretty.pdf
>
> Which seems to explain it pretty simply.  Seems like these measures should
> be fairly easy to implement.  I could take a stab at it and publish the
> results.
>
> Ian
>
> >What's in the project now is all I know about. Yes if you want to use it
> >with the Hadoop-based similarity calculator, that's what you would extend.
> >
> >How do you apply this metric to vectors?
> >
> >On Wed, Jul 13, 2011 at 10:09 PM, Ian Upright <ian-public@upright.net
> >wrote:
> >
> >> Hello,
> >>
> >> I'm looking for more similarity metrics, such as Hellinger distance.
> >> Wouldn't they be implemented as a subclass of
> DistributedVectorSimilarity?
> >> Does anyone have more implementations?
> >>
> >> Thanks, Ian
> >>
>

Re: similarity metrics?

Posted by Ian Upright <ia...@upright.net>.

I found this:

http://www.utdallas.edu/~herve/Abdi-Distance2007-pretty.pdf

Which seems to explain it pretty simply.  Seems like these measures should
be fairly easy to implement.  I could take a stab at it and publish the
results.

Ian

>What's in the project now is all I know about. Yes if you want to use it
>with the Hadoop-based similarity calculator, that's what you would extend.
>
>How do you apply this metric to vectors?
>
>On Wed, Jul 13, 2011 at 10:09 PM, Ian Upright <ia...@upright.net>wrote:
>
>> Hello,
>>
>> I'm looking for more similarity metrics, such as Hellinger distance.
>> Wouldn't they be implemented as a subclass of DistributedVectorSimilarity?
>> Does anyone have more implementations?
>>
>> Thanks, Ian
>>

Re: similarity metrics?

Posted by Ian Upright <ia...@upright.net>.

I found this:

http://www.utdallas.edu/~herve/Abdi-Distance2007-pretty.pdf

Which seems to explain it pretty simply.  Seems like these measures should
be fairly easy to implement.  I could take a stab at it and publish the
results.

Ian

>What's in the project now is all I know about. Yes if you want to use it
>with the Hadoop-based similarity calculator, that's what you would extend.
>
>How do you apply this metric to vectors?
>
>On Wed, Jul 13, 2011 at 10:09 PM, Ian Upright <ia...@upright.net>wrote:
>
>> Hello,
>>
>> I'm looking for more similarity metrics, such as Hellinger distance.
>> Wouldn't they be implemented as a subclass of DistributedVectorSimilarity?
>> Does anyone have more implementations?
>>
>> Thanks, Ian
>>

Re: similarity metrics?

Posted by Sean Owen <sr...@gmail.com>.

Yes that's it, according to the reference -- or rather, I suppose, construe
the vector as encoding a discrete distribution. Element i has probability
proportion to the value at i. (It can't have negative values of course.)

It would seem to be what you write below, but the square root of the sum of
squares of those differences. (Not that I've ever encountered it before.) Is
L0.5 the Minkowski 0.5 distance? this looks more like the old Euclidean
distance in that sense.

I wonder out loud what the effect of the difference in square roots versus
the vector element values themselves does, intuitively... since it seems so
close to Euclidean distance.

Yes in any event easy to implement this.

On Wed, Jul 13, 2011 at 10:53 PM, Ted Dunning <te...@gmail.com> wrote:

> You would have to encode the distributions as vectors.
>
> For discrete distributions, I think that this is relatively trivial since
> you could interpret each vector entry as the probability for an element i
> of
> the domain of the distribution.  I think that would result in the Hellinger
> distance [1] being defined as:
>
>  HD(P, Q) = \sum_i (\sqrt(p_i) - \sqrt(q_i) )
>
> This makes it look a lot like L_0.5 which we already have.  Perhaps the
> original poster can clarify if this is what they want?
>
> [1] http://en.wikipedia.org/wiki/Hellinger_distance
>
>
>
> On Wed, Jul 13, 2011 at 2:14 PM, Sean Owen <sr...@gmail.com> wrote:
>
> > How do you apply this metric to vectors?
> >
>

Re: similarity metrics?

Posted by Sean Owen <sr...@gmail.com>.

Yes that's it, according to the reference -- or rather, I suppose, construe
the vector as encoding a discrete distribution. Element i has probability
proportion to the value at i. (It can't have negative values of course.)

It would seem to be what you write below, but the square root of the sum of
squares of those differences. (Not that I've ever encountered it before.) Is
L0.5 the Minkowski 0.5 distance? this looks more like the old Euclidean
distance in that sense.

I wonder out loud what the effect of the difference in square roots versus
the vector element values themselves does, intuitively... since it seems so
close to Euclidean distance.

Yes in any event easy to implement this.

On Wed, Jul 13, 2011 at 10:53 PM, Ted Dunning <te...@gmail.com> wrote:

> You would have to encode the distributions as vectors.
>
> For discrete distributions, I think that this is relatively trivial since
> you could interpret each vector entry as the probability for an element i
> of
> the domain of the distribution.  I think that would result in the Hellinger
> distance [1] being defined as:
>
>  HD(P, Q) = \sum_i (\sqrt(p_i) - \sqrt(q_i) )
>
> This makes it look a lot like L_0.5 which we already have.  Perhaps the
> original poster can clarify if this is what they want?
>
> [1] http://en.wikipedia.org/wiki/Hellinger_distance
>
>
>
> On Wed, Jul 13, 2011 at 2:14 PM, Sean Owen <sr...@gmail.com> wrote:
>
> > How do you apply this metric to vectors?
> >
>

Re: similarity metrics?

Posted by Ted Dunning <te...@gmail.com>.

You would have to encode the distributions as vectors.

For discrete distributions, I think that this is relatively trivial since
you could interpret each vector entry as the probability for an element i of
the domain of the distribution.  I think that would result in the Hellinger
distance [1] being defined as:

  HD(P, Q) = \sum_i (\sqrt(p_i) - \sqrt(q_i) )

This makes it look a lot like L_0.5 which we already have.  Perhaps the
original poster can clarify if this is what they want?

[1] http://en.wikipedia.org/wiki/Hellinger_distance

On Wed, Jul 13, 2011 at 2:14 PM, Sean Owen <sr...@gmail.com> wrote:

> How do you apply this metric to vectors?
>

Re: similarity metrics?

Posted by Ted Dunning <te...@gmail.com>.

You would have to encode the distributions as vectors.

For discrete distributions, I think that this is relatively trivial since
you could interpret each vector entry as the probability for an element i of
the domain of the distribution.  I think that would result in the Hellinger
distance [1] being defined as:

  HD(P, Q) = \sum_i (\sqrt(p_i) - \sqrt(q_i) )

This makes it look a lot like L_0.5 which we already have.  Perhaps the
original poster can clarify if this is what they want?

[1] http://en.wikipedia.org/wiki/Hellinger_distance

On Wed, Jul 13, 2011 at 2:14 PM, Sean Owen <sr...@gmail.com> wrote:

> How do you apply this metric to vectors?
>

Re: similarity metrics?

Posted by Sean Owen <sr...@gmail.com>.

What's in the project now is all I know about. Yes if you want to use it
with the Hadoop-based similarity calculator, that's what you would extend.

How do you apply this metric to vectors?

On Wed, Jul 13, 2011 at 10:09 PM, Ian Upright <ia...@upright.net>wrote:

> Hello,
>
> I'm looking for more similarity metrics, such as Hellinger distance.
> Wouldn't they be implemented as a subclass of DistributedVectorSimilarity?
> Does anyone have more implementations?
>
> Thanks, Ian
>

Re: similarity metrics?

Posted by Sean Owen <sr...@gmail.com>.

What's in the project now is all I know about. Yes if you want to use it
with the Hadoop-based similarity calculator, that's what you would extend.

How do you apply this metric to vectors?

On Wed, Jul 13, 2011 at 10:09 PM, Ian Upright <ia...@upright.net>wrote:

> Hello,
>
> I'm looking for more similarity metrics, such as Hellinger distance.
> Wouldn't they be implemented as a subclass of DistributedVectorSimilarity?
> Does anyone have more implementations?
>
> Thanks, Ian
>