You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Daniel McEnnis <dm...@gmail.com> on 2011/03/29 23:16:14 UTC

new distance metric

Dear,

Here is a patch of a new distance metric for the collaborative
filtering modules - CityBlockDistance.  With the 0 - 1 binary split on
preference. KLDistance, AHDistance, and Symmetric KLDistance don't
make sense.

Daniel McEnnis.

Re: new distance metric

Posted by Sean Owen <sr...@gmail.com>.

He's referring to implementations to plug into the Hadoop-based all
item-item similarity job, in org.apache.mahout.cf.taste.hadoop.item.
It's in the patch in JIRA.

On Wed, Mar 30, 2011 at 10:47 PM, Lance Norskog <go...@gmail.com> wrote:
> "a "distributed" implementation of this new metric"
> What would this do?
>

Re: new distance metric

Posted by Lance Norskog <go...@gmail.com>.

"a "distributed" implementation of this new metric"
What would this do?

On Wed, Mar 30, 2011 at 7:55 AM, Daniel McEnnis <dm...@gmail.com> wrote:
> Sebastion,
>
> It will be in the next patch.  Thanks for the heads up.
>
> Daniel.
>
> On Wed, Mar 30, 2011 at 1:35 AM, Sebastian Schelter <ss...@apache.org> wrote:
>> Hi Daniel,
>>
>> We would also need a "distributed" implementation of this new metric. Could
>> you do that too?
>>
>> Shouldn't be too hard, just have a look at the other implementations in
>> org.apache.mahout.math.hadoop.similarity.vector.
>>
>> --sebastian
>>
>>
>> On 30.03.2011 00:40, Sean Owen wrote:
>>>
>>> Great, the best place for this would be a JIRA issue:
>>> https://issues.apache.org/jira/browse/MAHOUT
>>> I think it needs a bit of style work. For example, it ought to be very
>>> much like TanimotoCoefficientSimilarity. If you copied that and edited
>>> a few key methods, you'd be a lot closer I think.
>>> I guess I find the core computation a little quirky:
>>>
>>>             double distance = preferring1+preferring2 - 2*intersection;
>>>            if(distance<  1.0){
>>>                distance=1.0-distance;
>>>            }else{
>>>                distance = -1.0 + 1.0 / distance;
>>>            }
>>>
>>> distance is an int, so I think it's
>>>
>>>             int distance = preferring1+preferring2 - 2*intersection;
>>>            if(distance == 0){
>>>                distance=1;
>>>            }else{
>>>                distance = -1.0 + 1.0 / distance;
>>>            }
>>>
>>> The resulting values are a little odd then -- it can return values in
>>> [-1,0], or 1.
>>>
>>> By default I'd go with something more like "1.0 / (1.0 + distance)" I
>>> suppose, though that's not somehow the one right way to map a distance
>>> to a similarity -- though it would be consistent with
>>> EuclideanDistanceSimilarity.
>>>
>>>
>>> I'd actually welcome you to expand this idea and not just make a
>>> "boolean pref" version of this but one that computes an actual
>>> city-block distance for prefs with ratings too, for completeness.
>>>
>>>
>>> I know this as "Manhattan distance". Is that an Americanism or is that
>>> actually the more common name to anyone?
>>>
>>>
>>>
>>> On Tue, Mar 29, 2011 at 10:16 PM, Daniel McEnnis<dm...@gmail.com>
>>>  wrote:
>>>>
>>>> Dear,
>>>>
>>>> Here is a patch of a new distance metric for the collaborative
>>>> filtering modules - CityBlockDistance.  With the 0 - 1 binary split on
>>>> preference. KLDistance, AHDistance, and Symmetric KLDistance don't
>>>> make sense.
>>>>
>>>> Daniel McEnnis.
>>
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: new distance metric

Posted by Daniel McEnnis <dm...@gmail.com>.

Sebastion,

It will be in the next patch.  Thanks for the heads up.

Daniel.

On Wed, Mar 30, 2011 at 1:35 AM, Sebastian Schelter <ss...@apache.org> wrote:
> Hi Daniel,
>
> We would also need a "distributed" implementation of this new metric. Could
> you do that too?
>
> Shouldn't be too hard, just have a look at the other implementations in
> org.apache.mahout.math.hadoop.similarity.vector.
>
> --sebastian
>
>
> On 30.03.2011 00:40, Sean Owen wrote:
>>
>> Great, the best place for this would be a JIRA issue:
>> https://issues.apache.org/jira/browse/MAHOUT
>> I think it needs a bit of style work. For example, it ought to be very
>> much like TanimotoCoefficientSimilarity. If you copied that and edited
>> a few key methods, you'd be a lot closer I think.
>> I guess I find the core computation a little quirky:
>>
>>             double distance = preferring1+preferring2 - 2*intersection;
>>            if(distance<  1.0){
>>                distance=1.0-distance;
>>            }else{
>>                distance = -1.0 + 1.0 / distance;
>>            }
>>
>> distance is an int, so I think it's
>>
>>             int distance = preferring1+preferring2 - 2*intersection;
>>            if(distance == 0){
>>                distance=1;
>>            }else{
>>                distance = -1.0 + 1.0 / distance;
>>            }
>>
>> The resulting values are a little odd then -- it can return values in
>> [-1,0], or 1.
>>
>> By default I'd go with something more like "1.0 / (1.0 + distance)" I
>> suppose, though that's not somehow the one right way to map a distance
>> to a similarity -- though it would be consistent with
>> EuclideanDistanceSimilarity.
>>
>>
>> I'd actually welcome you to expand this idea and not just make a
>> "boolean pref" version of this but one that computes an actual
>> city-block distance for prefs with ratings too, for completeness.
>>
>>
>> I know this as "Manhattan distance". Is that an Americanism or is that
>> actually the more common name to anyone?
>>
>>
>>
>> On Tue, Mar 29, 2011 at 10:16 PM, Daniel McEnnis<dm...@gmail.com>
>>  wrote:
>>>
>>> Dear,
>>>
>>> Here is a patch of a new distance metric for the collaborative
>>> filtering modules - CityBlockDistance.  With the 0 - 1 binary split on
>>> preference. KLDistance, AHDistance, and Symmetric KLDistance don't
>>> make sense.
>>>
>>> Daniel McEnnis.
>
>

Re: new distance metric

Posted by Sebastian Schelter <ss...@apache.org>.

Hi Daniel,

We would also need a "distributed" implementation of this new metric. 
Could you do that too?

Shouldn't be too hard, just have a look at the other implementations in 
org.apache.mahout.math.hadoop.similarity.vector.

--sebastian


On 30.03.2011 00:40, Sean Owen wrote:
> Great, the best place for this would be a JIRA issue:
> https://issues.apache.org/jira/browse/MAHOUT
> I think it needs a bit of style work. For example, it ought to be very
> much like TanimotoCoefficientSimilarity. If you copied that and edited
> a few key methods, you'd be a lot closer I think.
> I guess I find the core computation a little quirky:
>
>              double distance = preferring1+preferring2 - 2*intersection;
> 	    if(distance<  1.0){
> 	    	distance=1.0-distance;
> 	    }else{
> 	    	distance = -1.0 + 1.0 / distance;
> 	    }
>
> distance is an int, so I think it's
>
>              int distance = preferring1+preferring2 - 2*intersection;
> 	    if(distance == 0){
> 	    	distance=1;
> 	    }else{
> 	    	distance = -1.0 + 1.0 / distance;
> 	    }
>
> The resulting values are a little odd then -- it can return values in
> [-1,0], or 1.
>
> By default I'd go with something more like "1.0 / (1.0 + distance)" I
> suppose, though that's not somehow the one right way to map a distance
> to a similarity -- though it would be consistent with
> EuclideanDistanceSimilarity.
>
>
> I'd actually welcome you to expand this idea and not just make a
> "boolean pref" version of this but one that computes an actual
> city-block distance for prefs with ratings too, for completeness.
>
>
> I know this as "Manhattan distance". Is that an Americanism or is that
> actually the more common name to anyone?
>
>
>
> On Tue, Mar 29, 2011 at 10:16 PM, Daniel McEnnis<dm...@gmail.com>  wrote:
>> Dear,
>>
>> Here is a patch of a new distance metric for the collaborative
>> filtering modules - CityBlockDistance.  With the 0 - 1 binary split on
>> preference. KLDistance, AHDistance, and Symmetric KLDistance don't
>> make sense.
>>
>> Daniel McEnnis.

Re: new distance metric

Posted by Ted Dunning <te...@gmail.com>.

http://en.wikipedia.org/wiki/Taxicab_geometry

On Tue, Mar 29, 2011 at 4:10 PM, Lance Norskog <go...@gmail.com> wrote:

> Dennis, is there a cite somewhere explaining this algorithm?
>
> On Tue, Mar 29, 2011 at 3:55 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > City block and Manhattan and L_1 metric are the names that I know for it.
> >
> > On Tue, Mar 29, 2011 at 3:40 PM, Sean Owen <sr...@gmail.com> wrote:
> >
> >> I know this as "Manhattan distance". Is that an Americanism or is that
> >> actually the more common name to anyone?
> >>
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: new distance metric

Posted by Lance Norskog <go...@gmail.com>.

Dennis, is there a cite somewhere explaining this algorithm?

On Tue, Mar 29, 2011 at 3:55 PM, Ted Dunning <te...@gmail.com> wrote:
> City block and Manhattan and L_1 metric are the names that I know for it.
>
> On Tue, Mar 29, 2011 at 3:40 PM, Sean Owen <sr...@gmail.com> wrote:
>
>> I know this as "Manhattan distance". Is that an Americanism or is that
>> actually the more common name to anyone?
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: new distance metric

Posted by Ted Dunning <te...@gmail.com>.

City block and Manhattan and L_1 metric are the names that I know for it.

On Tue, Mar 29, 2011 at 3:40 PM, Sean Owen <sr...@gmail.com> wrote:

> I know this as "Manhattan distance". Is that an Americanism or is that
> actually the more common name to anyone?
>

Re: new distance metric

Posted by Sean Owen <sr...@gmail.com>.

Great, the best place for this would be a JIRA issue:
https://issues.apache.org/jira/browse/MAHOUT
I think it needs a bit of style work. For example, it ought to be very
much like TanimotoCoefficientSimilarity. If you copied that and edited
a few key methods, you'd be a lot closer I think.
I guess I find the core computation a little quirky:

            double distance = preferring1+preferring2 - 2*intersection;
	    if(distance < 1.0){
	    	distance=1.0-distance;
	    }else{
	    	distance = -1.0 + 1.0 / distance;
	    }

distance is an int, so I think it's

            int distance = preferring1+preferring2 - 2*intersection;
	    if(distance == 0){
	    	distance=1;
	    }else{
	    	distance = -1.0 + 1.0 / distance;
	    }

The resulting values are a little odd then -- it can return values in
[-1,0], or 1.

By default I'd go with something more like "1.0 / (1.0 + distance)" I
suppose, though that's not somehow the one right way to map a distance
to a similarity -- though it would be consistent with
EuclideanDistanceSimilarity.

I'd actually welcome you to expand this idea and not just make a
"boolean pref" version of this but one that computes an actual
city-block distance for prefs with ratings too, for completeness.

I know this as "Manhattan distance". Is that an Americanism or is that
actually the more common name to anyone?

On Tue, Mar 29, 2011 at 10:16 PM, Daniel McEnnis <dm...@gmail.com> wrote:
>
> Dear,
>
> Here is a patch of a new distance metric for the collaborative
> filtering modules - CityBlockDistance.  With the 0 - 1 binary split on
> preference. KLDistance, AHDistance, and Symmetric KLDistance don't
> make sense.
>
> Daniel McEnnis.