You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by James Endicott <en...@gmail.com> on 2013/04/07 15:54:40 UTC

I believe the TanimotoSimilarity scorer actually uses the Jaccard similarity measure

I might be wrong about this because I'm just a grad student and my
knowledge of statistics and ability to read French leave something to be
desired but I think that the TanimotoSimilarity scorer actually uses the
Jaccard similarity measure instead of the TanimotoSimilarity measure. The
javadocs say that the tool uses the Tanimoto similarity which is also known
as the extended Jaccard similarity but the source code shows that it only
uses the regular Jaccard similarity.

As far as I can tell, the difference between the two is that the Jaccard
similarity can only be used to compare two items using the formula:
items appearing in both documents/(items just appearing in one + items just
appearing in the other + items appearing in both)
But the Tanimoto similarity measure allows for comparing between any number
of items by generalizing the formula to:
items appearing in all documents/(items just appearing in one + items just
appearing in another + ... + items appearing in some but not all + ... +
items appearing in all)

I think the class could be generalized to implement the full Tanimoto
similarity without too much difficulty (though I don't think it's a high
priority) but at the moment it does not do so. While I realize this is
probably a trivial matter, I hope the docs get updated at some point so
another grad student doesn't have to muddle through a botany article in a
Swiss journal from 1901 again.

--James

Re: I believe the TanimotoSimilarity scorer actually uses the Jaccard similarity measure

Posted by Ted Dunning <te...@gmail.com>.

To my mind, you as the reader have a major voice here.

So if you were confused/not happy with the doc, then there is a problem.
 You will know best how to fix that when you get done.

So let us know how!



On Mon, Apr 8, 2013 at 2:16 PM, James Endicott <en...@gmail.com>wrote:

> I didn't want to file a suggestion for a javadoc patch without hearing from
> someone who knows a bit more about the math history behind it because I
> didn't want to suggest something that may be in error. When I checked the
> Wikipedia article on it, the article noted that there was confusion an
> inconsistency between papers as to what Tanimoto actually was and how it
> compared to Jaccard. So, I went to the primary source for Jaccard and am
> getting the primary source for Tanimoto when/if interlibrary loan comes
> through.
>
>
> On Mon, Apr 8, 2013 at 12:04 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > I don't see the problem here.  We only want to compare two items so
> > Jaccard and Tanimoto are identical.
> >
> > Could you file a JIRA and suggest a javadoc patch?
> >
> > Why did this take you to an ancient journal instead of Wikipedia?
> >
> >
> > On Apr 7, 2013, at 6:54 AM, James Endicott wrote:
> >
> > > As far as I can tell, the difference between the two is that the
> Jaccard
> > > similarity can only be used to compare two items using the formula:
> > > items appearing in both documents/(items just appearing in one + items
> > just
> > > appearing in the other + items appearing in both)
> > > But the Tanimoto similarity measure allows for comparing between any
> > number
> > > of items by generalizing the formula to:
> > > items appearing in all documents/(items just appearing in one + items
> > just
> > > appearing in another + ... + items appearing in some but not all + ...
> +
> > > items appearing in all)
> > >
> > > I think the class could be generalized to implement the full Tanimoto
> > > similarity without too much difficulty (though I don't think it's a
> high
> > > priority) but at the moment it does not do so. While I realize this is
> > > probably a trivial matter, I hope the docs get updated at some point so
> > > another grad student doesn't have to muddle through a botany article
> in a
> > > Swiss journal from 1901 again.
> >
> >
>

Re: I believe the TanimotoSimilarity scorer actually uses the Jaccard similarity measure

Posted by James Endicott <en...@gmail.com>.

I didn't want to file a suggestion for a javadoc patch without hearing from
someone who knows a bit more about the math history behind it because I
didn't want to suggest something that may be in error. When I checked the
Wikipedia article on it, the article noted that there was confusion an
inconsistency between papers as to what Tanimoto actually was and how it
compared to Jaccard. So, I went to the primary source for Jaccard and am
getting the primary source for Tanimoto when/if interlibrary loan comes
through.

On Mon, Apr 8, 2013 at 12:04 PM, Ted Dunning <te...@gmail.com> wrote:

> I don't see the problem here.  We only want to compare two items so
> Jaccard and Tanimoto are identical.
>
> Could you file a JIRA and suggest a javadoc patch?
>
> Why did this take you to an ancient journal instead of Wikipedia?
>
>
> On Apr 7, 2013, at 6:54 AM, James Endicott wrote:
>
> > As far as I can tell, the difference between the two is that the Jaccard
> > similarity can only be used to compare two items using the formula:
> > items appearing in both documents/(items just appearing in one + items
> just
> > appearing in the other + items appearing in both)
> > But the Tanimoto similarity measure allows for comparing between any
> number
> > of items by generalizing the formula to:
> > items appearing in all documents/(items just appearing in one + items
> just
> > appearing in another + ... + items appearing in some but not all + ... +
> > items appearing in all)
> >
> > I think the class could be generalized to implement the full Tanimoto
> > similarity without too much difficulty (though I don't think it's a high
> > priority) but at the moment it does not do so. While I realize this is
> > probably a trivial matter, I hope the docs get updated at some point so
> > another grad student doesn't have to muddle through a botany article in a
> > Swiss journal from 1901 again.
>
>

Re: I believe the TanimotoSimilarity scorer actually uses the Jaccard similarity measure

Posted by Ted Dunning <te...@gmail.com>.

I don't see the problem here.  We only want to compare two items so Jaccard and Tanimoto are identical.

Could you file a JIRA and suggest a javadoc patch?

Why did this take you to an ancient journal instead of Wikipedia?


On Apr 7, 2013, at 6:54 AM, James Endicott wrote:

> As far as I can tell, the difference between the two is that the Jaccard
> similarity can only be used to compare two items using the formula:
> items appearing in both documents/(items just appearing in one + items just
> appearing in the other + items appearing in both)
> But the Tanimoto similarity measure allows for comparing between any number
> of items by generalizing the formula to:
> items appearing in all documents/(items just appearing in one + items just
> appearing in another + ... + items appearing in some but not all + ... +
> items appearing in all)
> 
> I think the class could be generalized to implement the full Tanimoto
> similarity without too much difficulty (though I don't think it's a high
> priority) but at the moment it does not do so. While I realize this is
> probably a trivial matter, I hope the docs get updated at some point so
> another grad student doesn't have to muddle through a botany article in a
> Swiss journal from 1901 again.

Re: I believe the TanimotoSimilarity scorer actually uses the Jaccard similarity measure

Posted by Sean Owen <sr...@gmail.com>.

I had not heard of Tanimoto being generalized to n-way similarity, but
then again, I can't say I know much at all authoritative about the
term. The Wikipedia page says it's incorrectly used to describe a lot
of things. Here, we're only looking at 2-way comparisons, pair-wise
similarity. As far as I know these are all synonymous in this case. So
I don't think it's incorrect? There's not a need for n-way similarity
anywhere that I know of here.

On Sun, Apr 7, 2013 at 2:54 PM, James Endicott <en...@gmail.com> wrote:
> I might be wrong about this because I'm just a grad student and my
> knowledge of statistics and ability to read French leave something to be
> desired but I think that the TanimotoSimilarity scorer actually uses the
> Jaccard similarity measure instead of the TanimotoSimilarity measure. The
> javadocs say that the tool uses the Tanimoto similarity which is also known
> as the extended Jaccard similarity but the source code shows that it only
> uses the regular Jaccard similarity.
>
> As far as I can tell, the difference between the two is that the Jaccard
> similarity can only be used to compare two items using the formula:
> items appearing in both documents/(items just appearing in one + items just
> appearing in the other + items appearing in both)
> But the Tanimoto similarity measure allows for comparing between any number
> of items by generalizing the formula to:
> items appearing in all documents/(items just appearing in one + items just
> appearing in another + ... + items appearing in some but not all + ... +
> items appearing in all)
>
> I think the class could be generalized to implement the full Tanimoto
> similarity without too much difficulty (though I don't think it's a high
> priority) but at the moment it does not do so. While I realize this is
> probably a trivial matter, I hope the docs get updated at some point so
> another grad student doesn't have to muddle through a botany article in a
> Swiss journal from 1901 again.
>
> --James