You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by "Robin M. E. Swezey" <ro...@toralab.org> on 2011/03/05 05:56:27 UTC

Problem with CNB implementation?

Hello,

My name is Robin Swezey.

We have a paper accepted to an international conference, which
mentions the use of Mahout and its Complementary Naive Bayes (CNB)
algorithm.

The deadline for submitting the final version of this paper is set to
March 5, 23:59 PST (GMT - 8), which is today.

Yet, I have reported what we believe is an issue with the CNB implementation:
https://issues.apache.org/jira/browse/MAHOUT-605?focusedCommentId=13000838&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13000838
(please look at this comment from 01/Mar/11 11:42, my previous ones
aren't so clear)

Basically, Mahout developers claim that weights decrease with class
affinity (as in a real CNB), but my professor claims that this is not
the case. So we conducted a test to prove this. The test is easy to
make, so I suggest you conduct it as well in case you need to verify.

The point is that we want to know if it is a _real_ CNB or not. We
cannot really write false statements in a paper.

Actually, we don't really need to solve the issue right now, just
confirmation that there is one (or not).
- If there is an issue, we can correct the paper and say that it was
an improved NB (or whatever this is) instead of a real CNB.
- If there is no issue, we will leave the paper as it is now.

The source is quite difficult and complex to navigate, and honestly I
don't want to write claims based on my sole understanding of it.

Can I kindly ask for your help on this one?

Btw, the paper relates to a nationwide governmental project for Japan,
and could have a good impact for Mahout when published. We also intend
on using Mahout in further papers.

We really thank you for your work and efforts on the Mahout platform
and hope to contribute to it as much as we can.

Best regards,
Robin S

PS: We have a Mahout course/training in our lab, I will share the
documentation as soon as I translate it (I originally wrote it in
Japanese).

--
Robin M. E. Swezey <ro...@toralab.org>

Re: Problem with CNB implementation?

Posted by "Robin M. E. Swezey" <ro...@toralab.org>.

Ted,
Robin A,

Thank you so much for your quick replies.

As I said, my statements weren't really clear in the beginning of the
issue report. The reason is that I have to deal with my own
understanding of the source code, and also the fact that it is not so
evident for me to get what my prof tells me. (I am quite fluent in
Japanese, but it gets harder when talking about math and algorithms)

So I first thought the array inversion was the problem, but it is not,
which is why I rewrote a whole comment where I hoped everything was
clear. I double-checked the paper that Robin A uses and also read many
of his posts on JIRA.

I totally understand the point about increasing weights and decreasing
class affinity.

The problem was that, when we actually tested it, we got an array of
positive doubles which increased with class affinity.
So my prof thought that it was actually outputting some kind of
improved NB instead of CNB, and that there might be an issue in the
Mahout source. Hence our worry.

Now, Robin A just made the point that those weights are to be thought
as _negative_, which I think we totally missed so far.

Thus, I think the issue is resolved now.

Btw my nameless professor is Dr Shiramatsu, graduated from Kyoto University.

Thank you so much for your help and your work!

Best regards,
Robin S

On Sat, Mar 5, 2011 at 3:05 PM, Robin Anil <ro...@gmail.com> wrote:
> Robin S, The implementation in CNB is almost a direct implementation of
> paper, and the results atleast with 20newsgroups matches with the result in
> the Jrennie's paper . The calculation shows that the sum of weight of
> complement class is use to normalize it.
> Now whether you think that is a real CNB or not is based on how you want to
> interpret it.  The theta is calculated as a negative number. So what ever
> the relevance value documentWeight returns is multiplied to get a negative
> score which is ranked and returned.
> Let me understand what you find as the issue? If there indeed a bug while
> implementing the paper, I am happy to correct it. But if you want to "claim"
> its not "CNB" and its a normalized NB, What ever we say is moot. You have to
> take it up with the author of the paper.
>
>

-- 
=============================================
Robin M. E. Swezey <ro...@toralab.org>
Web/AI PhD Candidate @ Nagoya Institute of Technology
  SCOPE Large-Scale Citizen Participation Platform Project
  Research: http://www-toralab.ics.nitech.ac.jp
IT Project Manager @ Wisdom Web Co. Ltd
  Resume & Achievements: http://www.swezey.fr
[Tel] +81-90-1785-1337 (Cell)   [Fax] +81-52-735-5584 (Lab)
=============================================

Re: Problem with CNB implementation?

Posted by Robin Anil <ro...@gmail.com>.

Robin S, The implementation in CNB is almost a direct implementation of
paper, and the results atleast with 20newsgroups matches with the result in
the Jrennie's paper . The calculation shows that the sum of weight of
complement class is use to normalize it.

Now whether you think that is a real CNB or not is based on how you want to
interpret it.  The theta is calculated as a negative number. So what ever
the relevance value documentWeight returns is multiplied to get a negative
score which is ranked and returned.

Let me understand what you find as the issue? If there indeed a bug while
implementing the paper, I am happy to correct it. But if you want to "claim"
its not "CNB" and its a normalized NB, What ever we say is moot. You have to
take it up with the author of the paper.

Re: Problem with CNB implementation?

Posted by Ted Dunning <te...@gmail.com>.

Robin (S),

As far as I know, your issue concerned the fact that results from the CNB in
Mahout used the convention than increasing score indicated decreasing
relevance.

The Mahout (Robin A, I will call him) Robin claimed that this was, in fact,
correct because the score from CNB really was just the relevance score for
the complementary class.  Thus, increasing score means that the document
being scored is more like the complementary class than the class being
scored.  This position is internally consistent at least and, since Robin A
wrote that code, there is some credibility to the thought that the Mahout
code really does work this way.

It sounds like your professor (who is nameless and thus I will annoint him
"the professor") feels that this score should be somehow inverted and less
relevance should mean lower score.

I am not at all clear about who is claiming what, so could your say whether
I have the right idea about what everybody is saying?

On Fri, Mar 4, 2011 at 8:56 PM, Robin M. E. Swezey <ro...@toralab.org>wrote:

> Hello,
>
> My name is Robin Swezey.
>
> We have a paper accepted to an international conference, which
> mentions the use of Mahout and its Complementary Naive Bayes (CNB)
> algorithm.
>
> The deadline for submitting the final version of this paper is set to
> March 5, 23:59 PST (GMT - 8), which is today.
>
> Yet, I have reported what we believe is an issue with the CNB
> implementation:
>
> https://issues.apache.org/jira/browse/MAHOUT-605?focusedCommentId=13000838&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13000838
> (please look at this comment from 01/Mar/11 11:42, my previous ones
> aren't so clear)
>
> Basically, Mahout developers claim that weights decrease with class
> affinity (as in a real CNB), but my professor claims that this is not
> the case. So we conducted a test to prove this. The test is easy to
> make, so I suggest you conduct it as well in case you need to verify.
>
> The point is that we want to know if it is a _real_ CNB or not. We
> cannot really write false statements in a paper.
>
> Actually, we don't really need to solve the issue right now, just
> confirmation that there is one (or not).
> - If there is an issue, we can correct the paper and say that it was
> an improved NB (or whatever this is) instead of a real CNB.
> - If there is no issue, we will leave the paper as it is now.
>
> The source is quite difficult and complex to navigate, and honestly I
> don't want to write claims based on my sole understanding of it.
>
> Can I kindly ask for your help on this one?
>
> Btw, the paper relates to a nationwide governmental project for Japan,
> and could have a good impact for Mahout when published. We also intend
> on using Mahout in further papers.
>
> We really thank you for your work and efforts on the Mahout platform
> and hope to contribute to it as much as we can.
>
> Best regards,
> Robin S
>
> PS: We have a Mahout course/training in our lab, I will share the
> documentation as soon as I translate it (I originally wrote it in
> Japanese).
>
> --
> Robin M. E. Swezey <ro...@toralab.org>
>