You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Maurizio (JIRA)" <ji...@apache.org> on 2008/06/28 04:15:45 UTC

[jira] Issue Comment Edited: (MAHOUT-9) Implement MapReduce BayesianClassifier

    [ https://issues.apache.org/jira/browse/MAHOUT-9?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12608964#action_12608964 ] 

maurizio316 edited comment on MAHOUT-9 at 6/27/08 7:14 PM:
--------------------------------------------------------

Hi Grant,
I'm developing something like your application and I found your code really interesting.
Probably I'm missing something, but I think that your bayesian approach doesn't work fine.
In the specific case, weightedFeatureProbability computes:
 
((weight * defaultProb) + (totalNumSeen * unweighted)) / (weight + totalNumSeen)
where  unweighted=numSeen/labelCount
again, where 
numSeen=# of time that feature has been seen within give label
and
labelCount=# of feature under label

If you observe the curve trend you realize that:
- terms never seen before are "heaver" than others.
- unweighted is a very small number , its contribution, in terms of probability, is insignificant. Moreover, numerator grow more slowly than denominator in case of widespread term.

What do you think about?
 
P.S.: sorry for my bad english

      was (Author: maurizio316):
    Hi Grant,
I'm developing something like your application and I found your code really interesting.
Probably I'm missing something, but I think that your bayesian approach doesn't work fine.
In the specific case, weightedFeatureProbability computes:
 
((weight * defaultProb) + (totalNumSeen * unweighted)) / (weight + totalNumSeen)
where  unweighted=numSeen/labelCount
again, where 
numSeen=# of time that feature has been seen within give label
and
labelCount=# of feature under label

If you observe the curve trend you realize that:
- terms never seen before are "heaver" than others.
- unweighted is a very small number , its contribution, in terms of probability, is insignificant. Moreover, numerator grow more slowly than denominator in case of widespread term.

What do you think about?
 

  
> Implement MapReduce BayesianClassifier
> --------------------------------------
>
>                 Key: MAHOUT-9
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-9
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.1
>
>         Attachments: MAHOUT-9.patch, MAHOUT-9.patch, MAHOUT-9.patch, MAHOUT-9.patch, MAHOUT-9.patch
>
>
> Implement a Bayesian classifier using M/R.
> I have a simple trainer done (not M/R) and will implement the classifier soon, then will upgrade it to use Hadoop.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.