You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by co...@apache.org on 2008/02/03 15:43:00 UTC

[CONF] Apache Lucene Mahout: NaiveBayes (page edited)

NaiveBayes (MAHOUT) edited by Isabel Drost
      Page: http://cwiki.apache.org/confluence/display/MAHOUT/NaiveBayes
   Changes: http://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=75078&originalVersion=1&revisedVersion=2

Comment:
---------------------------------------------------------------------

added general information on naive bayes

Change summary:
---------------------------------------------------------------------

added general information on naive bayes

Change summary:
---------------------------------------------------------------------

added general information on naive bayes

Change summary:
---------------------------------------------------------------------

added general information on naive bayes

Change summary:
---------------------------------------------------------------------

added general information on naive bayes

Content:
---------------------------------------------------------------------


h1. Naive Bayes

Naive Bayes is an algorithm that can be used to classify objects into usually binary categories. It is one of the most common learning algorithms in spam filters. Despite its simplicity and rather naive assumptions it has proven to work surprisingly well in practice.

Before applying the algorithm, the objects to be classified need to be represented by numerical features. In the case of e-mail spam each feature might indicate whether some specific word is present or absent in the mail to classify. The algorithm comes in two phases: Learning and application.
During learning, a set of feature vectors is given to the algorithm, each vector labelled with the class the object it represents, belongs to. From that it is deduced which combination of features appears with high probability in spam messages. Given this information, during application one can easily compute the probability of a new message being either spam or not.

The algorithm does make several assumptions, that are not true for most datasets, but make computations easier. The worst probably being, that all features of an objects are considered independent. In practice, that means, given the phrase "Statue of Liberty" was already found in a text, does not influence the probability of seeing the phrase "New York" as well.

h2. Strategy for a parallel Naive Bayes

h2. Design information

---------------------------------------------------------------------
CONFLUENCE INFORMATION
This message is automatically generated by Confluence

Unsubscribe or edit your notifications preferences
   http://cwiki.apache.org/confluence/users/viewnotifications.action

If you think it was sent incorrectly contact one of the administrators
   http://cwiki.apache.org/confluence/administrators.action

If you want more information on Confluence, or have a bug to report see
   http://www.atlassian.com/software/confluence