You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Lingxiang Cheng <li...@yahoo.com> on 2012/01/02 04:02:22 UTC

Bayes and Unicode?

It's interesting that the Bayes algorithms in Mahout strongly favor text data than numeric data. I am thinking about using them to categorize chinese websites. Has anyone used it to process unicodes?

Re: Bayes and Unicode?

Posted by Ted Dunning <te...@gmail.com>.

The Bayes algorithms favor sparse data with large numbers of potential
features.  Text is one kind of this data.

Using Naive Bayes with unicode should be fine.  The simplest method for
processing CJK text is to use character unigrams and bigrams.  This works
very well with retrieval systems, but I haven't heard if it would work with
classification although I expect it would.

On Sun, Jan 1, 2012 at 7:02 PM, Lingxiang Cheng <li...@yahoo.com>wrote:

> It's interesting that the Bayes algorithms in Mahout strongly favor text
> data than numeric data. I am thinking about using them to categorize
> chinese websites. Has anyone used it to process unicodes?