You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Ted Dunning <te...@gmail.com> on 2013/03/01 08:46:52 UTC

Re: mahout for web page categorization

You should derive several kinds of text and mark them differently.   Your
suggestions of keyworks, title, and anchor text are all good.  Separating
them into separate fields in the style of Lucene would work.  Adding a
field with all the different kinds of text would also be helpful.

Once you have that you can experiment with including or excluding different
fields for the classifier to see what works best on held out data.  For
Naive Bayes, if you need a textual encoding, you can simply put a field
name in front of the tokens.  For SGD or any system based on the hashed
vector encoding, you can have a separate encoder with a different name for
each field.

On Wed, Feb 27, 2013 at 5:33 AM, Rajesh Nikam <ra...@gmail.com> wrote:

> Hi,
>
> I am looking at how to use mahout for web page categorization.
>
> Idea is to have various categories like
>
> Adult
> Arts
> Business
> Computers
> Games
> Health
> Home
> Kids
> News
> Recreation
> Reference
> Science
> Shopping
> Society
> Sports
>
> and classify given web page into specific category.
>
> After going through some paper on related topics, they suggest to do
>
> Pre processing
> - remove html tags
> - remove stop words : "stop list"
> - remove rare words :
> - perform word stemming: Porter stemmer is well known algo
>
> Research paper suggest to have various approaches like Subject
> classification based on title and functional classification based on
> contents even html elements, image alt text, link anchor texts etc.
>
> I need to first finalize on list of categories and then prepare list of
> keywords for each category.
>
> My question in how mahout could be used for this purpose, I see example
> with mahout that shows classification of 20news groups using naive bays.
> However I am not sure about how I could make use of keywords in this case.
>
> Are there some examples that show how mahout could be used to pre preocess
> and do stemming.
>
> Thanks,
> Rajesh
>