You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Arjun Kumar Reddy <ch...@iiitb.net> on 2011/03/01 12:35:13 UTC

Regarding classification of URL's

Hi list,

I am a newbie in mahout and I want to now some details regarding this
project.

I am in need of a classification tool which gives me the category in which
the URL or content belongs to.

For example, If I give this particular URL's

http://www.espncricinfo.com/icc_cricket_worldcup2011/content/current/player/49764.htmlit
should give me the category as "cricket".

I was able to do this with other existing API's like alchemy, evri, textwise
etc. and I am looking for something better in terms of performance.

Could anyone please help me how can I use this mahout tool for classifying
the documents.


Thanks and regards,*
*Ch. Arjun Kumar Reddy,
International Institute of Information Technology – Bangalore (IIITB),
26/C, Electronics City, Hosur Road,
Bangalore 560 100
Ph: 8800710999*
*

Re: Regarding classification of URL's

Posted by Ted Dunning <te...@gmail.com>.
Scraping or spidering is, indeed, the first step.  Associated with each URL,
you should retain the plain text (without markup), the domain name, all
anchor text for links pointing to each page and a small neighborhood of text
around each link.

>From there, you can use the Naive Bayes classifiers as Vineet suggests or
you can use the SGD classifiers.  The SGD classifiers are more flexible but
performance in terms of accuracy should be similar.  The SGD classifiers are
significantly easier to integrate into other code.

You will need to have labels on a fair number of pages from each category.
 If you can have users tag these pages, that might be helpful.

If you have user interaction logs, you can also use that.

On Tue, Mar 1, 2011 at 3:57 AM, vineet yadav <vi...@gmail.com>wrote:

> Hi Arjun,
> you need to scrap content from website for a given url, and then need
> to prepare training datasets from scarped content  for  Bayesian
> classification.
> Also check out mahout twenty news groups example for reference
> https://cwiki.apache.org/MAHOUT/twenty-newsgroups.html
> Thanks
> Vineet Yadav
>
> On Tue, Mar 1, 2011 at 5:05 PM, Arjun Kumar Reddy
> <ch...@iiitb.net> wrote:
> > Hi list,
> >
> > I am a newbie in mahout and I want to now some details regarding this
> > project.
> >
> > I am in need of a classification tool which gives me the category in
> which
> > the URL or content belongs to.
> >
> > For example, If I give this particular URL's
> >
> >
> http://www.espncricinfo.com/icc_cricket_worldcup2011/content/current/player/49764.htmlit
> > should give me the category as "cricket".
> >
> > I was able to do this with other existing API's like alchemy, evri,
> textwise
> > etc. and I am looking for something better in terms of performance.
> >
> > Could anyone please help me how can I use this mahout tool for
> classifying
> > the documents.
> >
> >
> > Thanks and regards,*
> > *Ch. Arjun Kumar Reddy,
> > International Institute of Information Technology – Bangalore (IIITB),
> > 26/C, Electronics City, Hosur Road,
> > Bangalore 560 100
> > Ph: 8800710999*
> > *
> >
>

Re: Regarding classification of URL's

Posted by vineet yadav <vi...@gmail.com>.
Hi Arjun,
you need to scrap content from website for a given url, and then need
to prepare training datasets from scarped content  for  Bayesian
classification.
Also check out mahout twenty news groups example for reference
https://cwiki.apache.org/MAHOUT/twenty-newsgroups.html
Thanks
Vineet Yadav

On Tue, Mar 1, 2011 at 5:05 PM, Arjun Kumar Reddy
<ch...@iiitb.net> wrote:
> Hi list,
>
> I am a newbie in mahout and I want to now some details regarding this
> project.
>
> I am in need of a classification tool which gives me the category in which
> the URL or content belongs to.
>
> For example, If I give this particular URL's
>
> http://www.espncricinfo.com/icc_cricket_worldcup2011/content/current/player/49764.htmlit
> should give me the category as "cricket".
>
> I was able to do this with other existing API's like alchemy, evri, textwise
> etc. and I am looking for something better in terms of performance.
>
> Could anyone please help me how can I use this mahout tool for classifying
> the documents.
>
>
> Thanks and regards,*
> *Ch. Arjun Kumar Reddy,
> International Institute of Information Technology – Bangalore (IIITB),
> 26/C, Electronics City, Hosur Road,
> Bangalore 560 100
> Ph: 8800710999*
> *
>