You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by chee wu <ch...@gmail.com> on 2007/02/04 14:58:49 UTC

Any successful experiences for text classification ?

Hi,
  I am trying to divide all the web pages crawled to predefined categories,does anybody  have successfully fulfilled  classification based on Nutch? I did find some threads talking about this,but none of them are clear enough. Below are some possible solutions mentioned in the past threads :
  1. Using SVM-Light, but it seems a C based program ? 
  2. Can I fulfill this based on Carrot2? 
  3. Other open source software packages like Rainbow or IBM UIMA ?
I want to do a deeper research on the three options above,which one should I study first? Any other hints or experiences also are welcome!

Thanks
-Chee

Re: Any successful experiences for text classification ?

Posted by chee wu <ch...@gmail.com>.

I am trying to perform classification for web pages in Chinese.
Thank you Ashish. Yes, LibSVM might satisfy my requirement...I am also considering using LingPipe
http://alias-i.com/lingpipe/demos/tutorial/classify/read-me.html
It seems LingPipe can categorize the Chinese documents without word segmentations.any one has tried this?
I'll write a index filter based on LingPipe..
I

----- Original Message ----- 
From: "Shay Lawless" <se...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Monday, February 05, 2007 5:43 PM
Subject: Re: Any successful experiences for text classification ?


> Hi Chee,
> 
> Are you looking to perform this classification on a collection of local
> documents or on a collectiong of web pages?
> 
> Shay
> 
> On 04/02/07, chee wu <ch...@gmail.com> wrote:
>>
>> Hi,
>>   I am trying to divide all the web pages crawled to predefined
>> categories,does anybody  have successfully fulfilled  classification based
>> on Nutch? I did find some threads talking about this,but none of them are
>> clear enough. Below are some possible solutions mentioned in the past
>> threads :
>>   1. Using SVM-Light, but it seems a C based program ?
>>   2. Can I fulfill this based on Carrot2?
>>   3. Other open source software packages like Rainbow or IBM UIMA ?
>> I want to do a deeper research on the three options above,which one should
>> I study first? Any other hints or experiences also are welcome!
>>
>> Thanks
>> -Chee
>>
>

Re: Any successful experiences for text classification ?

Posted by Shay Lawless <se...@gmail.com>.

Hi Chee,

Are you looking to perform this classification on a collection of local
documents or on a collectiong of web pages?

Shay

On 04/02/07, chee wu <ch...@gmail.com> wrote:
>
> Hi,
>   I am trying to divide all the web pages crawled to predefined
> categories,does anybody  have successfully fulfilled  classification based
> on Nutch? I did find some threads talking about this,but none of them are
> clear enough. Below are some possible solutions mentioned in the past
> threads :
>   1. Using SVM-Light, but it seems a C based program ?
>   2. Can I fulfill this based on Carrot2?
>   3. Other open source software packages like Rainbow or IBM UIMA ?
> I want to do a deeper research on the three options above,which one should
> I study first? Any other hints or experiences also are welcome!
>
> Thanks
> -Chee
>

Re: Any successful experiences for text classification ?

Posted by kauu <ba...@gmail.com>.

that's right, carrot2 just used to clustering documents, but not to classify
them.

On 2/5/07, Stanislaw Osinski <st...@gmail.com> wrote:
>
> Hi,
>
> Carrot2 performs document clustering, which, as opposed to document
> classification, is an unsupervised technique (no predefined categories).
> Therefore, Carrot2 doesn't seem suitable for this particular problem.
>
> Thanks,
>
> Stanislaw
>
> On 2/4/07, kauu <ba...@gmail.com> wrote:
> >
> > hi chee wu :
> >   the easiest way is to realize ur goal i think. but the carrot2 's
> > performance is not very good. and the another important thing is that u
> > should input the data with as little spam as possible , or u will get
> > useless result.
> >
> > On 2/4/07, chee wu <ch...@gmail.com> wrote:
> > >
> > > Hi,
> > >   I am trying to divide all the web pages crawled to predefined
> > > categories,does anybody  have successfully fulfilled  classification
> > based
> > > on Nutch? I did find some threads talking about this,but none of them
> > are
> > > clear enough. Below are some possible solutions mentioned in the past
> > > threads :
> > >   1. Using SVM-Light, but it seems a C based program ?
> > >   2. Can I fulfill this based on Carrot2?
> > >   3. Other open source software packages like Rainbow or IBM UIMA ?
> > > I want to do a deeper research on the three options above,which one
> > should
> > > I study first? Any other hints or experiences also are welcome!
> > >
> > > Thanks
> > > -Chee
> > >
> >
> >
> >
> > --
> > www.babatu.com
> >
> >
>
>
> --
> Stanislaw Osinski, stachoo@carrot-search.com
> http://www.carrot-search.com
>



-- 
www.babatu.com

Re: Any successful experiences for text classification ?

Posted by Stanislaw Osinski <st...@gmail.com>.

Hi,

Carrot2 performs document clustering, which, as opposed to document
classification, is an unsupervised technique (no predefined categories).
Therefore, Carrot2 doesn't seem suitable for this particular problem.

Thanks,

Stanislaw

On 2/4/07, kauu <ba...@gmail.com> wrote:
>
> hi chee wu :
>   the easiest way is to realize ur goal i think. but the carrot2 's
> performance is not very good. and the another important thing is that u
> should input the data with as little spam as possible , or u will get
> useless result.
>
> On 2/4/07, chee wu <ch...@gmail.com> wrote:
> >
> > Hi,
> >   I am trying to divide all the web pages crawled to predefined
> > categories,does anybody  have successfully fulfilled  classification
> based
> > on Nutch? I did find some threads talking about this,but none of them
> are
> > clear enough. Below are some possible solutions mentioned in the past
> > threads :
> >   1. Using SVM-Light, but it seems a C based program ?
> >   2. Can I fulfill this based on Carrot2?
> >   3. Other open source software packages like Rainbow or IBM UIMA ?
> > I want to do a deeper research on the three options above,which one
> should
> > I study first? Any other hints or experiences also are welcome!
> >
> > Thanks
> > -Chee
> >
>
>
>
> --
> www.babatu.com
>
>


-- 
Stanislaw Osinski, stachoo@carrot-search.com
http://www.carrot-search.com

Re: Any successful experiences for text classification ?

Posted by kauu <ba...@gmail.com>.

hi chee wu :
  the easiest way is to realize ur goal i think. but the carrot2 's
performance is not very good. and the another important thing is that u
should input the data with as little spam as possible , or u will get
useless result.

On 2/4/07, chee wu <ch...@gmail.com> wrote:
>
> Hi,
>   I am trying to divide all the web pages crawled to predefined
> categories,does anybody  have successfully fulfilled  classification based
> on Nutch? I did find some threads talking about this,but none of them are
> clear enough. Below are some possible solutions mentioned in the past
> threads :
>   1. Using SVM-Light, but it seems a C based program ?
>   2. Can I fulfill this based on Carrot2?
>   3. Other open source software packages like Rainbow or IBM UIMA ?
> I want to do a deeper research on the three options above,which one should
> I study first? Any other hints or experiences also are welcome!
>
> Thanks
> -Chee
>



-- 
www.babatu.com

Re: Any successful experiences for text classification ?

Posted by The Golden Condor ! <tg...@gmail.com>.

On 2/5/07, Vlador <te...@gmail.com> wrote:
>
> I've seen a lot of discussions about implementing the above mentioned
> algorithm (SVM) , however i couldn't find any live examples designed for
> multi-classification tasks in which you have to classify the document
> between one of 10000+ classification categories.
> It seems impossible to me.
>

You should use heirarchical classification for that many classes. but
even then, you'd need way too many classifiers.

TAA

RE: Any successful experiences for text classification ?

Posted by Vlador <te...@gmail.com>.

I've seen a lot of discussions about implementing the above mentioned
algorithm (SVM) , however i couldn't find any live examples designed for
multi-classification tasks in which you have to classify the document
between one of 10000+ classification categories. 
It seems impossible to me.


Ashish-12 wrote:
> 
> Hi Chee Wu, 
> 
> If you're looking for a Java-based solution, you might find it worthwhile
> to
> look at LibSVM. You can use this open source package to train a Support
> Vector Machine based classifier, which can then be used to classify the
> documents that Nutch crawls for you. In general, more the number of
> training
> documents, better the accuracy. Keep in mind that training documents must
> be
> carefully hand-picked, to minimize false classification. You can use
> LibSVM
> for 2-class as well as multi-class classification tasks.
> 
> --
> 
> Regards....
> 
> ~ Ashish Saharia ~
> 
> 
> 
> -----Original Message-----
> From: chee wu [mailto:chee.wu@gmail.com] 
> Sent: Sunday, February 04, 2007 7:29 PM
> To: nutch-user@lucene.apache.org
> Subject: Any successful experiences for text classification ?
> 
> Hi,
>   I am trying to divide all the web pages crawled to predefined
> categories,does anybody  have successfully fulfilled  classification based
> on Nutch? I did find some threads talking about this,but none of them are
> clear enough. Below are some possible solutions mentioned in the past
> threads :
>   1. Using SVM-Light, but it seems a C based program ? 
>   2. Can I fulfill this based on Carrot2? 
>   3. Other open source software packages like Rainbow or IBM UIMA ?
> I want to do a deeper research on the three options above,which one should
> I
> study first? Any other hints or experiences also are welcome!
> 
> Thanks
> -Chee
>  
> 
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Any-successful-experiences-for--text-classification---tf3169828.html#a8802930
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Any successful experiences for text classification ?

Posted by Ashish Saharia <as...@smarteinc.com>.

Hi Chee Wu, 

If you're looking for a Java-based solution, you might find it worthwhile to
look at LibSVM. You can use this open source package to train a Support
Vector Machine based classifier, which can then be used to classify the
documents that Nutch crawls for you. In general, more the number of training
documents, better the accuracy. Keep in mind that training documents must be
carefully hand-picked, to minimize false classification. You can use LibSVM
for 2-class as well as multi-class classification tasks.

--

Regards....

~ Ashish Saharia ~



-----Original Message-----
From: chee wu [mailto:chee.wu@gmail.com] 
Sent: Sunday, February 04, 2007 7:29 PM
To: nutch-user@lucene.apache.org
Subject: Any successful experiences for text classification ?

Hi,
  I am trying to divide all the web pages crawled to predefined
categories,does anybody  have successfully fulfilled  classification based
on Nutch? I did find some threads talking about this,but none of them are
clear enough. Below are some possible solutions mentioned in the past
threads :
  1. Using SVM-Light, but it seems a C based program ? 
  2. Can I fulfill this based on Carrot2? 
  3. Other open source software packages like Rainbow or IBM UIMA ?
I want to do a deeper research on the three options above,which one should I
study first? Any other hints or experiences also are welcome!

Thanks
-Chee