You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Joyce Babu <jo...@joycebabu.com> on 2011/02/03 14:21:34 UTC

Mahout for Keyword Extraction

Hi,

I am new to Java and Machine Learning concept. I was searching for a method to extract keywords (like names of people, organization, places etc) from new stories sorted by relevance. I found several web services like OpenCalais that provide similar service, but they don't detect most of my terms. I have a list of approved keywords, and only need to detect from that list.

I found out about Machine Learning and got interested in the concept. I read somewhere that the classification feature of mahout can be used for detecting keywords by classifying terms as keywords and non-keywords. I have been trying to learn mahout for the past 30 hours, but haven't reached anywhere. It is not useful to waste time trying to learn, if mahout is not the tool to solve my problem.

Can someone provide details on using mahout for term extraction? Is it possible to do this with little to medium knowledge in Java? Is it an overkill to use mahout for this? Should I go for an NLP solution?

Thanks,
Joyce

Re: Mahout for Keyword Extraction

Posted by Joyce Babu <jo...@joycebabu.com>.

Thanks for the details Vineet.

I have already tried KEA with a training set of 300 stories and keywords generated using OpenCalais, but the output was of very low quality (I did not use any vocabulary or stop words). When I tried with the linked open data from data.nytimes.com, the output quality was good. I think it has potential with a good vocabulary. But KEA doesn't return the relevance value.

I will go through the provided links on the different algorithms. It will take me some time to digest it completely :)

Can I use clustering to detect similar documents? For example, the past one week there were several news stories on the Egypt unrest. I need to detect and group them. Is it possible to do this with mahout?

Joyce
On Thursday 3 February 2011 at 7:07 PM, vineet yadav wrote: 
> Hi Joyce,
> Mahout uses clustering algorithm to extract top terms or topics from
> documents sets. It uses basically three types of algorithm for keyword
> extraction .
> 1) Collocations extraction:-
> https://cwiki.apache.org/confluence/display/MAHOUT/Collocations
> 2) Clustering algorithm: It supports clustering algorithm like k-means,
> fuzzy k-mean, cancopy etc.
> 3)Latent Dirichet Allocation:-
> https://cwiki.apache.org/confluence/display/MAHOUT/Latent+Dirichlet+Allocation
> Mahout uses simple unsupervised(clustering) algorithm for keyword
> extraction. Where as I think OpenCalasis uses supervised and deep semantic
> approaches. I think you are looking some supervised(classification)
> algorithm for keyphrase extraction. I suggest to look at kea(
> http://www.nzdl.org/Kea/download.html) and maui-indexer(
> http://code.google.com/p/maui-indexer/)
> Thanks
> Vineet Yadav
> 
> On Thu, Feb 3, 2011 at 6:51 PM, Joyce Babu <jo...@joycebabu.com> wrote:
> 
> > Hi,
> > 
> > I am new to Java and Machine Learning concept. I was searching for a method
> > to extract keywords (like names of people, organization, places etc) from
> > new stories sorted by relevance. I found several web services like
> > OpenCalais that provide similar service, but they don't detect most of my
> > terms. I have a list of approved keywords, and only need to detect from that
> > list.
> > 
> > I found out about Machine Learning and got interested in the concept. I
> > read somewhere that the classification feature of mahout can be used for
> > detecting keywords by classifying terms as keywords and non-keywords. I have
> > been trying to learn mahout for the past 30 hours, but haven't reached
> > anywhere. It is not useful to waste time trying to learn, if mahout is not
> > the tool to solve my problem.
> > 
> > Can someone provide details on using mahout for term extraction? Is it
> > possible to do this with little to medium knowledge in Java? Is it an
> > overkill to use mahout for this? Should I go for an NLP solution?
> > 
> > Thanks,
> > Joyce
>

Re: Mahout for Keyword Extraction

Posted by vineet yadav <vi...@gmail.com>.

Hi Joyce,
Mahout uses clustering algorithm to extract top terms or topics from
documents sets. It uses basically three types of algorithm for keyword
extraction .
1) Collocations extraction:-
https://cwiki.apache.org/confluence/display/MAHOUT/Collocations
2) Clustering algorithm: It supports clustering algorithm like k-means,
fuzzy k-mean, cancopy etc.
3)Latent Dirichet Allocation:-
https://cwiki.apache.org/confluence/display/MAHOUT/Latent+Dirichlet+Allocation
Mahout uses simple unsupervised(clustering) algorithm for keyword
extraction. Where as I think  OpenCalasis uses supervised and deep semantic
approaches. I think you are looking some supervised(classification)
algorithm for keyphrase extraction. I suggest to look at kea(
http://www.nzdl.org/Kea/download.html) and maui-indexer(
http://code.google.com/p/maui-indexer/)
Thanks
Vineet Yadav

On Thu, Feb 3, 2011 at 6:51 PM, Joyce Babu <jo...@joycebabu.com> wrote:

> Hi,
>
> I am new to Java and Machine Learning concept. I was searching for a method
> to extract keywords (like names of people, organization, places etc) from
> new stories sorted by relevance. I found several web services like
> OpenCalais that provide similar service, but they don't detect most of my
> terms. I have a list of approved keywords, and only need to detect from that
> list.
>
> I found out about Machine Learning and got interested in the concept. I
> read somewhere that the classification feature of mahout can be used for
> detecting keywords by classifying terms as keywords and non-keywords. I have
> been trying to learn mahout for the past 30 hours, but haven't reached
> anywhere. It is not useful to waste time trying to learn, if mahout is not
> the tool to solve my problem.
>
> Can someone provide details on using mahout for term extraction? Is it
> possible to do this with little to medium knowledge in Java? Is it an
> overkill to use mahout for this? Should I go for an NLP solution?
>
> Thanks,
> Joyce
>
>
>