You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Sebastian Schelter <ss...@googlemail.com> on 2013/04/03 17:28:38 UTC

Re: Integrating Mahout with existing nlp libraries

Thinking loud here: It would be great to have a DocumentSimilarityJob
that is supplied a collection of documents and then applies necessary
preprocessing (tokenization, vectorization, etc) and computes document
similarities.

Could be a nice starter task to add something like this.

On 03.04.2013 17:09, Suneel Marthi wrote:
> Akshay,
> 
> If you are trying to determine document similarity using MapReduce, Mahout's RowSimiliarity may be useful here.
> 
> Have a look at the following thread:-
> 
> http://markmail.org/message/ddkd3qbuub3ak6gl#query:+page:1+mid:x5or2x4rsv2kl4wv+state:results
> 
> 
> I had tried this on a corpus of 2 million web sites and had good results.
> 
> Let us know if this works for u.
> 
> 
> 
> ________________________________
>  From: akshay bhatt <ak...@gmail.com>
> To: user@mahout.apache.org 
> Sent: Wednesday, April 3, 2013 5:36 AM
> Subject: Integrating Mahout with existing nlp libraries 
>  
> I tried searching for it here and there, but could not find any good solution,
> so though of asking nlp experts. I am developing an text similarity finding
> application for which I need to match thousands and thousands of documents (of
> around 1000 words each) with each other. For nlp part, my best bet is NLTK
> (seeing its capabilities and algorithm friendlyness of python.But now when parts
> of speech tagging in itself taking so much of time, I believe, nltk may not be
> best suitable. Java or C won't hurt me, hence any solution will work for me.
> Please note, I have already started migrating from mysql to hbase in order to
> work with more freedom on such large number of data. But still question exists,
> how to perform algos. Mahout may be a choice, but that too is for machine
> learning, not dedicated for nlp (may be good for speech recognition). What else
> are available options. In gist, I need high performance nlp, (a step down from
> high performance machine learning). (I am inclined a bit towards Mahout, seeing
> future usage).
> 
> (already asked at -
> http://stackoverflow.com/questions/15782898/how-can-i-imporve-performance-of-nltk-alternatives)
> . 
>

Re: Integrating Mahout with existing nlp libraries

Posted by Suneel Marthi <su...@yahoo.com>.

+1

I would like to take the lead on making this happen.



________________________________
 From: Sebastian Schelter <ss...@googlemail.com>
To: user@mahout.apache.org 
Sent: Wednesday, April 3, 2013 11:28 AM
Subject: Re: Integrating Mahout with existing nlp libraries
 
Thinking loud here: It would be great to have a DocumentSimilarityJob
that is supplied a collection of documents and then applies necessary
preprocessing (tokenization, vectorization, etc) and computes document
similarities.

Could be a nice starter task to add something like this.

On 03.04.2013 17:09, Suneel Marthi wrote:
> Akshay,
> 
> If you are trying to determine document similarity using MapReduce, Mahout's RowSimiliarity may be useful here.
> 
> Have a look at the following thread:-
> 
> http://markmail.org/message/ddkd3qbuub3ak6gl#query:+page:1+mid:x5or2x4rsv2kl4wv+state:results
> 
> 
> I had tried this on a corpus of 2 million web sites and had good results.
> 
> Let us know if this works for u.
> 
> 
> 
> ________________________________
>  From: akshay bhatt <ak...@gmail.com>
> To: user@mahout.apache.org 
> Sent: Wednesday, April 3, 2013 5:36 AM
> Subject: Integrating Mahout with existing nlp libraries 
>  
> I tried searching for it here and there, but could not find any good solution,
> so though of asking nlp experts. I am developing an text similarity finding
> application for which I need to match thousands and thousands of documents (of
> around 1000 words each) with each other. For nlp part, my best bet is NLTK
> (seeing its capabilities and algorithm friendlyness of python.But now when parts
> of speech tagging in itself taking so much of time, I believe, nltk may not be
> best suitable. Java or C won't hurt me, hence any solution will work for me.
> Please note, I have already started migrating from mysql to hbase in order to
> work with more freedom on such large number of data. But still question exists,
> how to perform algos. Mahout may be a choice, but that too is for machine
> learning, not dedicated for nlp (may be good for speech recognition). What else
> are available options. In gist, I need high performance nlp, (a step down from
> high performance machine learning). (I am inclined a bit towards Mahout, seeing
> future usage).
> 
> (already asked at -
> http://stackoverflow.com/questions/15782898/how-can-i-imporve-performance-of-nltk-alternatives)
> . 
>

Re: Integrating Mahout with existing nlp libraries

Posted by Ted Dunning <te...@gmail.com>.

This sounds like the best suggestion so far.
	
On Apr 3, 2013, at 8:45 AM, Julien Nioche wrote:

> This is typically what Behemoth can be used for
> https://github.com/DigitalPebble/behemoth. It has a Mahout module to
> generate vectors at the same format as SparseVectorsFromSequenceFiles. Assuming
> that the document similarity job itself can run on the same input as the
> clustering then you'd be able to use that in combination with the other
> Behemoth modules e.g. import the documents, parse with Tika, tokenize, do
> some NLP with GATE or UIMA, find the similarities with Mahout, send to SOLR
> etc...
> 
> Julien
> *
> *
> *
> *
> 
> 
> 
> On 3 April 2013 16:28, Sebastian Schelter <ss...@googlemail.com> wrote:
> 
>> Thinking loud here: It would be great to have a DocumentSimilarityJob
>> that is supplied a collection of documents and then applies necessary
>> preprocessing (tokenization, vectorization, etc) and computes document
>> similarities.
>> 
>> Could be a nice starter task to add something like this.
>> 
>> On 03.04.2013 17:09, Suneel Marthi wrote:
>>> Akshay,
>>> 
>>> If you are trying to determine document similarity using MapReduce,
>> Mahout's RowSimiliarity may be useful here.
>>> 
>>> Have a look at the following thread:-
>>> 
>>> 
>> http://markmail.org/message/ddkd3qbuub3ak6gl#query:+page:1+mid:x5or2x4rsv2kl4wv+state:results
>>> 
>>> 
>>> I had tried this on a corpus of 2 million web sites and had good results.
>>> 
>>> Let us know if this works for u.
>>> 
>>> 
>>> 
>>> ________________________________
>>> From: akshay bhatt <ak...@gmail.com>
>>> To: user@mahout.apache.org
>>> Sent: Wednesday, April 3, 2013 5:36 AM
>>> Subject: Integrating Mahout with existing nlp libraries
>>> 
>>> I tried searching for it here and there, but could not find any good
>> solution,
>>> so though of asking nlp experts. I am developing an text similarity
>> finding
>>> application for which I need to match thousands and thousands of
>> documents (of
>>> around 1000 words each) with each other. For nlp part, my best bet is
>> NLTK
>>> (seeing its capabilities and algorithm friendlyness of python.But now
>> when parts
>>> of speech tagging in itself taking so much of time, I believe, nltk may
>> not be
>>> best suitable. Java or C won't hurt me, hence any solution will work for
>> me.
>>> Please note, I have already started migrating from mysql to hbase in
>> order to
>>> work with more freedom on such large number of data. But still question
>> exists,
>>> how to perform algos. Mahout may be a choice, but that too is for machine
>>> learning, not dedicated for nlp (may be good for speech recognition).
>> What else
>>> are available options. In gist, I need high performance nlp, (a step
>> down from
>>> high performance machine learning). (I am inclined a bit towards Mahout,
>> seeing
>>> future usage).
>>> 
>>> (already asked at -
>>> 
>> http://stackoverflow.com/questions/15782898/how-can-i-imporve-performance-of-nltk-alternatives
>> )
>>> .
>>> 
>> 
>> 
> 
> 
> -- 
> *
> *Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble

Re: Integrating Mahout with existing nlp libraries

Posted by Julien Nioche <li...@gmail.com>.

This is typically what Behemoth can be used for
https://github.com/DigitalPebble/behemoth. It has a Mahout module to
generate vectors at the same format as SparseVectorsFromSequenceFiles. Assuming
that the document similarity job itself can run on the same input as the
clustering then you'd be able to use that in combination with the other
Behemoth modules e.g. import the documents, parse with Tika, tokenize, do
some NLP with GATE or UIMA, find the similarities with Mahout, send to SOLR
etc...

Julien
*
*
*
*



On 3 April 2013 16:28, Sebastian Schelter <ss...@googlemail.com> wrote:

> Thinking loud here: It would be great to have a DocumentSimilarityJob
> that is supplied a collection of documents and then applies necessary
> preprocessing (tokenization, vectorization, etc) and computes document
> similarities.
>
> Could be a nice starter task to add something like this.
>
> On 03.04.2013 17:09, Suneel Marthi wrote:
> > Akshay,
> >
> > If you are trying to determine document similarity using MapReduce,
> Mahout's RowSimiliarity may be useful here.
> >
> > Have a look at the following thread:-
> >
> >
> http://markmail.org/message/ddkd3qbuub3ak6gl#query:+page:1+mid:x5or2x4rsv2kl4wv+state:results
> >
> >
> > I had tried this on a corpus of 2 million web sites and had good results.
> >
> > Let us know if this works for u.
> >
> >
> >
> > ________________________________
> >  From: akshay bhatt <ak...@gmail.com>
> > To: user@mahout.apache.org
> > Sent: Wednesday, April 3, 2013 5:36 AM
> > Subject: Integrating Mahout with existing nlp libraries
> >
> > I tried searching for it here and there, but could not find any good
> solution,
> > so though of asking nlp experts. I am developing an text similarity
> finding
> > application for which I need to match thousands and thousands of
> documents (of
> > around 1000 words each) with each other. For nlp part, my best bet is
> NLTK
> > (seeing its capabilities and algorithm friendlyness of python.But now
> when parts
> > of speech tagging in itself taking so much of time, I believe, nltk may
> not be
> > best suitable. Java or C won't hurt me, hence any solution will work for
> me.
> > Please note, I have already started migrating from mysql to hbase in
> order to
> > work with more freedom on such large number of data. But still question
> exists,
> > how to perform algos. Mahout may be a choice, but that too is for machine
> > learning, not dedicated for nlp (may be good for speech recognition).
> What else
> > are available options. In gist, I need high performance nlp, (a step
> down from
> > high performance machine learning). (I am inclined a bit towards Mahout,
> seeing
> > future usage).
> >
> > (already asked at -
> >
> http://stackoverflow.com/questions/15782898/how-can-i-imporve-performance-of-nltk-alternatives
> )
> > .
> >
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble