You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by shankark <sh...@gmail.com> on 2014/03/12 19:10:24 UTC

NLP with Spark

(apologies if this was sent out multiple times before)

We are about to start a large-scale text-processing research project and
are debating between two alternatives for our cluster -- Spark and Hadoop.
I've researched possibilities of using NLTK with Hadoop and see that
there's some precedent (
http://blog.cloudera.com/blog/2010/03/natural-language-processing-with-hadoop-and-python/).
I wanted to know how easy it might be to use NLTK with pyspark, or if
scalanlp is mature enough to be used with the Scala API for Spark/mllib.

Thanks!




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NLP-with-Spark-tp2612.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: NLP with Spark

Posted by Andrei <fa...@gmail.com>.

In my experience, choice of tools for NLP mostly depends on concrete tasks.
For example, for named entity recognition (NER) there's nice  Java library
called GATE [1]. It allows you to annotate your text with special marks
(e.g. part of speech tags, "time", "name", etc.) and write regex-like rules
to capture even very complicated patterns. On other hand, Stanford NLP
Parser [2] gives unique possibility to extract sentense structure, feature,
that is not available in any other library known to me. And in Python world
there's NLTK, NumPy, SciKit Learn, easy integration with TreeTagger [3] and
super cool ecosystem for statistical text analysis. Each of these tools and
their combination has its pros and cons, so final choice really depends on
your specific needs and personal preferences.

As for Spark (and distributed computations in general), most of the NLP
tasks may be performed locally on workers (e.g. you don't need 1Tb dataset
to find out part of speech tags for particular sentense - you need only
this specific sentence and maybe some little context). Some tasks, however,
do require entire dataset at once. Most popular of them, such as KMeans
clustering or collaborative filtering, are already implemented in MLlib.
But it's always worth to check for specific algos you may need before
taking a final decision.

Let me know if you need advice on specific NLP or ML tasks.

[1]: https://gate.ac.uk/
[2]: http://nlp.stanford.edu/software/lex-parser.shtml
[3]: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/

Best Regards,
Andrei

On Wed, Mar 12, 2014 at 10:12 PM, Brian O'Neill <bo...@alumni.brown.edu>wrote:

>
> Please let us know how you make out.  We have NLP  requirements on the
> horizon.  I've used NLTK before, but never on Spark.  I'd love to hear if
> that works out for you.
>
> -brian
>
> ---
>
> Brian O'Neill
>
> Chief Technology Officer
>
>
> *Health Market Science*
>
> *The Science of Better Results*
>
> 2700 Horizon Drive * King of Prussia, PA * 19406
>
> M: 215.588.6024 * @boneill42 <http://www.twitter.com/boneill42>  *
>
> healthmarketscience.com
>
>
> This information transmitted in this email message is for the intended
> recipient only and may contain confidential and/or privileged material. If
> you received this email in error and are not the intended recipient, or the
> person responsible to deliver it to the intended recipient, please contact
> the sender at the email above and delete this email and any attachments and
> destroy any copies thereof. Any review, retransmission, dissemination,
> copying or other use of, or taking any action in reliance upon, this
> information by persons or entities other than the intended recipient is
> strictly prohibited.
>
>
>
>
> From: Mayur Rustagi <ma...@gmail.com>
> Reply-To: <us...@spark.apache.org>
> Date: Wednesday, March 12, 2014 at 2:38 PM
> To: <us...@spark.apache.org>
> Cc: "user@spark.incubator.apache.org" <us...@spark.incubator.apache.org>
> Subject: Re: NLP with Spark
>
> Would love to know if somebody has tried this, only possible problem I can
> forsee is non-serializable libraries, else no reason it should not work.
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
>
> On Wed, Mar 12, 2014 at 11:10 AM, shankark <sh...@gmail.com> wrote:
>
>> (apologies if this was sent out multiple times before)
>>
>> We are about to start a large-scale text-processing research project and
>> are debating between two alternatives for our cluster -- Spark and Hadoop.
>> I've researched possibilities of using NLTK with Hadoop and see that
>> there's some precedent (
>> http://blog.cloudera.com/blog/2010/03/natural-language-processing-with-hadoop-and-python/).
>> I wanted to know how easy it might be to use NLTK with pyspark, or if
>> scalanlp is mature enough to be used with the Scala API for Spark/mllib.
>>
>> Thanks!
>>
>> ------------------------------
>> View this message in context: NLP with Spark<http://apache-spark-user-list.1001560.n3.nabble.com/NLP-with-Spark-tp2612.html>
>> Sent from the Apache Spark User List mailing list archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com.
>>
>
>

Re: NLP with Spark

Posted by Brian O'Neill <bo...@alumni.brown.edu>.

Please let us know how you make out.  We have NLP  requirements on the
horizon.  I¹ve used NLTK before, but never on Spark.  I¹d love to hear if
that works out for you.

-brian

---
Brian O'Neill
Chief Technology Officer

Health Market Science
The Science of Better Results
2700 Horizon Drive  King of Prussia, PA  19406
M: 215.588.6024  @boneill42 <http://www.twitter.com/boneill42>   
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.

From:  Mayur Rustagi <ma...@gmail.com>
Reply-To:  <us...@spark.apache.org>
Date:  Wednesday, March 12, 2014 at 2:38 PM
To:  <us...@spark.apache.org>
Cc:  "user@spark.incubator.apache.org" <us...@spark.incubator.apache.org>
Subject:  Re: NLP with Spark

Would love to know if somebody has tried this, only possible problem I can
forsee is non-serializable libraries, else no reason it should not work.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>

On Wed, Mar 12, 2014 at 11:10 AM, shankark <sh...@gmail.com> wrote:
> (apologies if this was sent out multiple times before)
> 
> We are about to start a large-scale text-processing research project and are
> debating between two alternatives for our cluster -- Spark and Hadoop. I've
> researched possibilities of using NLTK with Hadoop and see that there's some
> precedent 
> (http://blog.cloudera.com/blog/2010/03/natural-language-processing-with-hadoop
> -and-python/). I wanted to know how easy it might be to use NLTK with pyspark,
> or if scalanlp is mature enough to be used with the Scala API for Spark/mllib.
> 
> Thanks!
> 
> 
> View this message in context: NLP with Spark
> <http://apache-spark-user-list.1001560.n3.nabble.com/NLP-with-Spark-tp2612.htm
> l> 
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/>  at Nabble.com.

Re: NLP with Spark

Posted by Mayur Rustagi <ma...@gmail.com>.

Would love to know if somebody has tried this, only possible problem I can
forsee is non-serializable libraries, else no reason it should not work.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Wed, Mar 12, 2014 at 11:10 AM, shankark <sh...@gmail.com> wrote:

> (apologies if this was sent out multiple times before)
>
> We are about to start a large-scale text-processing research project and
> are debating between two alternatives for our cluster -- Spark and Hadoop.
> I've researched possibilities of using NLTK with Hadoop and see that
> there's some precedent (
> http://blog.cloudera.com/blog/2010/03/natural-language-processing-with-hadoop-and-python/).
> I wanted to know how easy it might be to use NLTK with pyspark, or if
> scalanlp is mature enough to be used with the Scala API for Spark/mllib.
>
> Thanks!
>
> ------------------------------
> View this message in context: NLP with Spark<http://apache-spark-user-list.1001560.n3.nabble.com/NLP-with-Spark-tp2612.html>
> Sent from the Apache Spark User List mailing list archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com.
>