You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by tvas <th...@gmail.com> on 2014/11/24 16:46:04 UTC

Spark and Stanford CoreNLP

Hello,

I was wondering if anyone has gotten the Stanford CoreNLP Java library to
work with Spark.

My attempts to use the parser/annotator fail because of task serialization
errors since the class 
StanfordCoreNLP cannot be serialized.

I've tried the remedies of registering StanfordCoreNLP through kryo, as well
as using chill.MeatLocker,
but these still produce serialization errors.
Passing the StanfordCoreNLP object as transient leads to a
NullPointerException instead.

Has anybody managed to get this work?

Regards,
Theodore



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Stanford-CoreNLP-tp19654.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Spark and Stanford CoreNLP

Posted by mathewvinoj <vi...@hotmail.com>.
Evan,

could you please look into this post.Below is the link.Any thoughts or
suggestion is really appreciated

http://apache-spark-user-list.1001560.n3.nabble.com/Spark-partition-issue-with-Stanford-NLP-td23048.html



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Stanford-CoreNLP-tp19654p23059.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Spark and Stanford CoreNLP

Posted by "Evan R. Sparks" <ev...@gmail.com>.
Chris,

Thanks for stopping by! Here's a simple example. Imagine I've got a corpus
of data, which is an RDD[String], and I want to do some POS tagging on it.
In naive spark, that might look like this:

val props = new Properties.setAnnotators("pos")
val proc = new StanfordCoreNLP(props)
val data = sc.textFile("hdfs://some/distributed/corpus")

def processData(s: String): Annotation = {
  val a = new Annotation(s)
  proc.annotate(a)
}

val processedData = data.map(processData) //Note that this is actually
executed lazily.

Under the covers, spark takes the closure (processData), serializes it and
all objects/methods that it references (including the "proc"), and ships
the serialized closure off to workers so that they can run it on their
local partitions of the corpus. The issue at hand is that since the
StanfordCoreNLP object isn't serializable, *this will fail at runtime.* Hence
the solutions to this problem suggested in this thread, which all come down
to initializing the processor on the worker side (preferably once).

Your intuition about not wanting to serialize huge objects is fine. This
issue is not unique to CoreNLP - any Java library which has
non-serializable objects will face this issue.

HTH,
Evan


On Tue, Nov 25, 2014 at 8:05 AM, Christopher Manning <ma...@stanford.edu>
wrote:

> I’m not (yet!) an active Spark user, but saw this thread on twitter … and
> am involved with Stanford CoreNLP.
>
> Could someone explain how things need to be to work better with Spark —
> since that would be a useful goal.
>
> That is, while Stanford CoreNLP is not quite uniform (being developed by
> various people for over a decade), the general approach has always been
> that models should be serializable but that processors should not be. This
> make sense to me intuitively. It doesn’t really make sense to serialize a
> processor, which often has large mutable data structures used for
> processing.
>
> But does that not work well with Spark? Do processors need to be
> serializable, and then one needs to go through and make all the elements of
> the processor transient?
>
> Or what?
>
> Thanks!
>
> Chris
>
>
> > On Nov 25, 2014, at 7:54 AM, Evan Sparks <ev...@gmail.com> wrote:
> >
> > If you only mark it as transient, then the object won't be serialized,
> and on the worker the field will be null. When the worker goes to use it,
> you get an NPE.
> >
> > Marking it lazy defers initialization to first use. If that use happens
> to be after serialization time (e.g. on the worker), then the worker will
> first check to see if it's initialized, and then initialize it if not.
> >
> > I think if you *do* reference the lazy val before serializing you will
> likely get an NPE.
> >
> >
> >> On Nov 25, 2014, at 1:05 AM, Theodore Vasiloudis <
> theodoros.vasiloudis@gmail.com> wrote:
> >>
> >> Great, Ian's approach seems to work fine.
> >>
> >> Can anyone provide an explanation as to why this works, but passing the
> >> CoreNLP object itself
> >> as transient does not?
> >>
> >>
> >>
> >> --
> >> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Stanford-CoreNLP-tp19654p19739.html
> >> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> >> For additional commands, e-mail: user-help@spark.apache.org
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> > For additional commands, e-mail: user-help@spark.apache.org
> >
>
>
>

Re: Spark and Stanford CoreNLP

Posted by Christopher Manning <ma...@stanford.edu>.
I’m not (yet!) an active Spark user, but saw this thread on twitter … and am involved with Stanford CoreNLP.

Could someone explain how things need to be to work better with Spark — since that would be a useful goal.

That is, while Stanford CoreNLP is not quite uniform (being developed by various people for over a decade), the general approach has always been that models should be serializable but that processors should not be. This make sense to me intuitively. It doesn’t really make sense to serialize a processor, which often has large mutable data structures used for processing.

But does that not work well with Spark? Do processors need to be serializable, and then one needs to go through and make all the elements of the processor transient?

Or what?

Thanks!

Chris


> On Nov 25, 2014, at 7:54 AM, Evan Sparks <ev...@gmail.com> wrote:
> 
> If you only mark it as transient, then the object won't be serialized, and on the worker the field will be null. When the worker goes to use it, you get an NPE. 
> 
> Marking it lazy defers initialization to first use. If that use happens to be after serialization time (e.g. on the worker), then the worker will first check to see if it's initialized, and then initialize it if not. 
> 
> I think if you *do* reference the lazy val before serializing you will likely get an NPE. 
> 
> 
>> On Nov 25, 2014, at 1:05 AM, Theodore Vasiloudis <th...@gmail.com> wrote:
>> 
>> Great, Ian's approach seems to work fine.
>> 
>> Can anyone provide an explanation as to why this works, but passing the
>> CoreNLP object itself
>> as transient does not?
>> 
>> 
>> 
>> --
>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Stanford-CoreNLP-tp19654p19739.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Spark and Stanford CoreNLP

Posted by Evan Sparks <ev...@gmail.com>.
If you only mark it as transient, then the object won't be serialized, and on the worker the field will be null. When the worker goes to use it, you get an NPE. 

Marking it lazy defers initialization to first use. If that use happens to be after serialization time (e.g. on the worker), then the worker will first check to see if it's initialized, and then initialize it if not. 

I think if you *do* reference the lazy val before serializing you will likely get an NPE. 


> On Nov 25, 2014, at 1:05 AM, Theodore Vasiloudis <th...@gmail.com> wrote:
> 
> Great, Ian's approach seems to work fine.
> 
> Can anyone provide an explanation as to why this works, but passing the
> CoreNLP object itself
> as transient does not?
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Stanford-CoreNLP-tp19654p19739.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Spark and Stanford CoreNLP

Posted by Theodore Vasiloudis <th...@gmail.com>.
Great, Ian's approach seems to work fine.

Can anyone provide an explanation as to why this works, but passing the
CoreNLP object itself
as transient does not?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Stanford-CoreNLP-tp19654p19739.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Spark and Stanford CoreNLP

Posted by "Evan R. Sparks" <ev...@gmail.com>.
Neat hack! This is cute and actually seems to work. The fact that it works
is a little surprising and somewhat unintuitive.

On Mon, Nov 24, 2014 at 8:08 AM, Ian O'Connell <ia...@ianoconnell.com> wrote:

>
> object MyCoreNLP {
>   @transient lazy val coreNLP = new coreNLP()
> }
>
> and then refer to it from your map/reduce/map partitions or that it should
> be fine (presuming its thread safe), it will only be initialized once per
> classloader per jvm
>
> On Mon, Nov 24, 2014 at 7:58 AM, Evan Sparks <ev...@gmail.com>
> wrote:
>
>> We have gotten this to work, but it requires instantiating the CoreNLP
>> object on the worker side. Because of the initialization time it makes a
>> lot of sense to do this inside of a .mapPartitions instead of a .map, for
>> example.
>>
>> As an aside, if you're using it from Scala, have a look at sistanlp,
>> which provided a nicer, scala-friendly interface to CoreNLP.
>>
>>
>> > On Nov 24, 2014, at 7:46 AM, tvas <th...@gmail.com>
>> wrote:
>> >
>> > Hello,
>> >
>> > I was wondering if anyone has gotten the Stanford CoreNLP Java library
>> to
>> > work with Spark.
>> >
>> > My attempts to use the parser/annotator fail because of task
>> serialization
>> > errors since the class
>> > StanfordCoreNLP cannot be serialized.
>> >
>> > I've tried the remedies of registering StanfordCoreNLP through kryo, as
>> well
>> > as using chill.MeatLocker,
>> > but these still produce serialization errors.
>> > Passing the StanfordCoreNLP object as transient leads to a
>> > NullPointerException instead.
>> >
>> > Has anybody managed to get this work?
>> >
>> > Regards,
>> > Theodore
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Stanford-CoreNLP-tp19654.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> > For additional commands, e-mail: user-help@spark.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: Spark and Stanford CoreNLP

Posted by "Evan R. Sparks" <ev...@gmail.com>.
This is probably not the right venue for general questions on CoreNLP - the
project website (http://nlp.stanford.edu/software/corenlp.shtml) provides
documentation and links to mailing lists/stack overflow topics.

On Mon, Nov 24, 2014 at 9:08 AM, Madabhattula Rajesh Kumar <
mrajaforu@gmail.com> wrote:

> Hello,
>
> I'm new to Stanford CoreNLP. Could any one share good training material
> and examples(java or scala) on NLP.
>
> Regards,
> Rajesh
>
> On Mon, Nov 24, 2014 at 9:38 PM, Ian O'Connell <ia...@ianoconnell.com>
> wrote:
>
>>
>> object MyCoreNLP {
>>   @transient lazy val coreNLP = new coreNLP()
>> }
>>
>> and then refer to it from your map/reduce/map partitions or that it
>> should be fine (presuming its thread safe), it will only be initialized
>> once per classloader per jvm
>>
>> On Mon, Nov 24, 2014 at 7:58 AM, Evan Sparks <ev...@gmail.com>
>> wrote:
>>
>>> We have gotten this to work, but it requires instantiating the CoreNLP
>>> object on the worker side. Because of the initialization time it makes a
>>> lot of sense to do this inside of a .mapPartitions instead of a .map, for
>>> example.
>>>
>>> As an aside, if you're using it from Scala, have a look at sistanlp,
>>> which provided a nicer, scala-friendly interface to CoreNLP.
>>>
>>>
>>> > On Nov 24, 2014, at 7:46 AM, tvas <th...@gmail.com>
>>> wrote:
>>> >
>>> > Hello,
>>> >
>>> > I was wondering if anyone has gotten the Stanford CoreNLP Java library
>>> to
>>> > work with Spark.
>>> >
>>> > My attempts to use the parser/annotator fail because of task
>>> serialization
>>> > errors since the class
>>> > StanfordCoreNLP cannot be serialized.
>>> >
>>> > I've tried the remedies of registering StanfordCoreNLP through kryo,
>>> as well
>>> > as using chill.MeatLocker,
>>> > but these still produce serialization errors.
>>> > Passing the StanfordCoreNLP object as transient leads to a
>>> > NullPointerException instead.
>>> >
>>> > Has anybody managed to get this work?
>>> >
>>> > Regards,
>>> > Theodore
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Stanford-CoreNLP-tp19654.html
>>> > Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> > For additional commands, e-mail: user-help@spark.apache.org
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>

Re: Spark and Stanford CoreNLP

Posted by Madabhattula Rajesh Kumar <mr...@gmail.com>.
Hello,

I'm new to Stanford CoreNLP. Could any one share good training material and
examples(java or scala) on NLP.

Regards,
Rajesh

On Mon, Nov 24, 2014 at 9:38 PM, Ian O'Connell <ia...@ianoconnell.com> wrote:

>
> object MyCoreNLP {
>   @transient lazy val coreNLP = new coreNLP()
> }
>
> and then refer to it from your map/reduce/map partitions or that it should
> be fine (presuming its thread safe), it will only be initialized once per
> classloader per jvm
>
> On Mon, Nov 24, 2014 at 7:58 AM, Evan Sparks <ev...@gmail.com>
> wrote:
>
>> We have gotten this to work, but it requires instantiating the CoreNLP
>> object on the worker side. Because of the initialization time it makes a
>> lot of sense to do this inside of a .mapPartitions instead of a .map, for
>> example.
>>
>> As an aside, if you're using it from Scala, have a look at sistanlp,
>> which provided a nicer, scala-friendly interface to CoreNLP.
>>
>>
>> > On Nov 24, 2014, at 7:46 AM, tvas <th...@gmail.com>
>> wrote:
>> >
>> > Hello,
>> >
>> > I was wondering if anyone has gotten the Stanford CoreNLP Java library
>> to
>> > work with Spark.
>> >
>> > My attempts to use the parser/annotator fail because of task
>> serialization
>> > errors since the class
>> > StanfordCoreNLP cannot be serialized.
>> >
>> > I've tried the remedies of registering StanfordCoreNLP through kryo, as
>> well
>> > as using chill.MeatLocker,
>> > but these still produce serialization errors.
>> > Passing the StanfordCoreNLP object as transient leads to a
>> > NullPointerException instead.
>> >
>> > Has anybody managed to get this work?
>> >
>> > Regards,
>> > Theodore
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Stanford-CoreNLP-tp19654.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> > For additional commands, e-mail: user-help@spark.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: Spark and Stanford CoreNLP

Posted by Ian O'Connell <ia...@ianoconnell.com>.
object MyCoreNLP {
  @transient lazy val coreNLP = new coreNLP()
}

and then refer to it from your map/reduce/map partitions or that it should
be fine (presuming its thread safe), it will only be initialized once per
classloader per jvm

On Mon, Nov 24, 2014 at 7:58 AM, Evan Sparks <ev...@gmail.com> wrote:

> We have gotten this to work, but it requires instantiating the CoreNLP
> object on the worker side. Because of the initialization time it makes a
> lot of sense to do this inside of a .mapPartitions instead of a .map, for
> example.
>
> As an aside, if you're using it from Scala, have a look at sistanlp, which
> provided a nicer, scala-friendly interface to CoreNLP.
>
>
> > On Nov 24, 2014, at 7:46 AM, tvas <th...@gmail.com>
> wrote:
> >
> > Hello,
> >
> > I was wondering if anyone has gotten the Stanford CoreNLP Java library to
> > work with Spark.
> >
> > My attempts to use the parser/annotator fail because of task
> serialization
> > errors since the class
> > StanfordCoreNLP cannot be serialized.
> >
> > I've tried the remedies of registering StanfordCoreNLP through kryo, as
> well
> > as using chill.MeatLocker,
> > but these still produce serialization errors.
> > Passing the StanfordCoreNLP object as transient leads to a
> > NullPointerException instead.
> >
> > Has anybody managed to get this work?
> >
> > Regards,
> > Theodore
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Stanford-CoreNLP-tp19654.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> > For additional commands, e-mail: user-help@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: Spark and Stanford CoreNLP

Posted by Evan Sparks <ev...@gmail.com>.
We have gotten this to work, but it requires instantiating the CoreNLP object on the worker side. Because of the initialization time it makes a lot of sense to do this inside of a .mapPartitions instead of a .map, for example. 

As an aside, if you're using it from Scala, have a look at sistanlp, which provided a nicer, scala-friendly interface to CoreNLP. 


> On Nov 24, 2014, at 7:46 AM, tvas <th...@gmail.com> wrote:
> 
> Hello,
> 
> I was wondering if anyone has gotten the Stanford CoreNLP Java library to
> work with Spark.
> 
> My attempts to use the parser/annotator fail because of task serialization
> errors since the class 
> StanfordCoreNLP cannot be serialized.
> 
> I've tried the remedies of registering StanfordCoreNLP through kryo, as well
> as using chill.MeatLocker,
> but these still produce serialization errors.
> Passing the StanfordCoreNLP object as transient leads to a
> NullPointerException instead.
> 
> Has anybody managed to get this work?
> 
> Regards,
> Theodore
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Stanford-CoreNLP-tp19654.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org