You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Philip Ogren <ph...@oracle.com> on 2013/10/22 19:50:10 UTC

unable to serialize analytics pipeline

I have a text analytics pipeline that performs a sequence of steps (e.g. 
tokenization, part-of-speech tagging, etc.) on a line of text.  I have 
wrapped the whole pipeline up into a simple interface that allows me to 
call it from Scala as a POJO - i.e. I instantiate the pipeline, I pass 
it a string, and get back some objects.  Now, I would like to do the 
same thing for items in a Spark RDD via a map transformation.  
Unfortunately, my pipeline is not serializable and so I get a 
NotSerializableException when I try this.  I played around with Kryo 
just now to see if that could help and I ended up with a "missing no-arg 
constructor" exception on a class I have no control over.  It seems the 
Spark framework expects that I should be able to serialize my pipeline 
when I can't (or at least don't think I can at first glance.)

Is there a workaround for this scenario?  I am imagining a few possible 
solutions that seem a bit dubious to me, so I thought I would ask for 
direction before wandering about.  Perhaps a better understanding of 
serialization strategies might help me get the pipeline to serialize.  
Or perhaps there is a way to instantiate my pipeline on demand on the 
nodes through a factory call.

Any advice is appreciated.

Thanks,
Philip

Re: unable to serialize analytics pipeline

Posted by Philip Ogren <ph...@oracle.com>.
A simple workaround that seems to work (at least in localhost mode) is 
to mark my top-level pipeline object (inside my simple interface) as 
transient and add an initialize method.  In the method that calls the 
pipeline and returns the results, I simply call the initialize method if 
needed (i.e. if the pipeline object is null.)  This seems reasonable to 
me.  I will try it on an actual cluster next....

Thanks,
Philip

On 10/22/2013 11:50 AM, Philip Ogren wrote:
>
> I have a text analytics pipeline that performs a sequence of steps 
> (e.g. tokenization, part-of-speech tagging, etc.) on a line of text.  
> I have wrapped the whole pipeline up into a simple interface that 
> allows me to call it from Scala as a POJO - i.e. I instantiate the 
> pipeline, I pass it a string, and get back some objects.  Now, I would 
> like to do the same thing for items in a Spark RDD via a map 
> transformation.  Unfortunately, my pipeline is not serializable and so 
> I get a NotSerializableException when I try this.  I played around 
> with Kryo just now to see if that could help and I ended up with a 
> "missing no-arg constructor" exception on a class I have no control 
> over.  It seems the Spark framework expects that I should be able to 
> serialize my pipeline when I can't (or at least don't think I can at 
> first glance.)
>
> Is there a workaround for this scenario?  I am imagining a few 
> possible solutions that seem a bit dubious to me, so I thought I would 
> ask for direction before wandering about.  Perhaps a better 
> understanding of serialization strategies might help me get the 
> pipeline to serialize.  Or perhaps there is a way to instantiate my 
> pipeline on demand on the nodes through a factory call.
>
> Any advice is appreciated.
>
> Thanks,
> Philip


Re: unable to serialize analytics pipeline

Posted by Mark Hamstra <ma...@clearstorydata.com>.
If you distribute the needed jar(s) to your Workers, you may well be able
to instantiate what you need using mapPartitions, mapPartitionsWithIndex,
mapWith, flatMapWith, etc.  Be careful, though, about teardown of any
resource allocation that you may need to do within each partition.



On Tue, Oct 22, 2013 at 10:50 AM, Philip Ogren <ph...@oracle.com>wrote:

>
> I have a text analytics pipeline that performs a sequence of steps (e.g.
> tokenization, part-of-speech tagging, etc.) on a line of text.  I have
> wrapped the whole pipeline up into a simple interface that allows me to
> call it from Scala as a POJO - i.e. I instantiate the pipeline, I pass it a
> string, and get back some objects.  Now, I would like to do the same thing
> for items in a Spark RDD via a map transformation.  Unfortunately, my
> pipeline is not serializable and so I get a NotSerializableException when I
> try this.  I played around with Kryo just now to see if that could help and
> I ended up with a "missing no-arg constructor" exception on a class I have
> no control over.  It seems the Spark framework expects that I should be
> able to serialize my pipeline when I can't (or at least don't think I can
> at first glance.)
>
> Is there a workaround for this scenario?  I am imagining a few possible
> solutions that seem a bit dubious to me, so I thought I would ask for
> direction before wandering about.  Perhaps a better understanding of
> serialization strategies might help me get the pipeline to serialize.  Or
> perhaps there is a way to instantiate my pipeline on demand on the nodes
> through a factory call.
>
> Any advice is appreciated.
>
> Thanks,
> Philip
>