You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Philip Ogren <ph...@oracle.com> on 2013/10/22 19:50:10 UTC
unable to serialize analytics pipeline
I have a text analytics pipeline that performs a sequence of steps (e.g.
tokenization, part-of-speech tagging, etc.) on a line of text. I have
wrapped the whole pipeline up into a simple interface that allows me to
call it from Scala as a POJO - i.e. I instantiate the pipeline, I pass
it a string, and get back some objects. Now, I would like to do the
same thing for items in a Spark RDD via a map transformation.
Unfortunately, my pipeline is not serializable and so I get a
NotSerializableException when I try this. I played around with Kryo
just now to see if that could help and I ended up with a "missing no-arg
constructor" exception on a class I have no control over. It seems the
Spark framework expects that I should be able to serialize my pipeline
when I can't (or at least don't think I can at first glance.)
Is there a workaround for this scenario? I am imagining a few possible
solutions that seem a bit dubious to me, so I thought I would ask for
direction before wandering about. Perhaps a better understanding of
serialization strategies might help me get the pipeline to serialize.
Or perhaps there is a way to instantiate my pipeline on demand on the
nodes through a factory call.
Any advice is appreciated.
Thanks,
Philip
Re: unable to serialize analytics pipeline
Posted by Philip Ogren <ph...@oracle.com>.
A simple workaround that seems to work (at least in localhost mode) is
to mark my top-level pipeline object (inside my simple interface) as
transient and add an initialize method. In the method that calls the
pipeline and returns the results, I simply call the initialize method if
needed (i.e. if the pipeline object is null.) This seems reasonable to
me. I will try it on an actual cluster next....
Thanks,
Philip
On 10/22/2013 11:50 AM, Philip Ogren wrote:
>
> I have a text analytics pipeline that performs a sequence of steps
> (e.g. tokenization, part-of-speech tagging, etc.) on a line of text.
> I have wrapped the whole pipeline up into a simple interface that
> allows me to call it from Scala as a POJO - i.e. I instantiate the
> pipeline, I pass it a string, and get back some objects. Now, I would
> like to do the same thing for items in a Spark RDD via a map
> transformation. Unfortunately, my pipeline is not serializable and so
> I get a NotSerializableException when I try this. I played around
> with Kryo just now to see if that could help and I ended up with a
> "missing no-arg constructor" exception on a class I have no control
> over. It seems the Spark framework expects that I should be able to
> serialize my pipeline when I can't (or at least don't think I can at
> first glance.)
>
> Is there a workaround for this scenario? I am imagining a few
> possible solutions that seem a bit dubious to me, so I thought I would
> ask for direction before wandering about. Perhaps a better
> understanding of serialization strategies might help me get the
> pipeline to serialize. Or perhaps there is a way to instantiate my
> pipeline on demand on the nodes through a factory call.
>
> Any advice is appreciated.
>
> Thanks,
> Philip
Re: unable to serialize analytics pipeline
Posted by Mark Hamstra <ma...@clearstorydata.com>.
If you distribute the needed jar(s) to your Workers, you may well be able
to instantiate what you need using mapPartitions, mapPartitionsWithIndex,
mapWith, flatMapWith, etc. Be careful, though, about teardown of any
resource allocation that you may need to do within each partition.
On Tue, Oct 22, 2013 at 10:50 AM, Philip Ogren <ph...@oracle.com>wrote:
>
> I have a text analytics pipeline that performs a sequence of steps (e.g.
> tokenization, part-of-speech tagging, etc.) on a line of text. I have
> wrapped the whole pipeline up into a simple interface that allows me to
> call it from Scala as a POJO - i.e. I instantiate the pipeline, I pass it a
> string, and get back some objects. Now, I would like to do the same thing
> for items in a Spark RDD via a map transformation. Unfortunately, my
> pipeline is not serializable and so I get a NotSerializableException when I
> try this. I played around with Kryo just now to see if that could help and
> I ended up with a "missing no-arg constructor" exception on a class I have
> no control over. It seems the Spark framework expects that I should be
> able to serialize my pipeline when I can't (or at least don't think I can
> at first glance.)
>
> Is there a workaround for this scenario? I am imagining a few possible
> solutions that seem a bit dubious to me, so I thought I would ask for
> direction before wandering about. Perhaps a better understanding of
> serialization strategies might help me get the pipeline to serialize. Or
> perhaps there is a way to instantiate my pipeline on demand on the nodes
> through a factory call.
>
> Any advice is appreciated.
>
> Thanks,
> Philip
>