You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Matt Narrell <ma...@gmail.com> on 2014/09/02 18:45:28 UTC

Serialized 3rd party libs

Hello,

I’m using Spark streaming to aggregate data from a Kafka topic in sliding windows.  Usually we want to persist this aggregated data to a MongoDB cluster, or republish to a different Kafka topic.  When I include these 3rd party drivers, I usually get a NotSerializableException due to the parallelization of the job.  To side step this, I’ve used static class variables which seem to help, e.g., I can run my jobs.  

Is this the proper way to provide 3rd party libs to Spark jobs?  
Does having these drivers declared as static prohibit me from parallelizing my job?  
Is this even a proper way to design jobs?  

An alternative (I assume) would be to aggregate my data into HDFS and have another process (perhaps non-Spark?) consume it and republish/persist?

Thanks,
Matt
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Serialized 3rd party libs

Posted by Matt Narrell <ma...@gmail.com>.
Sean,

Thanks for point this out.  I’d have to experiment with the mapPartitions method, but you’re right, this seems to address this issue directly.  I’m also connecting to Zookeeper to retrieve SparkConf parameters.  I run into the same issue with my Zookeeper driver, however, this is before any Spark contexts are created, and I assume before the job is partitioned.  

mn

On Sep 2, 2014, at 11:00 AM, Sean Owen <so...@cloudera.com> wrote:

> The problem is not using the drivers per se, but writing your
> functions in a way that you are trying to serialize them. You can't
> serialize them, and indeed don't want to.  Instead your code needs to
> reopen connections and so forth when the function is instantiated on
> the remote worker.
> 
> static variables are a crude way to do that, probably too crude in general.
> No, there's certainly no reason you can't access these things in Spark.
> 
> Since it answers exactly this point, I don't mind promoting today's blog post:
> http://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-apache-spark/
> ... which repeats Tobias's good formulation of how to deal with things
> like drivers in an efficient way that doesn't trip over serialization.
> 
> On Tue, Sep 2, 2014 at 5:45 PM, Matt Narrell <ma...@gmail.com> wrote:
>> Hello,
>> 
>> I’m using Spark streaming to aggregate data from a Kafka topic in sliding windows.  Usually we want to persist this aggregated data to a MongoDB cluster, or republish to a different Kafka topic.  When I include these 3rd party drivers, I usually get a NotSerializableException due to the parallelization of the job.  To side step this, I’ve used static class variables which seem to help, e.g., I can run my jobs.
>> 
>> Is this the proper way to provide 3rd party libs to Spark jobs?
>> Does having these drivers declared as static prohibit me from parallelizing my job?
>> Is this even a proper way to design jobs?
>> 
>> An alternative (I assume) would be to aggregate my data into HDFS and have another process (perhaps non-Spark?) consume it and republish/persist?
>> 
>> Thanks,
>> Matt
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Serialized 3rd party libs

Posted by Sean Owen <so...@cloudera.com>.
The problem is not using the drivers per se, but writing your
functions in a way that you are trying to serialize them. You can't
serialize them, and indeed don't want to.  Instead your code needs to
reopen connections and so forth when the function is instantiated on
the remote worker.

static variables are a crude way to do that, probably too crude in general.
No, there's certainly no reason you can't access these things in Spark.

Since it answers exactly this point, I don't mind promoting today's blog post:
http://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-apache-spark/
... which repeats Tobias's good formulation of how to deal with things
like drivers in an efficient way that doesn't trip over serialization.

On Tue, Sep 2, 2014 at 5:45 PM, Matt Narrell <ma...@gmail.com> wrote:
> Hello,
>
> I’m using Spark streaming to aggregate data from a Kafka topic in sliding windows.  Usually we want to persist this aggregated data to a MongoDB cluster, or republish to a different Kafka topic.  When I include these 3rd party drivers, I usually get a NotSerializableException due to the parallelization of the job.  To side step this, I’ve used static class variables which seem to help, e.g., I can run my jobs.
>
> Is this the proper way to provide 3rd party libs to Spark jobs?
> Does having these drivers declared as static prohibit me from parallelizing my job?
> Is this even a proper way to design jobs?
>
> An alternative (I assume) would be to aggregate my data into HDFS and have another process (perhaps non-Spark?) consume it and republish/persist?
>
> Thanks,
> Matt
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org