You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Hao Ren <in...@gmail.com> on 2015/08/07 11:39:26 UTC

How to distribute non-serializable object in transform task or broadcast ?

Is there any workaround to distribute non-serializable object for RDD
transformation or broadcast variable ?

Say I have an object of class C which is not serializable. Class C is in a
jar package, I have no control on it. Now I need to distribute it either by
rdd transformation or by broadcast.

I tried to subclass the class C with Serializable interface. It works for
serialization, but deserialization does not work, since there are no
parameter-less constructor for the class C and deserialization is broken
with an invalid constructor exception.

I think it's a common use case. Any help is appreciated.

-- 
Hao Ren

Data Engineer @ leboncoin

Paris, France

Re: How to distribute non-serializable object in transform task or broadcast ?

Posted by Han JU <ju...@gmail.com>.
If the object is something like an utility object (say a DB connection
handler), I often use:

   @transient lazy val someObj = MyFactory.getObj(...)

So basically `@transient` tell the closure cleaner don't serialize this,
and the `lazy val` allows it to be initiated on each executor upon its
first usage (since the class is in your jar so executor should be able to
instantiate it).

2015-08-07 17:20 GMT+02:00 Philip Weaver <ph...@gmail.com>:

> If the object cannot be serialized, then I don't think broadcast will make
> it magically serializable. You can't transfer data structures between nodes
> without serializing them somehow.
>
> On Fri, Aug 7, 2015 at 7:31 AM, Sujit Pal <su...@gmail.com> wrote:
>
>> Hi Hao,
>>
>> I think sc.broadcast will allow you to broadcast non-serializable
>> objects. According to the scaladocs the Broadcast class itself is
>> Serializable and it wraps your object, allowing you to get it from the
>> Broadcast object using value().
>>
>> Not 100% sure though since I haven't tried broadcasting custom objects
>> but maybe worth trying unless you have already and failed.
>>
>> -sujit
>>
>>
>> On Fri, Aug 7, 2015 at 2:39 AM, Hao Ren <in...@gmail.com> wrote:
>>
>>> Is there any workaround to distribute non-serializable object for RDD
>>> transformation or broadcast variable ?
>>>
>>> Say I have an object of class C which is not serializable. Class C is in
>>> a jar package, I have no control on it. Now I need to distribute it either
>>> by rdd transformation or by broadcast.
>>>
>>> I tried to subclass the class C with Serializable interface. It works
>>> for serialization, but deserialization does not work, since there are no
>>> parameter-less constructor for the class C and deserialization is broken
>>> with an invalid constructor exception.
>>>
>>> I think it's a common use case. Any help is appreciated.
>>>
>>> --
>>> Hao Ren
>>>
>>> Data Engineer @ leboncoin
>>>
>>> Paris, France
>>>
>>
>>
>


-- 
*JU Han*

Software Engineer @ Teads.tv

+33 0619608888

Re: How to distribute non-serializable object in transform task or broadcast ?

Posted by Philip Weaver <ph...@gmail.com>.
If the object cannot be serialized, then I don't think broadcast will make
it magically serializable. You can't transfer data structures between nodes
without serializing them somehow.

On Fri, Aug 7, 2015 at 7:31 AM, Sujit Pal <su...@gmail.com> wrote:

> Hi Hao,
>
> I think sc.broadcast will allow you to broadcast non-serializable objects.
> According to the scaladocs the Broadcast class itself is Serializable and
> it wraps your object, allowing you to get it from the Broadcast object
> using value().
>
> Not 100% sure though since I haven't tried broadcasting custom objects but
> maybe worth trying unless you have already and failed.
>
> -sujit
>
>
> On Fri, Aug 7, 2015 at 2:39 AM, Hao Ren <in...@gmail.com> wrote:
>
>> Is there any workaround to distribute non-serializable object for RDD
>> transformation or broadcast variable ?
>>
>> Say I have an object of class C which is not serializable. Class C is in
>> a jar package, I have no control on it. Now I need to distribute it either
>> by rdd transformation or by broadcast.
>>
>> I tried to subclass the class C with Serializable interface. It works for
>> serialization, but deserialization does not work, since there are no
>> parameter-less constructor for the class C and deserialization is broken
>> with an invalid constructor exception.
>>
>> I think it's a common use case. Any help is appreciated.
>>
>> --
>> Hao Ren
>>
>> Data Engineer @ leboncoin
>>
>> Paris, France
>>
>
>

Re: How to distribute non-serializable object in transform task or broadcast ?

Posted by Sujit Pal <su...@gmail.com>.
Hi Hao,

I think sc.broadcast will allow you to broadcast non-serializable objects.
According to the scaladocs the Broadcast class itself is Serializable and
it wraps your object, allowing you to get it from the Broadcast object
using value().

Not 100% sure though since I haven't tried broadcasting custom objects but
maybe worth trying unless you have already and failed.

-sujit


On Fri, Aug 7, 2015 at 2:39 AM, Hao Ren <in...@gmail.com> wrote:

> Is there any workaround to distribute non-serializable object for RDD
> transformation or broadcast variable ?
>
> Say I have an object of class C which is not serializable. Class C is in a
> jar package, I have no control on it. Now I need to distribute it either by
> rdd transformation or by broadcast.
>
> I tried to subclass the class C with Serializable interface. It works for
> serialization, but deserialization does not work, since there are no
> parameter-less constructor for the class C and deserialization is broken
> with an invalid constructor exception.
>
> I think it's a common use case. Any help is appreciated.
>
> --
> Hao Ren
>
> Data Engineer @ leboncoin
>
> Paris, France
>

Re: How to distribute non-serializable object in transform task or broadcast ?

Posted by Eugene Morozov <fa...@list.ru>.
Would like to add smth, inlined.

On 07 Aug 2015, at 18:51, Eugene Morozov <fa...@list.ru> wrote:

> Hao, 
> 
> I’d say there are few possible ways to achieve that:
> 1. Use KryoSerializer.
>   The flaw of KryoSerializer is that current version (2.21) has an issue with internal state and it might not work for some objects. Spark get kryo dependency as transitive through chill and it’ll not be resolved quickly. Kryo doesn’t work for me (I have such an classes I have to transfer, but do not have their codebase).
> 
> 2. Wrap it into something you have control and make that something serializable.
>   The flaw is kind of obvious - it’s really hard to write serialization for complex objects.
> 
> 3. Tricky algo: don’t do anything that might end up as reshuffle.
>   That’s the way I took. The flow is that we have CSV file as input, parse it and create objects that we cannot serialize / deserialize, thus cannot transfer over the network. Currently we’ve workarounded it so that these objects processed only in those partitions where thye’ve been born. 

That means it’s not possible in debug to call collect() on our RDD (even though spark master is local), but there is always a way to get to know that’s inside.
The algo is pretty complex, we still do reshuffle and everything, but before we finally create those objects.

> 
> Hope, this helps.
> 
> On 07 Aug 2015, at 12:39, Hao Ren <in...@gmail.com> wrote:
> 
>> Is there any workaround to distribute non-serializable object for RDD transformation or broadcast variable ?
>> 
>> Say I have an object of class C which is not serializable. Class C is in a jar package, I have no control on it. Now I need to distribute it either by rdd transformation or by broadcast. 
>> 
>> I tried to subclass the class C with Serializable interface. It works for serialization, but deserialization does not work, since there are no parameter-less constructor for the class C and deserialization is broken with an invalid constructor exception.
>> 
>> I think it's a common use case. Any help is appreciated.
>> 
>> -- 
>> Hao Ren
>> 
>> Data Engineer @ leboncoin
>> 
>> Paris, France
> 
> Eugene Morozov
> fathersson@list.ru
> 
> 
> 
> 

Eugene Morozov
fathersson@list.ru





Re: How to distribute non-serializable object in transform task or broadcast ?

Posted by Eugene Morozov <fa...@list.ru>.
Hao, 

I’d say there are few possible ways to achieve that:
1. Use KryoSerializer.
  The flaw of KryoSerializer is that current version (2.21) has an issue with internal state and it might not work for some objects. Spark get kryo dependency as transitive through chill and it’ll not be resolved quickly. Kryo doesn’t work for me (I have such an classes I have to transfer, but do not have their codebase).

2. Wrap it into something you have control and make that something serializable.
  The flaw is kind of obvious - it’s really hard to write serialization for complex objects.

3. Tricky algo: don’t do anything that might end up as reshuffle.
  That’s the way I took. The flow is that we have CSV file as input, parse it and create objects that we cannot serialize / deserialize, thus cannot transfer over the network. Currently we’ve workarounded it so that these objects processed only in those partitions where thye’ve been born. 

Hope, this helps.

On 07 Aug 2015, at 12:39, Hao Ren <in...@gmail.com> wrote:

> Is there any workaround to distribute non-serializable object for RDD transformation or broadcast variable ?
> 
> Say I have an object of class C which is not serializable. Class C is in a jar package, I have no control on it. Now I need to distribute it either by rdd transformation or by broadcast. 
> 
> I tried to subclass the class C with Serializable interface. It works for serialization, but deserialization does not work, since there are no parameter-less constructor for the class C and deserialization is broken with an invalid constructor exception.
> 
> I think it's a common use case. Any help is appreciated.
> 
> -- 
> Hao Ren
> 
> Data Engineer @ leboncoin
> 
> Paris, France

Eugene Morozov
fathersson@list.ru