You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Pranay Tonpay <pr...@impetus.co.in> on 2013/11/13 04:43:32 UTC

is this possible in Spark? ( Serialization related)

When I use Spark Streaming for real time analytics, there is a limitation that I encounter....

Scenario -
I have a third party class and have to use some APIs from that class.
I invoke the object once in the driver method and pass this object to the map method and use the function of the object inside "call" method....
I am able to do this, if the class is serializable... If it's not, I am forced to create that object inside the call method itself which is a heavy operation because the constructor is pretty heavy... Remember that I am doing a real time analytics, so the number of times this would get invoked is very high and frequent ... ( and since the class is a part of third party jar, making it Serializable is not convenient and may not be possible at all )..

I know the reason for the need of serializabilty in Spark, but is there a way to get over the above limitation ( keeping Serialization intact)  ... If you see, Storm does provide a way to handle this by providing a "prepare" function in a bolt, where I can create the object only once... If not, I think, it could be a very useful enhancement to have ( if possible )..

Pls let me know

Thx
pranay

________________________________






NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.

Re: is this possible in Spark? ( Serialization related)

Posted by Ian O'Connell <ia...@ianoconnell.com>.
This could probably also be done just doing the initialization in an Object
so it will be done on each node when the jar is loaded.


On Tue, Nov 12, 2013 at 10:44 PM, Jason Lenderman <js...@gmail.com>wrote:

> You probably want to look at the mapPartition method of RDD. The usage
> might look something like:
>
> data mapPartition {iter =>
>   val o = new Expensive()
>   for (x <- iter) {
>     o.foo(x)
>   }
> }
>
> Note that each split of data is processed using a single instance of
> Expensive.
>
> Another method of RDD that you want to be aware of is
> mapPartitionWithIndex. This method enables you to use the index of the
> split in your transformation of the Iterator.
>
>
> On Tue, Nov 12, 2013 at 7:43 PM, Pranay Tonpay <
> pranay.tonpay@impetus.co.in> wrote:
>
>>  When I use Spark Streaming for real time analytics, there is a
>> limitation that I encounter….
>>
>>
>>
>> *Scenario –*
>>
>> I have a third party class and have to use some APIs from that class.
>>
>> I invoke the object once in the driver method and pass this object to the
>> map method and use the function of the object inside “call” method….
>>
>> I am able to do this, if the class is serializable… If it’s not, I am
>> forced to create that object inside the call method itself which is a heavy
>> operation because the constructor is pretty heavy… Remember that I am doing
>> a real time analytics, so the number of times this would get invoked is
>> very high and frequent … ( and since the class is a part of third party
>> jar, making it Serializable is not convenient and may not be possible at
>> all )..
>>
>>
>>
>> I know the reason for the need of serializabilty in Spark, but is there a
>> way to get over the above limitation ( keeping Serialization intact)  … If
>> you see, Storm does provide a way to handle this by providing a “prepare”
>> function in a bolt, where I can create the object only once… If not, I
>> think, it could be a very useful enhancement to have ( if possible )..
>>
>>
>>
>> Pls let me know
>>
>>
>>
>> Thx
>>
>> pranay
>>
>> ------------------------------
>>
>>
>>
>>
>>
>>
>> NOTE: This message may contain information that is confidential,
>> proprietary, privileged or otherwise protected by law. The message is
>> intended solely for the named addressee. If received in error, please
>> destroy and notify the sender. Any use of this email is prohibited when
>> received in error. Impetus does not represent, warrant and/or guarantee,
>> that the integrity of this communication has been maintained nor that the
>> communication is free of errors, virus, interception or interference.
>>
>
>

Re: is this possible in Spark? ( Serialization related)

Posted by Eugen Cepoi <ce...@gmail.com>.
You can try broadcasting it. To avoid the not serializable problem I am
using Kryo, you can try the same.

Eugen


2013/11/20 Pranay Tonpay <pr...@impetus.co.in>

>  Jason.. I tried this, using Java code and it didn’t work still….
>
>
>
> Is there any workaround for this problem ?
>
>
>
> Thx
>
> pranay
>
>
>
> *From:* Jason Lenderman [mailto:jslenderman@gmail.com]
> *Sent:* Wednesday, November 13, 2013 12:14 PM
> *To:* user@spark.incubator.apache.org
> *Subject:* Re: is this possible in Spark? ( Serialization related)
>
>
>
> You probably want to look at the mapPartition method of RDD. The usage
> might look something like:
>
> data mapPartition {iter =>
>
>   val o = new Expensive()
>
>   for (x <- iter) {
>
>     o.foo(x)
>
>   }
>
>
> }
>
> Note that each split of data is processed using a single instance of
> Expensive.
>
> Another method of RDD that you want to be aware of is
> mapPartitionWithIndex. This method enables you to use the index of the
> split in your transformation of the Iterator.
>
>
>
> On Tue, Nov 12, 2013 at 7:43 PM, Pranay Tonpay <
> pranay.tonpay@impetus.co.in> wrote:
>
> When I use Spark Streaming for real time analytics, there is a limitation
> that I encounter….
>
>
>
> *Scenario –*
>
> I have a third party class and have to use some APIs from that class.
>
> I invoke the object once in the driver method and pass this object to the
> map method and use the function of the object inside “call” method….
>
> I am able to do this, if the class is serializable… If it’s not, I am
> forced to create that object inside the call method itself which is a heavy
> operation because the constructor is pretty heavy… Remember that I am doing
> a real time analytics, so the number of times this would get invoked is
> very high and frequent … ( and since the class is a part of third party
> jar, making it Serializable is not convenient and may not be possible at
> all )..
>
>
>
> I know the reason for the need of serializabilty in Spark, but is there a
> way to get over the above limitation ( keeping Serialization intact)  … If
> you see, Storm does provide a way to handle this by providing a “prepare”
> function in a bolt, where I can create the object only once… If not, I
> think, it could be a very useful enhancement to have ( if possible )..
>
>
>
> Pls let me know
>
>
>
> Thx
>
> pranay
>
>
>  ------------------------------
>
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>
>
>
> ------------------------------
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>

RE: is this possible in Spark? ( Serialization related)

Posted by Pranay Tonpay <pr...@impetus.co.in>.
Jason.. I tried this, using Java code and it didn't work still....

Is there any workaround for this problem ?

Thx
pranay

From: Jason Lenderman [mailto:jslenderman@gmail.com]
Sent: Wednesday, November 13, 2013 12:14 PM
To: user@spark.incubator.apache.org
Subject: Re: is this possible in Spark? ( Serialization related)

You probably want to look at the mapPartition method of RDD. The usage might look something like:
data mapPartition {iter =>
  val o = new Expensive()
  for (x <- iter) {
    o.foo(x)
  }
}
Note that each split of data is processed using a single instance of Expensive.
Another method of RDD that you want to be aware of is mapPartitionWithIndex. This method enables you to use the index of the split in your transformation of the Iterator.

On Tue, Nov 12, 2013 at 7:43 PM, Pranay Tonpay <pr...@impetus.co.in>> wrote:
When I use Spark Streaming for real time analytics, there is a limitation that I encounter....

Scenario -
I have a third party class and have to use some APIs from that class.
I invoke the object once in the driver method and pass this object to the map method and use the function of the object inside "call" method....
I am able to do this, if the class is serializable... If it's not, I am forced to create that object inside the call method itself which is a heavy operation because the constructor is pretty heavy... Remember that I am doing a real time analytics, so the number of times this would get invoked is very high and frequent ... ( and since the class is a part of third party jar, making it Serializable is not convenient and may not be possible at all )..

I know the reason for the need of serializabilty in Spark, but is there a way to get over the above limitation ( keeping Serialization intact)  ... If you see, Storm does provide a way to handle this by providing a "prepare" function in a bolt, where I can create the object only once... If not, I think, it could be a very useful enhancement to have ( if possible )..

Pls let me know

Thx
pranay

________________________________






NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.


________________________________






NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.

Re: is this possible in Spark? ( Serialization related)

Posted by Jason Lenderman <js...@gmail.com>.
You probably want to look at the mapPartition method of RDD. The usage
might look something like:

data mapPartition {iter =>
  val o = new Expensive()
  for (x <- iter) {
    o.foo(x)
  }
}

Note that each split of data is processed using a single instance of
Expensive.

Another method of RDD that you want to be aware of is
mapPartitionWithIndex. This method enables you to use the index of the
split in your transformation of the Iterator.


On Tue, Nov 12, 2013 at 7:43 PM, Pranay Tonpay
<pr...@impetus.co.in>wrote:

>  When I use Spark Streaming for real time analytics, there is a
> limitation that I encounter….
>
>
>
> *Scenario –*
>
> I have a third party class and have to use some APIs from that class.
>
> I invoke the object once in the driver method and pass this object to the
> map method and use the function of the object inside “call” method….
>
> I am able to do this, if the class is serializable… If it’s not, I am
> forced to create that object inside the call method itself which is a heavy
> operation because the constructor is pretty heavy… Remember that I am doing
> a real time analytics, so the number of times this would get invoked is
> very high and frequent … ( and since the class is a part of third party
> jar, making it Serializable is not convenient and may not be possible at
> all )..
>
>
>
> I know the reason for the need of serializabilty in Spark, but is there a
> way to get over the above limitation ( keeping Serialization intact)  … If
> you see, Storm does provide a way to handle this by providing a “prepare”
> function in a bolt, where I can create the object only once… If not, I
> think, it could be a very useful enhancement to have ( if possible )..
>
>
>
> Pls let me know
>
>
>
> Thx
>
> pranay
>
> ------------------------------
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>