You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Michal Klos <mi...@gmail.com> on 2015/03/11 14:51:02 UTC

Define exception handling on lazy elements?

Hi Spark Community,

We would like to define exception handling behavior on RDD instantiation /
build. Since the RDD is lazily evaluated, it seems like we are forced to
put all exception handling in the first action call?

This is an example of something that would be nice:

def myRDD = {
Try {
val rdd = sc.textFile(...)
} match {
Failure(e) => Handle ...
}
}

myRDD.reduceByKey(...) //don't need to worry about that exception here

The reason being that we want to try to avoid having to copy paste
exception handling boilerplate on every first action. We would love to
define this once somewhere for the RDD build code and just re-use.

Is there a best practice for this? Are we missing something here?

thanks,
Michal

Re: Define exception handling on lazy elements?

Posted by Michal Klos <mi...@gmail.com>.

Well I'm thinking that this RDD would fail to build in a specific way...
 different from the subsequent code (e.g. s3 access denied or timeout on
connecting to a database)

So for example, define the RDD failure handling on the RDD, define the
action failure handling on the action? Does this make sense.. otherwise...
on that first action, we have to handle exceptions for all of the lazy
elements that preceded it.. and that could be a lot of stuff.

If the RDD failure handling code is defined with the RDD, it just seems
cleaner because it's right next to its element. Not to mention, we believe
it would be easier to import it into multiple spark jobs without a lot of
copy pasta

m

On Wed, Mar 11, 2015 at 10:45 AM, Sean Owen <so...@cloudera.com> wrote:

> Hm, but you already only have to define it in one place, rather than
> on each transformation. I thought you wanted exception handling at
> each transformation?
>
> Or do you want it once for all actions? you can enclose all actions in
> a try-catch block, I suppose, to write exception handling code once.
> You can easily write a Scala construct that takes a function and logs
> exceptions it throws, and the function you pass can invoke an RDD
> action. So you can refactor that way too.
>
> On Wed, Mar 11, 2015 at 2:39 PM, Michal Klos <mi...@gmail.com>
> wrote:
> > Is there a way to have the exception handling go lazily along with the
> > definition?
> >
> > e.g... we define it on the RDD but then our exception handling code gets
> > triggered on that first action... without us having to define it on the
> > first action? (e.g. that RDD code is boilerplate and we want to just
> have it
> > in many many projects)
> >
> > m
> >
> > On Wed, Mar 11, 2015 at 10:08 AM, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> Handling exceptions this way means handling errors on the driver side,
> >> which may or may not be what you want. You can also write functions
> >> with exception handling inside, which could make more sense in some
> >> cases (like, to ignore bad records or count them or something).
> >>
> >> If you want to handle errors at every step on the driver side, you
> >> have to force RDDs to materialize to see if they "work". You can do
> >> that with .count() or .take(1).length > 0. But to avoid recomputing
> >> the RDD then, it needs to be cached. So there is a big non-trivial
> >> overhead to approaching it this way.
> >>
> >> If you go this way, consider materializing only a few key RDDs in your
> >> flow, not every one.
> >>
> >> The most natural thing is indeed to handle exceptions where the action
> >> occurs.
> >>
> >>
> >> On Wed, Mar 11, 2015 at 1:51 PM, Michal Klos <mi...@gmail.com>
> >> wrote:
> >> > Hi Spark Community,
> >> >
> >> > We would like to define exception handling behavior on RDD
> instantiation
> >> > /
> >> > build. Since the RDD is lazily evaluated, it seems like we are forced
> to
> >> > put
> >> > all exception handling in the first action call?
> >> >
> >> > This is an example of something that would be nice:
> >> >
> >> > def myRDD = {
> >> > Try {
> >> > val rdd = sc.textFile(...)
> >> > } match {
> >> > Failure(e) => Handle ...
> >> > }
> >> > }
> >> >
> >> > myRDD.reduceByKey(...) //don't need to worry about that exception here
> >> >
> >> > The reason being that we want to try to avoid having to copy paste
> >> > exception
> >> > handling boilerplate on every first action. We would love to define
> this
> >> > once somewhere for the RDD build code and just re-use.
> >> >
> >> > Is there a best practice for this? Are we missing something here?
> >> >
> >> > thanks,
> >> > Michal
> >
> >
>

Re: Define exception handling on lazy elements?

Posted by Sean Owen <so...@cloudera.com>.

Hm, but you already only have to define it in one place, rather than
on each transformation. I thought you wanted exception handling at
each transformation?

Or do you want it once for all actions? you can enclose all actions in
a try-catch block, I suppose, to write exception handling code once.
You can easily write a Scala construct that takes a function and logs
exceptions it throws, and the function you pass can invoke an RDD
action. So you can refactor that way too.

On Wed, Mar 11, 2015 at 2:39 PM, Michal Klos <mi...@gmail.com> wrote:
> Is there a way to have the exception handling go lazily along with the
> definition?
>
> e.g... we define it on the RDD but then our exception handling code gets
> triggered on that first action... without us having to define it on the
> first action? (e.g. that RDD code is boilerplate and we want to just have it
> in many many projects)
>
> m
>
> On Wed, Mar 11, 2015 at 10:08 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>> Handling exceptions this way means handling errors on the driver side,
>> which may or may not be what you want. You can also write functions
>> with exception handling inside, which could make more sense in some
>> cases (like, to ignore bad records or count them or something).
>>
>> If you want to handle errors at every step on the driver side, you
>> have to force RDDs to materialize to see if they "work". You can do
>> that with .count() or .take(1).length > 0. But to avoid recomputing
>> the RDD then, it needs to be cached. So there is a big non-trivial
>> overhead to approaching it this way.
>>
>> If you go this way, consider materializing only a few key RDDs in your
>> flow, not every one.
>>
>> The most natural thing is indeed to handle exceptions where the action
>> occurs.
>>
>>
>> On Wed, Mar 11, 2015 at 1:51 PM, Michal Klos <mi...@gmail.com>
>> wrote:
>> > Hi Spark Community,
>> >
>> > We would like to define exception handling behavior on RDD instantiation
>> > /
>> > build. Since the RDD is lazily evaluated, it seems like we are forced to
>> > put
>> > all exception handling in the first action call?
>> >
>> > This is an example of something that would be nice:
>> >
>> > def myRDD = {
>> > Try {
>> > val rdd = sc.textFile(...)
>> > } match {
>> > Failure(e) => Handle ...
>> > }
>> > }
>> >
>> > myRDD.reduceByKey(...) //don't need to worry about that exception here
>> >
>> > The reason being that we want to try to avoid having to copy paste
>> > exception
>> > handling boilerplate on every first action. We would love to define this
>> > once somewhere for the RDD build code and just re-use.
>> >
>> > Is there a best practice for this? Are we missing something here?
>> >
>> > thanks,
>> > Michal
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Define exception handling on lazy elements?

Posted by Michal Klos <mi...@gmail.com>.

Is there a way to have the exception handling go lazily along with the
definition?

e.g... we define it on the RDD but then our exception handling code gets
triggered on that first action... without us having to define it on the
first action? (e.g. that RDD code is boilerplate and we want to just have
it in many many projects)

m

On Wed, Mar 11, 2015 at 10:08 AM, Sean Owen <so...@cloudera.com> wrote:

> Handling exceptions this way means handling errors on the driver side,
> which may or may not be what you want. You can also write functions
> with exception handling inside, which could make more sense in some
> cases (like, to ignore bad records or count them or something).
>
> If you want to handle errors at every step on the driver side, you
> have to force RDDs to materialize to see if they "work". You can do
> that with .count() or .take(1).length > 0. But to avoid recomputing
> the RDD then, it needs to be cached. So there is a big non-trivial
> overhead to approaching it this way.
>
> If you go this way, consider materializing only a few key RDDs in your
> flow, not every one.
>
> The most natural thing is indeed to handle exceptions where the action
> occurs.
>
>
> On Wed, Mar 11, 2015 at 1:51 PM, Michal Klos <mi...@gmail.com>
> wrote:
> > Hi Spark Community,
> >
> > We would like to define exception handling behavior on RDD instantiation
> /
> > build. Since the RDD is lazily evaluated, it seems like we are forced to
> put
> > all exception handling in the first action call?
> >
> > This is an example of something that would be nice:
> >
> > def myRDD = {
> > Try {
> > val rdd = sc.textFile(...)
> > } match {
> > Failure(e) => Handle ...
> > }
> > }
> >
> > myRDD.reduceByKey(...) //don't need to worry about that exception here
> >
> > The reason being that we want to try to avoid having to copy paste
> exception
> > handling boilerplate on every first action. We would love to define this
> > once somewhere for the RDD build code and just re-use.
> >
> > Is there a best practice for this? Are we missing something here?
> >
> > thanks,
> > Michal
>

Re: Define exception handling on lazy elements?

Posted by Sean Owen <so...@cloudera.com>.

Handling exceptions this way means handling errors on the driver side,
which may or may not be what you want. You can also write functions
with exception handling inside, which could make more sense in some
cases (like, to ignore bad records or count them or something).

If you want to handle errors at every step on the driver side, you
have to force RDDs to materialize to see if they "work". You can do
that with .count() or .take(1).length > 0. But to avoid recomputing
the RDD then, it needs to be cached. So there is a big non-trivial
overhead to approaching it this way.

If you go this way, consider materializing only a few key RDDs in your
flow, not every one.

The most natural thing is indeed to handle exceptions where the action occurs.

On Wed, Mar 11, 2015 at 1:51 PM, Michal Klos <mi...@gmail.com> wrote:
> Hi Spark Community,
>
> We would like to define exception handling behavior on RDD instantiation /
> build. Since the RDD is lazily evaluated, it seems like we are forced to put
> all exception handling in the first action call?
>
> This is an example of something that would be nice:
>
> def myRDD = {
> Try {
> val rdd = sc.textFile(...)
> } match {
> Failure(e) => Handle ...
> }
> }
>
> myRDD.reduceByKey(...) //don't need to worry about that exception here
>
> The reason being that we want to try to avoid having to copy paste exception
> handling boilerplate on every first action. We would love to define this
> once somewhere for the RDD build code and just re-use.
>
> Is there a best practice for this? Are we missing something here?
>
> thanks,
> Michal

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org