You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Reynold Xin <rx...@databricks.com> on 2016/04/18 21:16:45 UTC

more uniform exception handling?

Josh's pull request <https://github.com/apache/spark/pull/12433> on rpc
exception handling got me to think ...

In my experience, there have been a few things related exceptions that
created a lot of trouble for us in production debugging:

1. Some exception is thrown, but is caught by some try/catch that does not
do any logging nor rethrow.
2. Some exception is thrown, but is caught by some try/catch that does not
do any logging, but do rethrow. But the original exception is now masked.
2. Multiple exceptions are logged at different places close to each other,
but we don't know whether they are caused by the same problem or not.


To mitigate some of the above, here's an idea ...

(1) Create a common root class for all the exceptions (e.g. call it
SparkException) used in Spark. We should make sure every time we catch an
exception from a 3rd party library, we rethrow them as SparkException (a
lot of places already do that). In SparkException's constructor, log the
exception and the stacktrace.

(2) SparkException has a monotonically increasing ID, and this ID appears
in the exception error message (say at the end).


I think (1) will eliminate most of the cases that an exception gets
swallowed. The main downside I can think of is we might log an exception
multiple times. However, I'd argue exceptions should be rare, and it is not
that big of a deal to log them twice or three times. The unique ID (2) can
help us correlate exceptions if they appear multiple times.

Thoughts?

Re: more uniform exception handling?

Posted by Steve Loughran <st...@hortonworks.com>.

On 18 Apr 2016, at 20:16, Reynold Xin <rx...@databricks.com>> wrote:

Josh's pull request<https://github.com/apache/spark/pull/12433> on rpc exception handling got me to think ...

In my experience, there have been a few things related exceptions that created a lot of trouble for us in production debugging:

1. Some exception is thrown, but is caught by some try/catch that does not do any logging nor rethrow.
2. Some exception is thrown, but is caught by some try/catch that does not do any logging, but do rethrow. But the original exception is now masked.
2. Multiple exceptions are logged at different places close to each other, but we don't know whether they are caused by the same problem or not.

To mitigate some of the above, here's an idea ...

(1) Create a common root class for all the exceptions (e.g. call it SparkException) used in Spark. We should make sure every time we catch an exception from a 3rd party library, we rethrow them as SparkException (a lot of places already do that). In SparkException's constructor, log the exception and the stacktrace.

(2) SparkException has a monotonically increasing ID, and this ID appears in the exception error message (say at the end).

I think (1) will eliminate most of the cases that an exception gets swallowed. The main downside I can think of is we might log an exception multiple times. However, I'd argue exceptions should be rare, and it is not that big of a deal to log them twice or three times. The unique ID (2) can help us correlate exceptions if they appear multiple times.

Thoughts?

1. unique IDs is a nice touch
2. there are some exceptions where code really needs to match on them, usually in the network layer, also interruptedException. Its dangerous to swallow them.
3. I've done work on other projects (Slider, with YARN-679 to get them into Hadoop) where exceptions can also declare an exit code. This means system exits can have different exit codes for different problems —and the exception raising code gets to choose it. For extra fun, the set of exit codes attempt to lift numbers from HTTP errors, so "41" is Unauthed, from HTTP 401: https://slider.incubator.apache.org/docs/exitcodes.html
4. Once you have different exit codes, then you can start writing tests for the scripts designed to trigger failures —asserting about the exit code as way to assess the outcome

Something else to consider is "what can be added atop the classic runtime exceptions to make them useful. Hadoop's NetUtils.wrapException() does this: catches things coming up from the network stack and rethrows an exception of the same type (where possible), but now with source/dest hostnames and ports. That is incredibly useful. The exceptions also tack in wiki references to what the exceptions mean in a desparate attempt to reduce the #of JIRAs complaining about services refusing connections. Its hard to tell how often that works —some people do now just paste in the stack trace without reading the wiki link. At least now there's somewhere to point them at when the issue is closed as invalid. [ see: http://steveloughran.blogspot.co.uk/2011/09/note-on-distributed-computing.html

I'm now considering what could be done at the Kerberos layer too, though there the problem is that the JVM Exception is invariably a meaningless "Failure Unspecified at GSS API Level" + text which varies across JVM vendor and versions. Maybe the wiki URL should just point to page saying "nobody understands kerberos —sorry"

Re: more uniform exception handling?

Posted by Sean Owen <so...@cloudera.com>.

We already have SparkException, indeed. The ID is an interesting idea;
simple to implement and might help disambiguate.

Does it solve a lot of problems of this form? if something is
squelching Exception or SparkException the result will be the same. #2
is something we can sniff out with static analysis pretty easily, but
not as much #1. Ideally we'd just fix blocks like this but I bet there
are lots of them.

I like the idea but for a different reason, and that's that it's
probably best to control exceptions that propagate from the public
API, since in some cases they're a meaningful part of the API (see
https://issues.apache.org/jira/browse/SPARK-8393 which I'm hoping to
fix now)

And the catch there is -- throwing checked exceptions from Scala code
in a way that Java code can catch requires annotating lots of methods.

On Mon, Apr 18, 2016 at 8:16 PM, Reynold Xin <rx...@databricks.com> wrote:
> Josh's pull request on rpc exception handling got me to think ...
>
> In my experience, there have been a few things related exceptions that
> created a lot of trouble for us in production debugging:
>
> 1. Some exception is thrown, but is caught by some try/catch that does not
> do any logging nor rethrow.
> 2. Some exception is thrown, but is caught by some try/catch that does not
> do any logging, but do rethrow. But the original exception is now masked.
> 2. Multiple exceptions are logged at different places close to each other,
> but we don't know whether they are caused by the same problem or not.
>
>
> To mitigate some of the above, here's an idea ...
>
> (1) Create a common root class for all the exceptions (e.g. call it
> SparkException) used in Spark. We should make sure every time we catch an
> exception from a 3rd party library, we rethrow them as SparkException (a lot
> of places already do that). In SparkException's constructor, log the
> exception and the stacktrace.
>
> (2) SparkException has a monotonically increasing ID, and this ID appears in
> the exception error message (say at the end).
>
>
> I think (1) will eliminate most of the cases that an exception gets
> swallowed. The main downside I can think of is we might log an exception
> multiple times. However, I'd argue exceptions should be rare, and it is not
> that big of a deal to log them twice or three times. The unique ID (2) can
> help us correlate exceptions if they appear multiple times.
>
> Thoughts?
>
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: more uniform exception handling?

Posted by Zhan Zhang <zz...@hortonworks.com>.

+1
Both of the would be very helpful in debugging

Thanks.

Zhan Zhang

On Apr 18, 2016, at 1:18 PM, Evan Chan <ve...@gmail.com> wrote:

> +1000.
> 
> Especially if the UI can help correlate exceptions, and we can reduce
> some exceptions.
> 
> There are some exceptions which are in practice very common, such as
> the nasty ClassNotFoundException, that most folks end up spending tons
> of time debugging.
> 
> 
> On Mon, Apr 18, 2016 at 12:16 PM, Reynold Xin <rx...@databricks.com> wrote:
>> Josh's pull request on rpc exception handling got me to think ...
>> 
>> In my experience, there have been a few things related exceptions that
>> created a lot of trouble for us in production debugging:
>> 
>> 1. Some exception is thrown, but is caught by some try/catch that does not
>> do any logging nor rethrow.
>> 2. Some exception is thrown, but is caught by some try/catch that does not
>> do any logging, but do rethrow. But the original exception is now masked.
>> 2. Multiple exceptions are logged at different places close to each other,
>> but we don't know whether they are caused by the same problem or not.
>> 
>> 
>> To mitigate some of the above, here's an idea ...
>> 
>> (1) Create a common root class for all the exceptions (e.g. call it
>> SparkException) used in Spark. We should make sure every time we catch an
>> exception from a 3rd party library, we rethrow them as SparkException (a lot
>> of places already do that). In SparkException's constructor, log the
>> exception and the stacktrace.
>> 
>> (2) SparkException has a monotonically increasing ID, and this ID appears in
>> the exception error message (say at the end).
>> 
>> 
>> I think (1) will eliminate most of the cases that an exception gets
>> swallowed. The main downside I can think of is we might log an exception
>> multiple times. However, I'd argue exceptions should be rare, and it is not
>> that big of a deal to log them twice or three times. The unique ID (2) can
>> help us correlate exceptions if they appear multiple times.
>> 
>> Thoughts?
>> 
>> 
>> 
>> 
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: more uniform exception handling?

Posted by Evan Chan <ve...@gmail.com>.

+1000.

Especially if the UI can help correlate exceptions, and we can reduce
some exceptions.

There are some exceptions which are in practice very common, such as
the nasty ClassNotFoundException, that most folks end up spending tons
of time debugging.


On Mon, Apr 18, 2016 at 12:16 PM, Reynold Xin <rx...@databricks.com> wrote:
> Josh's pull request on rpc exception handling got me to think ...
>
> In my experience, there have been a few things related exceptions that
> created a lot of trouble for us in production debugging:
>
> 1. Some exception is thrown, but is caught by some try/catch that does not
> do any logging nor rethrow.
> 2. Some exception is thrown, but is caught by some try/catch that does not
> do any logging, but do rethrow. But the original exception is now masked.
> 2. Multiple exceptions are logged at different places close to each other,
> but we don't know whether they are caused by the same problem or not.
>
>
> To mitigate some of the above, here's an idea ...
>
> (1) Create a common root class for all the exceptions (e.g. call it
> SparkException) used in Spark. We should make sure every time we catch an
> exception from a 3rd party library, we rethrow them as SparkException (a lot
> of places already do that). In SparkException's constructor, log the
> exception and the stacktrace.
>
> (2) SparkException has a monotonically increasing ID, and this ID appears in
> the exception error message (say at the end).
>
>
> I think (1) will eliminate most of the cases that an exception gets
> swallowed. The main downside I can think of is we might log an exception
> multiple times. However, I'd argue exceptions should be rare, and it is not
> that big of a deal to log them twice or three times. The unique ID (2) can
> help us correlate exceptions if they appear multiple times.
>
> Thoughts?
>
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org