You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@geode.apache.org by Kirk Lund <kl...@apache.org> on 2016/09/02 21:55:40 UTC

SystemFailure and VirtualMachineErrors

The Geode codebase currently includes the component SystemFailure which is
initiated by any instance of VirtualMachineError:

      } catch (VirtualMachineError e) {
        SystemFailure.initiateFailure(e);
        throw e;

SystemFailure will ultimately react by closing the DistributedSystem and
Cache (ie, shutdown the server). The original reason was to close the Cache
in the event of an OutOfMemoryError to prevent Cache inconsistency from one
member to another.

There are additional types of VirtualMachineError besides OutOfMemoryError.
Does it really make sense to initiate SystemFailure for all other types
including StackOverflowError?

GFSH starts all processes with a flag indicating that OutOfMemoryError
should result in shutdown. It specifies "-XX:OnOutOfMemoryError=taskkill /F
/PID %p" for HotSpot on Windows, "-XX:OnOutOfMemoryError=kill -KILL %p" for
HotSpot on all other platforms, "-Xcheck:memory" on J9
or "-XXexitOnOutOfMemory" on JRockit.

Given that the above flag should terminate the process on OutOfMemoryError,
are we now able to delete and remove SystemFailure from Geode? Opinions?

Thanks,
Kirk

Re: SystemFailure and VirtualMachineErrors

Posted by Kirk Lund <ki...@gmail.com>.
In this specific case it's a FunctionService thread. GFSH is executing a
management query function against a region value with circular references
which causes TypedJson to happily recurse forever.

I'll focus on the recursion bug in TypedJson and leave SystemFailure alone
for now.

Thanks,
Kirk

On Wed, Sep 7, 2016 at 10:41 AM, Anthony Baker <ab...@pivotal.io> wrote:

> I’m a little concerned about trying to continue on when a thread has
> died.  What if the thread was important?  Can we prevent the StackOverflow
> in the first place?
>
> Anthony
>
> > On Sep 7, 2016, at 10:00 AM, Kirk Lund <kl...@pivotal.io> wrote:
> >
> > Here's the current behavior: the stack trace for a StackOverflowError is
> > logged, the thread dies, and then Geode closes its Cache and
> > DistributedSystem. If it's a Server process, then the process exits.
> >
> > The proposal is to still have Geode log the stack trace and the thread
> will
> > die. The change is that we will not close the Cache and DistributedSystem
> > so that the Server process does not exit.
> >
> > The only way we would hit this situation today that I know of involves
> > writing a query that exposes a recursive json parsing bug in the
> TypedJson
> > class. I would alter SystemFailure to not shutdown for a
> StackOverflowError
> > AND also fix the underlying bug in TypedJson which results in a
> > StackOverflowError. The fix for TypedJson may involve removing it in
> favor
> > of using Jackson for json parsing. I'll work on writing tests that expose
> > both of these issues -- in the meantime please let me know if anyone has
> > any feedback or opinions.
> >
> > Thanks,
> > Kirk
> >
> >
> > On Wed, Sep 7, 2016 at 9:44 AM, Kirk Lund <kl...@pivotal.io> wrote:
> >
> >> I'd like to change SystemFailure and calling code to not shutdown for a
> >> java.lang.StackOverflowError.
> >>
> >> The existing behavior would be unchanged for these VirtualMachineErrors:
> >>
> >> java.lang.InternalError
> >> java.lang.OutOfMemoryError
> >> java.lang.UnknownError
> >> java.util.zip.ZipError
> >>
> >> Thoughts or concerns?
> >>
> >> Thanks,
> >> Kirk
> >>
> >>
> >> On Fri, Sep 2, 2016 at 2:55 PM, Kirk Lund <kl...@apache.org> wrote:
> >>
> >>> The Geode codebase currently includes the component SystemFailure which
> >>> is initiated by any instance of VirtualMachineError:
> >>>
> >>>      } catch (VirtualMachineError e) {
> >>>        SystemFailure.initiateFailure(e);
> >>>        throw e;
> >>>
> >>> SystemFailure will ultimately react by closing the DistributedSystem
> and
> >>> Cache (ie, shutdown the server). The original reason was to close the
> Cache
> >>> in the event of an OutOfMemoryError to prevent Cache inconsistency
> from one
> >>> member to another.
> >>>
> >>> There are additional types of VirtualMachineError besides
> >>> OutOfMemoryError. Does it really make sense to initiate SystemFailure
> for
> >>> all other types including StackOverflowError?
> >>>
> >>> GFSH starts all processes with a flag indicating that OutOfMemoryError
> >>> should result in shutdown. It specifies "-XX:OnOutOfMemoryError=
> taskkill
> >>> /F /PID %p" for HotSpot on Windows, "-XX:OnOutOfMemoryError=kill -KILL
> %p"
> >>> for HotSpot on all other platforms, "-Xcheck:memory" on J9
> >>> or "-XXexitOnOutOfMemory" on JRockit.
> >>>
> >>> Given that the above flag should terminate the process on
> >>> OutOfMemoryError, are we now able to delete and remove SystemFailure
> from
> >>> Geode? Opinions?
> >>>
> >>> Thanks,
> >>> Kirk
> >>>
> >>>
> >>
>
>

Re: SystemFailure and VirtualMachineErrors

Posted by Anthony Baker <ab...@pivotal.io>.
I’m a little concerned about trying to continue on when a thread has died.  What if the thread was important?  Can we prevent the StackOverflow in the first place?

Anthony

> On Sep 7, 2016, at 10:00 AM, Kirk Lund <kl...@pivotal.io> wrote:
> 
> Here's the current behavior: the stack trace for a StackOverflowError is
> logged, the thread dies, and then Geode closes its Cache and
> DistributedSystem. If it's a Server process, then the process exits.
> 
> The proposal is to still have Geode log the stack trace and the thread will
> die. The change is that we will not close the Cache and DistributedSystem
> so that the Server process does not exit.
> 
> The only way we would hit this situation today that I know of involves
> writing a query that exposes a recursive json parsing bug in the TypedJson
> class. I would alter SystemFailure to not shutdown for a StackOverflowError
> AND also fix the underlying bug in TypedJson which results in a
> StackOverflowError. The fix for TypedJson may involve removing it in favor
> of using Jackson for json parsing. I'll work on writing tests that expose
> both of these issues -- in the meantime please let me know if anyone has
> any feedback or opinions.
> 
> Thanks,
> Kirk
> 
> 
> On Wed, Sep 7, 2016 at 9:44 AM, Kirk Lund <kl...@pivotal.io> wrote:
> 
>> I'd like to change SystemFailure and calling code to not shutdown for a
>> java.lang.StackOverflowError.
>> 
>> The existing behavior would be unchanged for these VirtualMachineErrors:
>> 
>> java.lang.InternalError
>> java.lang.OutOfMemoryError
>> java.lang.UnknownError
>> java.util.zip.ZipError
>> 
>> Thoughts or concerns?
>> 
>> Thanks,
>> Kirk
>> 
>> 
>> On Fri, Sep 2, 2016 at 2:55 PM, Kirk Lund <kl...@apache.org> wrote:
>> 
>>> The Geode codebase currently includes the component SystemFailure which
>>> is initiated by any instance of VirtualMachineError:
>>> 
>>>      } catch (VirtualMachineError e) {
>>>        SystemFailure.initiateFailure(e);
>>>        throw e;
>>> 
>>> SystemFailure will ultimately react by closing the DistributedSystem and
>>> Cache (ie, shutdown the server). The original reason was to close the Cache
>>> in the event of an OutOfMemoryError to prevent Cache inconsistency from one
>>> member to another.
>>> 
>>> There are additional types of VirtualMachineError besides
>>> OutOfMemoryError. Does it really make sense to initiate SystemFailure for
>>> all other types including StackOverflowError?
>>> 
>>> GFSH starts all processes with a flag indicating that OutOfMemoryError
>>> should result in shutdown. It specifies "-XX:OnOutOfMemoryError=taskkill
>>> /F /PID %p" for HotSpot on Windows, "-XX:OnOutOfMemoryError=kill -KILL %p"
>>> for HotSpot on all other platforms, "-Xcheck:memory" on J9
>>> or "-XXexitOnOutOfMemory" on JRockit.
>>> 
>>> Given that the above flag should terminate the process on
>>> OutOfMemoryError, are we now able to delete and remove SystemFailure from
>>> Geode? Opinions?
>>> 
>>> Thanks,
>>> Kirk
>>> 
>>> 
>> 


Re: SystemFailure and VirtualMachineErrors

Posted by Kirk Lund <kl...@pivotal.io>.
Here's the current behavior: the stack trace for a StackOverflowError is
logged, the thread dies, and then Geode closes its Cache and
DistributedSystem. If it's a Server process, then the process exits.

The proposal is to still have Geode log the stack trace and the thread will
die. The change is that we will not close the Cache and DistributedSystem
so that the Server process does not exit.

The only way we would hit this situation today that I know of involves
writing a query that exposes a recursive json parsing bug in the TypedJson
class. I would alter SystemFailure to not shutdown for a StackOverflowError
AND also fix the underlying bug in TypedJson which results in a
StackOverflowError. The fix for TypedJson may involve removing it in favor
of using Jackson for json parsing. I'll work on writing tests that expose
both of these issues -- in the meantime please let me know if anyone has
any feedback or opinions.

Thanks,
Kirk


On Wed, Sep 7, 2016 at 9:44 AM, Kirk Lund <kl...@pivotal.io> wrote:

> I'd like to change SystemFailure and calling code to not shutdown for a
> java.lang.StackOverflowError.
>
> The existing behavior would be unchanged for these VirtualMachineErrors:
>
> java.lang.InternalError
> java.lang.OutOfMemoryError
> java.lang.UnknownError
> java.util.zip.ZipError
>
> Thoughts or concerns?
>
> Thanks,
> Kirk
>
>
> On Fri, Sep 2, 2016 at 2:55 PM, Kirk Lund <kl...@apache.org> wrote:
>
>> The Geode codebase currently includes the component SystemFailure which
>> is initiated by any instance of VirtualMachineError:
>>
>>       } catch (VirtualMachineError e) {
>>         SystemFailure.initiateFailure(e);
>>         throw e;
>>
>> SystemFailure will ultimately react by closing the DistributedSystem and
>> Cache (ie, shutdown the server). The original reason was to close the Cache
>> in the event of an OutOfMemoryError to prevent Cache inconsistency from one
>> member to another.
>>
>> There are additional types of VirtualMachineError besides
>> OutOfMemoryError. Does it really make sense to initiate SystemFailure for
>> all other types including StackOverflowError?
>>
>> GFSH starts all processes with a flag indicating that OutOfMemoryError
>> should result in shutdown. It specifies "-XX:OnOutOfMemoryError=taskkill
>> /F /PID %p" for HotSpot on Windows, "-XX:OnOutOfMemoryError=kill -KILL %p"
>> for HotSpot on all other platforms, "-Xcheck:memory" on J9
>> or "-XXexitOnOutOfMemory" on JRockit.
>>
>> Given that the above flag should terminate the process on
>> OutOfMemoryError, are we now able to delete and remove SystemFailure from
>> Geode? Opinions?
>>
>> Thanks,
>> Kirk
>>
>>
>

Re: SystemFailure and VirtualMachineErrors

Posted by Kirk Lund <kl...@pivotal.io>.
I'd like to change SystemFailure and calling code to not shutdown for a
java.lang.StackOverflowError.

The existing behavior would be unchanged for these VirtualMachineErrors:

java.lang.InternalError
java.lang.OutOfMemoryError
java.lang.UnknownError
java.util.zip.ZipError

Thoughts or concerns?

Thanks,
Kirk


On Fri, Sep 2, 2016 at 2:55 PM, Kirk Lund <kl...@apache.org> wrote:

> The Geode codebase currently includes the component SystemFailure which is
> initiated by any instance of VirtualMachineError:
>
>       } catch (VirtualMachineError e) {
>         SystemFailure.initiateFailure(e);
>         throw e;
>
> SystemFailure will ultimately react by closing the DistributedSystem and
> Cache (ie, shutdown the server). The original reason was to close the Cache
> in the event of an OutOfMemoryError to prevent Cache inconsistency from one
> member to another.
>
> There are additional types of VirtualMachineError besides
> OutOfMemoryError. Does it really make sense to initiate SystemFailure for
> all other types including StackOverflowError?
>
> GFSH starts all processes with a flag indicating that OutOfMemoryError
> should result in shutdown. It specifies "-XX:OnOutOfMemoryError=taskkill
> /F /PID %p" for HotSpot on Windows, "-XX:OnOutOfMemoryError=kill -KILL %p"
> for HotSpot on all other platforms, "-Xcheck:memory" on J9
> or "-XXexitOnOutOfMemory" on JRockit.
>
> Given that the above flag should terminate the process on
> OutOfMemoryError, are we now able to delete and remove SystemFailure from
> Geode? Opinions?
>
> Thanks,
> Kirk
>
>