You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@aurora.apache.org by David Pan <da...@gmail.com> on 2014/10/10 23:23:02 UTC

Health Check Disabler Discussion

Hi Aurora,

I am currently working on a feature that allows for health checks to be
disabled temporarily for a running instance of a job.  The code review can
be found at https://reviews.apache.org/r/26383/.  The idea is that the
presence of a special "snooze file" in the task's sandbox will trigger the
disabling of the health checks.

Currently, the code reviewers have split off into two camps:
1. One set of reviewers believe that simplicity is key.  Disable the health
checks if the snooze file is present, enable it otherwise.

2. The other set of reviewers believe that there should be a snooze
duration.  The timer starts when the snooze file is touched.  After the
snooze duration is exhausted, the snooze file should be deleted by the
health checker, and health checks resume.  This is useful if the process
that initially disabled the health checks dies unexpectedly, and is no
longer there to re-enable the health checks.

I would like to invite anyone interested to voice your opinions and chime
in.

Thanks,

David Pan

Re: Health Check Disabler Discussion

Posted by Zameer Manji <zm...@twopensource.com>.
+1 #2. We don't surface disabling health checks anywhere to the user. I
think the system should err on the side of caution and get to the state
that it is advertising on the UI.

On Fri, Oct 10, 2014 at 2:32 PM, Maxim Khutornenko <ma...@apache.org> wrote:

> +1 to the #1. Disabling health checks is like signing a waiver where
> all health check guarantees are off.
>
> On Fri, Oct 10, 2014 at 2:23 PM, David Pan <da...@gmail.com> wrote:
> > Hi Aurora,
> >
> > I am currently working on a feature that allows for health checks to be
> > disabled temporarily for a running instance of a job.  The code review
> can
> > be found at https://reviews.apache.org/r/26383/.  The idea is that the
> > presence of a special "snooze file" in the task's sandbox will trigger
> the
> > disabling of the health checks.
> >
> > Currently, the code reviewers have split off into two camps:
> > 1. One set of reviewers believe that simplicity is key.  Disable the
> health
> > checks if the snooze file is present, enable it otherwise.
> >
> > 2. The other set of reviewers believe that there should be a snooze
> > duration.  The timer starts when the snooze file is touched.  After the
> > snooze duration is exhausted, the snooze file should be deleted by the
> > health checker, and health checks resume.  This is useful if the process
> > that initially disabled the health checks dies unexpectedly, and is no
> > longer there to re-enable the health checks.
> >
> > I would like to invite anyone interested to voice your opinions and chime
> > in.
> >
> > Thanks,
> >
> > David Pan
>



-- 
Zameer Manji

Re: Health Check Disabler Discussion

Posted by David Pan <da...@gmail.com>.
Sounds good to me.

On Mon, Oct 13, 2014 at 11:51 AM, Bill Farner <wf...@apache.org> wrote:

> We had a discussion about this in our weekly community meeting in IRC
> today, and after some debate there was unanimous agreement to avoid all
> time control but to use the presence of the snooze file only.  Below is the
> excerpt from the discussion.  If you disagree, feel free to continue the
> discussion here.  Otherwise, i suggest the patch be updated to remove all
> time control.
>
> ## Health check snooze ##
> > [Mon Oct 13 18:21:52 2014] <wfarner>: We had a review for a new feature
> > move to a dev list discussion last week.  Does anybody believe we did not
> > achieve consensus on the approach?
> > [Mon Oct 13 18:22:00 2014] <wfarner>:
> https://reviews.apache.org/r/26383/
> > [Mon Oct 13 18:22:21 2014] <wfarner>: AURORA-795
> > [Mon Oct 13 18:22:27 2014] <zmanji>: There is a mailing list thread here:
> >
> http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201410.mbox/%3CCACGrrVnLWDU=vEVAFt_QN0iL5C8OQ7pqae-3Ge5NNH6vJg4uGg@mail.gmail.com%3E
> > [Mon Oct 13 18:22:57 2014] <zmanji>: I don’t think we have a consenus yet
> > so please voice your opinion
> > [Mon Oct 13 18:23:00 2014] <wickman>: I think the consensus was "touch a
> > snooze file, then unlink after mtime + CONSTANT_TIMEOUT"
> > [Mon Oct 13 18:23:07 2014] <wickman>: is that not correct?
> > [Mon Oct 13 18:23:14 2014] <wfarner>: wickman: that was my understanding
> > as well
> > [Mon Oct 13 18:23:42 2014] <wickman>: the other option is "touch a file,
> > and the health checker is disabled as long as that file is there."
> > [Mon Oct 13 18:23:43 2014] <kts>: I still feel that we should avoid being
> > too clever in our implementation here
> > [Mon Oct 13 18:23:44 2014] <jcohen>: yeah, it sounded to me like that’s
> > what we were coalescing on.
> > [Mon Oct 13 18:23:58 2014] <wickman>: the reason that I'm less in favor
> of
> > that approach is that it's not really a snooze -- it's a sleep, and could
> > be prone to somebody forgetting to turn it off
> > [Mon Oct 13 18:24:10 2014] <wickman>: which might be okay -- i think 99
> > times out of 100, people will be snoozing so they can get the state of a
> > wedged task
> > [Mon Oct 13 18:24:13 2014] <wickman>: at which point they will kill when
> > they're done
> > [Mon Oct 13 18:24:18 2014] <wickman>: so i think there's a reasonable
> > argument either way
> > [Mon Oct 13 18:24:24 2014] <wfarner>: yes, i'm torn
> > [Mon Oct 13 18:24:36 2014] <kts>: but we don't really know how long a
> tool
> > will take to get information about the wedged state
> > [Mon Oct 13 18:24:47 2014] <wickman>: kts: yeah, that's why #1 might be
> > more appealing
> > [Mon Oct 13 18:24:58 2014] <jcohen>: kts: in that cause they could extend
> > the snooze by using touch -m?
> > [Mon Oct 13 18:25:05 2014] <wickman>: though you could just do (while
> > true; do touch .snooze; sleep 60; done;) &
> > [Mon Oct 13 18:25:46 2014] <jcohen>: I suppose it’s a question of what’s
> > more likely (or more concerning): will someone forget to remove a snooze,
> > or forget to extend it
> > [Mon Oct 13 18:26:06 2014] <mkhutornenko>: +1 for not deleting the file.
> > Avoiding FS mutation == Less complexity == less things to go wrong
> > [Mon Oct 13 18:26:09 2014] <wickman>: i think it's important to look at
> > why you'd want to snooze in the first place
> > [Mon Oct 13 18:26:15 2014] <kts>: forget to extend means diagnostic
> > information is lost forever
> > [Mon Oct 13 18:26:18 2014] <wickman>: the only case i can think of is
> > something in a super weird state
> > [Mon Oct 13 18:26:27 2014] <wickman>: and they're almost always going to
> > kill those things in the weird state when they're done
> > [Mon Oct 13 18:26:32 2014] <wickman>: which would point to a permanent
> > snooze
> > [Mon Oct 13 18:26:38 2014] <wfarner>: that was my feeling as well
> > [Mon Oct 13 18:28:17 2014] <wfarner>: should we reverse the position on
> > this back to no time awareness at all?
> > [Mon Oct 13 18:28:23 2014] <kts>: +1
> > [Mon Oct 13 18:28:31 2014] <mkhutornenko>: +1
> > [Mon Oct 13 18:28:40 2014] <zmanji>: +1
> > [Mon Oct 13 18:28:46 2014] <wfarner>: +1
>
>
>
> -=Bill
>
> On Fri, Oct 10, 2014 at 3:32 PM, Bill Farner <wf...@apache.org> wrote:
>
> > Ignore my first response, i think gmail drafts are out to get me.
> >
> > -=Bill
> >
> > On Fri, Oct 10, 2014 at 3:30 PM, Bill Farner <wf...@apache.org> wrote:
> >
> >> I'm cool with #2, specifically if we do not attempt to parse the file
> and
> >> use that to determine the auto-expire time.
> >>
> >>
> >> -=Bill
> >>
> >> On Fri, Oct 10, 2014 at 2:48 PM, Joshua Cohen <jc...@twopensource.com>
> >> wrote:
> >>
> >>> I'm in camp #2, I don't feel that it adds a significant amount of
> >>> complexity to the health check logic, and it provides a substantial
> >>> safeguard against users accidentally shooting themselves in the foot by
> >>> accidentally leaving a health check snoozed.
> >>>
> >>> On Fri, Oct 10, 2014 at 2:32 PM, Maxim Khutornenko <ma...@apache.org>
> >>> wrote:
> >>>
> >>> > +1 to the #1. Disabling health checks is like signing a waiver where
> >>> > all health check guarantees are off.
> >>> >
> >>> > On Fri, Oct 10, 2014 at 2:23 PM, David Pan <da...@gmail.com>
> >>> wrote:
> >>> > > Hi Aurora,
> >>> > >
> >>> > > I am currently working on a feature that allows for health checks
> to
> >>> be
> >>> > > disabled temporarily for a running instance of a job.  The code
> >>> review
> >>> > can
> >>> > > be found at https://reviews.apache.org/r/26383/.  The idea is that
> >>> the
> >>> > > presence of a special "snooze file" in the task's sandbox will
> >>> trigger
> >>> > the
> >>> > > disabling of the health checks.
> >>> > >
> >>> > > Currently, the code reviewers have split off into two camps:
> >>> > > 1. One set of reviewers believe that simplicity is key.  Disable
> the
> >>> > health
> >>> > > checks if the snooze file is present, enable it otherwise.
> >>> > >
> >>> > > 2. The other set of reviewers believe that there should be a snooze
> >>> > > duration.  The timer starts when the snooze file is touched.  After
> >>> the
> >>> > > snooze duration is exhausted, the snooze file should be deleted by
> >>> the
> >>> > > health checker, and health checks resume.  This is useful if the
> >>> process
> >>> > > that initially disabled the health checks dies unexpectedly, and is
> >>> no
> >>> > > longer there to re-enable the health checks.
> >>> > >
> >>> > > I would like to invite anyone interested to voice your opinions and
> >>> chime
> >>> > > in.
> >>> > >
> >>> > > Thanks,
> >>> > >
> >>> > > David Pan
> >>> >
> >>>
> >>
> >>
> >
>

Re: Health Check Disabler Discussion

Posted by Bill Farner <wf...@apache.org>.
We had a discussion about this in our weekly community meeting in IRC
today, and after some debate there was unanimous agreement to avoid all
time control but to use the presence of the snooze file only.  Below is the
excerpt from the discussion.  If you disagree, feel free to continue the
discussion here.  Otherwise, i suggest the patch be updated to remove all
time control.

## Health check snooze ##
> [Mon Oct 13 18:21:52 2014] <wfarner>: We had a review for a new feature
> move to a dev list discussion last week.  Does anybody believe we did not
> achieve consensus on the approach?
> [Mon Oct 13 18:22:00 2014] <wfarner>: https://reviews.apache.org/r/26383/
> [Mon Oct 13 18:22:21 2014] <wfarner>: AURORA-795
> [Mon Oct 13 18:22:27 2014] <zmanji>: There is a mailing list thread here:
> http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201410.mbox/%3CCACGrrVnLWDU=vEVAFt_QN0iL5C8OQ7pqae-3Ge5NNH6vJg4uGg@mail.gmail.com%3E
> [Mon Oct 13 18:22:57 2014] <zmanji>: I don’t think we have a consenus yet
> so please voice your opinion
> [Mon Oct 13 18:23:00 2014] <wickman>: I think the consensus was "touch a
> snooze file, then unlink after mtime + CONSTANT_TIMEOUT"
> [Mon Oct 13 18:23:07 2014] <wickman>: is that not correct?
> [Mon Oct 13 18:23:14 2014] <wfarner>: wickman: that was my understanding
> as well
> [Mon Oct 13 18:23:42 2014] <wickman>: the other option is "touch a file,
> and the health checker is disabled as long as that file is there."
> [Mon Oct 13 18:23:43 2014] <kts>: I still feel that we should avoid being
> too clever in our implementation here
> [Mon Oct 13 18:23:44 2014] <jcohen>: yeah, it sounded to me like that’s
> what we were coalescing on.
> [Mon Oct 13 18:23:58 2014] <wickman>: the reason that I'm less in favor of
> that approach is that it's not really a snooze -- it's a sleep, and could
> be prone to somebody forgetting to turn it off
> [Mon Oct 13 18:24:10 2014] <wickman>: which might be okay -- i think 99
> times out of 100, people will be snoozing so they can get the state of a
> wedged task
> [Mon Oct 13 18:24:13 2014] <wickman>: at which point they will kill when
> they're done
> [Mon Oct 13 18:24:18 2014] <wickman>: so i think there's a reasonable
> argument either way
> [Mon Oct 13 18:24:24 2014] <wfarner>: yes, i'm torn
> [Mon Oct 13 18:24:36 2014] <kts>: but we don't really know how long a tool
> will take to get information about the wedged state
> [Mon Oct 13 18:24:47 2014] <wickman>: kts: yeah, that's why #1 might be
> more appealing
> [Mon Oct 13 18:24:58 2014] <jcohen>: kts: in that cause they could extend
> the snooze by using touch -m?
> [Mon Oct 13 18:25:05 2014] <wickman>: though you could just do (while
> true; do touch .snooze; sleep 60; done;) &
> [Mon Oct 13 18:25:46 2014] <jcohen>: I suppose it’s a question of what’s
> more likely (or more concerning): will someone forget to remove a snooze,
> or forget to extend it
> [Mon Oct 13 18:26:06 2014] <mkhutornenko>: +1 for not deleting the file.
> Avoiding FS mutation == Less complexity == less things to go wrong
> [Mon Oct 13 18:26:09 2014] <wickman>: i think it's important to look at
> why you'd want to snooze in the first place
> [Mon Oct 13 18:26:15 2014] <kts>: forget to extend means diagnostic
> information is lost forever
> [Mon Oct 13 18:26:18 2014] <wickman>: the only case i can think of is
> something in a super weird state
> [Mon Oct 13 18:26:27 2014] <wickman>: and they're almost always going to
> kill those things in the weird state when they're done
> [Mon Oct 13 18:26:32 2014] <wickman>: which would point to a permanent
> snooze
> [Mon Oct 13 18:26:38 2014] <wfarner>: that was my feeling as well
> [Mon Oct 13 18:28:17 2014] <wfarner>: should we reverse the position on
> this back to no time awareness at all?
> [Mon Oct 13 18:28:23 2014] <kts>: +1
> [Mon Oct 13 18:28:31 2014] <mkhutornenko>: +1
> [Mon Oct 13 18:28:40 2014] <zmanji>: +1
> [Mon Oct 13 18:28:46 2014] <wfarner>: +1



-=Bill

On Fri, Oct 10, 2014 at 3:32 PM, Bill Farner <wf...@apache.org> wrote:

> Ignore my first response, i think gmail drafts are out to get me.
>
> -=Bill
>
> On Fri, Oct 10, 2014 at 3:30 PM, Bill Farner <wf...@apache.org> wrote:
>
>> I'm cool with #2, specifically if we do not attempt to parse the file and
>> use that to determine the auto-expire time.
>>
>>
>> -=Bill
>>
>> On Fri, Oct 10, 2014 at 2:48 PM, Joshua Cohen <jc...@twopensource.com>
>> wrote:
>>
>>> I'm in camp #2, I don't feel that it adds a significant amount of
>>> complexity to the health check logic, and it provides a substantial
>>> safeguard against users accidentally shooting themselves in the foot by
>>> accidentally leaving a health check snoozed.
>>>
>>> On Fri, Oct 10, 2014 at 2:32 PM, Maxim Khutornenko <ma...@apache.org>
>>> wrote:
>>>
>>> > +1 to the #1. Disabling health checks is like signing a waiver where
>>> > all health check guarantees are off.
>>> >
>>> > On Fri, Oct 10, 2014 at 2:23 PM, David Pan <da...@gmail.com>
>>> wrote:
>>> > > Hi Aurora,
>>> > >
>>> > > I am currently working on a feature that allows for health checks to
>>> be
>>> > > disabled temporarily for a running instance of a job.  The code
>>> review
>>> > can
>>> > > be found at https://reviews.apache.org/r/26383/.  The idea is that
>>> the
>>> > > presence of a special "snooze file" in the task's sandbox will
>>> trigger
>>> > the
>>> > > disabling of the health checks.
>>> > >
>>> > > Currently, the code reviewers have split off into two camps:
>>> > > 1. One set of reviewers believe that simplicity is key.  Disable the
>>> > health
>>> > > checks if the snooze file is present, enable it otherwise.
>>> > >
>>> > > 2. The other set of reviewers believe that there should be a snooze
>>> > > duration.  The timer starts when the snooze file is touched.  After
>>> the
>>> > > snooze duration is exhausted, the snooze file should be deleted by
>>> the
>>> > > health checker, and health checks resume.  This is useful if the
>>> process
>>> > > that initially disabled the health checks dies unexpectedly, and is
>>> no
>>> > > longer there to re-enable the health checks.
>>> > >
>>> > > I would like to invite anyone interested to voice your opinions and
>>> chime
>>> > > in.
>>> > >
>>> > > Thanks,
>>> > >
>>> > > David Pan
>>> >
>>>
>>
>>
>

Re: Health Check Disabler Discussion

Posted by Bill Farner <wf...@apache.org>.
Ignore my first response, i think gmail drafts are out to get me.

-=Bill

On Fri, Oct 10, 2014 at 3:30 PM, Bill Farner <wf...@apache.org> wrote:

> I'm cool with #2, specifically if we do not attempt to parse the file and
> use that to determine the auto-expire time.
>
>
> -=Bill
>
> On Fri, Oct 10, 2014 at 2:48 PM, Joshua Cohen <jc...@twopensource.com>
> wrote:
>
>> I'm in camp #2, I don't feel that it adds a significant amount of
>> complexity to the health check logic, and it provides a substantial
>> safeguard against users accidentally shooting themselves in the foot by
>> accidentally leaving a health check snoozed.
>>
>> On Fri, Oct 10, 2014 at 2:32 PM, Maxim Khutornenko <ma...@apache.org>
>> wrote:
>>
>> > +1 to the #1. Disabling health checks is like signing a waiver where
>> > all health check guarantees are off.
>> >
>> > On Fri, Oct 10, 2014 at 2:23 PM, David Pan <da...@gmail.com>
>> wrote:
>> > > Hi Aurora,
>> > >
>> > > I am currently working on a feature that allows for health checks to
>> be
>> > > disabled temporarily for a running instance of a job.  The code review
>> > can
>> > > be found at https://reviews.apache.org/r/26383/.  The idea is that
>> the
>> > > presence of a special "snooze file" in the task's sandbox will trigger
>> > the
>> > > disabling of the health checks.
>> > >
>> > > Currently, the code reviewers have split off into two camps:
>> > > 1. One set of reviewers believe that simplicity is key.  Disable the
>> > health
>> > > checks if the snooze file is present, enable it otherwise.
>> > >
>> > > 2. The other set of reviewers believe that there should be a snooze
>> > > duration.  The timer starts when the snooze file is touched.  After
>> the
>> > > snooze duration is exhausted, the snooze file should be deleted by the
>> > > health checker, and health checks resume.  This is useful if the
>> process
>> > > that initially disabled the health checks dies unexpectedly, and is no
>> > > longer there to re-enable the health checks.
>> > >
>> > > I would like to invite anyone interested to voice your opinions and
>> chime
>> > > in.
>> > >
>> > > Thanks,
>> > >
>> > > David Pan
>> >
>>
>
>

Re: Health Check Disabler Discussion

Posted by Bill Farner <wf...@apache.org>.
I'm cool with #2, specifically if we do not attempt to parse the file and
use that to determine the auto-expire time.


-=Bill

On Fri, Oct 10, 2014 at 2:48 PM, Joshua Cohen <jc...@twopensource.com>
wrote:

> I'm in camp #2, I don't feel that it adds a significant amount of
> complexity to the health check logic, and it provides a substantial
> safeguard against users accidentally shooting themselves in the foot by
> accidentally leaving a health check snoozed.
>
> On Fri, Oct 10, 2014 at 2:32 PM, Maxim Khutornenko <ma...@apache.org>
> wrote:
>
> > +1 to the #1. Disabling health checks is like signing a waiver where
> > all health check guarantees are off.
> >
> > On Fri, Oct 10, 2014 at 2:23 PM, David Pan <da...@gmail.com> wrote:
> > > Hi Aurora,
> > >
> > > I am currently working on a feature that allows for health checks to be
> > > disabled temporarily for a running instance of a job.  The code review
> > can
> > > be found at https://reviews.apache.org/r/26383/.  The idea is that the
> > > presence of a special "snooze file" in the task's sandbox will trigger
> > the
> > > disabling of the health checks.
> > >
> > > Currently, the code reviewers have split off into two camps:
> > > 1. One set of reviewers believe that simplicity is key.  Disable the
> > health
> > > checks if the snooze file is present, enable it otherwise.
> > >
> > > 2. The other set of reviewers believe that there should be a snooze
> > > duration.  The timer starts when the snooze file is touched.  After the
> > > snooze duration is exhausted, the snooze file should be deleted by the
> > > health checker, and health checks resume.  This is useful if the
> process
> > > that initially disabled the health checks dies unexpectedly, and is no
> > > longer there to re-enable the health checks.
> > >
> > > I would like to invite anyone interested to voice your opinions and
> chime
> > > in.
> > >
> > > Thanks,
> > >
> > > David Pan
> >
>

Re: Health Check Disabler Discussion

Posted by Bill Farner <wf...@apache.org>.
I'm generally in #1, but could land somewhere in between.  I think the idea
of using mtime came up, which i like more than parsing the snooze file and
giving full control.  I'd be fine with expiring this file at mtime +
SNOOZE_TIMEOUT (constant).  This fails closed, is relatively simple to
implement, and doesn't allow the user to snooze for an unreasonable amount
of time.

-=Bill

On Fri, Oct 10, 2014 at 2:48 PM, Joshua Cohen <jc...@twopensource.com>
wrote:

> I'm in camp #2, I don't feel that it adds a significant amount of
> complexity to the health check logic, and it provides a substantial
> safeguard against users accidentally shooting themselves in the foot by
> accidentally leaving a health check snoozed.
>
> On Fri, Oct 10, 2014 at 2:32 PM, Maxim Khutornenko <ma...@apache.org>
> wrote:
>
> > +1 to the #1. Disabling health checks is like signing a waiver where
> > all health check guarantees are off.
> >
> > On Fri, Oct 10, 2014 at 2:23 PM, David Pan <da...@gmail.com> wrote:
> > > Hi Aurora,
> > >
> > > I am currently working on a feature that allows for health checks to be
> > > disabled temporarily for a running instance of a job.  The code review
> > can
> > > be found at https://reviews.apache.org/r/26383/.  The idea is that the
> > > presence of a special "snooze file" in the task's sandbox will trigger
> > the
> > > disabling of the health checks.
> > >
> > > Currently, the code reviewers have split off into two camps:
> > > 1. One set of reviewers believe that simplicity is key.  Disable the
> > health
> > > checks if the snooze file is present, enable it otherwise.
> > >
> > > 2. The other set of reviewers believe that there should be a snooze
> > > duration.  The timer starts when the snooze file is touched.  After the
> > > snooze duration is exhausted, the snooze file should be deleted by the
> > > health checker, and health checks resume.  This is useful if the
> process
> > > that initially disabled the health checks dies unexpectedly, and is no
> > > longer there to re-enable the health checks.
> > >
> > > I would like to invite anyone interested to voice your opinions and
> chime
> > > in.
> > >
> > > Thanks,
> > >
> > > David Pan
> >
>

Re: Health Check Disabler Discussion

Posted by Joshua Cohen <jc...@twopensource.com>.
I'm in camp #2, I don't feel that it adds a significant amount of
complexity to the health check logic, and it provides a substantial
safeguard against users accidentally shooting themselves in the foot by
accidentally leaving a health check snoozed.

On Fri, Oct 10, 2014 at 2:32 PM, Maxim Khutornenko <ma...@apache.org> wrote:

> +1 to the #1. Disabling health checks is like signing a waiver where
> all health check guarantees are off.
>
> On Fri, Oct 10, 2014 at 2:23 PM, David Pan <da...@gmail.com> wrote:
> > Hi Aurora,
> >
> > I am currently working on a feature that allows for health checks to be
> > disabled temporarily for a running instance of a job.  The code review
> can
> > be found at https://reviews.apache.org/r/26383/.  The idea is that the
> > presence of a special "snooze file" in the task's sandbox will trigger
> the
> > disabling of the health checks.
> >
> > Currently, the code reviewers have split off into two camps:
> > 1. One set of reviewers believe that simplicity is key.  Disable the
> health
> > checks if the snooze file is present, enable it otherwise.
> >
> > 2. The other set of reviewers believe that there should be a snooze
> > duration.  The timer starts when the snooze file is touched.  After the
> > snooze duration is exhausted, the snooze file should be deleted by the
> > health checker, and health checks resume.  This is useful if the process
> > that initially disabled the health checks dies unexpectedly, and is no
> > longer there to re-enable the health checks.
> >
> > I would like to invite anyone interested to voice your opinions and chime
> > in.
> >
> > Thanks,
> >
> > David Pan
>

Re: Health Check Disabler Discussion

Posted by Maxim Khutornenko <ma...@apache.org>.
+1 to the #1. Disabling health checks is like signing a waiver where
all health check guarantees are off.

On Fri, Oct 10, 2014 at 2:23 PM, David Pan <da...@gmail.com> wrote:
> Hi Aurora,
>
> I am currently working on a feature that allows for health checks to be
> disabled temporarily for a running instance of a job.  The code review can
> be found at https://reviews.apache.org/r/26383/.  The idea is that the
> presence of a special "snooze file" in the task's sandbox will trigger the
> disabling of the health checks.
>
> Currently, the code reviewers have split off into two camps:
> 1. One set of reviewers believe that simplicity is key.  Disable the health
> checks if the snooze file is present, enable it otherwise.
>
> 2. The other set of reviewers believe that there should be a snooze
> duration.  The timer starts when the snooze file is touched.  After the
> snooze duration is exhausted, the snooze file should be deleted by the
> health checker, and health checks resume.  This is useful if the process
> that initially disabled the health checks dies unexpectedly, and is no
> longer there to re-enable the health checks.
>
> I would like to invite anyone interested to voice your opinions and chime
> in.
>
> Thanks,
>
> David Pan