You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@accumulo.apache.org by Christopher <ct...@apache.org> on 2016/01/25 01:45:52 UTC

Interesting bug report

I saw this bug report: https://bugzilla.redhat.com/show_bug.cgi?id=1300987

As far as I can tell, they are reporting normal, expected, and desired
behavior of Accumulo as a bug. But, is there something we can do upstream
to enable fast failures in the case of Accumulo not running to support
their use case?

Personally, I don't see how we can reliably detect within the client that
the cluster is down or up, vs. a normal temporary server outage/migration,
since there is there is no single point of authority for Accumulo to
determine its overall operating status if ZooKeeper is running and no other
servers are. Am I wrong?

Re: Interesting bug report

Posted by John Vines <vi...@apache.org>.

That sounds like great follow on work (clients register ephemerally so the
master can tell clients to disconnect, etc.), but I think just having a
client that can get a better read on the state of the system is a
phenomenal starting point.

On Tue, Jan 26, 2016 at 11:52 AM Keith Turner <ke...@deenlo.com> wrote:

> On Mon, Jan 25, 2016 at 10:59 AM, John Vines <vi...@apache.org> wrote:
>
> > Of course, it's when I hit send that I realize that we could mitigate by
> > making the client aware of the master state, and if the system is shut
> down
> >
>
> Thats a good idea.  Should consider the use case when someone wants to shut
> Accumulo down and bring it back up immediately.  We could allow an admin to
> decide what they want clients to do when they shutdown Accumulo (clients
> die, wait, anything else?).  This could be accomplished with supplemental
> information in ZK or other goal states.
>
>
> > (which was the case for that ticket), then it can fail quickly with a
> > descriptive message.
> >
> > On Mon, Jan 25, 2016 at 10:58 AM John Vines <vi...@apache.org> wrote:
> >
> > > While we want to be fault tolerant, there's a point where we want to
> > > eventually fail. I know we have a couple never ending retry loops that
> > need
> > > to be addressed (https://issues.apache.org/jira/browse/ACCUMULO-1268),
> > > but I'm unsure if queries suffer from this problem.
> > >
> > > Unfortunately, fault tolerance is a bit at odds with instant
> notification
> > > of system issues, since some of the fault tolerance is temporally
> > oriented.
> > > And that ticket lacks context of it never failing out vs. failing out
> > > eventually (but too long for the user)
> > >
> > >
> > > On Sun, Jan 24, 2016 at 7:46 PM Christopher <ct...@apache.org>
> wrote:
> > >
> > >> I saw this bug report:
> > >> https://bugzilla.redhat.com/show_bug.cgi?id=1300987
> > >>
> > >> As far as I can tell, they are reporting normal, expected, and desired
> > >> behavior of Accumulo as a bug. But, is there something we can do
> > upstream
> > >> to enable fast failures in the case of Accumulo not running to support
> > >> their use case?
> > >>
> > >> Personally, I don't see how we can reliably detect within the client
> > that
> > >> the cluster is down or up, vs. a normal temporary server
> > outage/migration,
> > >> since there is there is no single point of authority for Accumulo to
> > >> determine its overall operating status if ZooKeeper is running and no
> > >> other
> > >> servers are. Am I wrong?
> > >>
> > >
> >
>

Re: Interesting bug report

Posted by Keith Turner <ke...@deenlo.com>.

On Mon, Jan 25, 2016 at 10:59 AM, John Vines <vi...@apache.org> wrote:

> Of course, it's when I hit send that I realize that we could mitigate by
> making the client aware of the master state, and if the system is shut down
>

Thats a good idea.  Should consider the use case when someone wants to shut
Accumulo down and bring it back up immediately.  We could allow an admin to
decide what they want clients to do when they shutdown Accumulo (clients
die, wait, anything else?).  This could be accomplished with supplemental
information in ZK or other goal states.


> (which was the case for that ticket), then it can fail quickly with a
> descriptive message.
>
> On Mon, Jan 25, 2016 at 10:58 AM John Vines <vi...@apache.org> wrote:
>
> > While we want to be fault tolerant, there's a point where we want to
> > eventually fail. I know we have a couple never ending retry loops that
> need
> > to be addressed (https://issues.apache.org/jira/browse/ACCUMULO-1268),
> > but I'm unsure if queries suffer from this problem.
> >
> > Unfortunately, fault tolerance is a bit at odds with instant notification
> > of system issues, since some of the fault tolerance is temporally
> oriented.
> > And that ticket lacks context of it never failing out vs. failing out
> > eventually (but too long for the user)
> >
> >
> > On Sun, Jan 24, 2016 at 7:46 PM Christopher <ct...@apache.org> wrote:
> >
> >> I saw this bug report:
> >> https://bugzilla.redhat.com/show_bug.cgi?id=1300987
> >>
> >> As far as I can tell, they are reporting normal, expected, and desired
> >> behavior of Accumulo as a bug. But, is there something we can do
> upstream
> >> to enable fast failures in the case of Accumulo not running to support
> >> their use case?
> >>
> >> Personally, I don't see how we can reliably detect within the client
> that
> >> the cluster is down or up, vs. a normal temporary server
> outage/migration,
> >> since there is there is no single point of authority for Accumulo to
> >> determine its overall operating status if ZooKeeper is running and no
> >> other
> >> servers are. Am I wrong?
> >>
> >
>

Re: Interesting bug report

Posted by Keith Turner <ke...@deenlo.com>.

On Mon, Jan 25, 2016 at 12:14 PM, Josh Elser <jo...@gmail.com> wrote:

> I've long be waffling about the usefulness of our "infinite retry" logic.
> It's great for daemons. It sucks for humans.
>
> Maybe there's a story in addressing this via ClientConfiguration -- let
> the user tell us the policy they want to follow.



+1 for configurable retry policy.    Curator has a configurable retry
policy.  Would be good to see how it works when designing something for
Accumulo.


>
>
> John Vines wrote:
>
>> Of course, it's when I hit send that I realize that we could mitigate by
>> making the client aware of the master state, and if the system is shut
>> down
>> (which was the case for that ticket), then it can fail quickly with a
>> descriptive message.
>>
>> On Mon, Jan 25, 2016 at 10:58 AM John Vines<vi...@apache.org>  wrote:
>>
>> While we want to be fault tolerant, there's a point where we want to
>>> eventually fail. I know we have a couple never ending retry loops that
>>> need
>>> to be addressed (https://issues.apache.org/jira/browse/ACCUMULO-1268),
>>> but I'm unsure if queries suffer from this problem.
>>>
>>> Unfortunately, fault tolerance is a bit at odds with instant notification
>>> of system issues, since some of the fault tolerance is temporally
>>> oriented.
>>> And that ticket lacks context of it never failing out vs. failing out
>>> eventually (but too long for the user)
>>>
>>>
>>> On Sun, Jan 24, 2016 at 7:46 PM Christopher<ct...@apache.org>  wrote:
>>>
>>> I saw this bug report:
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1300987
>>>>
>>>> As far as I can tell, they are reporting normal, expected, and desired
>>>> behavior of Accumulo as a bug. But, is there something we can do
>>>> upstream
>>>> to enable fast failures in the case of Accumulo not running to support
>>>> their use case?
>>>>
>>>> Personally, I don't see how we can reliably detect within the client
>>>> that
>>>> the cluster is down or up, vs. a normal temporary server
>>>> outage/migration,
>>>> since there is there is no single point of authority for Accumulo to
>>>> determine its overall operating status if ZooKeeper is running and no
>>>> other
>>>> servers are. Am I wrong?
>>>>
>>>>
>>

Re: Interesting bug report

Posted by Josh Elser <jo...@gmail.com>.

I've long be waffling about the usefulness of our "infinite retry" 
logic. It's great for daemons. It sucks for humans.

Maybe there's a story in addressing this via ClientConfiguration -- let 
the user tell us the policy they want to follow.

John Vines wrote:
> Of course, it's when I hit send that I realize that we could mitigate by
> making the client aware of the master state, and if the system is shut down
> (which was the case for that ticket), then it can fail quickly with a
> descriptive message.
>
> On Mon, Jan 25, 2016 at 10:58 AM John Vines<vi...@apache.org>  wrote:
>
>> While we want to be fault tolerant, there's a point where we want to
>> eventually fail. I know we have a couple never ending retry loops that need
>> to be addressed (https://issues.apache.org/jira/browse/ACCUMULO-1268),
>> but I'm unsure if queries suffer from this problem.
>>
>> Unfortunately, fault tolerance is a bit at odds with instant notification
>> of system issues, since some of the fault tolerance is temporally oriented.
>> And that ticket lacks context of it never failing out vs. failing out
>> eventually (but too long for the user)
>>
>>
>> On Sun, Jan 24, 2016 at 7:46 PM Christopher<ct...@apache.org>  wrote:
>>
>>> I saw this bug report:
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1300987
>>>
>>> As far as I can tell, they are reporting normal, expected, and desired
>>> behavior of Accumulo as a bug. But, is there something we can do upstream
>>> to enable fast failures in the case of Accumulo not running to support
>>> their use case?
>>>
>>> Personally, I don't see how we can reliably detect within the client that
>>> the cluster is down or up, vs. a normal temporary server outage/migration,
>>> since there is there is no single point of authority for Accumulo to
>>> determine its overall operating status if ZooKeeper is running and no
>>> other
>>> servers are. Am I wrong?
>>>
>

Re: Interesting bug report

Posted by John Vines <vi...@apache.org>.

Of course, it's when I hit send that I realize that we could mitigate by
making the client aware of the master state, and if the system is shut down
(which was the case for that ticket), then it can fail quickly with a
descriptive message.

On Mon, Jan 25, 2016 at 10:58 AM John Vines <vi...@apache.org> wrote:

> While we want to be fault tolerant, there's a point where we want to
> eventually fail. I know we have a couple never ending retry loops that need
> to be addressed (https://issues.apache.org/jira/browse/ACCUMULO-1268),
> but I'm unsure if queries suffer from this problem.
>
> Unfortunately, fault tolerance is a bit at odds with instant notification
> of system issues, since some of the fault tolerance is temporally oriented.
> And that ticket lacks context of it never failing out vs. failing out
> eventually (but too long for the user)
>
>
> On Sun, Jan 24, 2016 at 7:46 PM Christopher <ct...@apache.org> wrote:
>
>> I saw this bug report:
>> https://bugzilla.redhat.com/show_bug.cgi?id=1300987
>>
>> As far as I can tell, they are reporting normal, expected, and desired
>> behavior of Accumulo as a bug. But, is there something we can do upstream
>> to enable fast failures in the case of Accumulo not running to support
>> their use case?
>>
>> Personally, I don't see how we can reliably detect within the client that
>> the cluster is down or up, vs. a normal temporary server outage/migration,
>> since there is there is no single point of authority for Accumulo to
>> determine its overall operating status if ZooKeeper is running and no
>> other
>> servers are. Am I wrong?
>>
>

Re: Interesting bug report

Posted by John Vines <vi...@apache.org>.

While we want to be fault tolerant, there's a point where we want to
eventually fail. I know we have a couple never ending retry loops that need
to be addressed (https://issues.apache.org/jira/browse/ACCUMULO-1268), but
I'm unsure if queries suffer from this problem.

Unfortunately, fault tolerance is a bit at odds with instant notification
of system issues, since some of the fault tolerance is temporally oriented.
And that ticket lacks context of it never failing out vs. failing out
eventually (but too long for the user)

On Sun, Jan 24, 2016 at 7:46 PM Christopher <ct...@apache.org> wrote:

> I saw this bug report: https://bugzilla.redhat.com/show_bug.cgi?id=1300987
>
> As far as I can tell, they are reporting normal, expected, and desired
> behavior of Accumulo as a bug. But, is there something we can do upstream
> to enable fast failures in the case of Accumulo not running to support
> their use case?
>
> Personally, I don't see how we can reliably detect within the client that
> the cluster is down or up, vs. a normal temporary server outage/migration,
> since there is there is no single point of authority for Accumulo to
> determine its overall operating status if ZooKeeper is running and no other
> servers are. Am I wrong?
>