You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@curator.apache.org by Henrik Nordvik <he...@gmail.com> on 2013/11/05 10:47:34 UTC

Switching from State suspended, to lost, to suspended

Hi,

I'm getting some strange behaviour when stopping zookeeper in one
environment that I can't reproduce locally.
The result is that the leader selector "quits" even though it is set as
auto-requeue. (I think that happens because the retry loop inside
LeaderSelector checks the interrupt-flag, which is set again even when I
cleared it).

I think it boils down to getting

2013-11-04 18:22:32,501 INFO  [main-EventThread    ]
c.n.c.f.state.ConnectionStateManager      - State change: LOST
2013-11-04 18:22:32,501 DEBUG [ectionStateManager-0]
s.f.s.a.feed.MyListener        - Interrupting thread
Thread[LeaderSelector-0,5,main]
2013-11-04 18:22:32,503 INFO  [main-EventThread    ]
c.n.c.f.state.ConnectionStateManager      - State change: SUSPENDED
2013-11-04 18:22:32,504 DEBUG [ectionStateManager-0]
s.f.s.a.feed.MyListener        - Interrupting thread
Thread[LeaderSelector-0,5,main]

... then I handle the interrupt in the leader thread.

Then I get this:
2013-11-04 18:22:36,465 INFO  [main-EventThread    ]
c.n.c.f.state.ConnectionStateManager      - State change: LOST
2013-11-04 18:22:36,465 INFO  [main-EventThread    ]
c.n.c.f.state.ConnectionStateManager      - State change: SUSPENDED
2013-11-04 18:22:36,465 DEBUG [ectionStateManager-0]
s.f.s.a.feed.MyListener        - StateChanged: LOST
2013-11-04 18:22:36,465 DEBUG [ectionStateManager-0]
s.f.s.a.feed.MyListener        - Interrupting thread
Thread[LeaderSelector-0,5,main]
2013-11-04 18:22:36,466 DEBUG [ectionStateManager-0]
s.f.s.a.feed.MyListener        - StateChanged: SUSPENDED
2013-11-04 18:22:36,466 DEBUG [ectionStateManager-0]
s.f.s.a.feed.MyListener        - Interrupting thread
Thread[LeaderSelector-0,5,main]


Full log is here: https://gist.github.com/zerd/7316258

The code follows the old leader selector example pretty well:

    @Override
    public void takeLeadership(CuratorFramework curatorFramework) throws
Exception {
        ourThread = Thread.currentThread();
        logger.debug(format("(%s) Got leadership", ourThread));
        try {
            waitForAndPerformWork();
        } catch (InterruptedException e) {
            logger.debug(format("(%s) Interrupted ", ourThread), e);
        } finally {
            logger.debug(format("(%s) No longer leader", ourThread));
        }
    }

    @Override
    public void stateChanged(CuratorFramework curatorFramework,
ConnectionState newState) {
        logger.debug("StateChanged: " + newState);

        if ((newState == ConnectionState.LOST) || (newState ==
ConnectionState.SUSPENDED)) {
            if (ourThread != null) {
                logger.debug("Interrupting thread " + ourThread);
                ourThread.interrupt();
            } else {
                logger.debug("Thread is null");
            }
        }
    }

Is it supposed to go back and forth from lost to suspended?
My goal is to get it to resume trying to get the leadership when zookeeper
comes back. Do I have to requeue it manually when this happens?
Would upgrading to latest curator with CancelLeadershipException fix this?

Thank you very much for your time.

--
Henrik Nordvik

Re: Switching from State suspended, to lost, to suspended

Posted by Robert Hodges <ro...@continuent.com>.

Hi, 

I have been looking at the same problem as Henrik.  Just to be clear, the problem is the following:  a process wants to make state updates that are only safe to do while it has the leader role.  

If this is correctly stated, there are three cases that are interesting.  

a.) Ensuring within the process that you have the leadership role when you start. 
b.) Ensuring that the process does not give up leadership while such updates are proceeding. 
c.) Handling the case where the process loses leadership during the operation, leading to a late update 

I was planning on handling cases a and b using a shared lock within each process that can become leader.  To perform updates threads need to acquire the shared lock.  This is only granted if the process has leadership to begin with.  To give up leadership you need to acquire the lock exclusively, which means the leader callback must wait for the shared locks to be released before return to Curator. 

Case c is the hard one.  One option is to put a callback on the lock so that clients holding it will receive an interrupt.  However, there's still a race condition hiding under there as Arie points out, so this is only a partial solution--in fact it's really identical to checking the flags as described below.  

This could be largely cured if Curator had semantics such that it would not try to select a new leader before ensuring that the old leader had actually processed the interrupt and properly exited.  

What are the Curator leader selection semantics in this case?  If Curator does not do something like what I described it's almost trivially easy to get overlapping leaders. 

Cheers, Robert Hodges

p.s., If there's interest in the lock approach I would be happy to prepare a patch so it can be added to Curator.

On Nov 14, 2013, at 8:11 AM PST, Arie Zilberstein wrote:

> Henrik,
> 
> You should be able to transactionally test for leadership and update a state a varaible in Zookeeper.
> This is something that I requested a few weeks ago in a thread named "Atomically setting a node's data while having leadership", and I hope will be implemented. Personally I think it is a must-have capability.
> 
> In your scenario, however, since you must update a database, there is a race condition that cannot be readily resolved (without some kind of distributed transactions). You can test for leadership and then update the DB, but there is no guarantee that the leadership is still yours by the end of your DB update call.
> 
> Thanks,
> Arie 
> 
> 
> On Wed, Nov 13, 2013 at 4:02 PM, Henrik Nordvik <he...@gmail.com> wrote:
> I've upgraded to curator 2.3.0.
> LeaderSelector still uses thread interrupting for signaling to the thread running takeLeadership() to stop, right?
> Inside my takeLeadership I do some database operations, and before commiting I'm checking if I was interrupted, and roll back if I was.
> However, some code in between clears the interrupt flag (i.e. logback does this), so I'm committing even though I lost/suspended the connection.
> 
> I need some other criteria to decide if I can commit or not. hasLeadership only checks a local flag, which is always true inside takeLeadership().
> Do I have another flag I can check?
> 
> 
> --
> Henrik Nordvik
> 
> 
> On Tue, Nov 5, 2013 at 5:21 PM, Jordan Zimmerman <jo...@jordanzimmerman.com> wrote:
> This sounds like a variation of https://issues.apache.org/jira/browse/CURATOR-54 - The next release of Curator (later this week) provides a more robust way of canceling leadership that doesn’t require thread interruption.
> 
> -Jordan
> 
> On Nov 5, 2013, at 1:47 AM, Henrik Nordvik <he...@gmail.com> wrote:
> 
>> Hi,
>> 
>> I'm getting some strange behaviour when stopping zookeeper in one environment that I can't reproduce locally.
>> The result is that the leader selector "quits" even though it is set as auto-requeue. (I think that happens because the retry loop inside LeaderSelector checks the interrupt-flag, which is set again even when I cleared it).
>> 
>> I think it boils down to getting
>> 
>> 2013-11-04 18:22:32,501 INFO  [main-EventThread    ] c.n.c.f.state.ConnectionStateManager      - State change: LOST
>> 2013-11-04 18:22:32,501 DEBUG [ectionStateManager-0] s.f.s.a.feed.MyListener        - Interrupting thread Thread[LeaderSelector-0,5,main]
>> 2013-11-04 18:22:32,503 INFO  [main-EventThread    ] c.n.c.f.state.ConnectionStateManager      - State change: SUSPENDED
>> 2013-11-04 18:22:32,504 DEBUG [ectionStateManager-0] s.f.s.a.feed.MyListener        - Interrupting thread Thread[LeaderSelector-0,5,main]
>> 
>> ... then I handle the interrupt in the leader thread.
>> 
>> Then I get this:
>> 2013-11-04 18:22:36,465 INFO  [main-EventThread    ] c.n.c.f.state.ConnectionStateManager      - State change: LOST
>> 2013-11-04 18:22:36,465 INFO  [main-EventThread    ] c.n.c.f.state.ConnectionStateManager      - State change: SUSPENDED
>> 2013-11-04 18:22:36,465 DEBUG [ectionStateManager-0] s.f.s.a.feed.MyListener        - StateChanged: LOST 
>> 2013-11-04 18:22:36,465 DEBUG [ectionStateManager-0] s.f.s.a.feed.MyListener        - Interrupting thread Thread[LeaderSelector-0,5,main]
>> 2013-11-04 18:22:36,466 DEBUG [ectionStateManager-0] s.f.s.a.feed.MyListener        - StateChanged: SUSPENDED 
>> 2013-11-04 18:22:36,466 DEBUG [ectionStateManager-0] s.f.s.a.feed.MyListener        - Interrupting thread Thread[LeaderSelector-0,5,main]
>> 
>> 
>> Full log is here: https://gist.github.com/zerd/7316258
>> 
>> The code follows the old leader selector example pretty well:
>> 
>>     @Override
>>     public void takeLeadership(CuratorFramework curatorFramework) throws Exception {
>>         ourThread = Thread.currentThread();
>>         logger.debug(format("(%s) Got leadership", ourThread));
>>         try {
>>             waitForAndPerformWork();
>>         } catch (InterruptedException e) {
>>             logger.debug(format("(%s) Interrupted ", ourThread), e);
>>         } finally {
>>             logger.debug(format("(%s) No longer leader", ourThread));
>>         }
>>     }
>> 
>>     @Override
>>     public void stateChanged(CuratorFramework curatorFramework, ConnectionState newState) {
>>         logger.debug("StateChanged: " + newState);
>> 
>>         if ((newState == ConnectionState.LOST) || (newState == ConnectionState.SUSPENDED)) {
>>             if (ourThread != null) {
>>                 logger.debug("Interrupting thread " + ourThread);
>>                 ourThread.interrupt();
>>             } else {
>>                 logger.debug("Thread is null");
>>             }
>>         }
>>     }
>> 
>> Is it supposed to go back and forth from lost to suspended?
>> My goal is to get it to resume trying to get the leadership when zookeeper comes back. Do I have to requeue it manually when this happens?
>> Would upgrading to latest curator with CancelLeadershipException fix this?
>> 
>> Thank you very much for your time.
>> 
>> --
>> Henrik Nordvik
> 
> 
>

Re: Switching from State suspended, to lost, to suspended

Posted by Arie Zilberstein <az...@salesforce.com>.

Henrik,

You should be able to transactionally test for leadership and update a
state a varaible in Zookeeper.
This is something that I requested a few weeks ago in a thread named
"Atomically setting a node's data while having leadership", and I hope will
be implemented. Personally I think it is a must-have capability.

In your scenario, however, since you must update a database, there is a
race condition that cannot be readily resolved (without some kind of
distributed transactions). You can test for leadership and then update the
DB, but there is no guarantee that the leadership is still yours by the end
of your DB update call.

Thanks,
Arie


On Wed, Nov 13, 2013 at 4:02 PM, Henrik Nordvik <he...@gmail.com> wrote:

> I've upgraded to curator 2.3.0.
> LeaderSelector still uses thread interrupting for signaling to the thread
> running takeLeadership() to stop, right?
> Inside my takeLeadership I do some database operations, and before
> commiting I'm checking if I was interrupted, and roll back if I was.
> However, some code in between clears the interrupt flag (i.e. logback does
> this), so I'm committing even though I lost/suspended the connection.
>
> I need some other criteria to decide if I can commit or not. hasLeadership
> only checks a local flag, which is always true inside takeLeadership().
> Do I have another flag I can check?
>
>
> --
> Henrik Nordvik
>
>
> On Tue, Nov 5, 2013 at 5:21 PM, Jordan Zimmerman <
> jordan@jordanzimmerman.com> wrote:
>
>> This sounds like a variation of
>> https://issues.apache.org/jira/browse/CURATOR-54 - The next release of
>> Curator (later this week) provides a more robust way of canceling
>> leadership that doesn’t require thread interruption.
>>
>> -Jordan
>>
>> On Nov 5, 2013, at 1:47 AM, Henrik Nordvik <he...@gmail.com> wrote:
>>
>> Hi,
>>
>> I'm getting some strange behaviour when stopping zookeeper in one
>> environment that I can't reproduce locally.
>> The result is that the leader selector "quits" even though it is set as
>> auto-requeue. (I think that happens because the retry loop inside
>> LeaderSelector checks the interrupt-flag, which is set again even when I
>> cleared it).
>>
>> I think it boils down to getting
>>
>> 2013-11-04 18:22:32,501 INFO  [main-EventThread    ]
>> c.n.c.f.state.ConnectionStateManager      - State change: LOST
>> 2013-11-04 18:22:32,501 DEBUG [ectionStateManager-0]
>> s.f.s.a.feed.MyListener        - Interrupting thread
>> Thread[LeaderSelector-0,5,main]
>> 2013-11-04 18:22:32,503 INFO  [main-EventThread    ]
>> c.n.c.f.state.ConnectionStateManager      - State change: SUSPENDED
>> 2013-11-04 18:22:32,504 DEBUG [ectionStateManager-0]
>> s.f.s.a.feed.MyListener        - Interrupting thread
>> Thread[LeaderSelector-0,5,main]
>>
>> ... then I handle the interrupt in the leader thread.
>>
>> Then I get this:
>> 2013-11-04 18:22:36,465 INFO  [main-EventThread    ]
>> c.n.c.f.state.ConnectionStateManager      - State change: LOST
>> 2013-11-04 18:22:36,465 INFO  [main-EventThread    ]
>> c.n.c.f.state.ConnectionStateManager      - State change: SUSPENDED
>> 2013-11-04 18:22:36,465 DEBUG [ectionStateManager-0]
>> s.f.s.a.feed.MyListener        - StateChanged: LOST
>> 2013-11-04 18:22:36,465 DEBUG [ectionStateManager-0]
>> s.f.s.a.feed.MyListener        - Interrupting thread
>> Thread[LeaderSelector-0,5,main]
>> 2013-11-04 18:22:36,466 DEBUG [ectionStateManager-0]
>> s.f.s.a.feed.MyListener        - StateChanged: SUSPENDED
>> 2013-11-04 18:22:36,466 DEBUG [ectionStateManager-0]
>> s.f.s.a.feed.MyListener        - Interrupting thread
>> Thread[LeaderSelector-0,5,main]
>>
>>
>> Full log is here: https://gist.github.com/zerd/7316258
>>
>> The code follows the old leader selector example pretty well:
>>
>>     @Override
>>     public void takeLeadership(CuratorFramework curatorFramework) throws
>> Exception {
>>         ourThread = Thread.currentThread();
>>         logger.debug(format("(%s) Got leadership", ourThread));
>>         try {
>>             waitForAndPerformWork();
>>         } catch (InterruptedException e) {
>>             logger.debug(format("(%s) Interrupted ", ourThread), e);
>>         } finally {
>>             logger.debug(format("(%s) No longer leader", ourThread));
>>         }
>>     }
>>
>>     @Override
>>     public void stateChanged(CuratorFramework curatorFramework,
>> ConnectionState newState) {
>>         logger.debug("StateChanged: " + newState);
>>
>>         if ((newState == ConnectionState.LOST) || (newState ==
>> ConnectionState.SUSPENDED)) {
>>             if (ourThread != null) {
>>                 logger.debug("Interrupting thread " + ourThread);
>>                 ourThread.interrupt();
>>             } else {
>>                 logger.debug("Thread is null");
>>             }
>>         }
>>     }
>>
>> Is it supposed to go back and forth from lost to suspended?
>> My goal is to get it to resume trying to get the leadership when
>> zookeeper comes back. Do I have to requeue it manually when this happens?
>> Would upgrading to latest curator with CancelLeadershipException fix this?
>>
>> Thank you very much for your time.
>>
>> --
>> Henrik Nordvik
>>
>>
>>
>

Re: Switching from State suspended, to lost, to suspended

Posted by Henrik Nordvik <he...@gmail.com>.

I've upgraded to curator 2.3.0.
LeaderSelector still uses thread interrupting for signaling to the thread
running takeLeadership() to stop, right?
Inside my takeLeadership I do some database operations, and before
commiting I'm checking if I was interrupted, and roll back if I was.
However, some code in between clears the interrupt flag (i.e. logback does
this), so I'm committing even though I lost/suspended the connection.

I need some other criteria to decide if I can commit or not. hasLeadership
only checks a local flag, which is always true inside takeLeadership().
Do I have another flag I can check?


--
Henrik Nordvik


On Tue, Nov 5, 2013 at 5:21 PM, Jordan Zimmerman <jordan@jordanzimmerman.com
> wrote:

> This sounds like a variation of
> https://issues.apache.org/jira/browse/CURATOR-54 - The next release of
> Curator (later this week) provides a more robust way of canceling
> leadership that doesn’t require thread interruption.
>
> -Jordan
>
> On Nov 5, 2013, at 1:47 AM, Henrik Nordvik <he...@gmail.com> wrote:
>
> Hi,
>
> I'm getting some strange behaviour when stopping zookeeper in one
> environment that I can't reproduce locally.
> The result is that the leader selector "quits" even though it is set as
> auto-requeue. (I think that happens because the retry loop inside
> LeaderSelector checks the interrupt-flag, which is set again even when I
> cleared it).
>
> I think it boils down to getting
>
> 2013-11-04 18:22:32,501 INFO  [main-EventThread    ]
> c.n.c.f.state.ConnectionStateManager      - State change: LOST
> 2013-11-04 18:22:32,501 DEBUG [ectionStateManager-0]
> s.f.s.a.feed.MyListener        - Interrupting thread
> Thread[LeaderSelector-0,5,main]
> 2013-11-04 18:22:32,503 INFO  [main-EventThread    ]
> c.n.c.f.state.ConnectionStateManager      - State change: SUSPENDED
> 2013-11-04 18:22:32,504 DEBUG [ectionStateManager-0]
> s.f.s.a.feed.MyListener        - Interrupting thread
> Thread[LeaderSelector-0,5,main]
>
> ... then I handle the interrupt in the leader thread.
>
> Then I get this:
> 2013-11-04 18:22:36,465 INFO  [main-EventThread    ]
> c.n.c.f.state.ConnectionStateManager      - State change: LOST
> 2013-11-04 18:22:36,465 INFO  [main-EventThread    ]
> c.n.c.f.state.ConnectionStateManager      - State change: SUSPENDED
> 2013-11-04 18:22:36,465 DEBUG [ectionStateManager-0]
> s.f.s.a.feed.MyListener        - StateChanged: LOST
> 2013-11-04 18:22:36,465 DEBUG [ectionStateManager-0]
> s.f.s.a.feed.MyListener        - Interrupting thread
> Thread[LeaderSelector-0,5,main]
> 2013-11-04 18:22:36,466 DEBUG [ectionStateManager-0]
> s.f.s.a.feed.MyListener        - StateChanged: SUSPENDED
> 2013-11-04 18:22:36,466 DEBUG [ectionStateManager-0]
> s.f.s.a.feed.MyListener        - Interrupting thread
> Thread[LeaderSelector-0,5,main]
>
>
> Full log is here: https://gist.github.com/zerd/7316258
>
> The code follows the old leader selector example pretty well:
>
>     @Override
>     public void takeLeadership(CuratorFramework curatorFramework) throws
> Exception {
>         ourThread = Thread.currentThread();
>         logger.debug(format("(%s) Got leadership", ourThread));
>         try {
>             waitForAndPerformWork();
>         } catch (InterruptedException e) {
>             logger.debug(format("(%s) Interrupted ", ourThread), e);
>         } finally {
>             logger.debug(format("(%s) No longer leader", ourThread));
>         }
>     }
>
>     @Override
>     public void stateChanged(CuratorFramework curatorFramework,
> ConnectionState newState) {
>         logger.debug("StateChanged: " + newState);
>
>         if ((newState == ConnectionState.LOST) || (newState ==
> ConnectionState.SUSPENDED)) {
>             if (ourThread != null) {
>                 logger.debug("Interrupting thread " + ourThread);
>                 ourThread.interrupt();
>             } else {
>                 logger.debug("Thread is null");
>             }
>         }
>     }
>
> Is it supposed to go back and forth from lost to suspended?
> My goal is to get it to resume trying to get the leadership when zookeeper
> comes back. Do I have to requeue it manually when this happens?
> Would upgrading to latest curator with CancelLeadershipException fix this?
>
> Thank you very much for your time.
>
> --
> Henrik Nordvik
>
>
>

Re: Switching from State suspended, to lost, to suspended

Posted by Jordan Zimmerman <jo...@jordanzimmerman.com>.

This sounds like a variation of https://issues.apache.org/jira/browse/CURATOR-54 - The next release of Curator (later this week) provides a more robust way of canceling leadership that doesn’t require thread interruption.

-Jordan

On Nov 5, 2013, at 1:47 AM, Henrik Nordvik <he...@gmail.com> wrote:

> Hi,
> 
> I'm getting some strange behaviour when stopping zookeeper in one environment that I can't reproduce locally.
> The result is that the leader selector "quits" even though it is set as auto-requeue. (I think that happens because the retry loop inside LeaderSelector checks the interrupt-flag, which is set again even when I cleared it).
> 
> I think it boils down to getting
> 
> 2013-11-04 18:22:32,501 INFO  [main-EventThread    ] c.n.c.f.state.ConnectionStateManager      - State change: LOST
> 2013-11-04 18:22:32,501 DEBUG [ectionStateManager-0] s.f.s.a.feed.MyListener        - Interrupting thread Thread[LeaderSelector-0,5,main]
> 2013-11-04 18:22:32,503 INFO  [main-EventThread    ] c.n.c.f.state.ConnectionStateManager      - State change: SUSPENDED
> 2013-11-04 18:22:32,504 DEBUG [ectionStateManager-0] s.f.s.a.feed.MyListener        - Interrupting thread Thread[LeaderSelector-0,5,main]
> 
> ... then I handle the interrupt in the leader thread.
> 
> Then I get this:
> 2013-11-04 18:22:36,465 INFO  [main-EventThread    ] c.n.c.f.state.ConnectionStateManager      - State change: LOST
> 2013-11-04 18:22:36,465 INFO  [main-EventThread    ] c.n.c.f.state.ConnectionStateManager      - State change: SUSPENDED
> 2013-11-04 18:22:36,465 DEBUG [ectionStateManager-0] s.f.s.a.feed.MyListener        - StateChanged: LOST 
> 2013-11-04 18:22:36,465 DEBUG [ectionStateManager-0] s.f.s.a.feed.MyListener        - Interrupting thread Thread[LeaderSelector-0,5,main]
> 2013-11-04 18:22:36,466 DEBUG [ectionStateManager-0] s.f.s.a.feed.MyListener        - StateChanged: SUSPENDED 
> 2013-11-04 18:22:36,466 DEBUG [ectionStateManager-0] s.f.s.a.feed.MyListener        - Interrupting thread Thread[LeaderSelector-0,5,main]
> 
> 
> Full log is here: https://gist.github.com/zerd/7316258
> 
> The code follows the old leader selector example pretty well:
> 
>     @Override
>     public void takeLeadership(CuratorFramework curatorFramework) throws Exception {
>         ourThread = Thread.currentThread();
>         logger.debug(format("(%s) Got leadership", ourThread));
>         try {
>             waitForAndPerformWork();
>         } catch (InterruptedException e) {
>             logger.debug(format("(%s) Interrupted ", ourThread), e);
>         } finally {
>             logger.debug(format("(%s) No longer leader", ourThread));
>         }
>     }
> 
>     @Override
>     public void stateChanged(CuratorFramework curatorFramework, ConnectionState newState) {
>         logger.debug("StateChanged: " + newState);
> 
>         if ((newState == ConnectionState.LOST) || (newState == ConnectionState.SUSPENDED)) {
>             if (ourThread != null) {
>                 logger.debug("Interrupting thread " + ourThread);
>                 ourThread.interrupt();
>             } else {
>                 logger.debug("Thread is null");
>             }
>         }
>     }
> 
> Is it supposed to go back and forth from lost to suspended?
> My goal is to get it to resume trying to get the leadership when zookeeper comes back. Do I have to requeue it manually when this happens?
> Would upgrading to latest curator with CancelLeadershipException fix this?
> 
> Thank you very much for your time.
> 
> --
> Henrik Nordvik