You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Thawan Kooburat (JIRA)" <ji...@apache.org> on 2012/06/11 23:37:43 UTC

[jira] [Created] (ZOOKEEPER-1484) Missing znode found in the follower

Thawan Kooburat created ZOOKEEPER-1484:
------------------------------------------

             Summary: Missing znode found in the follower
                 Key: ZOOKEEPER-1484
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1484
             Project: ZooKeeper
          Issue Type: Bug
          Components: server
    Affects Versions: 3.4.3
            Reporter: Thawan Kooburat
            Assignee: Thawan Kooburat
            Priority: Critical


We noticed that one of the follower fail to restart due to missing parent node

{noformat}
2012-05-29 15:44:41,037 [myid:9] - INFO [main:FileSnap@83] - Reading snapshot /var/facebook/zeus-server/data/global-ropt.0/version-2/snapshot.3d001f19c9
2012-05-29 15:44:43,300 [myid:9] - ERROR [main:FileTxnSnapLog@220] - Parent /phpunittest/1862297546 missing for /phpunittest/1862297546/dir1
2012-05-29 15:44:43,302 [myid:9] - ERROR [main:QuorumPeer@488] - Unable to load database on disk
java.io.IOException: Failed to process transaction type: 1 error: KeeperErrorCode = NoNode for /phpunittest/1862297546
{noformat}

We believed that the root cause is due to bugs in follower sync-up logic. Due to race condition, the follower may miss some proposals. The log below show that the follower see the commit message but it haven't seen this proposal before
{noformat}
2012-05-15 15:11:27,449 [myid:13] - WARN [QuorumPeer[myid=13]/0.0.0.0:2182:Learner@378] - Got zxid 0x3c00282dc9 expected 0x3c00282dca
{noformat}

I can reproduce this by keep running FollowerResyncConcurrencyTest until failure occurs. I suspected that the root caused is due to how we handle toBeApplied and outstandingProposals in the leader. 

1. In-flight proposals is removed from outstandingProposal before it is added to toBeApplied. Most of the problem I seen so far seem to caused by this gap.
2. startForwarding() iterate through outstandingProposal without locking PrepRequestProcessor properly, so there is possibility of missing in-flight proposal. 



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (ZOOKEEPER-1484) Missing znode found in the follower

Posted by "Thawan Kooburat (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thawan Kooburat resolved ZOOKEEPER-1484.
----------------------------------------

      Resolution: Invalid
    Release Note: Trunk seems to be OK. Found that our own effort in increasing the currency on the leader cause the issue.
    
> Missing znode found in the follower
> -----------------------------------
>
>                 Key: ZOOKEEPER-1484
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1484
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.3
>            Reporter: Thawan Kooburat
>            Assignee: Thawan Kooburat
>            Priority: Critical
>
> We noticed that one of the follower fail to restart due to missing parent node
> {noformat}
> 2012-05-29 15:44:41,037 [myid:9] - INFO [main:FileSnap@83] - Reading snapshot /var/facebook/zeus-server/data/global-ropt.0/version-2/snapshot.3d001f19c9
> 2012-05-29 15:44:43,300 [myid:9] - ERROR [main:FileTxnSnapLog@220] - Parent /phpunittest/1862297546 missing for /phpunittest/1862297546/dir1
> 2012-05-29 15:44:43,302 [myid:9] - ERROR [main:QuorumPeer@488] - Unable to load database on disk
> java.io.IOException: Failed to process transaction type: 1 error: KeeperErrorCode = NoNode for /phpunittest/1862297546
> {noformat}
> We believed that the root cause is due to bugs in follower sync-up logic. Due to race condition, the follower may miss some proposals. The log below show that the follower see the commit message but it haven't seen this proposal before
> {noformat}
> 2012-05-15 15:11:27,449 [myid:13] - WARN [QuorumPeer[myid=13]/0.0.0.0:2182:Learner@378] - Got zxid 0x3c00282dc9 expected 0x3c00282dca
> {noformat}
> I can reproduce this by keep running FollowerResyncConcurrencyTest until failure occurs. I suspected that the root caused is due to how we handle toBeApplied and outstandingProposals in the leader. 
> 1. In-flight proposals is removed from outstandingProposal before it is added to toBeApplied. Most of the problem I seen so far seem to caused by this gap.
> 2. startForwarding() iterate through outstandingProposal without locking PrepRequestProcessor properly, so there is possibility of missing in-flight proposal. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ZOOKEEPER-1484) Missing znode found in the follower

Posted by "Thawan Kooburat (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293116#comment-13293116 ] 

Thawan Kooburat commented on ZOOKEEPER-1484:
--------------------------------------------

Just noticed that log are from different machines. So the actual root cause is not yet found, but I think the issue that I point out seem to be a legitimate problem. 
                
> Missing znode found in the follower
> -----------------------------------
>
>                 Key: ZOOKEEPER-1484
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1484
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.3
>            Reporter: Thawan Kooburat
>            Assignee: Thawan Kooburat
>            Priority: Critical
>
> We noticed that one of the follower fail to restart due to missing parent node
> {noformat}
> 2012-05-29 15:44:41,037 [myid:9] - INFO [main:FileSnap@83] - Reading snapshot /var/facebook/zeus-server/data/global-ropt.0/version-2/snapshot.3d001f19c9
> 2012-05-29 15:44:43,300 [myid:9] - ERROR [main:FileTxnSnapLog@220] - Parent /phpunittest/1862297546 missing for /phpunittest/1862297546/dir1
> 2012-05-29 15:44:43,302 [myid:9] - ERROR [main:QuorumPeer@488] - Unable to load database on disk
> java.io.IOException: Failed to process transaction type: 1 error: KeeperErrorCode = NoNode for /phpunittest/1862297546
> {noformat}
> We believed that the root cause is due to bugs in follower sync-up logic. Due to race condition, the follower may miss some proposals. The log below show that the follower see the commit message but it haven't seen this proposal before
> {noformat}
> 2012-05-15 15:11:27,449 [myid:13] - WARN [QuorumPeer[myid=13]/0.0.0.0:2182:Learner@378] - Got zxid 0x3c00282dc9 expected 0x3c00282dca
> {noformat}
> I can reproduce this by keep running FollowerResyncConcurrencyTest until failure occurs. I suspected that the root caused is due to how we handle toBeApplied and outstandingProposals in the leader. 
> 1. In-flight proposals is removed from outstandingProposal before it is added to toBeApplied. Most of the problem I seen so far seem to caused by this gap.
> 2. startForwarding() iterate through outstandingProposal without locking PrepRequestProcessor properly, so there is possibility of missing in-flight proposal. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira