You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2020/06/20 01:33:44 UTC

[GitHub] [pulsar] sijie opened a new pull request #7319: Fix producer stucks on creating ledger timeout

sijie opened a new pull request #7319:
URL: https://github.com/apache/pulsar/pull/7319


   *Motivation*
   
   The `ledgerCreated` flag is passed as ctx to the createLedger callback.
   The callback already had the logic on handling `ledgerCreated` flag. But we set the flag to `false`
   when timeout happens. It will trigger the following race condition:
   
   a) The createComplete callback is triggered when timeout. But the pending add requests are not error'd out.
   b) If the createComplete callback eventually completes, it will see `ledgerCreated` flag is set to true,
   so it will cause `checkAndCompleteLedgerOpTask` returns false and exist too early without processing the
   pending add requests.
   
   This race condition only happens when creating ledger times out.
   
   ```
       public synchronized void createComplete(int rc, final LedgerHandle lh, Object ctx) {
           if (log.isDebugEnabled()) {
               log.debug("[{}] createComplete rc={} ledger={}", name, rc, lh != null ? lh.getId() : -1);
           }
   
           if (checkAndCompleteLedgerOpTask(rc, lh, ctx)) {
               return;
           }
   ```
   
   *Modification*
   
   The timeout logic shouldn't modify the `ledgerCreated` context. It should let the callback itself to process
   the `ledgerCreated` context.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] sijie commented on a change in pull request #7319: Fix producer stucks on creating ledger timeout

Posted by GitBox <gi...@apache.org>.
sijie commented on a change in pull request #7319:
URL: https://github.com/apache/pulsar/pull/7319#discussion_r443102544



##########
File path: managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java
##########
@@ -3177,8 +3177,7 @@ protected void asyncCreateLedger(BookKeeper bookKeeper, ManagedLedgerConfig conf
                 digestType, config.getPassword(), cb, ledgerCreated, finalMetadata);
         scheduledExecutor.schedule(() -> {
             if (!ledgerCreated.get()) {
-                ledgerCreated.set(true);
-                cb.createComplete(BKException.Code.TimeoutException, null, null);
+                cb.createComplete(BKException.Code.TimeoutException, null, ledgerCreated);

Review comment:
       The callback is synchronized at ManagedLedgerImpl. So there is no race condition here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] merlimat merged pull request #7319: Fix producer stucks on creating ledger timeout

Posted by GitBox <gi...@apache.org>.
merlimat merged pull request #7319:
URL: https://github.com/apache/pulsar/pull/7319


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] merlimat commented on a change in pull request #7319: Fix producer stucks on creating ledger timeout

Posted by GitBox <gi...@apache.org>.
merlimat commented on a change in pull request #7319:
URL: https://github.com/apache/pulsar/pull/7319#discussion_r443100931



##########
File path: managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java
##########
@@ -3177,8 +3177,7 @@ protected void asyncCreateLedger(BookKeeper bookKeeper, ManagedLedgerConfig conf
                 digestType, config.getPassword(), cb, ledgerCreated, finalMetadata);
         scheduledExecutor.schedule(() -> {
             if (!ledgerCreated.get()) {
-                ledgerCreated.set(true);
-                cb.createComplete(BKException.Code.TimeoutException, null, null);
+                cb.createComplete(BKException.Code.TimeoutException, null, ledgerCreated);

Review comment:
       Wouldn't there be a race condition here (also in existing code) if the creation completes at the same time that the timeout is triggered?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] sijie commented on a change in pull request #7319: Fix producer stucks on creating ledger timeout

Posted by GitBox <gi...@apache.org>.
sijie commented on a change in pull request #7319:
URL: https://github.com/apache/pulsar/pull/7319#discussion_r443334325



##########
File path: managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java
##########
@@ -3177,8 +3177,7 @@ protected void asyncCreateLedger(BookKeeper bookKeeper, ManagedLedgerConfig conf
                 digestType, config.getPassword(), cb, ledgerCreated, finalMetadata);
         scheduledExecutor.schedule(() -> {
             if (!ledgerCreated.get()) {
-                ledgerCreated.set(true);
-                cb.createComplete(BKException.Code.TimeoutException, null, null);
+                cb.createComplete(BKException.Code.TimeoutException, null, ledgerCreated);

Review comment:
       I see. I missed the createNewMetadataLedger operation. I changed it to use CAS.
   
   For canceling the timeout tasks, I will have a follow-up issue to do that.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] merlimat commented on a change in pull request #7319: Fix producer stucks on creating ledger timeout

Posted by GitBox <gi...@apache.org>.
merlimat commented on a change in pull request #7319:
URL: https://github.com/apache/pulsar/pull/7319#discussion_r443136519



##########
File path: managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java
##########
@@ -3177,8 +3177,7 @@ protected void asyncCreateLedger(BookKeeper bookKeeper, ManagedLedgerConfig conf
                 digestType, config.getPassword(), cb, ledgerCreated, finalMetadata);
         scheduledExecutor.schedule(() -> {
             if (!ledgerCreated.get()) {
-                ledgerCreated.set(true);
-                cb.createComplete(BKException.Code.TimeoutException, null, null);
+                cb.createComplete(BKException.Code.TimeoutException, null, ledgerCreated);

Review comment:
       It's not always the case. eg: In `createNewMetadataLedger()` when we create a new cursor ledger, we pass a lambda and there's no mutex. The logic in `checkAndCompleteLedgerOpTask()` seems race prone.
   
   I think we should probably use compareAndSet() to manipulate that `ledgerCreated` flag, which also might be better represented by an enum with multiple states. 
   
   Finally (and this is unrelated to the bug here), we should try to cancel the timeout task, once the create operation succeeds or fails before it times out. Otherwise we can get into a state with many timer tasks pending.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] jiazhai commented on pull request #7319: Fix producer stucks on creating ledger timeout

Posted by GitBox <gi...@apache.org>.
jiazhai commented on pull request #7319:
URL: https://github.com/apache/pulsar/pull/7319#issuecomment-646943710


   /pulsarbot run-failure-checks


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #7319: Fix producer stucks on creating ledger timeout

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #7319:
URL: https://github.com/apache/pulsar/pull/7319#issuecomment-677636483


   Hi, 
   We are facing this issue in chaos monkey type of testing of Pulsar, with Pulsar version 2.5.2 . 
   It's great that this issue is already fixed. Will this fix be available in Pulsar 2.6.1 ? How about Pulsar 2.5.3 ? 
   Thank you! 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org