You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@bookkeeper.apache.org by GitBox <gi...@apache.org> on 2022/01/05 09:37:59 UTC

[GitHub] [bookkeeper] lordcheng10 opened a new pull request #2973: checkAllLedgers in Auditor supports read throttle

lordcheng10 opened a new pull request #2973:
URL: https://github.com/apache/bookkeeper/pull/2973


   ### Motivation
   When checkAllLedgers is scheduled periodically, because it will try to read almost all entry data, it may cause the bookkeeper to time out and cause the entry to be incorrectly marked with markLedgerUnderreplicatedAsync.
   In our cluster, the execution cycle of checkAllLedgers is 1 week. Then we found that a large number of ledger will be marked markLedgerUnderreplicatedAsync each time it is executed. Analyzing the log found that there are some reading bookkeeper timeouts:
   ![image](https://user-images.githubusercontent.com/19296967/148194737-142e38ab-119e-4466-a7f8-75f9ce9e1d2b.png)
   
   ![image](https://user-images.githubusercontent.com/19296967/148195160-3cab0783-5f4d-4c79-842d-4fa056ba6507.png)
   
   Due to too many read requests, the cluster pressure is too high, and the latency of pulsar's write time continues to soar until the recovery is completed.:
   ![image](https://user-images.githubusercontent.com/19296967/148195503-277cec00-0a08-45a1-9228-4c1a46dc2fd9.png)
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] lordcheng10 commented on pull request #2973: checkAllLedgers in Auditor supports read throttle

Posted by GitBox <gi...@apache.org>.
lordcheng10 commented on pull request #2973:
URL: https://github.com/apache/bookkeeper/pull/2973#issuecomment-1007562120






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] lordcheng10 edited a comment on pull request #2973: checkAllLedgers in Auditor supports read throttle

Posted by GitBox <gi...@apache.org>.
lordcheng10 edited a comment on pull request #2973:
URL: https://github.com/apache/bookkeeper/pull/2973#issuecomment-1008221599


   > But I am still not sure , how it addresses the root cause. I understand that throttling with semaphore sort of reduces the pressure on the bookie. And there is a another mechanism for that too, that is percentageOfLedgerFragmentToBeVerified(Slight misnomer here, it actually checks percentage of entries in the ledger fragments). I understand throttling will reduce timeout from bookie. But timeout can still happen and will happen. My question is why not address this issue that occasional timeout should not be considered a failure, or may be should be retried? Thoughts? Looks good otherwise.
   
   percentageOfLedgerFragmentToBeVerified is not very useful. This parameter is for a Fragment, but in fact, most Fragments have only one entry, but at least the first and last entry will be checked. 
   percentageOfLedgerFragmentToBeVerified configures the default configuration we use, and the default configuration is 0.
   Every time when check all ledger is executed, a large number of ledgers will be marked as under replica, obviously this is a wrong judgment。@pkumar-singh 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] lordcheng10 commented on pull request #2973: checkAllLedgers in Auditor supports read throttle

Posted by GitBox <gi...@apache.org>.
lordcheng10 commented on pull request #2973:
URL: https://github.com/apache/bookkeeper/pull/2973#issuecomment-1006215303


   OK  I  will fix


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] lordcheng10 commented on a change in pull request #2973: checkAllLedgers in Auditor supports read throttle

Posted by GitBox <gi...@apache.org>.
lordcheng10 commented on a change in pull request #2973:
URL: https://github.com/apache/bookkeeper/pull/2973#discussion_r779247723



##########
File path: bookkeeper-server/src/main/java/org/apache/bookkeeper/client/LedgerChecker.java
##########
@@ -51,6 +51,14 @@
     public final BookieClient bookieClient;
     public final BookieWatcher bookieWatcher;
 
+    private static int averageEntrySize;
+
+    private static final int INITIAL_AVERAGE_ENTRY_SIZE = 1024;

Review comment:
       OK  I will fix




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] dlg99 commented on a change in pull request #2973: checkAllLedgers in Auditor supports read throttle

Posted by GitBox <gi...@apache.org>.
dlg99 commented on a change in pull request #2973:
URL: https://github.com/apache/bookkeeper/pull/2973#discussion_r779216068



##########
File path: bookkeeper-server/src/main/java/org/apache/bookkeeper/client/LedgerChecker.java
##########
@@ -51,6 +51,14 @@
     public final BookieClient bookieClient;
     public final BookieWatcher bookieWatcher;
 
+    private static int averageEntrySize;
+
+    private static final int INITIAL_AVERAGE_ENTRY_SIZE = 1024;

Review comment:
       consider simplifying by using number of entries in flight (using Semaphore, releasing when processed) instead of guessing avg sizes and rate limiting. Or rate limit by number of entries.
   This also removes need in synchronization.
   
   One ledger may have entries of 512K, next one of the 1K, third one is mixed. 
   I don't see how tracking avg size significantly helps in this case, especially if the backpressure is not enabled.  
   

##########
File path: bookkeeper-server/src/main/java/org/apache/bookkeeper/client/LedgerChecker.java
##########
@@ -76,6 +84,11 @@
         @Override
         public void readEntryComplete(int rc, long ledgerId, long entryId,
                 ByteBuf buffer, Object ctx) {
+            if (readThrottle != null && buffer != null) {
+                int readSize = buffer.readableBytes();
+                averageEntrySize = (int) (averageEntrySize * AVERAGE_ENTRY_SIZE_RATIO

Review comment:
       async code, concurrent reads, plus updates of non-volatile field/later reads from non-volatile field. 
   You'll need to add synchronization around use of this field.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] pkumar-singh commented on pull request #2973: checkAllLedgers in Auditor supports read throttle

Posted by GitBox <gi...@apache.org>.
pkumar-singh commented on pull request #2973:
URL: https://github.com/apache/bookkeeper/pull/2973#issuecomment-1008108825


   But I am still not sure , how it addresses the root cause. I understand that throttling with semaphore sort of reduces the pressure on the bookie. And there is a another mechanism  for that too, that is percentageOfLedgerFragmentToBeVerified(Slight misnomer here, it actually checks percentage of entries in the ledger fragments). I understand throttling will reduce timeout from bookie. But timeout can still happen and will happen. My question is why not address this issue that occasional timeout is not considered a failure.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] pkumar-singh edited a comment on pull request #2973: checkAllLedgers in Auditor supports read throttle

Posted by GitBox <gi...@apache.org>.
pkumar-singh edited a comment on pull request #2973:
URL: https://github.com/apache/bookkeeper/pull/2973#issuecomment-1008108825


   But I am still not sure , how it addresses the root cause. I understand that throttling with semaphore sort of reduces the pressure on the bookie. And there is a another mechanism  for that too, that is percentageOfLedgerFragmentToBeVerified(Slight misnomer here, it actually checks percentage of entries in the ledger fragments). I understand throttling will reduce timeout from bookie. But timeout can still happen and will happen. My question is why not address this issue that occasional timeout should not be considered a failure, or may be should be retried? Thoughts? Looks good otherwise.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] pkumar-singh commented on pull request #2973: checkAllLedgers in Auditor supports read throttle

Posted by GitBox <gi...@apache.org>.
pkumar-singh commented on pull request #2973:
URL: https://github.com/apache/bookkeeper/pull/2973#issuecomment-1009249269


   > > But I am still not sure , how it addresses the root cause. I understand that throttling with semaphore sort of reduces the pressure on the bookie. And there is a another mechanism for that too, that is percentageOfLedgerFragmentToBeVerified(Slight misnomer here, it actually checks percentage of entries in the ledger fragments). I understand throttling will reduce timeout from bookie. But timeout can still happen and will happen. My question is why not address this issue that occasional timeout should not be considered a failure, or may be should be retried? Thoughts? Looks good otherwise.
   > 
   > I agree that the timeout problem should be solved through retry, but at the same time I think the rate should also be limited to prevent checking all ledger from putting too much pressure on the cluster! @pkumar-singh
   
   Sure. It may not be sufficient but its accurate regardless. So OK from my end.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] pkumar-singh merged pull request #2973: checkAllLedgers in Auditor supports read throttle

Posted by GitBox <gi...@apache.org>.
pkumar-singh merged pull request #2973:
URL: https://github.com/apache/bookkeeper/pull/2973


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] lordcheng10 removed a comment on pull request #2973: checkAllLedgers in Auditor supports read throttle

Posted by GitBox <gi...@apache.org>.
lordcheng10 removed a comment on pull request #2973:
URL: https://github.com/apache/bookkeeper/pull/2973#issuecomment-1006215303


   OK  I  will fix


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] lordcheng10 commented on pull request #2973: checkAllLedgers in Auditor supports read throttle

Posted by GitBox <gi...@apache.org>.
lordcheng10 commented on pull request #2973:
URL: https://github.com/apache/bookkeeper/pull/2973#issuecomment-1008222213


   > But I am still not sure , how it addresses the root cause. I understand that throttling with semaphore sort of reduces the pressure on the bookie. And there is a another mechanism for that too, that is percentageOfLedgerFragmentToBeVerified(Slight misnomer here, it actually checks percentage of entries in the ledger fragments). I understand throttling will reduce timeout from bookie. But timeout can still happen and will happen. My question is why not address this issue that occasional timeout should not be considered a failure, or may be should be retried? Thoughts? Looks good otherwise.
   
   I agree that the timeout problem should be solved through retry, but at the same time I think the rate should also be limited to prevent checking all ledger from putting too much pressure on the cluster! @pkumar-singh 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] lordcheng10 commented on pull request #2973: checkAllLedgers in Auditor supports read throttle

Posted by GitBox <gi...@apache.org>.
lordcheng10 commented on pull request #2973:
URL: https://github.com/apache/bookkeeper/pull/2973#issuecomment-1007562120


   @michaeljmarshall PTAL,thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] lordcheng10 edited a comment on pull request #2973: checkAllLedgers in Auditor supports read throttle

Posted by GitBox <gi...@apache.org>.
lordcheng10 edited a comment on pull request #2973:
URL: https://github.com/apache/bookkeeper/pull/2973#issuecomment-1008221599


   > But I am still not sure , how it addresses the root cause. I understand that throttling with semaphore sort of reduces the pressure on the bookie. And there is a another mechanism for that too, that is percentageOfLedgerFragmentToBeVerified(Slight misnomer here, it actually checks percentage of entries in the ledger fragments). I understand throttling will reduce timeout from bookie. But timeout can still happen and will happen. My question is why not address this issue that occasional timeout should not be considered a failure, or may be should be retried? Thoughts? Looks good otherwise.
   
   percentageOfLedgerFragmentToBeVerified is not very useful. This parameter is for a Fragment, but in fact, most Fragments have only one entry, but at least the first and last entry will be checked. @pkumar-singh 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] lordcheng10 commented on pull request #2973: checkAllLedgers in Auditor supports read throttle

Posted by GitBox <gi...@apache.org>.
lordcheng10 commented on pull request #2973:
URL: https://github.com/apache/bookkeeper/pull/2973#issuecomment-1005564198


   rerun failure checks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] lordcheng10 commented on a change in pull request #2973: checkAllLedgers in Auditor supports read throttle

Posted by GitBox <gi...@apache.org>.
lordcheng10 commented on a change in pull request #2973:
URL: https://github.com/apache/bookkeeper/pull/2973#discussion_r779358378



##########
File path: bookkeeper-server/src/main/java/org/apache/bookkeeper/client/LedgerChecker.java
##########
@@ -51,6 +51,14 @@
     public final BookieClient bookieClient;
     public final BookieWatcher bookieWatcher;
 
+    private static int averageEntrySize;
+
+    private static final int INITIAL_AVERAGE_ENTRY_SIZE = 1024;

Review comment:
       I will rate limit by number of entries




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] lordcheng10 commented on a change in pull request #2973: checkAllLedgers in Auditor supports read throttle

Posted by GitBox <gi...@apache.org>.
lordcheng10 commented on a change in pull request #2973:
URL: https://github.com/apache/bookkeeper/pull/2973#discussion_r780046855



##########
File path: bookkeeper-server/src/main/java/org/apache/bookkeeper/client/LedgerChecker.java
##########
@@ -51,6 +51,14 @@
     public final BookieClient bookieClient;
     public final BookieWatcher bookieWatcher;
 
+    private static int averageEntrySize;
+
+    private static final int INITIAL_AVERAGE_ENTRY_SIZE = 1024;

Review comment:
       In accordance with your suggestions, I made changes。
   Please review again, thank you!@dlg99 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] lordcheng10 commented on pull request #2973: checkAllLedgers in Auditor supports read throttle

Posted by GitBox <gi...@apache.org>.
lordcheng10 commented on pull request #2973:
URL: https://github.com/apache/bookkeeper/pull/2973#issuecomment-1007909974


   @eolivelli @nicoloboschi PTAL,thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] lordcheng10 commented on pull request #2973: checkAllLedgers in Auditor supports read throttle

Posted by GitBox <gi...@apache.org>.
lordcheng10 commented on pull request #2973:
URL: https://github.com/apache/bookkeeper/pull/2973#issuecomment-1008859726


   ping


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] lordcheng10 commented on pull request #2973: checkAllLedgers in Auditor supports read throttle

Posted by GitBox <gi...@apache.org>.
lordcheng10 commented on pull request #2973:
URL: https://github.com/apache/bookkeeper/pull/2973#issuecomment-1009623601


   ping


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] lordcheng10 commented on pull request #2973: checkAllLedgers in Auditor supports read throttle

Posted by GitBox <gi...@apache.org>.
lordcheng10 commented on pull request #2973:
URL: https://github.com/apache/bookkeeper/pull/2973#issuecomment-1005547447


   rerun failure checks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] lordcheng10 commented on a change in pull request #2973: checkAllLedgers in Auditor supports read throttle

Posted by GitBox <gi...@apache.org>.
lordcheng10 commented on a change in pull request #2973:
URL: https://github.com/apache/bookkeeper/pull/2973#discussion_r779247634



##########
File path: bookkeeper-server/src/main/java/org/apache/bookkeeper/client/LedgerChecker.java
##########
@@ -76,6 +84,11 @@
         @Override
         public void readEntryComplete(int rc, long ledgerId, long entryId,
                 ByteBuf buffer, Object ctx) {
+            if (readThrottle != null && buffer != null) {
+                int readSize = buffer.readableBytes();
+                averageEntrySize = (int) (averageEntrySize * AVERAGE_ENTRY_SIZE_RATIO

Review comment:
       OK  I will  fix




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] pkumar-singh edited a comment on pull request #2973: checkAllLedgers in Auditor supports read throttle

Posted by GitBox <gi...@apache.org>.
pkumar-singh edited a comment on pull request #2973:
URL: https://github.com/apache/bookkeeper/pull/2973#issuecomment-1008108825


   But I am still not sure , how it addresses the root cause. I understand that throttling with semaphore sort of reduces the pressure on the bookie. And there is a another mechanism  for that too, that is percentageOfLedgerFragmentToBeVerified(Slight misnomer here, it actually checks percentage of entries in the ledger fragments). I understand throttling will reduce timeout from bookie. But timeout can still happen and will happen. My question is why not address this issue that occasional timeout should not be considered a failure, or may be should be retried? Thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] lordcheng10 commented on pull request #2973: checkAllLedgers in Auditor supports read throttle

Posted by GitBox <gi...@apache.org>.
lordcheng10 commented on pull request #2973:
URL: https://github.com/apache/bookkeeper/pull/2973#issuecomment-1008221599


   > But I am still not sure , how it addresses the root cause. I understand that throttling with semaphore sort of reduces the pressure on the bookie. And there is a another mechanism for that too, that is percentageOfLedgerFragmentToBeVerified(Slight misnomer here, it actually checks percentage of entries in the ledger fragments). I understand throttling will reduce timeout from bookie. But timeout can still happen and will happen. My question is why not address this issue that occasional timeout should not be considered a failure, or may be should be retried? Thoughts? Looks good otherwise.
   
   percentageOfLedgerFragmentToBeVerified only works for a small number of entries!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org