You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@bookkeeper.apache.org by GitBox <gi...@apache.org> on 2021/08/17 17:19:21 UTC

[GitHub] [bookkeeper] sursingh opened a new pull request #2764: Ensure progress while restoring from checkpoint.

sursingh opened a new pull request #2764:
URL: https://github.com/apache/bookkeeper/pull/2764


   ### Motivation
   
   Sometimes the checkpoint restoration never completes. From the logs it looks
   like it is stuck waiting for entries in distributedlog. But all the entries
   have already been read. This looks like a race condition that causes the
   reader to believe that there are more entries.
   
   Although this can happen for any storage container, it is catastrophic when it
   happens for root storage container. In that case all api calls start to fail.
   
   ### Changes
   
   This happens very rarely and is difficult to debug. Here we are adding a
   protection around this deadlock. We will monitor the input stream to see we
   are getting new data. If new data is not received within specified amount of
   time, we will fail the restore process. This will fail the starting of the
   storage container. After failure we will restart the storage container again.
   
   I ran into this issue during my testing and was able confirm that we are able
   to recover from this.
   
   ```
   2021-08-12T11:47:39.839+0000    org.apache.distributedlog.ReadAheadEntryReader  ReadAhead for 000000000000000000/000000000000000000/000000000000000000/checkpoints/cea4e286-5224-43ac-b6e0-7321d607f98e/MANIFEST-000004:<default> is caught up and no log segments to read now
   
   2021-08-12T11:47:39.834+0000    org.apache.distributedlog.logsegment.PerStreamLogSegmentCache   000000000000000000/000000000000000000/000000000000000000/checkpoints/cea4e286-5224-43ac-b6e0-7321d607f98e/MANIFEST-000004 added log segment (logrecs_000000000000000001 : [LogSegmentId:2534, firstTxId:99, lastTxId:99, version:VERSION_V5_SEQUENCE_ID, completionTime:1628793488256, recordCount:1, regionId:0, status:0, logSegmentSequenceNumber:1, lastEntryId:0, lastSlotId:0, inprogress:false, minActiveDLSN:DLSN{logSegmentSequenceNo=1, entryId=0, slotId=0}, startSequenceId:0]) to cache.
   2021-08-12T11:47:39.834+0000    org.apache.distributedlog.ReadAheadEntryReader  Starting the readahead entry reader for 000000000000000000/000000000000000000/000000000000000000/checkpoints/cea4e286-5224-43ac-b6e0-7321d607f98e/MANIFEST-000004:<default> : number of segments: 1, top 10 segments = [[LogSegmentId:2534, firstTxId:99, lastTxId:99, version:VERSION_V5_SEQUENCE_ID, completionTime:1628793488256, recordCount:1, regionId:0, status:0, logSegmentSequenceNumber:1, lastEntryId:0, lastSlotId:0, inprogress:false, minActiveDLSN:DLSN{logSegmentSequenceNo=1, entryId=0, slotId=0}, startSequenceId:0]]
   2021-08-12T11:47:39.834+0000    org.apache.distributedlog.logsegment.PerStreamLogSegmentCache   000000000000000000/000000000000000000/000000000000000000/checkpoints/cea4e286-5224-43ac-b6e0-7321d607f98e/MANIFEST-000004 added log segment (logrecs_000000000000000001 : [LogSegmentId:2534, firstTxId:99, lastTxId:99, version:VERSION_V5_SEQUENCE_ID, completionTime:1628793488256, recordCount:1, regionId:0, status:0, logSegmentSequenceNumber:1, lastEntryId:0, lastSlotId:0, inprogress:false, minActiveDLSN:DLSN{logSegmentSequenceNo=1, entryId=0, slotId=0}, startSequenceId:0]) to cache.
   
   2021-08-12T11:47:29.516+0000    org.apache.bookkeeper.stream.storage.impl.sc.ZkStorageContainerManager  Failed to start storage container (0)
   2021-08-12T11:47:29.513+0000    org.apache.bookkeeper.stream.storage.impl.sc.StorageContainerRegistryImpl       De-registered StorageContainer ('0') when failed to start
   
   2021-08-12T11:47:29.511+0000    org.apache.bookkeeper.statelib.impl.kv.RocksdbKVStore   Timeout waiting for checkpoint restore: Checkpoint{ID='cea4e286-5224-43ac-b6e0-7321d607f98e', createdAt: 1628793488360 files: "CURRENT"\nfiles: "MANIFEST-000004"\nfiles: "OPTIONS-000008"\ntxid: "\\000\\000\\000\\000\\000\\000\\000\\000"\ncreated_at: 1628793488360\nfileInfos {\n  name: "CURRENT"\n  checksum: "0861415cada612ea5834d56e2cf1055d3e63979b69eb71d32ae9ae394d8306cd"\n}\nfileInfos {\n  name: "MANIFEST-000004"\n  checksum: "7e4ff7d93941139236f69906d42a9e387dec565966b0017b6e8b16650c6c8be5"\n}\nfileInfos {\n  name: "OPTIONS-000008"\n  checksum: "41c8a054439d7798f88992a7a2dcb96f189cce0ed7ddb4c71af3594edc6a7a14"\n}\n}
   
   2021-08-12T11:47:29.489+0000    org.apache.bookkeeper.statelib.impl.rocksdb.checkpoint.CheckpointFile   Timeout waiting for copy: 000000000000000000/000000000000000000/000000000000000000/checkpoints/cea4e286-5224-43ac-b6e0-7321d607f98e/MANIFEST-000004 last-read 0 current-read 0 runtime(ms) 300000
   
   2021-08-12T11:42:29.467+0000    org.apache.distributedlog.ReadAheadEntryReader  Starting the readahead entry reader for 000000000000000000/000000000000000000/000000000000000000/checkpoints/cea4e286-5224-43ac-b6e0-7321d607f98e/MANIFEST-000004:<default> : number of segments: 1, top 10 segments = [[LogSegmentId:2534, firstTxId:99, lastTxId:99, version:VERSION_V5_SEQUENCE_ID, completionTime:1628793488256, recordCount:1, regionId:0, status:0, logSegmentSequenceNumber:1, lastEntryId:0, lastSlotId:0, inprogress:false, minActiveDLSN:DLSN{logSegmentSequenceNo=1, entryId=0, slotId=0}, startSequenceId:0]]
   2021-08-12T11:42:29.466+0000    org.apache.distributedlog.logsegment.PerStreamLogSegmentCache   000000000000000000/000000000000000000/000000000000000000/checkpoints/cea4e286-5224-43ac-b6e0-7321d607f98e/MANIFEST-000004 added log segment (logrecs_000000000000000001 : [LogSegmentId:2534, firstTxId:99, lastTxId:99, version:VERSION_V5_SEQUENCE_ID, completionTime:1628793488256, recordCount:1, regionId:0, status:0, logSegmentSequenceNumber:1, lastEntryId:0, lastSlotId:0, inprogress:false, minActiveDLSN:DLSN{logSegmentSequenceNo=1, entryId=0, slotId=0}, startSequenceId:0]) to cache.
   2021-08-12T11:42:29.466+0000    org.apache.distributedlog.logsegment.PerStreamLogSegmentCache   000000000000000000/000000000000000000/000000000000000000/checkpoints/cea4e286-5224-43ac-b6e0-7321d607f98e/MANIFEST-000004 added log segment (logrecs_000000000000000001 : [LogSegmentId:2534, firstTxId:99, lastTxId:99, version:VERSION_V5_SEQUENCE_ID, completionTime:1628793488256, recordCount:1, regionId:0, status:0, logSegmentSequenceNumber:1, lastEntryId:0, lastSlotId:0, inprogress:false, minActiveDLSN:DLSN{logSegmentSequenceNo=1, entryId=0, slotId=0}, startSequenceId:0]) to cache.
   
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] hsaputra merged pull request #2764: [Stream storage] Ensure progress while restoring from checkpoint.

Posted by GitBox <gi...@apache.org>.
hsaputra merged pull request #2764:
URL: https://github.com/apache/bookkeeper/pull/2764


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] sursingh commented on pull request #2764: Ensure progress while restoring from checkpoint.

Posted by GitBox <gi...@apache.org>.
sursingh commented on pull request #2764:
URL: https://github.com/apache/bookkeeper/pull/2764#issuecomment-900550890


   rerun failure checks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] ivankelly commented on a change in pull request #2764: Ensure progress while restoring from checkpoint.

Posted by GitBox <gi...@apache.org>.
ivankelly commented on a change in pull request #2764:
URL: https://github.com/apache/bookkeeper/pull/2764#discussion_r691097245



##########
File path: stream/statelib/src/main/java/org/apache/bookkeeper/statelib/impl/rocksdb/checkpoint/CheckpointInfo.java
##########
@@ -128,17 +131,26 @@ public int compareTo(CheckpointInfo o) {
         return this.getCreatedAt().compareTo(o.getCreatedAt());
     }
 
-    public CheckpointMetadata restore(File dbPath, RocksdbRestoreTask task) throws StateStoreException, IOException {
+    public CheckpointMetadata restore(File dbPath, RocksdbRestoreTask task)
+        throws StateStoreException, IOException, TimeoutException {
+
         task.restore(id, metadata);
         updateCurrent(dbPath);
         log.info("Successfully restore checkpoint {} to {}", id, getCheckpointPath(dbPath));
         return metadata;
     }
+    public CheckpointMetadata restore(String dbName, File dbPath, CheckpointStore store)
+        throws StateStoreException, TimeoutException {
+
+        return restore(dbName, dbPath, store, Duration.ofMillis(5));

Review comment:
       5ms is a very small amount of  "idle" time. Gc could easily trigger this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] ivankelly commented on a change in pull request #2764: Ensure progress while restoring from checkpoint.

Posted by GitBox <gi...@apache.org>.
ivankelly commented on a change in pull request #2764:
URL: https://github.com/apache/bookkeeper/pull/2764#discussion_r691097245



##########
File path: stream/statelib/src/main/java/org/apache/bookkeeper/statelib/impl/rocksdb/checkpoint/CheckpointInfo.java
##########
@@ -128,17 +131,26 @@ public int compareTo(CheckpointInfo o) {
         return this.getCreatedAt().compareTo(o.getCreatedAt());
     }
 
-    public CheckpointMetadata restore(File dbPath, RocksdbRestoreTask task) throws StateStoreException, IOException {
+    public CheckpointMetadata restore(File dbPath, RocksdbRestoreTask task)
+        throws StateStoreException, IOException, TimeoutException {
+
         task.restore(id, metadata);
         updateCurrent(dbPath);
         log.info("Successfully restore checkpoint {} to {}", id, getCheckpointPath(dbPath));
         return metadata;
     }
+    public CheckpointMetadata restore(String dbName, File dbPath, CheckpointStore store)
+        throws StateStoreException, TimeoutException {
+
+        return restore(dbName, dbPath, store, Duration.ofMillis(5));

Review comment:
       5ms is a very small amount of  "idle" time. Gc could easily trigger this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] sursingh commented on pull request #2764: Ensure progress while restoring from checkpoint.

Posted by GitBox <gi...@apache.org>.
sursingh commented on pull request #2764:
URL: https://github.com/apache/bookkeeper/pull/2764#issuecomment-900550890


   rerun failure checks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org