You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@bookkeeper.apache.org by GitBox <gi...@apache.org> on 2021/08/17 17:19:21 UTC
[GitHub] [bookkeeper] sursingh opened a new pull request #2764: Ensure progress while restoring from checkpoint.
sursingh opened a new pull request #2764:
URL: https://github.com/apache/bookkeeper/pull/2764
### Motivation
Sometimes the checkpoint restoration never completes. From the logs it looks
like it is stuck waiting for entries in distributedlog. But all the entries
have already been read. This looks like a race condition that causes the
reader to believe that there are more entries.
Although this can happen for any storage container, it is catastrophic when it
happens for root storage container. In that case all api calls start to fail.
### Changes
This happens very rarely and is difficult to debug. Here we are adding a
protection around this deadlock. We will monitor the input stream to see we
are getting new data. If new data is not received within specified amount of
time, we will fail the restore process. This will fail the starting of the
storage container. After failure we will restart the storage container again.
I ran into this issue during my testing and was able confirm that we are able
to recover from this.
```
2021-08-12T11:47:39.839+0000 org.apache.distributedlog.ReadAheadEntryReader ReadAhead for 000000000000000000/000000000000000000/000000000000000000/checkpoints/cea4e286-5224-43ac-b6e0-7321d607f98e/MANIFEST-000004:<default> is caught up and no log segments to read now
2021-08-12T11:47:39.834+0000 org.apache.distributedlog.logsegment.PerStreamLogSegmentCache 000000000000000000/000000000000000000/000000000000000000/checkpoints/cea4e286-5224-43ac-b6e0-7321d607f98e/MANIFEST-000004 added log segment (logrecs_000000000000000001 : [LogSegmentId:2534, firstTxId:99, lastTxId:99, version:VERSION_V5_SEQUENCE_ID, completionTime:1628793488256, recordCount:1, regionId:0, status:0, logSegmentSequenceNumber:1, lastEntryId:0, lastSlotId:0, inprogress:false, minActiveDLSN:DLSN{logSegmentSequenceNo=1, entryId=0, slotId=0}, startSequenceId:0]) to cache.
2021-08-12T11:47:39.834+0000 org.apache.distributedlog.ReadAheadEntryReader Starting the readahead entry reader for 000000000000000000/000000000000000000/000000000000000000/checkpoints/cea4e286-5224-43ac-b6e0-7321d607f98e/MANIFEST-000004:<default> : number of segments: 1, top 10 segments = [[LogSegmentId:2534, firstTxId:99, lastTxId:99, version:VERSION_V5_SEQUENCE_ID, completionTime:1628793488256, recordCount:1, regionId:0, status:0, logSegmentSequenceNumber:1, lastEntryId:0, lastSlotId:0, inprogress:false, minActiveDLSN:DLSN{logSegmentSequenceNo=1, entryId=0, slotId=0}, startSequenceId:0]]
2021-08-12T11:47:39.834+0000 org.apache.distributedlog.logsegment.PerStreamLogSegmentCache 000000000000000000/000000000000000000/000000000000000000/checkpoints/cea4e286-5224-43ac-b6e0-7321d607f98e/MANIFEST-000004 added log segment (logrecs_000000000000000001 : [LogSegmentId:2534, firstTxId:99, lastTxId:99, version:VERSION_V5_SEQUENCE_ID, completionTime:1628793488256, recordCount:1, regionId:0, status:0, logSegmentSequenceNumber:1, lastEntryId:0, lastSlotId:0, inprogress:false, minActiveDLSN:DLSN{logSegmentSequenceNo=1, entryId=0, slotId=0}, startSequenceId:0]) to cache.
2021-08-12T11:47:29.516+0000 org.apache.bookkeeper.stream.storage.impl.sc.ZkStorageContainerManager Failed to start storage container (0)
2021-08-12T11:47:29.513+0000 org.apache.bookkeeper.stream.storage.impl.sc.StorageContainerRegistryImpl De-registered StorageContainer ('0') when failed to start
2021-08-12T11:47:29.511+0000 org.apache.bookkeeper.statelib.impl.kv.RocksdbKVStore Timeout waiting for checkpoint restore: Checkpoint{ID='cea4e286-5224-43ac-b6e0-7321d607f98e', createdAt: 1628793488360 files: "CURRENT"\nfiles: "MANIFEST-000004"\nfiles: "OPTIONS-000008"\ntxid: "\\000\\000\\000\\000\\000\\000\\000\\000"\ncreated_at: 1628793488360\nfileInfos {\n name: "CURRENT"\n checksum: "0861415cada612ea5834d56e2cf1055d3e63979b69eb71d32ae9ae394d8306cd"\n}\nfileInfos {\n name: "MANIFEST-000004"\n checksum: "7e4ff7d93941139236f69906d42a9e387dec565966b0017b6e8b16650c6c8be5"\n}\nfileInfos {\n name: "OPTIONS-000008"\n checksum: "41c8a054439d7798f88992a7a2dcb96f189cce0ed7ddb4c71af3594edc6a7a14"\n}\n}
2021-08-12T11:47:29.489+0000 org.apache.bookkeeper.statelib.impl.rocksdb.checkpoint.CheckpointFile Timeout waiting for copy: 000000000000000000/000000000000000000/000000000000000000/checkpoints/cea4e286-5224-43ac-b6e0-7321d607f98e/MANIFEST-000004 last-read 0 current-read 0 runtime(ms) 300000
2021-08-12T11:42:29.467+0000 org.apache.distributedlog.ReadAheadEntryReader Starting the readahead entry reader for 000000000000000000/000000000000000000/000000000000000000/checkpoints/cea4e286-5224-43ac-b6e0-7321d607f98e/MANIFEST-000004:<default> : number of segments: 1, top 10 segments = [[LogSegmentId:2534, firstTxId:99, lastTxId:99, version:VERSION_V5_SEQUENCE_ID, completionTime:1628793488256, recordCount:1, regionId:0, status:0, logSegmentSequenceNumber:1, lastEntryId:0, lastSlotId:0, inprogress:false, minActiveDLSN:DLSN{logSegmentSequenceNo=1, entryId=0, slotId=0}, startSequenceId:0]]
2021-08-12T11:42:29.466+0000 org.apache.distributedlog.logsegment.PerStreamLogSegmentCache 000000000000000000/000000000000000000/000000000000000000/checkpoints/cea4e286-5224-43ac-b6e0-7321d607f98e/MANIFEST-000004 added log segment (logrecs_000000000000000001 : [LogSegmentId:2534, firstTxId:99, lastTxId:99, version:VERSION_V5_SEQUENCE_ID, completionTime:1628793488256, recordCount:1, regionId:0, status:0, logSegmentSequenceNumber:1, lastEntryId:0, lastSlotId:0, inprogress:false, minActiveDLSN:DLSN{logSegmentSequenceNo=1, entryId=0, slotId=0}, startSequenceId:0]) to cache.
2021-08-12T11:42:29.466+0000 org.apache.distributedlog.logsegment.PerStreamLogSegmentCache 000000000000000000/000000000000000000/000000000000000000/checkpoints/cea4e286-5224-43ac-b6e0-7321d607f98e/MANIFEST-000004 added log segment (logrecs_000000000000000001 : [LogSegmentId:2534, firstTxId:99, lastTxId:99, version:VERSION_V5_SEQUENCE_ID, completionTime:1628793488256, recordCount:1, regionId:0, status:0, logSegmentSequenceNumber:1, lastEntryId:0, lastSlotId:0, inprogress:false, minActiveDLSN:DLSN{logSegmentSequenceNo=1, entryId=0, slotId=0}, startSequenceId:0]) to cache.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [bookkeeper] hsaputra merged pull request #2764: [Stream storage] Ensure progress while restoring from checkpoint.
Posted by GitBox <gi...@apache.org>.
hsaputra merged pull request #2764:
URL: https://github.com/apache/bookkeeper/pull/2764
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [bookkeeper] sursingh commented on pull request #2764: Ensure progress while restoring from checkpoint.
Posted by GitBox <gi...@apache.org>.
sursingh commented on pull request #2764:
URL: https://github.com/apache/bookkeeper/pull/2764#issuecomment-900550890
rerun failure checks
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [bookkeeper] ivankelly commented on a change in pull request #2764: Ensure progress while restoring from checkpoint.
Posted by GitBox <gi...@apache.org>.
ivankelly commented on a change in pull request #2764:
URL: https://github.com/apache/bookkeeper/pull/2764#discussion_r691097245
##########
File path: stream/statelib/src/main/java/org/apache/bookkeeper/statelib/impl/rocksdb/checkpoint/CheckpointInfo.java
##########
@@ -128,17 +131,26 @@ public int compareTo(CheckpointInfo o) {
return this.getCreatedAt().compareTo(o.getCreatedAt());
}
- public CheckpointMetadata restore(File dbPath, RocksdbRestoreTask task) throws StateStoreException, IOException {
+ public CheckpointMetadata restore(File dbPath, RocksdbRestoreTask task)
+ throws StateStoreException, IOException, TimeoutException {
+
task.restore(id, metadata);
updateCurrent(dbPath);
log.info("Successfully restore checkpoint {} to {}", id, getCheckpointPath(dbPath));
return metadata;
}
+ public CheckpointMetadata restore(String dbName, File dbPath, CheckpointStore store)
+ throws StateStoreException, TimeoutException {
+
+ return restore(dbName, dbPath, store, Duration.ofMillis(5));
Review comment:
5ms is a very small amount of "idle" time. Gc could easily trigger this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [bookkeeper] ivankelly commented on a change in pull request #2764: Ensure progress while restoring from checkpoint.
Posted by GitBox <gi...@apache.org>.
ivankelly commented on a change in pull request #2764:
URL: https://github.com/apache/bookkeeper/pull/2764#discussion_r691097245
##########
File path: stream/statelib/src/main/java/org/apache/bookkeeper/statelib/impl/rocksdb/checkpoint/CheckpointInfo.java
##########
@@ -128,17 +131,26 @@ public int compareTo(CheckpointInfo o) {
return this.getCreatedAt().compareTo(o.getCreatedAt());
}
- public CheckpointMetadata restore(File dbPath, RocksdbRestoreTask task) throws StateStoreException, IOException {
+ public CheckpointMetadata restore(File dbPath, RocksdbRestoreTask task)
+ throws StateStoreException, IOException, TimeoutException {
+
task.restore(id, metadata);
updateCurrent(dbPath);
log.info("Successfully restore checkpoint {} to {}", id, getCheckpointPath(dbPath));
return metadata;
}
+ public CheckpointMetadata restore(String dbName, File dbPath, CheckpointStore store)
+ throws StateStoreException, TimeoutException {
+
+ return restore(dbName, dbPath, store, Duration.ofMillis(5));
Review comment:
5ms is a very small amount of "idle" time. Gc could easily trigger this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [bookkeeper] sursingh commented on pull request #2764: Ensure progress while restoring from checkpoint.
Posted by GitBox <gi...@apache.org>.
sursingh commented on pull request #2764:
URL: https://github.com/apache/bookkeeper/pull/2764#issuecomment-900550890
rerun failure checks
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org