You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@bookkeeper.apache.org by GitBox <gi...@apache.org> on 2021/06/03 18:34:43 UTC

[GitHub] [bookkeeper] RaulGracia opened a new issue #2728: Entry Log GC may get blocked when using entryLogPerLedgerEnabled option

RaulGracia opened a new issue #2728:
URL: https://github.com/apache/bookkeeper/issues/2728


   **BUG REPORT**
   
   ***Describe the bug***
   
   _Summary_: We use `entryLogPerLedgerEnabled:"true"` option in environments that have SSDs/NVMes to exploit drive parallelism and it provides significant improvements (as suggested in documentation). However, under high throughput workloads (1GBps, 6 Bookies, 3 replicas), our experiments showed us that there is a consistent problem with this option in which, at some point, makes the ledger volume to run out of storage space. Concretely, at some point one entry log file does not get rotated (i.e., `flushRotatedLogs()` method does not get invoked in `EntryLogManagerForEntryLogPerLedger`) while the system continues writing data and generating more entry log files. When this happens, and given that the removal of unnecessary (truncated) entry logs is sequential by entry log id, the GC Thread cannot continue progressing deleting entry logs. This eventually leads to the ledger volume to fill up. This problem was first detected on Bookkeeper 4.11.1 but we also reproduced it in Bookkeep
 er 4.14.0.
   
   Next, we expose a more detailed explanation with real logs of the issue. Everything works well for some time (minutes or hours, depending on the load) until we detect some ledger activity like this:
   
   ```
   (ENTRY LOG ID 1261)
   λ grep "Created new entry log file\|Synced" bookkeeper-server-0.log | grep "1261"
   2021-05-09 14:35:25,364 - INFO  - [pool-4-thread-1:EntryLoggerAllocator@181] - Created new entry log file /bk/ledgers/current/4ed.log for logId 1261.
   2021-05-09 14:37:32,800 - INFO  - [SyncThread-14-1:EntryLogManagerForEntryLogPerLedger@693] - Synced entry logger 1261 to disk.
   
   (ENTRY LOG ID 1262)
   λ grep "Created new entry log file\|Synced" bookkeeper-server-0.log | grep "1262"
   2021-05-09 14:35:26,152 - INFO  - [pool-4-thread-1:EntryLoggerAllocator@181] - Created new entry log file /bk/ledgers/current/4ee.log for logId 1262.
   
   (ENTRY LOG ID 1263)
   λ grep "Created new entry log file\|Synced" bookkeeper-server-0.log | grep "1263"
   2021-05-09 14:35:26,419 - INFO  - [pool-4-thread-1:EntryLoggerAllocator@181] - Created new entry log file /bk/ledgers/current/4ef.log for logId 1263.
   2021-05-09 14:36:33,031 - INFO  - [SyncThread-14-1:EntryLogManagerForEntryLogPerLedger@693] - Synced entry logger 1263 to disk.
   ```
   
   So, as we can see, there is no `Synced entry logger XXX` for entry log id `1262`, but the subsequent entry logs are rotated. The problem is that it seems that one entry log file not being "rotated" impacts GC for all the subsequent files. That is, if we look at the code that calculates the `entryLogger.getLeastUnflushedLogId()`, is as follows: https://github.com/apache/bookkeeper/blob/8dc108b862db7aa11c557695ea4beffcf21cece4/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/EntryLogger.java#L1240
   
   ```
   synchronized void flushRotatedEntryLog(Long entryLogId) {                
       entryLogsStatusMap.replace(entryLogId, true);            
       while ((!entryLogsStatusMap.isEmpty()) &&(entryLogsStatusMap.get(entryLogsStatusMap.firstKey()))) {                
           long leastFlushedLogId = entryLogsStatusMap.firstKey();                
           entryLogsStatusMap.remove(leastFlushedLogId);                
           leastUnflushedLogId = leastFlushedLogId + 1;            
       }        
   }
   ```
   
   As you can see, `flushRotatedEntryLog` is only going to make progress assuming that each an every of the entry log files have been rotated. If just one of them is not rotated, then it stops updating `leastUnflushedLogId`, which makes the GC not deleting entry logs beyond that point: https://github.com/apache/bookkeeper/blob/25264f997ef22d9f456a9b88832dfa6634927b60/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/GarbageCollectorThread.java#L589
   
   For that entry log `1262`, we see this message multiple times along the experiment:
   
   ```
   2021-05-09 14:35:31,942 - DEBUG - [SyncThread-14-1:EntryLogManagerBase@129] - Flush and sync current entry logger 1262
   ...
   2021-05-09 15:07:31,831 - DEBUG - [SyncThread-14-1:EntryLogManagerBase@129] - Flush and sync current entry logger 1262
   ```
   
   Again, there is no sign of the message `Synced entry logger XXXX` to disk for that entry log file. This explains why the GC stopped, which makes sense given the current logic. But we don't know why the rotation of that entry log `1262` did not complete. Another question is whether GC for any subsequent entry log file should be stuck just because one of them is making slower progress in getting rotated.
   
   Note that this problem only appears when `entryLogPerLedgerEnabled:"true"`, so disabling that option solves the problem but at the cost of not being able to exploit the parallelism of SSDs/NVMe drives.
   
   ***To Reproduce***
   
   Inducing a high load (+500MBps) on a Bookie with `entryLogPerLedgerEnabled:"true"`, we normally get this error.
   
   ***Expected behavior***
   
   A Bookie with `entryLogPerLedgerEnabled:"true"` should be able to continue garbage collecting entry log files even though one of the previous entry log files has not been yet rotated.
   
   ***Screenshots***
   In this screenshot you can see the behavior related to this issue. As you can see, the Disk Usage of the ledger volume for few Bookies continues to grow until they get in read-only mode, whereas others are still handling properly GC activity:
   
   ![image (3)](https://user-images.githubusercontent.com/717112/120694470-892f6080-c4aa-11eb-8a55-bee4895abf16.png)
   
   
   ***Additional context***
   n/a


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] dlg99 commented on issue #2728: Entry Log GC may get blocked when using entryLogPerLedgerEnabled option

Posted by GitBox <gi...@apache.org>.

dlg99 commented on issue #2728:
URL: https://github.com/apache/bookkeeper/issues/2728#issuecomment-859173284


   @fpj ELPL gets tricky if you don't know size/number of ledgers. I.e. in multi tenants cluster tenants can end up with too many small ledgers (or many tenants/topics in pulsar with ledger rotation policy on the schedule) + large disk.
   
   I don't remember all the context from the time when ELPL was implemented, I think the following factors should be considered:
   * File system doesn't always handle too many files gracefully. 
   * Then there are more entry logs open for write -> more fsyncs/parallel fsyncs (suboptimal on rotational disks, sometimes on SSDs)
   * potentially too many file handles
   * checkpoint on flush interval and perf considerations when journal(s) is co-located with entry logs
   
   maybe something else.
   
   I do agree that reduced write amplification (no compaction) helps (helped in the env it was developed for - large ledgers of predictable sizes, multiple SSDs, etc).
   
   If you are looking for the lessons learned with ELPL, I'd suggest reaching out to @reddycharan and @jvrao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] OlegKashtanov commented on issue #2728: Entry Log GC may get blocked when using entryLogPerLedgerEnabled option

Posted by GitBox <gi...@apache.org>.

OlegKashtanov commented on issue #2728:
URL: https://github.com/apache/bookkeeper/issues/2728#issuecomment-862169887


   I reproduced that issue using Bookkeeper benchmark tool.
   The following deployment and test steps:
   #### Configuration
   - Kubernetes cluster
   - 3 ZKs
   - 6 BKs. 1 NVME drive per BK (both for Ledger and Journal).
   - 1 BK benchmark tool
   #### Deployment steps
   - Deploy ZK operator:
   ```
   git clone https://github.com/pravega/zookeeper-operator.git
   helm install zk zookeeper-operator/charts/zookeeper-operator --set image.tag=0.2.8
   ```
   - Deploy bookkeeper operator
   ```
   kubectl create -f https://raw.githubusercontent.com/pravega/bookkeeper-operator/master/deploy/certificate.yaml
   git clone https://github.com/pravega/bookkeeper-operator.git
   helm install bk bookkeeper-operator/charts/bookkeeper-operator --set image.tag=0.1.3 --set testmode.enabled=true
   ```
   - Apply the following configmap:
   ```
   kind: ConfigMap
   apiVersion: v1
   metadata:
     name: bk-config-map
   data:
     # Configuration values can be set as key-value properties
     PRAVEGA_CLUSTER_NAME: "pravega"
     WAIT_FOR: zookeeper-client:2181
   ```
   - Build Bookie image using docker file https://github.com/pravega/pravega/blob/master/docker/bookkeeper/Dockerfile
   - Create Bookie pods using the below manifest. Don't forget to indicate Bookie's image and change Storage classes:
   ```
   apiVersion: "bookkeeper.pravega.io/v1alpha1"
   kind: "BookkeeperCluster"
   metadata:
     name: "bookkeeper"
   spec:
     version: <BK_VERSION>
     zookeeperUri: zookeeper-client:2181
     blockOwnerDeletion: false
     envVars: bk-config-map
     image:
       imageSpec:
         repository: "<BK_REPOSITORY>"
         pullPolicy: IfNotPresent
    
     replicas: 6
     resources:
       requests:
         memory: 4Gi
         cpu: 2000m
       limits:
         memory: 20Gi
         cpu: 8000m
    
     storage:
       ledgerVolumeClaimTemplate:
         accessModes: [ "ReadWriteOnce" ]
         storageClassName: "localfile-bk-ledger"
         resources:
           requests:
             storage: 700Gi
    
       journalVolumeClaimTemplate:
         accessModes: [ "ReadWriteOnce" ]
         storageClassName: "localfile-bk-journal"
         resources:
           requests:
             storage: 700Gi
    
       indexVolumeClaimTemplate:
         accessModes: [ "ReadWriteOnce" ]
         storageClassName: "ocs-storagecluster-cephfs"
         resources:
           requests:
             storage: 10Gi
    
     autoRecovery: true
     options:
       useHostNameAsBookieID: "true"
       enableStatistics: "true"
       statsProviderClass: "org.apache.bookkeeper.stats.prometheus.PrometheusMetricsProvider"
       isForceGCAllowWhenNoSpace: "true"
       minorCompactionThreshold: "0.4"
       minorCompactionInterval: "1800"
       majorCompactionThreshold: "0.8"
       majorCompactionInterval: "43200"
       numAddWorkerThreads: "8"
       entryLogPerLedgerEnabled: "true"
       flushInterval: "60000"
       gcWaitTime: "600000"
       readBufferSizeBytes: "4096"
       writeBufferSizeBytes: "524288"
       ledgerStorageClass: "org.apache.bookkeeper.bookie.InterleavedLedgerStorage"
       journalDirectories: "/bk/journal/j0,/bk/journal/j1,/bk/journal/j2,/bk/journal/j3,/bk/journal/j4,/bk/journal/j5,/bk/journal/j6,/bk/journal/j7"
       ledgerDirectories: "/bk/ledgers/"
       flushEntrylogBytes: "0"
    
     jvmOptions:
             memoryOpts: ["-Xms2g","-Xmx12g","-XX:MaxDirectMemorySize=8g","-XX:+IgnoreUnrecognizedVMOptions","-XX:+UseContainerSupport","-XX:+UnlockDiagnosticVMOptions"]
   ```
   - Build Bookkeeper benchmark image using docker file:
   ```
   FROM ubuntu:18.04
   RUN apt-get update
   RUN echo deb http://ppa.launchpad.net/openjdk-r/ppa/ubuntu bionic main >| /etc/apt/sources.list.d/openjdk-r-ubuntu-ppa-bionic.list
   RUN apt-get -y install openjdk-8-jdk
   RUN apt-get -y install git
   RUN apt-get -y install maven
   RUN apt-get -y install vim
   RUN cd /opt; git clone https://github.com/apache/bookkeeper.git --branch release-4.11.1
   WORKDIR /opt/bookkeeper
   RUN mvn -pl .,bookkeeper-benchmark clean install -Dmaven.test.skip=true
   WORKDIR /root
   USER root
   RUN cd /root
   CMD ["/bin/sh", "-c", "while true; do sleep 1; done"]
   ```
   - Deploy Bookkeeper benchmark pod using the below manifest. Don't forget to indicate Image name:
   ```
   apiVersion: apps/v1
   kind: StatefulSet
   metadata:
     name: benchmark-bookkeeper
   spec:
     # modify replicas according to your case
     replicas: 1
     selector:
       matchLabels:
         app: benchmark-bookkeeper
     serviceName: "benchmark-bookkeeper"
     template:
       metadata:
         labels:
           app: benchmark-bookkeeper
       spec:
         containers:
         - name: benchmark-bookkeeper
           image: bk-bmrk:4.11.1
           imagePullPolicy: IfNotPresent
           resources:
             limits:
               cpu: "16"
               memory: "20Gi"
             requests:
               cpu: "2"
               memory: "4Gi"
           env:
           - name: BENCHMARK_EXTRA_OPTS
             value: -Dorg.apache.bookkeeper.conf.readsystemproperties=true -DzkLedgersRootPath=/pravega/pravega/bookkeeper/ledgers -Xmx7g -XX:MaxDirectMemorySize=8g
             #value: -Dorg.apache.bookkeeper.conf.readsystemproperties=true -DzkLedgersRootPath=/pravega/prvg-pravega/bookkeeper/ledgers
           - name: BENCHMARK_EXTRA_CLASSPATH
             value: /opt/bookkeeper/bookkeeper-benchmark/target/bookkeeper-benchmark-4.11.1.jar
   ```
   - Run Bookkeeper benchmark:
   ```
   kubectl exec -i benchmark-bookkeeper-0 -- /opt/bookkeeper/bookkeeper-benchmark/bin/benchmark writes -zookeeper zookeeper-client:2181 -quorum 3 -ensemble 3 -ackQuorum 3 -entrysize 300000 -ledgers 300 -throttle 10 -skipwarmup true -time 3600
   ```
   - Monitor in any way the Ledger's drive capacity utilization. You will see the permanent increasing of used space. Here the screenshot from Prometheus monitoring (Ledger's drives capacity utilization):
   ![ledgers-drives-util](https://user-images.githubusercontent.com/8101276/122186698-f5d93080-ce96-11eb-9e3c-25559b3e17f1.png)
   - When Ledger's capacity utilization is 95%, you'll see the following events in Bookkeeper logs:
   `2021-06-11 08:10:19,837 - WARN  - [BookieWriteThreadPool-OrderedExecutor-7-0:WriteEntryProcessorV3@73] - BookieServer is running as readonly mode, so rejecting the request from the client!`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] fpj commented on issue #2728: Entry Log GC may get blocked when using entryLogPerLedgerEnabled option

Posted by GitBox <gi...@apache.org>.

fpj commented on issue #2728:
URL: https://github.com/apache/bookkeeper/issues/2728#issuecomment-858510104


   @dlg99 any particular why you and others don't use this option? We are observing much higher performance with `entryLogPerLedgerEnabled:"true"` in the hardware we are testing with. I'm curious to know why this is not popular given the performance benefit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] eolivelli commented on issue #2728: Entry Log GC may get blocked when using entryLogPerLedgerEnabled option

Posted by GitBox <gi...@apache.org>.

eolivelli commented on issue #2728:
URL: https://github.com/apache/bookkeeper/issues/2728#issuecomment-858569126


   @diegosalvi @nicoloboschi are you enabling this option in production for BlobIt/EmailSuccess ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] dlg99 commented on issue #2728: Entry Log GC may get blocked when using entryLogPerLedgerEnabled option

Posted by GitBox <gi...@apache.org>.

dlg99 commented on issue #2728:
URL: https://github.com/apache/bookkeeper/issues/2728#issuecomment-856158555


   @reddycharan any suggestions here? Do you still use ELPL?
   
   @RaulGracia I don't remember all details of the ELPL/compaction and work that happened afterwards, I'd start with checking:
   major/minor compaction intervals, and [flushInterval](https://github.com/apache/bookkeeper/blob/16e8ba772bb5cf4c7546fb559bd9d455d4e42625/conf/bk_server.conf#L430). I am not sure if ledgerStorageClass affects this, but I think SortedLedgerStorage and InterleavedLedgerStorage should work. 
   In case you are trying to repro it in unit tests, I am curious if ledgers with TTL < flushInterval are the reason for this somehow..


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] eolivelli commented on issue #2728: Entry Log GC may get blocked when using entryLogPerLedgerEnabled option

Posted by GitBox <gi...@apache.org>.

eolivelli commented on issue #2728:
URL: https://github.com/apache/bookkeeper/issues/2728#issuecomment-858568965


   @fpj I believe that the problem is that no one knows about it :-) 
   
   In general in BK we have the problem that there is no advertising about new features
   
   We are doing some internal testing at DataStax for this option, especially for the fact that with this option it is easier that when you delete a ledger it is deleted from disk more promptly (see BP-45).
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org