You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@bookkeeper.apache.org by GitBox <gi...@apache.org> on 2021/02/23 06:24:07 UTC

[GitHub] [bookkeeper] michaeljmarshall opened a new issue #2619: bookie_journal_JOURNAL_QUEUE_SIZE Prometheus metric is negative

michaeljmarshall opened a new issue #2619:
URL: https://github.com/apache/bookkeeper/issues/2619


   @frankjkelly opened this issue https://github.com/apache/pulsar/issues/9634. Copying it here, as it is a bookkeeper issue.
   
   Here is the original content of the issue:
   
   > **Describe the bug**
   > The prometheus metric `bookie_journal_JOURNAL_QUEUE_SIZE` exposed by bookie is negative. Shouldn't queue lengths be positive
   > 
   > **To Reproduce**
   > 
   > 1. Launch Pulsar on Kubernetes
   > 2. Inspect metrics endpoint of Bookkeeper
   > 
   > ```
   > $ curl http://platform-pulsar-bookkeeper:8000/metrics | grep QUEUE_SIZE
   > # TYPE bookie_journal_JOURNAL_CB_QUEUE_SIZE counter
   > bookie_journal_JOURNAL_CB_QUEUE_SIZE 57253790
   > # TYPE bookie_journal_JOURNAL_FORCE_WRITE_QUEUE_SIZE counter
   > bookie_journal_JOURNAL_FORCE_WRITE_QUEUE_SIZE 0
   > # TYPE bookie_journal_JOURNAL_QUEUE_SIZE counter
   > bookie_journal_JOURNAL_QUEUE_SIZE -57253790
   > ```
   > 
   > **Expected behavior**
   > Queue lengths are zero or above
   > 
   > **Screenshots**
   > ![image](https://user-images.githubusercontent.com/62910985/108514728-34b4fc00-7292-11eb-9038-ad435fc9f12c.png)
   > 
   > **Additional context**
   > 
   > * Pulsar `2.6.1`
   > * Kafkaesque Helm Chart `1.0.24`
   > * Probably has not bearing but `exposeTopicLevelMetricsInPrometheus` is set to `false` as we have a lot of topics


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] michaeljmarshall edited a comment on issue #2619: bookie_journal_JOURNAL_QUEUE_SIZE Prometheus metric is negative

Posted by GitBox <gi...@apache.org>.
michaeljmarshall edited a comment on issue #2619:
URL: https://github.com/apache/bookkeeper/issues/2619#issuecomment-783949669


   In digging into this issue, I noticed a couple different irregularities.
   
   It looks like there might be an issue for which argument is passed into the `QueueEntry` constructor here. The final argument is `journalCbQueueSize`, but this method `journalStats.getJournalQueueSize()` is called to get the `Counter` that gets passed into constructor:
   
   https://github.com/apache/bookkeeper/blob/26120e0436d0076bf7990b5cea1e41a3e287af04/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/Journal.java#L867-L872
   
   I'm not yet sure if the parameter is named incorrectly or if the argument is incorrectly passed. Notice that the constructor creates a runnable instance that has the following `run` definition:
   
   https://github.com/apache/bookkeeper/blob/26120e0436d0076bf7990b5cea1e41a3e287af04/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/Journal.java#L328-L336
   
   That `run` method is the only place that I can see the metric being decremented, which could lead to a negative metric. I'm guessing we must be calling that run method multiple times (we know we're seeing a negative metric), and I found two places where we likely are. Even if the argument being passed in to the method above is wrong, it looks like there is still a bug in the way `dec` is being called. Here are the two references (or at least part of their stacks):
   
   https://github.com/apache/bookkeeper/blob/26120e0436d0076bf7990b5cea1e41a3e287af04/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/Journal.java#L1031-L1038
   
   https://github.com/apache/bookkeeper/blob/26120e0436d0076bf7990b5cea1e41a3e287af04/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/Journal.java#L1060-L1076
   
   I plan to continue digging into this more tomorrow night, but I wanted to post my findings thus far.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] michaeljmarshall edited a comment on issue #2619: bookie_journal_JOURNAL_QUEUE_SIZE Prometheus metric is negative

Posted by GitBox <gi...@apache.org>.
michaeljmarshall edited a comment on issue #2619:
URL: https://github.com/apache/bookkeeper/issues/2619#issuecomment-784364213


   In my haste to post a comment last night, I think I might have misunderstood the `Journal`'s code. What I posted above regarding the two places where we execute the runnable, I missed the line where we set the flushed value to `null` `(toFlush.set(i, null);`). I read more of this code this morning and am getting a better understanding, as is likely still an issue, but it might be different from my initial comment.
   
   I looked at my deployment of bookkeeper, and I only see nonnegative values for `bookie_journal_JOURNAL_QUEUE_SIZE` and `bookie_journal_JOURNAL_FORCE_WRITE_QUEUE_SIZE`. Both of those appear to go up and down (like a queue's size should), and don't go below 0. Further, `bookie_journal_JOURNAL_CB_QUEUE_SIZE` appears to be growing unbounded, which is likely because of the use of the wrong parameter when creating the `QueueEntry`. That should be fixed. I'll continue looking into these metrics later tonight and post an update then.
   
   Edit: I'm running pulsar `2.7.0`, which runs with bookkeeper `4.12.0`.
   
   @frankjkelly - can you post your bookkeeper configuration here? Or at least let me know how you deployed pulsar (if it was with the helm chart, what version and values)? Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] michaeljmarshall commented on issue #2619: bookie_journal_JOURNAL_QUEUE_SIZE Prometheus metric is negative

Posted by GitBox <gi...@apache.org>.
michaeljmarshall commented on issue #2619:
URL: https://github.com/apache/bookkeeper/issues/2619#issuecomment-784755700


   It looks like this issue has already been solved https://github.com/apache/bookkeeper/pull/2290. The user was running pulsar `2.6.1`, which pulled in bookkeeper `4.10.0`. Upgrading to pulsar `2.6.3` will resolve the bug for the negative `bookie_journal_JOURNAL_QUEUE_SIZE` metric.
   
   Given that this bug is solved, my hypothesizing about other possible issues regarding executables being run multiple times is no longer valid (it is no longer true that the current code is producing a negative metric).
   
   I do, however, wonder about the naming of some of the metrics. I think I'll open a separate issue instead to dig into the naming though.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] michaeljmarshall commented on issue #2619: bookie_journal_JOURNAL_QUEUE_SIZE Prometheus metric is negative

Posted by GitBox <gi...@apache.org>.
michaeljmarshall commented on issue #2619:
URL: https://github.com/apache/bookkeeper/issues/2619#issuecomment-784364213


   In my haste to post a comment last night, I think I might have misunderstood the `Journal`'s code. What I posted above regarding the two places where we execute the runnable, I missed the line where we set the flushed value to `null` `(toFlush.set(i, null);`). I read more of this code this morning and am getting a better understanding, as is likely still an issue, but it might be different from my initial comment.
   
   I looked at my deployment of bookkeeper, and I only see nonnegative values for `bookie_journal_JOURNAL_QUEUE_SIZE` and `bookie_journal_JOURNAL_FORCE_WRITE_QUEUE_SIZE`. Both of those appear to go up and down (like a queue's size should), and don't go below 0. Further, `bookie_journal_JOURNAL_CB_QUEUE_SIZE` appears to be growing unbounded, which is likely because of the use of the wrong parameter when creating the `QueueEntry`. That should be fixed. I'll continue looking into these metrics later tonight and post an update then.
   
   @frankjkelly - can you post your bookkeeper configuration here? Or at least let me know how you deployed pulsar (if it was with the helm chart, what version and values)? Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] eolivelli commented on issue #2619: bookie_journal_JOURNAL_QUEUE_SIZE Prometheus metric is negative

Posted by GitBox <gi...@apache.org>.
eolivelli commented on issue #2619:
URL: https://github.com/apache/bookkeeper/issues/2619#issuecomment-783993518


   @michaeljmarshall your findings are very interesting.
   We should fix it.
   Please go ahead
   thanks


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] michaeljmarshall closed issue #2619: bookie_journal_JOURNAL_QUEUE_SIZE Prometheus metric is negative

Posted by GitBox <gi...@apache.org>.
michaeljmarshall closed issue #2619:
URL: https://github.com/apache/bookkeeper/issues/2619


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] michaeljmarshall commented on issue #2619: bookie_journal_JOURNAL_QUEUE_SIZE Prometheus metric is negative

Posted by GitBox <gi...@apache.org>.
michaeljmarshall commented on issue #2619:
URL: https://github.com/apache/bookkeeper/issues/2619#issuecomment-783949669


   In digging into this issue, I noticed a couple different irregularities.
   
   It looks like there might be an issue for which argument is passed into the `QueueEntry` constructor here. The final argument is `journalCbQueueSize`, but this method `journalStats.getJournalQueueSize()` is called to get the `Counter` that gets passed into constructor:
   
   https://github.com/apache/bookkeeper/blob/26120e0436d0076bf7990b5cea1e41a3e287af04/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/Journal.java#L867-L872
   
   I'm not yet sure if the parameter is named incorrectly or if the argument is correctly passed. Notice that the constructor creates a runnable instance that has the following `run` definition:
   
   https://github.com/apache/bookkeeper/blob/26120e0436d0076bf7990b5cea1e41a3e287af04/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/Journal.java#L328-L336
   
   That `run` method is the only place that I can see the metric being decremented, which would lead to a negative metric. That means we must be calling that run method multiple times (we know we're seeing a negative metric), and I found two places where we likely are. Even if the argument being passed in to the method above is wrong, it looks like there is still a bug in the way `dec` is getting called. Here are the two references (or at least part of their stacks):
   
   https://github.com/apache/bookkeeper/blob/26120e0436d0076bf7990b5cea1e41a3e287af04/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/Journal.java#L1031-L1038
   
   https://github.com/apache/bookkeeper/blob/26120e0436d0076bf7990b5cea1e41a3e287af04/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/Journal.java#L1060-L1076
   
   I plan to continue digging into this more tomorrow night, but I wanted to post my findings thus far.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org