You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2020/07/09 14:12:42 UTC

[GitHub] [pulsar] sbourkeostk opened a new issue #7489: Every function/connector exception results in a new metric

sbourkeostk opened a new issue #7489:
URL: https://github.com/apache/pulsar/issues/7489


   **Describe the bug**
   When an exception occurs in a Pulsar function or connector the timestamp of the exception is included as a metric label and the gauge value is set to 1.0. This has several drawbacks:
   1. It cause the brokers' metrics scrapes to continuously grow - we witnessed individual broker metrics scrapes of ~ 100 MB
   2. In Prometheus it results in a new time series (of length 1) for **every** exception thrown - in extreme cases this will cause Prometheus to run out of memory and fail - we experienced such a failure.
   
   **To Reproduce**
   Steps to reproduce the behavior:
   1. Get a Pulsar function to throw multiple exceptions
   2. Scrape the metrics for the broker involved
   3. Look for pulsar_function_user_exception metrics
   4. A gauge will exist for each exception thrown (see "screenshot" below)
   
   **Expected behavior**
   The timestamp not included in the metric, and the metric incremented for each exception thrown.
   
   **Screenshots**
   ```
   # HELP pulsar_function_user_exception Exception from user code.
   # TYPE pulsar_function_user_exception gauge
   pulsar_function_user_exception{tenant="public",namespace="public/default",name="Thrower",instance_id="0",cluster="standalone",fqfn="public/default/Thrower",error="Exception message",ts="1594301177959",} 1.0
   pulsar_function_user_exception{tenant="public",namespace="public/default",name="Thrower",instance_id="0",cluster="standalone",fqfn="public/default/Thrower",error="Exception message",ts="1594301238033",} 1.0
   pulsar_function_user_exception{tenant="public",namespace="public/default",name="Thrower",instance_id="0",cluster="standalone",fqfn="public/default/Thrower",error="Exception message",ts="1594301266054",} 1.0
   pulsar_function_user_exception{tenant="public",namespace="public/default",name="Thrower",instance_id="0",cluster="standalone",fqfn="public/default/Thrower",error="Exception message",ts="1594301278993",} 1.0
   pulsar_function_user_exception{tenant="public",namespace="public/default",name="Thrower",instance_id="0",cluster="standalone",fqfn="public/default/Thrower",error="Exception message",ts="1594301272472",} 1.0
   ```
   
   **Desktop (please complete the following information):**
    OS: (NA)
   
   **Additional context**
   We experienced failures in our production system due to this issue. A single node of a distributed database being down caused function exceptions - we were prepared for this and it did not cause problem. However the resulting metrics caused our Prometheus to fail and required our brokers to be restarted to clear their metrics data.
   While this example is for a Pulsar function, the same is true for connectors.
   
   https://github.com/apache/pulsar/blob/beb9e3be60513bdfbd0e412a68747b97714af1d7/pulsar-functions/instance/src/main/java/org/apache/pulsar/functions/instance/stats/FunctionStatsManager.java#L246
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] sbourkeostk commented on issue #7489: Every function/connector exception results in a new metric

Posted by GitBox <gi...@apache.org>.
sbourkeostk commented on issue #7489:
URL: https://github.com/apache/pulsar/issues/7489#issuecomment-656705729


   I don't think there's a problem in the Prometheus client, it's Pulsar that's creating the new metrics by setting a new label. Prometheus is clear in it's advice about not overusing labels:
   https://prometheus.io/docs/practices/instrumentation/#do-not-overuse-labels
   I believe the solution is simple - don't set the "ts" label and instead increment the metric.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] devinbost edited a comment on issue #7489: Every function/connector exception results in a new metric

Posted by GitBox <gi...@apache.org>.
devinbost edited a comment on issue #7489:
URL: https://github.com/apache/pulsar/issues/7489#issuecomment-656182881


   It looks like the current design is leaking the metrics when exceptions are thrown. 
   @sijie @jerrypeng Based on feedback from the Prometheus maintainers, the way we're currently using the metrics for Java functions goes against their design recommendations. So, when I wrote the Prometheus implementation for the pulsar-function-go module, I implemented their recommended design instead. (It uses a stateless approach instead of the stateful approach that currently exists in our Java function Prometheus implementation.)
   I need to confirm if the new approach resolves these issues or not. 
   
   We may need to do some refactoring to fix this issue, but the implementation that I used for Go functions may serve as a guide:
   https://github.com/apache/pulsar/blob/master/pulsar-function-go/pf/stats.go


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] devinbost commented on issue #7489: Every function/connector exception results in a new metric

Posted by GitBox <gi...@apache.org>.
devinbost commented on issue #7489:
URL: https://github.com/apache/pulsar/issues/7489#issuecomment-656182881


   It looks like the current design is leaking the metrics when exceptions are thrown. 
   @sijie @jerrypeng Based on feedback from the Prometheus maintainers, the way we're currently using the metrics goes against their design recommendations. So, when I wrote the Prometheus implementation for the pulsar-function-go module, I implemented their recommended design instead, which avoids these issues. (It uses a stateless approach instead of the stateful approach that currently exists in our Java function Prometheus implementation.)
   
   We may need to do some refactoring to fix this issue, but the implementation that I used for Go functions may serve as a guide:
   https://github.com/apache/pulsar/blob/master/pulsar-function-go/pf/stats.go


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] devinbost removed a comment on issue #7489: Every function/connector exception results in a new metric

Posted by GitBox <gi...@apache.org>.
devinbost removed a comment on issue #7489:
URL: https://github.com/apache/pulsar/issues/7489#issuecomment-656871407


   @matthewfollegot 
   Since we're using the Prometheus Counter's internal state to keep track of our exceptions, every time an exception is thrown, we're still adding the exception to the Counter, aren't we?
   e.g. https://github.com/apache/pulsar/blob/02bf9a0b770e53f5a6f3810e9602ccc9a4c05050/pulsar-functions/instance/src/main/java/org/apache/pulsar/functions/instance/stats/FunctionStatsManager.java#L293
   
   If I'm not mistaken, even if we omit the timestamp, wouldn't we still need to check if that exception has been thrown recently and filter new writes (unless we change how we're tracking that information) to prevent the new exception from perpetuating the same issue?
   If that's true and we need to filter new writes, then unless we periodically reset the exceptions (which we might already do), we could end up with new exceptions being suppressed inadvertently because they match old exceptions. There's a difference between an exception that occurs once a day and an exception that occurs with every method call, so we'd need a way to distinguish between them. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] devinbost edited a comment on issue #7489: Every function/connector exception results in a new metric

Posted by GitBox <gi...@apache.org>.
devinbost edited a comment on issue #7489:
URL: https://github.com/apache/pulsar/issues/7489#issuecomment-656871611


   @merlimat  
   Since we're using the Prometheus Counter's internal state to keep track of our exceptions in our Java implementation, every time an exception is thrown, we're still adding the exception to the Counter, aren't we?
   e.g. https://github.com/apache/pulsar/blob/02bf9a0b770e53f5a6f3810e9602ccc9a4c05050/pulsar-functions/instance/src/main/java/org/apache/pulsar/functions/instance/stats/FunctionStatsManager.java#L293
   
   If I'm not mistaken, even if we omit the timestamp, wouldn't we still need to check if that exception has been thrown recently and filter new writes (unless we change how we're tracking that information) to prevent the new exception from perpetuating the same issue?
   If that's true and we need to filter new writes, then unless we periodically reset the exceptions (which we might already do), we could end up with new exceptions being suppressed inadvertently because they match old exceptions. There's a difference between an exception that occurs once a day and an exception that occurs with every method call, so we'd need a way to distinguish between them. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] vzhikserg commented on issue #7489: Every function/connector exception results in a new metric

Posted by GitBox <gi...@apache.org>.
vzhikserg commented on issue #7489:
URL: https://github.com/apache/pulsar/issues/7489#issuecomment-658851306


   @devinbost could you please review the go implementation in the #7539 pull request?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] devinbost edited a comment on issue #7489: Every function/connector exception results in a new metric

Posted by GitBox <gi...@apache.org>.
devinbost edited a comment on issue #7489:
URL: https://github.com/apache/pulsar/issues/7489#issuecomment-656182881


   It looks like the current design is leaking the metrics when exceptions are thrown. 
   @sijie @jerrypeng Based on feedback from the Prometheus maintainers, the way we're currently using the metrics for Java functions goes against their design recommendations. So, when I wrote the Prometheus implementation for the pulsar-function-go module, I implemented their recommended design instead, which avoids these issues. (It uses a stateless approach instead of the stateful approach that currently exists in our Java function Prometheus implementation.)
   
   We may need to do some refactoring to fix this issue, but the implementation that I used for Go functions may serve as a guide:
   https://github.com/apache/pulsar/blob/master/pulsar-function-go/pf/stats.go


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] devinbost commented on issue #7489: Every function/connector exception results in a new metric

Posted by GitBox <gi...@apache.org>.
devinbost commented on issue #7489:
URL: https://github.com/apache/pulsar/issues/7489#issuecomment-659728513


   @vzhikserg Looks great! We might want to take a look at the Python implementation too.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] merlimat commented on issue #7489: Every function/connector exception results in a new metric

Posted by GitBox <gi...@apache.org>.
merlimat commented on issue #7489:
URL: https://github.com/apache/pulsar/issues/7489#issuecomment-656862813


   > but if it will change the end-user data contract
   
   This is not changing the data contract. This is just a metric to help debuggabiity. 
   
   I think the timestamp was there to display the "exceptions in the last X hours". If it doesn't work with Prometheus, we need to get rid of the label. There are surely other ways to do the query without having that label, since the timeseries have timestamps.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] merlimat closed issue #7489: Every function/connector exception results in a new metric

Posted by GitBox <gi...@apache.org>.
merlimat closed issue #7489:
URL: https://github.com/apache/pulsar/issues/7489


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] devinbost commented on issue #7489: Every function/connector exception results in a new metric

Posted by GitBox <gi...@apache.org>.
devinbost commented on issue #7489:
URL: https://github.com/apache/pulsar/issues/7489#issuecomment-656861153


   @sijie I'm open to contributing a fix for this, but if it will change the end-user data contract, we'd need to vet it with the community. I'll need to investigate the matter further. 
   
   @sbourkeostk Thanks for the feedback. I'll take a closer look at that information. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] devinbost commented on issue #7489: Every function/connector exception results in a new metric

Posted by GitBox <gi...@apache.org>.
devinbost commented on issue #7489:
URL: https://github.com/apache/pulsar/issues/7489#issuecomment-656871611


   @merlimat  
   Since we're using the Prometheus Counter's internal state to keep track of our exceptions, every time an exception is thrown, we're still adding the exception to the Counter, aren't we?
   e.g. https://github.com/apache/pulsar/blob/02bf9a0b770e53f5a6f3810e9602ccc9a4c05050/pulsar-functions/instance/src/main/java/org/apache/pulsar/functions/instance/stats/FunctionStatsManager.java#L293
   
   If I'm not mistaken, even if we omit the timestamp, wouldn't we still need to check if that exception has been thrown recently and filter new writes (unless we change how we're tracking that information) to prevent the new exception from perpetuating the same issue?
   If that's true and we need to filter new writes, then unless we periodically reset the exceptions (which we might already do), we could end up with new exceptions being suppressed inadvertently because they match old exceptions. There's a difference between an exception that occurs once a day and an exception that occurs with every method call, so we'd need a way to distinguish between them. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] devinbost commented on issue #7489: Every function/connector exception results in a new metric

Posted by GitBox <gi...@apache.org>.
devinbost commented on issue #7489:
URL: https://github.com/apache/pulsar/issues/7489#issuecomment-656873339


   It looks like we might be able to just remove the timestamp for the Go functions implementation, but is it that simple on the Java side? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] devinbost commented on issue #7489: Every function/connector exception results in a new metric

Posted by GitBox <gi...@apache.org>.
devinbost commented on issue #7489:
URL: https://github.com/apache/pulsar/issues/7489#issuecomment-656871407


   @matthewfollegot 
   Since we're using the Prometheus Counter's internal state to keep track of our exceptions, every time an exception is thrown, we're still adding the exception to the Counter, aren't we?
   e.g. https://github.com/apache/pulsar/blob/02bf9a0b770e53f5a6f3810e9602ccc9a4c05050/pulsar-functions/instance/src/main/java/org/apache/pulsar/functions/instance/stats/FunctionStatsManager.java#L293
   
   If I'm not mistaken, even if we omit the timestamp, wouldn't we still need to check if that exception has been thrown recently and filter new writes (unless we change how we're tracking that information) to prevent the new exception from perpetuating the same issue?
   If that's true and we need to filter new writes, then unless we periodically reset the exceptions (which we might already do), we could end up with new exceptions being suppressed inadvertently because they match old exceptions. There's a difference between an exception that occurs once a day and an exception that occurs with every method call, so we'd need a way to distinguish between them. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] sijie commented on issue #7489: Every function/connector exception results in a new metric

Posted by GitBox <gi...@apache.org>.
sijie commented on issue #7489:
URL: https://github.com/apache/pulsar/issues/7489#issuecomment-656437080


   @devinbost Are you interested in contributing a fix for this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org