You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by jelmer <jk...@gmail.com> on 2018/06/14 21:34:03 UTC

TaskIOMetricGroup metrics not unregistered in prometheus on job failure ?

Hi

We are using flink-metrics-prometheus for reporting on apache flink 1.4.2

And I am looking into an issue where it seems that somehow in some cases
the metrics registered
by org.apache.flink.runtime.metrics.groups.TaskIOMetricGroup
(flink_taskmanager_job_task_buffers_outPoolUsage etc)  are not being
unregistered in prometheus in case of a job restart

Eventually this seems to cause a java.lang.NoClassDefFoundError:
org/apache/kafka/common/metrics/stats/Rate$1 error when a new version of
the job is deployed  because the jar file
in /tmp/blobStore-foo/job_bar/blob_p-baz-qux has been removed upon
deployment of the new job but the url classloader still points to it and it
cannot find stats/Rate$1 (some synthetically generated code generated by
the java compiler because its a switch on an enum)

Has anybody come across this issue ? Has it possibly been fixed in 1.5 ?
Can somebody any pointers as to where to look to tackle this ?

Attached screenshot shows what classloader that cannot be garbage collected
with the gc root

Re: TaskIOMetricGroup metrics not unregistered in prometheus on job failure ?

Posted by jelmer <jk...@gmail.com>.

Thanks for taking a look. I took the liberty of creating a pull request for
this.

https://github.com/apache/flink/pull/6211

It would be great if you guys could take a look at it and see if it makes
sense. I tried it out on our servers and it seems to do the job


On Tue, 26 Jun 2018 at 18:47, Chesnay Schepler <ch...@apache.org> wrote:

> Great work on debugging this, you're exactly right.
>
> The children we add to the collector have to be removed individually when
> a metric is unregistered.
>
> If the collector is a io.prometheus.client.Gauge we can use the #remove() method.
> For histograms we will have to modify our HistogramSummaryProxy class to
> allow removing individual histograms.
>
> I've filed FLINK-9665 <https://issues.apache.org/jira/browse/FLINK-9665>.
>
> On 26.06.2018 17:28, jelmer wrote:
>
> Hi Chesnay, sorry for the late reply. I did not have time to look into
> this sooner
>
> I did what you suggested. Added some logging to the PrometheusReporter
> like this :
>
>
> https://github.com/jelmerk/flink/commit/58779ee60a8c3961f3eb2c487c603c33822bba8a
>
> And deployed a custom build of the reporter to our test environment.
>
> I managed to reproduce the issue like this
>
> 1. Deploy job A : it lands on worker 1
> 2. Deploy job B : it lands on worker 1, take note of the job id
> 3. Redeploy job b by canceling it from a savepoint and deploying it again
> from the savepoint : it lands on worker 3
> 4. Execute curl -s http://localhost:9249/metrics | grep "job id from step
> 2" on worker 1. The metrics are still exposed even though the job is
> canceled
>
> I attached a piece of the log to the email. What I notice is that the two
> jobs register metrics with the same scoped metric name. In this case
> flink_taskmanager_job_task_buffers_inputQueueLength.
>
> The prometheus exporter seems to use reference counting for the metrics
> and the metrics will only be removed when the count is 0, canceling job B
> will lower the counter by 5 but because job A still is deployed the count
> is not 1 so the metric never gets unregistered
>
> Canceling job A will remove the lingering metrics from the old job B
>
> It seems to me that this is a bug and that the childs that are being
> added in notifyOfAddedMetric
> <https://github.com/jelmerk/flink/commit/58779ee60a8c3961f3eb2c487c603c33822bba8a#diff-36ff6f170e359d30a1265b43659443bfR163>
> should be removed in notifyOfRemovedMetric
>
> Can you confirm this ?
>
>
> --Jelmer
>
>
>
> On Fri, 15 Jun 2018 at 18:01, Chesnay Schepler <ch...@apache.org> wrote:
>
>> I remember that another user reported something similar, but he wasn't
>> using the PrometheusReporter. see
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/JVM-metrics-disappearing-after-job-crash-restart-tt20420.html
>>
>> We couldn't find the cause, but my suspicion was FLINK-8946 which will be
>> fixed in 1.4.3 .
>> You could cherry-pick 8b046fafb6ee77a86e360f6b792e7f73399239bd and see
>> whether this actually caused it.
>>
>> Alternatively, if you can reproduce this it would be immensely helpful if
>> you could modify the PrometheusReporter and log all notifications about
>> added or removed metrics.
>>
>> On 15.06.2018 15:42, Till Rohrmann wrote:
>>
>> Hi,
>>
>> this sounds very strange. I just tried it out locally with with a
>> standard metric and the Prometheus metrics seem to be unregistered after
>> the job has reached a terminal state. Thus, it looks as if the standard
>> metrics are properly removed from `CollectorRegistry.defaultRegistry`.
>> Could you check the log files whether they contain anything suspicious
>> about a failed metric deregistration a la `There was a problem
>> unregistering metric`?
>>
>> I've also pulled in Chesnay who knows more about the metric reporters.
>>
>> Cheers,
>> Till
>>
>> On Thu, Jun 14, 2018 at 11:34 PM jelmer <jk...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> We are using flink-metrics-prometheus for reporting on apache flink 1.4.2
>>>
>>> And I am looking into an issue where it seems that somehow in some cases
>>> the metrics registered
>>> by org.apache.flink.runtime.metrics.groups.TaskIOMetricGroup
>>> (flink_taskmanager_job_task_buffers_outPoolUsage etc)  are not being
>>> unregistered in prometheus in case of a job restart
>>>
>>> Eventually this seems to cause a java.lang.NoClassDefFoundError:
>>> org/apache/kafka/common/metrics/stats/Rate$1 error when a new version of
>>> the job is deployed  because the jar file
>>> in /tmp/blobStore-foo/job_bar/blob_p-baz-qux has been removed upon
>>> deployment of the new job but the url classloader still points to it and it
>>> cannot find stats/Rate$1 (some synthetically generated code generated
>>> by the java compiler because its a switch on an enum)
>>>
>>> Has anybody come across this issue ? Has it possibly been fixed in 1.5 ?
>>> Can somebody any pointers as to where to look to tackle this ?
>>>
>>> Attached screenshot shows what classloader that cannot be garbage
>>> collected with the gc root
>>>
>>>
>>
>

Re: TaskIOMetricGroup metrics not unregistered in prometheus on job failure ?

Posted by Chesnay Schepler <ch...@apache.org>.

Great work on debugging this, you're exactly right.

The children we add to the collector have to be removed individually 
when a metric is unregistered.

If the collector is a io.prometheus.client.Gauge we can use the 
#remove() method. For histograms we will have to modify our 
HistogramSummaryProxy class to allow removing individual histograms.

I've filed FLINK-9665 <https://issues.apache.org/jira/browse/FLINK-9665>.

On 26.06.2018 17:28, jelmer wrote:
> Hi Chesnay, sorry for the late reply. I did not have time to look into 
> this sooner
>
> I did what you suggested. Added some logging to the PrometheusReporter 
> like this :
>
> https://github.com/jelmerk/flink/commit/58779ee60a8c3961f3eb2c487c603c33822bba8a
>
> And deployed a custom build of the reporter to our test environment.
>
> I managed to reproduce the issue like this
>
> 1. Deploy job A : it lands on worker 1
> 2. Deploy job B : it lands on worker 1, take note of the job id
> 3. Redeploy job b by canceling it from a savepoint and deploying it 
> again from the savepoint : it lands on worker 3
> 4. Execute curl -s http://localhost:9249/metrics | grep "job id from 
> step 2" on worker 1. The metrics are still exposed even though the job 
> is canceled
>
> I attached a piece of the log to the email. What I notice is that the 
> two jobs register metrics with the same scoped metric name. In this 
> case flink_taskmanager_job_task_buffers_inputQueueLength.
>
> The prometheus exporter seems to use reference counting for the 
> metrics and the metrics will only be removed when the count is 0, 
> canceling job B will lower the counter by 5 but because job A still is 
> deployed the count is not 1 so the metric never gets unregistered
>
> Canceling job A will remove the lingering metrics from the old job B
>
> It seems to me that this is a bug and that the childs that are being 
> added in notifyOfAddedMetric 
> <https://github.com/jelmerk/flink/commit/58779ee60a8c3961f3eb2c487c603c33822bba8a#diff-36ff6f170e359d30a1265b43659443bfR163> 
> should be removed in notifyOfRemovedMetric
>
> Can you confirm this ?
>
>
> --Jelmer
>
>
>
> On Fri, 15 Jun 2018 at 18:01, Chesnay Schepler <chesnay@apache.org 
> <ma...@apache.org>> wrote:
>
>     I remember that another user reported something similar, but he
>     wasn't using the PrometheusReporter. see
>     http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/JVM-metrics-disappearing-after-job-crash-restart-tt20420.html
>
>     We couldn't find the cause, but my suspicion was FLINK-8946 which
>     will be fixed in 1.4.3 .
>     You could cherry-pick 8b046fafb6ee77a86e360f6b792e7f73399239bd and
>     see whether this actually caused it.
>
>     Alternatively, if you can reproduce this it would be immensely
>     helpful if you could modify the PrometheusReporter and log all
>     notifications about added or removed metrics.
>
>     On 15.06.2018 15:42, Till Rohrmann wrote:
>>     Hi,
>>
>>     this sounds very strange. I just tried it out locally with with a
>>     standard metric and the Prometheus metrics seem to be
>>     unregistered after the job has reached a terminal state. Thus, it
>>     looks as if the standard metrics are properly removed from
>>     `CollectorRegistry.defaultRegistry`. Could you check the log
>>     files whether they contain anything suspicious about a failed
>>     metric deregistration a la `There was a problem unregistering
>>     metric`?
>>
>>     I've also pulled in Chesnay who knows more about the metric
>>     reporters.
>>
>>     Cheers,
>>     Till
>>
>>     On Thu, Jun 14, 2018 at 11:34 PM jelmer <jkuperus@gmail.com
>>     <ma...@gmail.com>> wrote:
>>
>>         Hi
>>
>>         We are using flink-metrics-prometheus for reporting on apache
>>         flink 1.4.2
>>
>>         And I am looking into an issue where it seems that somehow in
>>         some cases the metrics registered
>>         by org.apache.flink.runtime.metrics.groups.TaskIOMetricGroup
>>         (flink_taskmanager_job_task_buffers_outPoolUsage etc)  are
>>         not being unregistered in prometheus in case of a job restart
>>
>>         Eventually this seems to cause
>>         a java.lang.NoClassDefFoundError:
>>         org/apache/kafka/common/metrics/stats/Rate$1 error when a new
>>         version of the job is deployed  because the jar file
>>         in /tmp/blobStore-foo/job_bar/blob_p-baz-qux has been removed
>>         upon deployment of the new job but the url classloader still
>>         points to it and it cannot find stats/Rate$1 (some
>>         synthetically generated code generated by the java compiler
>>         because its a switch on an enum)
>>
>>         Has anybody come across this issue ? Has it possibly been
>>         fixed in 1.5 ? Can somebody any pointers as to where to look
>>         to tackle this ?
>>
>>         Attached screenshot shows what classloader that cannot be
>>         garbage collected with the gc root
>>
>

Re: TaskIOMetricGroup metrics not unregistered in prometheus on job failure ?

Posted by jelmer <jk...@gmail.com>.

Hi Chesnay, sorry for the late reply. I did not have time to look into this
sooner

I did what you suggested. Added some logging to the PrometheusReporter like
this :

https://github.com/jelmerk/flink/commit/58779ee60a8c3961f3eb2c487c603c33822bba8a

And deployed a custom build of the reporter to our test environment.

I managed to reproduce the issue like this

1. Deploy job A : it lands on worker 1
2. Deploy job B : it lands on worker 1, take note of the job id
3. Redeploy job b by canceling it from a savepoint and deploying it again
from the savepoint : it lands on worker 3
4. Execute curl -s http://localhost:9249/metrics | grep "job id from step
2" on worker 1. The metrics are still exposed even though the job is
canceled

I attached a piece of the log to the email. What I notice is that the two
jobs register metrics with the same scoped metric name. In this case
flink_taskmanager_job_task_buffers_inputQueueLength.

The prometheus exporter seems to use reference counting for the metrics and
the metrics will only be removed when the count is 0, canceling job B will
lower the counter by 5 but because job A still is deployed the count is not
1 so the metric never gets unregistered

Canceling job A will remove the lingering metrics from the old job B

It seems to me that this is a bug and that the childs that are being added
in notifyOfAddedMetric
<https://github.com/jelmerk/flink/commit/58779ee60a8c3961f3eb2c487c603c33822bba8a#diff-36ff6f170e359d30a1265b43659443bfR163>
should be removed in notifyOfRemovedMetric

Can you confirm this ?


--Jelmer



On Fri, 15 Jun 2018 at 18:01, Chesnay Schepler <ch...@apache.org> wrote:

> I remember that another user reported something similar, but he wasn't
> using the PrometheusReporter. see
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/JVM-metrics-disappearing-after-job-crash-restart-tt20420.html
>
> We couldn't find the cause, but my suspicion was FLINK-8946 which will be
> fixed in 1.4.3 .
> You could cherry-pick 8b046fafb6ee77a86e360f6b792e7f73399239bd and see
> whether this actually caused it.
>
> Alternatively, if you can reproduce this it would be immensely helpful if
> you could modify the PrometheusReporter and log all notifications about
> added or removed metrics.
>
> On 15.06.2018 15:42, Till Rohrmann wrote:
>
> Hi,
>
> this sounds very strange. I just tried it out locally with with a standard
> metric and the Prometheus metrics seem to be unregistered after the job has
> reached a terminal state. Thus, it looks as if the standard metrics are
> properly removed from `CollectorRegistry.defaultRegistry`. Could you check
> the log files whether they contain anything suspicious about a failed
> metric deregistration a la `There was a problem unregistering metric`?
>
> I've also pulled in Chesnay who knows more about the metric reporters.
>
> Cheers,
> Till
>
> On Thu, Jun 14, 2018 at 11:34 PM jelmer <jk...@gmail.com> wrote:
>
>> Hi
>>
>> We are using flink-metrics-prometheus for reporting on apache flink 1.4.2
>>
>> And I am looking into an issue where it seems that somehow in some cases
>> the metrics registered
>> by org.apache.flink.runtime.metrics.groups.TaskIOMetricGroup
>> (flink_taskmanager_job_task_buffers_outPoolUsage etc)  are not being
>> unregistered in prometheus in case of a job restart
>>
>> Eventually this seems to cause a java.lang.NoClassDefFoundError:
>> org/apache/kafka/common/metrics/stats/Rate$1 error when a new version of
>> the job is deployed  because the jar file
>> in /tmp/blobStore-foo/job_bar/blob_p-baz-qux has been removed upon
>> deployment of the new job but the url classloader still points to it and it
>> cannot find stats/Rate$1 (some synthetically generated code generated by
>> the java compiler because its a switch on an enum)
>>
>> Has anybody come across this issue ? Has it possibly been fixed in 1.5 ?
>> Can somebody any pointers as to where to look to tackle this ?
>>
>> Attached screenshot shows what classloader that cannot be garbage
>> collected with the gc root
>>
>>
>

Re: TaskIOMetricGroup metrics not unregistered in prometheus on job failure ?

Posted by Chesnay Schepler <ch...@apache.org>.

I remember that another user reported something similar, but he wasn't 
using the PrometheusReporter. see 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/JVM-metrics-disappearing-after-job-crash-restart-tt20420.html

We couldn't find the cause, but my suspicion was FLINK-8946 which will 
be fixed in 1.4.3 .
You could cherry-pick 8b046fafb6ee77a86e360f6b792e7f73399239bd and see 
whether this actually caused it.

Alternatively, if you can reproduce this it would be immensely helpful 
if you could modify the PrometheusReporter and log all notifications 
about added or removed metrics.

On 15.06.2018 15:42, Till Rohrmann wrote:
> Hi,
>
> this sounds very strange. I just tried it out locally with with a 
> standard metric and the Prometheus metrics seem to be unregistered 
> after the job has reached a terminal state. Thus, it looks as if the 
> standard metrics are properly removed from 
> `CollectorRegistry.defaultRegistry`. Could you check the log files 
> whether they contain anything suspicious about a failed metric 
> deregistration a la `There was a problem unregistering metric`?
>
> I've also pulled in Chesnay who knows more about the metric reporters.
>
> Cheers,
> Till
>
> On Thu, Jun 14, 2018 at 11:34 PM jelmer <jkuperus@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Hi
>
>     We are using flink-metrics-prometheus for reporting on apache
>     flink 1.4.2
>
>     And I am looking into an issue where it seems that somehow in some
>     cases the metrics registered
>     by org.apache.flink.runtime.metrics.groups.TaskIOMetricGroup
>     (flink_taskmanager_job_task_buffers_outPoolUsage etc)  are not
>     being unregistered in prometheus in case of a job restart
>
>     Eventually this seems to cause a java.lang.NoClassDefFoundError:
>     org/apache/kafka/common/metrics/stats/Rate$1 error when a new
>     version of the job is deployed  because the jar file
>     in /tmp/blobStore-foo/job_bar/blob_p-baz-qux has been removed upon
>     deployment of the new job but the url classloader still points to
>     it and it cannot find stats/Rate$1 (some synthetically generated
>     code generated by the java compiler because its a switch on an enum)
>
>     Has anybody come across this issue ? Has it possibly been fixed in
>     1.5 ? Can somebody any pointers as to where to look to tackle this ?
>
>     Attached screenshot shows what classloader that cannot be garbage
>     collected with the gc root
>

Re: TaskIOMetricGroup metrics not unregistered in prometheus on job failure ?

Posted by Till Rohrmann <tr...@apache.org>.

Hi,

this sounds very strange. I just tried it out locally with with a standard
metric and the Prometheus metrics seem to be unregistered after the job has
reached a terminal state. Thus, it looks as if the standard metrics are
properly removed from `CollectorRegistry.defaultRegistry`. Could you check
the log files whether they contain anything suspicious about a failed
metric deregistration a la `There was a problem unregistering metric`?

I've also pulled in Chesnay who knows more about the metric reporters.

Cheers,
Till

On Thu, Jun 14, 2018 at 11:34 PM jelmer <jk...@gmail.com> wrote:

> Hi
>
> We are using flink-metrics-prometheus for reporting on apache flink 1.4.2
>
> And I am looking into an issue where it seems that somehow in some cases
> the metrics registered
> by org.apache.flink.runtime.metrics.groups.TaskIOMetricGroup
> (flink_taskmanager_job_task_buffers_outPoolUsage etc)  are not being
> unregistered in prometheus in case of a job restart
>
> Eventually this seems to cause a java.lang.NoClassDefFoundError:
> org/apache/kafka/common/metrics/stats/Rate$1 error when a new version of
> the job is deployed  because the jar file
> in /tmp/blobStore-foo/job_bar/blob_p-baz-qux has been removed upon
> deployment of the new job but the url classloader still points to it and it
> cannot find stats/Rate$1 (some synthetically generated code generated by
> the java compiler because its a switch on an enum)
>
> Has anybody come across this issue ? Has it possibly been fixed in 1.5 ?
> Can somebody any pointers as to where to look to tackle this ?
>
> Attached screenshot shows what classloader that cannot be garbage
> collected with the gc root
>
>