You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Nikolas Davis <nd...@newrelic.com> on 2018/05/30 23:49:12 UTC

JVM metrics disappearing after job crash, restart

Howdy,

We are seeing our task manager JVM metrics disappear over time. This last
time we correlated it to our job crashing and restarting. I wasn't able to
grab the failing exception to share. Any thoughts?

We track metrics through the MetricReporter interface. As far as I can tell
this more or less only affects the JVM metrics. I.e. most / all other
metrics continue reporting fine as the job is automatically restarted.

Nik Davis
Software Engineer
New Relic

Re: JVM metrics disappearing after job crash, restart

Posted by Chesnay Schepler <ch...@apache.org>.

The config looks OK to me. On the Flink side I cannot find an 
explanation why only /some /metrics disappear.

The only explanation I could come up with at the moment is that 
FLINK-8946 is triggered, all metrics are (officially) unregistered, but 
the reporter isn't removing some metrics (i.e. all job related ones).
Due to FLINK-8946 no new metrics would be registered after the JM 
restart, but the old metrics continue to be reported.

To verify this I would add logging statements to the 
/notifyOfAddedMetric/notifyOfRemovedMetric/ methods, to check whether 
Flink attempts to unregister all metrics or only some.

On 05.06.2018 02:02, Nikolas Davis wrote:
> Fabian,
>
> It does look like it may be related. I'll add a comment. After digging 
> a bit more I found that the crash and lack of metrics were 
> precipitated by the JobManager instance crashing and cycling, which 
> caused the job to restart.
>
>
> Chesnay,
>
> I didn't see anything interesting in our logs. Our reporter config is 
> fairly straightforward (I think):
>
> metrics.reporter.nr.class: com.newrelic.flink.NewRelicReporter
> metrics.reporter.nr.interval: 60 SECONDS
> metrics.reporters: nr
>
> Nik Davis
> Software Engineer
> New Relic
>
> On Mon, Jun 4, 2018 at 1:56 AM, Chesnay Schepler <chesnay@apache.org 
> <ma...@apache.org>> wrote:
>
>     Can you show us the metrics-related configuration parameters in
>     flink-conf.yaml?
>
>     Please also check the logs for any warnings from the MetricGroup
>     and MetricRegistry classes.
>
>
>     On 04.06.2018 10:44, Fabian Hueske wrote:
>>     Hi Nik,
>>
>>     Can you have a look at this JIRA ticket [1] and check if it is
>>     related to the problems your are facing?
>>     If so, would you mind leaving a comment there?
>>
>>     Thank you,
>>     Fabian
>>
>>     [1] https://issues.apache.org/jira/browse/FLINK-8946
>>     <https://issues.apache.org/jira/browse/FLINK-8946>
>>
>>     2018-05-31 4:41 GMT+02:00 Nikolas Davis <ndavis@newrelic.com
>>     <ma...@newrelic.com>>:
>>
>>         We keep track of metrics by using the value of
>>         MetricGroup::getMetricIdentifier, which returns the fully
>>         qualified metric name. The query that we use to monitor
>>         metrics filters for metrics IDs that
>>         match '%Status.JVM.Memory%'. As long as the new metrics come
>>         online via the MetricReporter interface then I think the
>>         chart would be continuous; we would just see the old JVM
>>         memory metrics cycle into new metrics.
>>
>>         Nik Davis
>>         Software Engineer
>>         New Relic
>>
>>         On Wed, May 30, 2018 at 5:30 PM, Ajay Tripathy
>>         <ajayt@yelp.com <ma...@yelp.com>> wrote:
>>
>>             How are your metrics dimensionalized/named? Task managers
>>             often have UIDs generated for them. The task id dimension
>>             will change on restart. If you name your metric based on
>>             this 'task_id' there would be a discontinuity with the
>>             old metric.
>>
>>             On Wed, May 30, 2018 at 4:49 PM, Nikolas Davis
>>             <ndavis@newrelic.com <ma...@newrelic.com>> wrote:
>>
>>                 Howdy,
>>
>>                 We are seeing our task manager JVM metrics disappear
>>                 over time. This last time we correlated it to our job
>>                 crashing and restarting. I wasn't able to grab the
>>                 failing exception to share. Any thoughts?
>>
>>                 We track metrics through the MetricReporter
>>                 interface. As far as I can tell this more or less
>>                 only affects the JVM metrics. I.e. most / all other
>>                 metrics continue reporting fine as the job is
>>                 automatically restarted.
>>
>>                 Nik Davis
>>                 Software Engineer
>>                 New Relic
>>
>>
>>
>>
>
>

Re: JVM metrics disappearing after job crash, restart

Posted by Nikolas Davis <nd...@newrelic.com>.

Fabian,

It does look like it may be related. I'll add a comment. After digging a
bit more I found that the crash and lack of metrics were precipitated by the
JobManager instance crashing and cycling, which caused the job to restart.


Chesnay,

I didn't see anything interesting in our logs. Our reporter config is
fairly straightforward (I think):

metrics.reporter.nr.class: com.newrelic.flink.NewRelicReporter
metrics.reporter.nr.interval: 60 SECONDS
metrics.reporters: nr

Nik Davis
Software Engineer
New Relic

On Mon, Jun 4, 2018 at 1:56 AM, Chesnay Schepler <ch...@apache.org> wrote:

> Can you show us the metrics-related configuration parameters in
> flink-conf.yaml?
>
> Please also check the logs for any warnings from the MetricGroup and MetricRegistry
> classes.
>
>
> On 04.06.2018 10:44, Fabian Hueske wrote:
>
> Hi Nik,
>
> Can you have a look at this JIRA ticket [1] and check if it is related to
> the problems your are facing?
> If so, would you mind leaving a comment there?
>
> Thank you,
> Fabian
>
> [1] https://issues.apache.org/jira/browse/FLINK-8946
>
> 2018-05-31 4:41 GMT+02:00 Nikolas Davis <nd...@newrelic.com>:
>
>> We keep track of metrics by using the value of
>> MetricGroup::getMetricIdentifier, which returns the fully qualified
>> metric name. The query that we use to monitor metrics filters for metrics
>> IDs that match '%Status.JVM.Memory%'. As long as the new metrics come
>> online via the MetricReporter interface then I think the chart would be
>> continuous; we would just see the old JVM memory metrics cycle into new
>> metrics.
>>
>> Nik Davis
>> Software Engineer
>> New Relic
>>
>> On Wed, May 30, 2018 at 5:30 PM, Ajay Tripathy <aj...@yelp.com> wrote:
>>
>>> How are your metrics dimensionalized/named? Task managers often have
>>> UIDs generated for them. The task id dimension will change on restart. If
>>> you name your metric based on this 'task_id' there would be a discontinuity
>>> with the old metric.
>>>
>>> On Wed, May 30, 2018 at 4:49 PM, Nikolas Davis <nd...@newrelic.com>
>>> wrote:
>>>
>>>> Howdy,
>>>>
>>>> We are seeing our task manager JVM metrics disappear over time. This
>>>> last time we correlated it to our job crashing and restarting. I wasn't
>>>> able to grab the failing exception to share. Any thoughts?
>>>>
>>>> We track metrics through the MetricReporter interface. As far as I can
>>>> tell this more or less only affects the JVM metrics. I.e. most / all other
>>>> metrics continue reporting fine as the job is automatically restarted.
>>>>
>>>> Nik Davis
>>>> Software Engineer
>>>> New Relic
>>>>
>>>
>>>
>>
>
>

Re: JVM metrics disappearing after job crash, restart

Posted by Chesnay Schepler <ch...@apache.org>.

Can you show us the metrics-related configuration parameters in 
flink-conf.yaml?

Please also check the logs for any warnings from the MetricGroup and 
MetricRegistry classes.

On 04.06.2018 10:44, Fabian Hueske wrote:
> Hi Nik,
>
> Can you have a look at this JIRA ticket [1] and check if it is related 
> to the problems your are facing?
> If so, would you mind leaving a comment there?
>
> Thank you,
> Fabian
>
> [1] https://issues.apache.org/jira/browse/FLINK-8946
>
> 2018-05-31 4:41 GMT+02:00 Nikolas Davis <ndavis@newrelic.com 
> <ma...@newrelic.com>>:
>
>     We keep track of metrics by using the value of
>     MetricGroup::getMetricIdentifier, which returns the fully
>     qualified metric name. The query that we use to monitor metrics
>     filters for metrics IDs that match '%Status.JVM.Memory%'. As long
>     as the new metrics come online via the MetricReporter interface
>     then I think the chart would be continuous; we would just see the
>     old JVM memory metrics cycle into new metrics.
>
>     Nik Davis
>     Software Engineer
>     New Relic
>
>     On Wed, May 30, 2018 at 5:30 PM, Ajay Tripathy <ajayt@yelp.com
>     <ma...@yelp.com>> wrote:
>
>         How are your metrics dimensionalized/named? Task managers
>         often have UIDs generated for them. The task id dimension will
>         change on restart. If you name your metric based on this
>         'task_id' there would be a discontinuity with the old metric.
>
>         On Wed, May 30, 2018 at 4:49 PM, Nikolas Davis
>         <ndavis@newrelic.com <ma...@newrelic.com>> wrote:
>
>             Howdy,
>
>             We are seeing our task manager JVM metrics disappear over
>             time. This last time we correlated it to our job crashing
>             and restarting. I wasn't able to grab the failing
>             exception to share. Any thoughts?
>
>             We track metrics through the MetricReporter interface. As
>             far as I can tell this more or less only affects the JVM
>             metrics. I.e. most / all other metrics continue reporting
>             fine as the job is automatically restarted.
>
>             Nik Davis
>             Software Engineer
>             New Relic
>
>
>
>

Re: JVM metrics disappearing after job crash, restart

Posted by Fabian Hueske <fh...@gmail.com>.

Hi Nik,

Can you have a look at this JIRA ticket [1] and check if it is related to
the problems your are facing?
If so, would you mind leaving a comment there?

Thank you,
Fabian

[1] https://issues.apache.org/jira/browse/FLINK-8946

2018-05-31 4:41 GMT+02:00 Nikolas Davis <nd...@newrelic.com>:

> We keep track of metrics by using the value of MetricGroup::getMetricIdentifier,
> which returns the fully qualified metric name. The query that we use to
> monitor metrics filters for metrics IDs that match '%Status.JVM.Memory%'.
> As long as the new metrics come online via the MetricReporter interface
> then I think the chart would be continuous; we would just see the old JVM
> memory metrics cycle into new metrics.
>
> Nik Davis
> Software Engineer
> New Relic
>
> On Wed, May 30, 2018 at 5:30 PM, Ajay Tripathy <aj...@yelp.com> wrote:
>
>> How are your metrics dimensionalized/named? Task managers often have UIDs
>> generated for them. The task id dimension will change on restart. If you
>> name your metric based on this 'task_id' there would be a discontinuity
>> with the old metric.
>>
>> On Wed, May 30, 2018 at 4:49 PM, Nikolas Davis <nd...@newrelic.com>
>> wrote:
>>
>>> Howdy,
>>>
>>> We are seeing our task manager JVM metrics disappear over time. This
>>> last time we correlated it to our job crashing and restarting. I wasn't
>>> able to grab the failing exception to share. Any thoughts?
>>>
>>> We track metrics through the MetricReporter interface. As far as I can
>>> tell this more or less only affects the JVM metrics. I.e. most / all other
>>> metrics continue reporting fine as the job is automatically restarted.
>>>
>>> Nik Davis
>>> Software Engineer
>>> New Relic
>>>
>>
>>
>

Re: JVM metrics disappearing after job crash, restart

Posted by Nikolas Davis <nd...@newrelic.com>.

We keep track of metrics by using the value of
MetricGroup::getMetricIdentifier, which returns the fully qualified metric
name. The query that we use to monitor metrics filters for metrics IDs that
match '%Status.JVM.Memory%'. As long as the new metrics come online via the
MetricReporter interface then I think the chart would be continuous; we
would just see the old JVM memory metrics cycle into new metrics.

Nik Davis
Software Engineer
New Relic

On Wed, May 30, 2018 at 5:30 PM, Ajay Tripathy <aj...@yelp.com> wrote:

> How are your metrics dimensionalized/named? Task managers often have UIDs
> generated for them. The task id dimension will change on restart. If you
> name your metric based on this 'task_id' there would be a discontinuity
> with the old metric.
>
> On Wed, May 30, 2018 at 4:49 PM, Nikolas Davis <nd...@newrelic.com>
> wrote:
>
>> Howdy,
>>
>> We are seeing our task manager JVM metrics disappear over time. This last
>> time we correlated it to our job crashing and restarting. I wasn't able to
>> grab the failing exception to share. Any thoughts?
>>
>> We track metrics through the MetricReporter interface. As far as I can
>> tell this more or less only affects the JVM metrics. I.e. most / all other
>> metrics continue reporting fine as the job is automatically restarted.
>>
>> Nik Davis
>> Software Engineer
>> New Relic
>>
>
>

Re: JVM metrics disappearing after job crash, restart

Posted by Ajay Tripathy <aj...@yelp.com>.

How are your metrics dimensionalized/named? Task managers often have UIDs
generated for them. The task id dimension will change on restart. If you
name your metric based on this 'task_id' there would be a discontinuity
with the old metric.

On Wed, May 30, 2018 at 4:49 PM, Nikolas Davis <nd...@newrelic.com> wrote:

> Howdy,
>
> We are seeing our task manager JVM metrics disappear over time. This last
> time we correlated it to our job crashing and restarting. I wasn't able to
> grab the failing exception to share. Any thoughts?
>
> We track metrics through the MetricReporter interface. As far as I can
> tell this more or less only affects the JVM metrics. I.e. most / all other
> metrics continue reporting fine as the job is automatically restarted.
>
> Nik Davis
> Software Engineer
> New Relic
>