You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Kelly Smith <ke...@zillowgroup.com> on 2019/11/21 18:47:48 UTC

Metrics for Task States

I’ve been running Flink in production on EMR (YARN) for some time and have found the metrics system to be quite useful, but there is one specific case where I’m missing a signal for this scenario:


  *   When a job has been submitted, but YARN does not have enough resources to provide

Observed:

  *   Job is in RUNNING state
  *   All of the tasks for the job are in the (I believe) DEPLOYING state

Is there a way to access these as metrics for monitoring the number of tasks in each state for a given job (image below)? The metric I’m currently using is the number of running jobs, but it misses this “unhealthy” scenario. I realize that I could use application-level metrics (record counts, etc) as a proxy for this, but I’m working on providing a streaming platform and need all of my monitoring to be application agnostic.
[cid:image001.png@01D5A059.19DB3EB0]

I can’t find anything on it in the documentation.

Thanks,
Kelly

Re: Metrics for Task States

Posted by Kelly Smith <ke...@zillowgroup.com>.

Thanks Caizhi, that was what I was afraid of. Thanks for the information on the REST API 😊

It seems like the right solution would be to add it as a first-class feature for Flink so I will add a feature request. I may end up using the REST API as a workaround in the short-term - probably with a side-car container once we move to Kubernetes.

Kelly

From: Caizhi Weng <ts...@gmail.com>
Date: Monday, November 25, 2019 at 1:41 AM
To: Kelly Smith <ke...@zillowgroup.com>
Cc: Piper Piper <pi...@gmail.com>, "user@flink.apache.org" <us...@flink.apache.org>
Subject: Re: Metrics for Task States

Hi Kelly,

As far as I know Flink currently does not have such metrics to monitor on the number of tasks in each states. See https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fci.apache.org%2Fprojects%2Fflink%2Fflink-docs-stable%2Fmonitoring%2Fmetrics.html&data=02%7C01%7Ckellysm%40zillowgroup.com%7C2e123af242df4c2794b708d7718bb535%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637102717167008430&sdata=0P9tJLl7H5yjw4ov5Kkn2rwr4YDiOPCpU%2Fa%2BAeQeZhc%3D&reserved=0> for the complete metrics list. (It seems that `taskSlotsAvailable` in the metrics list is the most related metrics).

But Flink has a REST api which can provide states for all the tasks (http://hostname:port/overview). This REST returns a json string containing all the metrics you want. Maybe you can write your own tool to monitor on this api.

If you really want to have metrics that describe the number of tasks in each states, you can open up a JIRA ticket at https://issues.apache.org/jira/projects/FLINK/issues/<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fprojects%2FFLINK%2Fissues%2F&data=02%7C01%7Ckellysm%40zillowgroup.com%7C2e123af242df4c2794b708d7718bb535%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637102717167008430&sdata=0gR88PMtJmE9JE53grFTquRdFkKssd3YUG2oufPRe3U%3D&reserved=0>

Thank you

Kelly Smith <ke...@zillowgroup.com>> 于2019年11月25日周一 上午12:59写道：
With EMR/YARN, the cluster is definitely running in session mode. It exists independently of any job and continues running after the job exits.
Whether or not this is a bug in Flink, is it possible to get access to the metrics I'm asking about? Those would be useful even if this behavior is fixed.
Get Outlook for Android<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Faka.ms%2Fghei36&data=02%7C01%7Ckellysm%40zillowgroup.com%7C2e123af242df4c2794b708d7718bb535%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637102717167018425&sdata=WIM9W5dCtpxc4HWa0Azf1SvO4TtUl7ZpCDkxMyfhw94%3D&reserved=0>

________________________________
From: Piper Piper <pi...@gmail.com>>
Sent: Friday, November 22, 2019 9:10:41 PM
To: Kelly Smith <ke...@zillowgroup.com>>
Cc: user@flink.apache.org<ma...@flink.apache.org> <us...@flink.apache.org>>
Subject: Re: Metrics for Task States

I am trying to reason why this problem should occur (i.e. why Flink could not reject the job when it required more slots than were available).

Flink in production on EMR (YARN): Does this mean Flink was being run in Job mode or Session mode?

Thank you,

Piper

On Thu, Nov 21, 2019 at 4:56 PM Piper Piper <pi...@gmail.com>> wrote:
Thank you, Kelly!

On Thu, Nov 21, 2019 at 4:06 PM Kelly Smith <ke...@zillowgroup.com>> wrote:

Hi Piper,

The repro is pretty simple:

  *   Submit a job with parallelism set higher than YARN has resources to support

What this ends up looking like in the Flink UI is this:
[cid:16ea1e8b5784cff311]

The Job is in a “RUNNING” state, but all of the tasks are in the “SCHEDULED” state. The `jobmanager.numRunningJobs` metric that Flink emits by default will increase by 1, but none of the tasks actually get scheduled on any TM.

[cid:16ea1e8b5785b16b22]

What I’m looking for is a way to detect when I am in this state using Flink metrics (ideally the count of tasks in each state for better observability).

Does that make sense?

Thanks,

Kelly

From: Piper Piper <pi...@gmail.com>>
Date: Thursday, November 21, 2019 at 12:59 PM
To: Kelly Smith <ke...@zillowgroup.com>>
Cc: "user@flink.apache.org<ma...@flink.apache.org>" <us...@flink.apache.org>>
Subject: Re: Metrics for Task States

Hello Kelly,

I thought that Flink scheduler only starts a job if all requested containers/TMs are available and allotted to that job.

How can I reproduce your issue on Flink with YARN?

Thank you,

Piper

On Thu, Nov 21, 2019, 1:48 PM Kelly Smith <ke...@zillowgroup.com>> wrote:

I’ve been running Flink in production on EMR (YARN) for some time and have found the metrics system to be quite useful, but there is one specific case where I’m missing a signal for this scenario:

  *   When a job has been submitted, but YARN does not have enough resources to provide

Observed:

  *   Job is in RUNNING state
  *   All of the tasks for the job are in the (I believe) DEPLOYING state

Is there a way to access these as metrics for monitoring the number of tasks in each state for a given job (image below)? The metric I’m currently using is the number of running jobs, but it misses this “unhealthy” scenario. I realize that I could use application-level metrics (record counts, etc) as a proxy for this, but I’m working on providing a streaming platform and need all of my monitoring to be application agnostic.

Error! Filename not specified.

I can’t find anything on it in the documentation.

Thanks,

Kelly

Re: Metrics for Task States

Posted by Caizhi Weng <ts...@gmail.com>.

Hi Kelly,

As far as I know Flink currently does not have such metrics to monitor on
the number of tasks in each states. See
https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html
for
the complete metrics list. (It seems that `taskSlotsAvailable` in the
metrics list is the most related metrics).

But Flink has a REST api which can provide states for all the tasks
(http://hostname:port/overview). This REST returns a json string containing
all the metrics you want. Maybe you can write your own tool to monitor on
this api.

If you really want to have metrics that describe the number of tasks in
each states, you can open up a JIRA ticket at
https://issues.apache.org/jira/projects/FLINK/issues/

Thank you

Kelly Smith <ke...@zillowgroup.com> 于2019年11月25日周一 上午12:59写道：

> With EMR/YARN, the cluster is definitely running in session mode. It
> exists independently of any job and continues running after the job exits.
>
> Whether or not this is a bug in Flink, is it possible to get access to the
> metrics I'm asking about? Those would be useful even if this behavior is
> fixed.
>
> Get Outlook for Android <https://aka.ms/ghei36>
>
> ------------------------------
> *From:* Piper Piper <pi...@gmail.com>
> *Sent:* Friday, November 22, 2019 9:10:41 PM
> *To:* Kelly Smith <ke...@zillowgroup.com>
> *Cc:* user@flink.apache.org <us...@flink.apache.org>
> *Subject:* Re: Metrics for Task States
>
> I am trying to reason why this problem should occur (i.e. why Flink could
> not reject the job when it required more slots than were available).
>
> Flink in production on EMR (YARN): Does this mean Flink was being run in
> Job mode or Session mode?
>
> Thank you,
>
> Piper
>
> On Thu, Nov 21, 2019 at 4:56 PM Piper Piper <pi...@gmail.com> wrote:
>
> Thank you, Kelly!
>
> On Thu, Nov 21, 2019 at 4:06 PM Kelly Smith <ke...@zillowgroup.com>
> wrote:
>
> Hi Piper,
>
>
>
> The repro is pretty simple:
>
>    - Submit a job with parallelism set higher than YARN has resources to
>    support
>
>
>
> What this ends up looking like in the Flink UI is this:
>
>
>
> The Job is in a “RUNNING” state, but all of the tasks are in the
> “SCHEDULED” state. The `jobmanager.numRunningJobs` metric that Flink emits
> by default will increase by 1, but none of the tasks actually get scheduled
> on any TM.
>
>
>
>
>
> What I’m looking for is a way to detect when I am in this state using
> Flink metrics (ideally the count of tasks in each state for better
> observability).
>
>
>
> Does that make sense?
>
>
>
> Thanks,
>
> Kelly
>
>
>
> *From: *Piper Piper <pi...@gmail.com>
> *Date: *Thursday, November 21, 2019 at 12:59 PM
> *To: *Kelly Smith <ke...@zillowgroup.com>
> *Cc: *"user@flink.apache.org" <us...@flink.apache.org>
> *Subject: *Re: Metrics for Task States
>
>
>
> Hello Kelly,
>
>
>
> I thought that Flink scheduler only starts a job if all requested
> containers/TMs are available and allotted to that job.
>
>
>
> How can I reproduce your issue on Flink with YARN?
>
>
>
> Thank you,
>
>
>
> Piper
>
>
>
>
>
> On Thu, Nov 21, 2019, 1:48 PM Kelly Smith <ke...@zillowgroup.com> wrote:
>
> I’ve been running Flink in production on EMR (YARN) for some time and have
> found the metrics system to be quite useful, but there is one specific case
> where I’m missing a signal for this scenario:
>
>
>
>    - When a job has been submitted, but YARN does not have enough
>    resources to provide
>
>
>
> Observed:
>
>    - Job is in RUNNING state
>    - All of the tasks for the job are in the (I believe) DEPLOYING state
>
>
>
> Is there a way to access these as metrics for monitoring the number of
> tasks in each state for a given job (image below)? The metric I’m currently
> using is the number of running jobs, but it misses this “unhealthy”
> scenario. I realize that I could use application-level metrics (record
> counts, etc) as a proxy for this, but I’m working on providing a streaming
> platform and need all of my monitoring to be application agnostic.
>
> [image: cid:image001.png@01D5A059.19DB3EB0]
>
>
>
> I can’t find anything on it in the documentation.
>
>
>
> Thanks,
>
> Kelly
>
>

Re: Metrics for Task States

Posted by Kelly Smith <ke...@zillowgroup.com>.

With EMR/YARN, the cluster is definitely running in session mode. It exists independently of any job and continues running after the job exits.

Whether or not this is a bug in Flink, is it possible to get access to the metrics I'm asking about? Those would be useful even if this behavior is fixed.

Get Outlook for Android<https://aka.ms/ghei36>

________________________________
From: Piper Piper <pi...@gmail.com>
Sent: Friday, November 22, 2019 9:10:41 PM
To: Kelly Smith <ke...@zillowgroup.com>
Cc: user@flink.apache.org <us...@flink.apache.org>
Subject: Re: Metrics for Task States

I am trying to reason why this problem should occur (i.e. why Flink could not reject the job when it required more slots than were available).

Flink in production on EMR (YARN): Does this mean Flink was being run in Job mode or Session mode?

Thank you,

Piper

On Thu, Nov 21, 2019 at 4:56 PM Piper Piper <pi...@gmail.com>> wrote:
Thank you, Kelly!

On Thu, Nov 21, 2019 at 4:06 PM Kelly Smith <ke...@zillowgroup.com>> wrote:

Hi Piper,

The repro is pretty simple:

  *   Submit a job with parallelism set higher than YARN has resources to support

What this ends up looking like in the Flink UI is this:
[cid:16e8fd359da4cff311]

The Job is in a “RUNNING” state, but all of the tasks are in the “SCHEDULED” state. The `jobmanager.numRunningJobs` metric that Flink emits by default will increase by 1, but none of the tasks actually get scheduled on any TM.

[cid:16e8fd359db5b16b22]

What I’m looking for is a way to detect when I am in this state using Flink metrics (ideally the count of tasks in each state for better observability).

Does that make sense?

Thanks,

Kelly

From: Piper Piper <pi...@gmail.com>>
Date: Thursday, November 21, 2019 at 12:59 PM
To: Kelly Smith <ke...@zillowgroup.com>>
Cc: "user@flink.apache.org<ma...@flink.apache.org>" <us...@flink.apache.org>>
Subject: Re: Metrics for Task States

Hello Kelly,

I thought that Flink scheduler only starts a job if all requested containers/TMs are available and allotted to that job.

How can I reproduce your issue on Flink with YARN?

Thank you,

Piper

On Thu, Nov 21, 2019, 1:48 PM Kelly Smith <ke...@zillowgroup.com>> wrote:

I’ve been running Flink in production on EMR (YARN) for some time and have found the metrics system to be quite useful, but there is one specific case where I’m missing a signal for this scenario:

  *   When a job has been submitted, but YARN does not have enough resources to provide

Observed:

  *   Job is in RUNNING state
  *   All of the tasks for the job are in the (I believe) DEPLOYING state

Is there a way to access these as metrics for monitoring the number of tasks in each state for a given job (image below)? The metric I’m currently using is the number of running jobs, but it misses this “unhealthy” scenario. I realize that I could use application-level metrics (record counts, etc) as a proxy for this, but I’m working on providing a streaming platform and need all of my monitoring to be application agnostic.

[cid:image001.png@01D5A059.19DB3EB0]

I can’t find anything on it in the documentation.

Thanks,

Kelly

Re: Metrics for Task States

Posted by Piper Piper <pi...@gmail.com>.

I am trying to reason why this problem should occur (i.e. why Flink could
not reject the job when it required more slots than were available).

Flink in production on EMR (YARN): Does this mean Flink was being run in
Job mode or Session mode?

Thank you,

Piper

On Thu, Nov 21, 2019 at 4:56 PM Piper Piper <pi...@gmail.com> wrote:

> Thank you, Kelly!
>
> On Thu, Nov 21, 2019 at 4:06 PM Kelly Smith <ke...@zillowgroup.com>
> wrote:
>
>> Hi Piper,
>>
>>
>>
>> The repro is pretty simple:
>>
>>    - Submit a job with parallelism set higher than YARN has resources to
>>    support
>>
>>
>>
>> What this ends up looking like in the Flink UI is this:
>>
>>
>>
>> The Job is in a “RUNNING” state, but all of the tasks are in the
>> “SCHEDULED” state. The `jobmanager.numRunningJobs` metric that Flink emits
>> by default will increase by 1, but none of the tasks actually get scheduled
>> on any TM.
>>
>>
>>
>>
>>
>> What I’m looking for is a way to detect when I am in this state using
>> Flink metrics (ideally the count of tasks in each state for better
>> observability).
>>
>>
>>
>> Does that make sense?
>>
>>
>>
>> Thanks,
>>
>> Kelly
>>
>>
>>
>> *From: *Piper Piper <pi...@gmail.com>
>> *Date: *Thursday, November 21, 2019 at 12:59 PM
>> *To: *Kelly Smith <ke...@zillowgroup.com>
>> *Cc: *"user@flink.apache.org" <us...@flink.apache.org>
>> *Subject: *Re: Metrics for Task States
>>
>>
>>
>> Hello Kelly,
>>
>>
>>
>> I thought that Flink scheduler only starts a job if all requested
>> containers/TMs are available and allotted to that job.
>>
>>
>>
>> How can I reproduce your issue on Flink with YARN?
>>
>>
>>
>> Thank you,
>>
>>
>>
>> Piper
>>
>>
>>
>>
>>
>> On Thu, Nov 21, 2019, 1:48 PM Kelly Smith <ke...@zillowgroup.com>
>> wrote:
>>
>> I’ve been running Flink in production on EMR (YARN) for some time and
>> have found the metrics system to be quite useful, but there is one specific
>> case where I’m missing a signal for this scenario:
>>
>>
>>
>>    - When a job has been submitted, but YARN does not have enough
>>    resources to provide
>>
>>
>>
>> Observed:
>>
>>    - Job is in RUNNING state
>>    - All of the tasks for the job are in the (I believe) DEPLOYING state
>>
>>
>>
>> Is there a way to access these as metrics for monitoring the number of
>> tasks in each state for a given job (image below)? The metric I’m currently
>> using is the number of running jobs, but it misses this “unhealthy”
>> scenario. I realize that I could use application-level metrics (record
>> counts, etc) as a proxy for this, but I’m working on providing a streaming
>> platform and need all of my monitoring to be application agnostic.
>>
>> [image: cid:image001.png@01D5A059.19DB3EB0]
>>
>>
>>
>> I can’t find anything on it in the documentation.
>>
>>
>>
>> Thanks,
>>
>> Kelly
>>
>>

Re: Metrics for Task States

Posted by Piper Piper <pi...@gmail.com>.

Thank you, Kelly!

On Thu, Nov 21, 2019 at 4:06 PM Kelly Smith <ke...@zillowgroup.com> wrote:

> Hi Piper,
>
>
>
> The repro is pretty simple:
>
>    - Submit a job with parallelism set higher than YARN has resources to
>    support
>
>
>
> What this ends up looking like in the Flink UI is this:
>
>
>
> The Job is in a “RUNNING” state, but all of the tasks are in the
> “SCHEDULED” state. The `jobmanager.numRunningJobs` metric that Flink emits
> by default will increase by 1, but none of the tasks actually get scheduled
> on any TM.
>
>
>
>
>
> What I’m looking for is a way to detect when I am in this state using
> Flink metrics (ideally the count of tasks in each state for better
> observability).
>
>
>
> Does that make sense?
>
>
>
> Thanks,
>
> Kelly
>
>
>
> *From: *Piper Piper <pi...@gmail.com>
> *Date: *Thursday, November 21, 2019 at 12:59 PM
> *To: *Kelly Smith <ke...@zillowgroup.com>
> *Cc: *"user@flink.apache.org" <us...@flink.apache.org>
> *Subject: *Re: Metrics for Task States
>
>
>
> Hello Kelly,
>
>
>
> I thought that Flink scheduler only starts a job if all requested
> containers/TMs are available and allotted to that job.
>
>
>
> How can I reproduce your issue on Flink with YARN?
>
>
>
> Thank you,
>
>
>
> Piper
>
>
>
>
>
> On Thu, Nov 21, 2019, 1:48 PM Kelly Smith <ke...@zillowgroup.com> wrote:
>
> I’ve been running Flink in production on EMR (YARN) for some time and have
> found the metrics system to be quite useful, but there is one specific case
> where I’m missing a signal for this scenario:
>
>
>
>    - When a job has been submitted, but YARN does not have enough
>    resources to provide
>
>
>
> Observed:
>
>    - Job is in RUNNING state
>    - All of the tasks for the job are in the (I believe) DEPLOYING state
>
>
>
> Is there a way to access these as metrics for monitoring the number of
> tasks in each state for a given job (image below)? The metric I’m currently
> using is the number of running jobs, but it misses this “unhealthy”
> scenario. I realize that I could use application-level metrics (record
> counts, etc) as a proxy for this, but I’m working on providing a streaming
> platform and need all of my monitoring to be application agnostic.
>
> [image: cid:image001.png@01D5A059.19DB3EB0]
>
>
>
> I can’t find anything on it in the documentation.
>
>
>
> Thanks,
>
> Kelly
>
>

Re: Metrics for Task States

Posted by Kelly Smith <ke...@zillowgroup.com>.

Hi Piper,

The repro is pretty simple:

  *   Submit a job with parallelism set higher than YARN has resources to support

What this ends up looking like in the Flink UI is this:
[cid:image001.png@01D5A06C.6E16D580]

The Job is in a “RUNNING” state, but all of the tasks are in the “SCHEDULED” state. The `jobmanager.numRunningJobs` metric that Flink emits by default will increase by 1, but none of the tasks actually get scheduled on any TM.

[cid:image002.png@01D5A06C.6E16D580]

What I’m looking for is a way to detect when I am in this state using Flink metrics (ideally the count of tasks in each state for better observability).

Does that make sense?

Thanks,
Kelly

From: Piper Piper <pi...@gmail.com>
Date: Thursday, November 21, 2019 at 12:59 PM
To: Kelly Smith <ke...@zillowgroup.com>
Cc: "user@flink.apache.org" <us...@flink.apache.org>
Subject: Re: Metrics for Task States

Hello Kelly,

I thought that Flink scheduler only starts a job if all requested containers/TMs are available and allotted to that job.

How can I reproduce your issue on Flink with YARN?

Thank you,

Piper


On Thu, Nov 21, 2019, 1:48 PM Kelly Smith <ke...@zillowgroup.com>> wrote:
I’ve been running Flink in production on EMR (YARN) for some time and have found the metrics system to be quite useful, but there is one specific case where I’m missing a signal for this scenario:


  *   When a job has been submitted, but YARN does not have enough resources to provide

Observed:

  *   Job is in RUNNING state
  *   All of the tasks for the job are in the (I believe) DEPLOYING state

Is there a way to access these as metrics for monitoring the number of tasks in each state for a given job (image below)? The metric I’m currently using is the number of running jobs, but it misses this “unhealthy” scenario. I realize that I could use application-level metrics (record counts, etc) as a proxy for this, but I’m working on providing a streaming platform and need all of my monitoring to be application agnostic.
[cid:image001.png@01D5A059.19DB3EB0]

I can’t find anything on it in the documentation.

Thanks,
Kelly

Re: Metrics for Task States

Posted by Piper Piper <pi...@gmail.com>.

Hello Kelly,

I thought that Flink scheduler only starts a job if all requested
containers/TMs are available and allotted to that job.

How can I reproduce your issue on Flink with YARN?

Thank you,

Piper


On Thu, Nov 21, 2019, 1:48 PM Kelly Smith <ke...@zillowgroup.com> wrote:

> I’ve been running Flink in production on EMR (YARN) for some time and have
> found the metrics system to be quite useful, but there is one specific case
> where I’m missing a signal for this scenario:
>
>
>
>    - When a job has been submitted, but YARN does not have enough
>    resources to provide
>
>
>
> Observed:
>
>    - Job is in RUNNING state
>    - All of the tasks for the job are in the (I believe) DEPLOYING state
>
>
>
> Is there a way to access these as metrics for monitoring the number of
> tasks in each state for a given job (image below)? The metric I’m currently
> using is the number of running jobs, but it misses this “unhealthy”
> scenario. I realize that I could use application-level metrics (record
> counts, etc) as a proxy for this, but I’m working on providing a streaming
> platform and need all of my monitoring to be application agnostic.
>
>
>
> I can’t find anything on it in the documentation.
>
>
>
> Thanks,
>
> Kelly
>