You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Andrew Otto <ot...@wikimedia.org> on 2023/05/22 18:47:37 UTC

Flink Kubernetes Operator lifecycle state count metrics question

Hello!

I'm doing some grafana+prometheus dashboarding for
flink-kubernetes-operator.  Reading metrics docs
<https://stackoverflow.com/a/61795256>, I see that I have nice per k8s
namespace lifecycle current count gauge metrics in Prometheus.

Using kubectl, I can see that I have one FlinkDeployment in my namespace:

# kubectl -n stream-enrichment-poc get flinkdeployments
NAME             JOB STATUS   LIFECYCLE STATE
flink-app-main   RUNNING      STABLE

But, prometheus is reporting that I have 2 FlinkDeployments in the STABLE
state.

# curl -s <pod_ip>:<prom_port>  | grep
flink_k8soperator_namespace_Lifecycle_State_STABLE_Count
flink_k8soperator_namespace_Lifecycle_State_STABLE_Count{resourcetype="FlinkDeployment",resourcens="stream_enrichment_poc",name="flink_kubernetes_operator",host="flink_kubernetes_operator_86b888d6b6_gbrt4",namespace="flink_operator",}
2.0

I'm not sure why I see 2.0 reported.
flink_k8soperator_namespace_JmDeploymentStatus_READY_Count reports only one
FlinkDeployment.

# curl <pod_ip>:<prom_port>/metrics | grep
flink_k8soperator_namespace_JmDeploymentStatus_READY_Count
flink_k8soperator_namespace_JmDeploymentStatus_READY_Count{resourcetype="FlinkDeployment",resourcens="stream_enrichment_poc",name="flink_kubernetes_operator",host="flink_kubernetes_operator_86b888d6b6_gbrt4",namespace="flink_operator",}
1.0

Is it possible that flink_k8soperator_namespace_Lifecycle_State_STABLE_Count
is being reported as an incrementing counter instead of a guage?

Thanks
-Andrew Otto
 Wikimedia Foundation

Re: Flink Kubernetes Operator lifecycle state count metrics question

Posted by Gyula Fóra <gy...@gmail.com>.
Hi Andrew!

I think you are completely right, this is a bug. The per namespace metrics
do not seem to filter per namespace and show the aggregated global count
for each namespace:

I opened a ticket:
https://issues.apache.org/jira/browse/FLINK-32164

Thanks for reporting this!
Gyula

On Mon, May 22, 2023 at 10:49 PM Andrew Otto <ot...@wikimedia.org> wrote:

> Also!  I do have 2 FlinkDeployments deployed with this operator, but they
> are in different namespaces, and each of the per namespace metrics reports
> that it has 2 Deployments in them, even though there is only one according
> to kubectl.
>
> Actually...we just tried to deploy a change (enabling some checkpointing)
> that caused one of our FlinkDeployments to fail.  Now, both namespace
> STABLE_Counts each report 1.
>
> # curl -s <pod_ip>:<prom_port> | grep
> flink_k8soperator_namespace_Lifecycle_State_STABLE_Count
> flink_k8soperator_namespace_Lifecycle_State_STABLE_Count{resourcetype="FlinkDeployment",resourcens="stream_enrichment_poc",name="flink_kubernetes_operator",host="flink_kubernetes_operator_86b888d6b6_gbrt4",namespace="flink_operator",}
> 1.0
> flink_k8soperator_namespace_Lifecycle_State_STABLE_Count{resourcetype="FlinkDeployment",resourcens="rdf_streaming_updater",name="flink_kubernetes_operator",host="flink_kubernetes_operator_86b888d6b6_gbrt4",namespace="flink_operator",}
> 1.0
>
> It looks like maybe this metric is not reporting per namespace, but a
> global count.
>
>
>
> On Mon, May 22, 2023 at 2:56 PM Andrew Otto <ot...@wikimedia.org> wrote:
>
>> Oh, FWIW, I do have operator HA enabled with 2 replicas running, but in
>> my examples there, I am curl-ing the leader flink operator pod.
>>
>>
>>
>> On Mon, May 22, 2023 at 2:47 PM Andrew Otto <ot...@wikimedia.org> wrote:
>>
>>> Hello!
>>>
>>> I'm doing some grafana+prometheus dashboarding for
>>> flink-kubernetes-operator.  Reading metrics docs
>>> <https://stackoverflow.com/a/61795256>, I see that I have nice per k8s
>>> namespace lifecycle current count gauge metrics in Prometheus.
>>>
>>> Using kubectl, I can see that I have one FlinkDeployment in my namespace:
>>>
>>> # kubectl -n stream-enrichment-poc get flinkdeployments
>>> NAME             JOB STATUS   LIFECYCLE STATE
>>> flink-app-main   RUNNING      STABLE
>>>
>>> But, prometheus is reporting that I have 2 FlinkDeployments in the
>>> STABLE state.
>>>
>>> # curl -s <pod_ip>:<prom_port>  | grep
>>> flink_k8soperator_namespace_Lifecycle_State_STABLE_Count
>>> flink_k8soperator_namespace_Lifecycle_State_STABLE_Count{resourcetype="FlinkDeployment",resourcens="stream_enrichment_poc",name="flink_kubernetes_operator",host="flink_kubernetes_operator_86b888d6b6_gbrt4",namespace="flink_operator",}
>>> 2.0
>>>
>>> I'm not sure why I see 2.0 reported.
>>> flink_k8soperator_namespace_JmDeploymentStatus_READY_Count reports only
>>> one FlinkDeployment.
>>>
>>> # curl <pod_ip>:<prom_port>/metrics | grep
>>> flink_k8soperator_namespace_JmDeploymentStatus_READY_Count
>>> flink_k8soperator_namespace_JmDeploymentStatus_READY_Count{resourcetype="FlinkDeployment",resourcens="stream_enrichment_poc",name="flink_kubernetes_operator",host="flink_kubernetes_operator_86b888d6b6_gbrt4",namespace="flink_operator",}
>>> 1.0
>>>
>>> Is it possible that
>>> flink_k8soperator_namespace_Lifecycle_State_STABLE_Count is being
>>> reported as an incrementing counter instead of a guage?
>>>
>>> Thanks
>>> -Andrew Otto
>>>  Wikimedia Foundation
>>>
>>>

Re: Flink Kubernetes Operator lifecycle state count metrics question

Posted by Andrew Otto <ot...@wikimedia.org>.
Also!  I do have 2 FlinkDeployments deployed with this operator, but they
are in different namespaces, and each of the per namespace metrics reports
that it has 2 Deployments in them, even though there is only one according
to kubectl.

Actually...we just tried to deploy a change (enabling some checkpointing)
that caused one of our FlinkDeployments to fail.  Now, both namespace
STABLE_Counts each report 1.

# curl -s <pod_ip>:<prom_port> | grep
flink_k8soperator_namespace_Lifecycle_State_STABLE_Count
flink_k8soperator_namespace_Lifecycle_State_STABLE_Count{resourcetype="FlinkDeployment",resourcens="stream_enrichment_poc",name="flink_kubernetes_operator",host="flink_kubernetes_operator_86b888d6b6_gbrt4",namespace="flink_operator",}
1.0
flink_k8soperator_namespace_Lifecycle_State_STABLE_Count{resourcetype="FlinkDeployment",resourcens="rdf_streaming_updater",name="flink_kubernetes_operator",host="flink_kubernetes_operator_86b888d6b6_gbrt4",namespace="flink_operator",}
1.0

It looks like maybe this metric is not reporting per namespace, but a
global count.



On Mon, May 22, 2023 at 2:56 PM Andrew Otto <ot...@wikimedia.org> wrote:

> Oh, FWIW, I do have operator HA enabled with 2 replicas running, but in my
> examples there, I am curl-ing the leader flink operator pod.
>
>
>
> On Mon, May 22, 2023 at 2:47 PM Andrew Otto <ot...@wikimedia.org> wrote:
>
>> Hello!
>>
>> I'm doing some grafana+prometheus dashboarding for
>> flink-kubernetes-operator.  Reading metrics docs
>> <https://stackoverflow.com/a/61795256>, I see that I have nice per k8s
>> namespace lifecycle current count gauge metrics in Prometheus.
>>
>> Using kubectl, I can see that I have one FlinkDeployment in my namespace:
>>
>> # kubectl -n stream-enrichment-poc get flinkdeployments
>> NAME             JOB STATUS   LIFECYCLE STATE
>> flink-app-main   RUNNING      STABLE
>>
>> But, prometheus is reporting that I have 2 FlinkDeployments in the STABLE
>> state.
>>
>> # curl -s <pod_ip>:<prom_port>  | grep
>> flink_k8soperator_namespace_Lifecycle_State_STABLE_Count
>> flink_k8soperator_namespace_Lifecycle_State_STABLE_Count{resourcetype="FlinkDeployment",resourcens="stream_enrichment_poc",name="flink_kubernetes_operator",host="flink_kubernetes_operator_86b888d6b6_gbrt4",namespace="flink_operator",}
>> 2.0
>>
>> I'm not sure why I see 2.0 reported.
>> flink_k8soperator_namespace_JmDeploymentStatus_READY_Count reports only
>> one FlinkDeployment.
>>
>> # curl <pod_ip>:<prom_port>/metrics | grep
>> flink_k8soperator_namespace_JmDeploymentStatus_READY_Count
>> flink_k8soperator_namespace_JmDeploymentStatus_READY_Count{resourcetype="FlinkDeployment",resourcens="stream_enrichment_poc",name="flink_kubernetes_operator",host="flink_kubernetes_operator_86b888d6b6_gbrt4",namespace="flink_operator",}
>> 1.0
>>
>> Is it possible that
>> flink_k8soperator_namespace_Lifecycle_State_STABLE_Count is being
>> reported as an incrementing counter instead of a guage?
>>
>> Thanks
>> -Andrew Otto
>>  Wikimedia Foundation
>>
>>

Re: Flink Kubernetes Operator lifecycle state count metrics question

Posted by Andrew Otto <ot...@wikimedia.org>.
Oh, FWIW, I do have operator HA enabled with 2 replicas running, but in my
examples there, I am curl-ing the leader flink operator pod.



On Mon, May 22, 2023 at 2:47 PM Andrew Otto <ot...@wikimedia.org> wrote:

> Hello!
>
> I'm doing some grafana+prometheus dashboarding for
> flink-kubernetes-operator.  Reading metrics docs
> <https://stackoverflow.com/a/61795256>, I see that I have nice per k8s
> namespace lifecycle current count gauge metrics in Prometheus.
>
> Using kubectl, I can see that I have one FlinkDeployment in my namespace:
>
> # kubectl -n stream-enrichment-poc get flinkdeployments
> NAME             JOB STATUS   LIFECYCLE STATE
> flink-app-main   RUNNING      STABLE
>
> But, prometheus is reporting that I have 2 FlinkDeployments in the STABLE
> state.
>
> # curl -s <pod_ip>:<prom_port>  | grep
> flink_k8soperator_namespace_Lifecycle_State_STABLE_Count
> flink_k8soperator_namespace_Lifecycle_State_STABLE_Count{resourcetype="FlinkDeployment",resourcens="stream_enrichment_poc",name="flink_kubernetes_operator",host="flink_kubernetes_operator_86b888d6b6_gbrt4",namespace="flink_operator",}
> 2.0
>
> I'm not sure why I see 2.0 reported.
> flink_k8soperator_namespace_JmDeploymentStatus_READY_Count reports only
> one FlinkDeployment.
>
> # curl <pod_ip>:<prom_port>/metrics | grep
> flink_k8soperator_namespace_JmDeploymentStatus_READY_Count
> flink_k8soperator_namespace_JmDeploymentStatus_READY_Count{resourcetype="FlinkDeployment",resourcens="stream_enrichment_poc",name="flink_kubernetes_operator",host="flink_kubernetes_operator_86b888d6b6_gbrt4",namespace="flink_operator",}
> 1.0
>
> Is it possible that
> flink_k8soperator_namespace_Lifecycle_State_STABLE_Count is being
> reported as an incrementing counter instead of a guage?
>
> Thanks
> -Andrew Otto
>  Wikimedia Foundation
>
>