You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Hangxiang Yu (Jira)" <ji...@apache.org> on 2022/04/11 02:10:00 UTC

[jira] [Commented] (FLINK-25470) Add/Expose/Differentiate metrics of checkpoint size between changelog size vs materialization size

    [ https://issues.apache.org/jira/browse/FLINK-25470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17520267#comment-17520267 ] 

Hangxiang Yu commented on FLINK-25470:
--------------------------------------

I think we may don't need to expose these changelog metics into Flink UI in first step, but need to expose them by REST API so that we could see the complete metrics by some visualization tools, e.g. grafana. It is meaningful to check whether it works well by metrics of different parts for different jobs. I think how to expose them to Flink UI deserves further discussion.



After FLINK-25557, IIUC, We have two metrics:
 # checkpointed size. For Changelog, it refers to incremental size of non-materialization part.
 # full size. For Changelog, it refers to full size of all parts of materialization and non-materialization.


In my opinion, we may need to expose:
 # incremental size of materialization part (positive if updated by materialization, zero otherwise).
 # full size of materialization part.
 # full size of non-materialization part (It also could be infered by full size and full size of materialization part).


According to these metics, we could roughly infer:
 # restore time by full size of materialization part and non-materialization part. 
 # when a checkpoint includes a new Materialization by incremetal/full size of materialization part.
 # the cleanup efficiency of non-materialization part by compare the full size of non-materialization part which is the real size and the actual size in the dfs.


I also think "How much Data Size increases/exploding" have been answered by current "full size".

I think other metrics [~ym]  metioned could be seen in the above.



BTW, I also think whether we need to expose "async duration of materialization part". 

Current "async duration" refers to the asunc duration of incremental checkpoint of non-materialization part.

If we expose "async duration of materialization part", we could see whether the materialization part will affect the job.

[~ym] [~roman] WDYT?

> Add/Expose/Differentiate metrics of checkpoint size between changelog size vs materialization size
> --------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-25470
>                 URL: https://issues.apache.org/jira/browse/FLINK-25470
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Metrics, Runtime / State Backends
>            Reporter: Yuan Mei
>            Priority: Major
>             Fix For: 1.16.0
>
>         Attachments: Screen Shot 2021-12-29 at 1.09.48 PM.png
>
>
> FLINK-25557  only resolves part of the problems. 
> Eventually, we should answer questions:
>  * How much Data Size increases/exploding
>  * When a checkpoint includes a new Materialization
>  * Materialization size
>  * changelog sizes from the last complete checkpoint (that can roughly infer restore time)
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)