You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2015/12/15 01:08:46 UTC

[jira] [Commented] (FLINK-3131) Expose checkpoint metrics

    [ https://issues.apache.org/jira/browse/FLINK-3131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15057016#comment-15057016 ] 

ASF GitHub Bot commented on FLINK-3131:
---------------------------------------

GitHub user uce opened a pull request:

    https://github.com/apache/flink/pull/1453

    [FLINK-3131] Expose checkpoint metrics

    - Adds `long getStateSize()` to `StateHandle` and `KvStateSnapshot`. Everything except test classes and `LazyDbKvState` implement this. `LazyDbKvState` could implement it correctly, but currently the state is serialized lazily, which means that the state size is not known (currently set as 0) when creating the state handle.
    
    - Adds simple statistics tracking to the checkpoint coordinator. This is not using the accumulators, because I wanted more fine-grained control. I think we can expand the system internal accumulators to accommodate these use cases better. It is also possible to retro fit this on the accumulators, if you want to.
    - Adds the following web runtime monitor handlers:
      * `/jobs/:jobid/checkpoints` for completed checkpoint statistics for the job with the history
      * `/jobs/:jobid/vertices/:vertexid/checkpoints` for per operator statistics including subtasks
    
    - Adds the web frontend HTML/Javascript (screenshots below)
    
    This feature can be disabled via `jobmanager.web.checkpoints.disable`. I think this is good practice, because it is attached to one of the most critical parts of the system.
    
    The maximum history size (see screenshot) for job level statistics can be configured via `jobmanager.web.checkpoints.history`. Current default is 10. Maybe a little too high?
    
    ---
    
    - **Checkpoints Tab** (Overview and Operators): 
    ![screen shot 2015-12-15 at 00 45 41](https://cloud.githubusercontent.com/assets/1756620/11797953/e4f17c84-a2c6-11e5-86b1-040a4e1bff12.png)
    - **History** (configurable):
     ![screen shot 2015-12-15 at 00 45 51](https://cloud.githubusercontent.com/assets/1756620/11797957/f2f87940-a2c6-11e5-82ce-5c5fcf8b1ca1.png)
    - **Subtasks**: 
    ![screen shot 2015-12-15 at 00 46 08](https://cloud.githubusercontent.com/assets/1756620/11797963/0d105fd2-a2c7-11e5-9a90-458bd0b7fdc4.png)
    - **Terminated job**: 
    ![screen shot 2015-12-15 at 00 46 44](https://cloud.githubusercontent.com/assets/1756620/11797969/1b0999a0-a2c7-11e5-9826-723f12e997d9.png)
    
    Jobs without checkpoints just show `No checkpoints` currently.
    
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/uce/flink 3131-checkpoint_metrics

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/1453.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1453
    
----
commit aa12f3c7bb6ac43b91d5926087d7c181958c95cb
Author: Ufuk Celebi <uc...@apache.org>
Date:   2015-12-14T18:40:10Z

    [FLINK-3131] [contrib, runtime, streaming-java] Add long getStateSize() to StateHandle and KvStateSnapshot
    
    In order to report the state sizes, we need to expose them. All state backends
    currently available backends know the state size. Only the LazyDbKvState does
    not expose it at the moment, because it serializes the data lazily. This can be
    changed in a follow-up fix.

commit 2dae2a8ee98ca08cba4925f15110f1d9de2c1831
Author: Ufuk Celebi <uc...@apache.org>
Date:   2015-12-14T19:12:59Z

    [FLINK-3131] [core, runtime] Add checkpoint statistics tracker
    
    Adds a simple tracker of checkpoint statistics.

commit 53feb2a1a008f08218d05b91af4853ad18574fa2
Author: Ufuk Celebi <uc...@apache.org>
Date:   2015-12-14T19:13:59Z

    [FLINK-3131] [runtime-web] Add checkpoint statistics handlers

commit 47f89d5d24ae2fb6c314205531d696b985acb508
Author: Ufuk Celebi <uc...@apache.org>
Date:   2015-12-14T19:48:03Z

    [FLINK-3131] [runtime-web] Add checkpoint statistics to web frontend

----


> Expose checkpoint metrics
> -------------------------
>
>                 Key: FLINK-3131
>                 URL: https://issues.apache.org/jira/browse/FLINK-3131
>             Project: Flink
>          Issue Type: Improvement
>          Components: Webfrontend
>    Affects Versions: 0.10.1
>            Reporter: Ufuk Celebi
>            Assignee: Ufuk Celebi
>
> Metrics about checkpoints are only accessible via the job manager logs and only show information about the completed checkpoints.
> The checkpointing metrics should be exposed in the web frontend, including:
> - number
> - duration
> - state size



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)