You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "David Anderson (Jira)" <ji...@apache.org> on 2019/09/23 14:57:00 UTC

[jira] [Reopened] (FLINK-12576) inputQueueLength metric does not work for LocalInputChannels

     [ https://issues.apache.org/jira/browse/FLINK-12576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Anderson reopened FLINK-12576:
------------------------------------

To reproduce,


git clone --branch backpressure-with-2-TMs https://github.com/alpinegizmo/flink-playgrounds.git
cd flink-playgrounds/operations-playground
docker-compose build
docker-compose up -d

You will find a job with these 5 operators

(1) kafka -> (2) timestamps / watermarks -> (3) keyBy + backpressure map -> (4) keyBy + window -> (5) kafka

where #3, the backpressure map, causes severe backpressure every other minute. The job is running with parallelism of 2 throughout; up until the first keyBy all the traffic is on the subtasks with 0 as their index. 

In this backpressure-with-2-TMs branch there are two TMs each with one slot. You will observe that all of the output metrics for the 0-index watermarking subtask rise to 1 during the even-numbered minutes, and fall to 0 during the odd numbered minutes, as expected. 

If I run this with one TM with 2 slots, all of the input metrics for the backpressure operator are always zero. 

In this backpressure-with-2-TMs branch there are 2 single-slot TMs, and here the input metrics for subtask 0 of the backpressure operator are always 0, but the input metrics for subtask 1 of that operator rise and fall every minute, as they should. Thus my conclusion that the local input metrics are still broken.



> inputQueueLength metric does not work for LocalInputChannels
> ------------------------------------------------------------
>
>                 Key: FLINK-12576
>                 URL: https://issues.apache.org/jira/browse/FLINK-12576
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Metrics, Runtime / Network
>    Affects Versions: 1.6.4, 1.7.2, 1.8.0
>            Reporter: Piotr Nowojski
>            Assignee: Aitozi
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.9.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently {{inputQueueLength}} ignores LocalInputChannels ({{SingleInputGate#getNumberOfQueuedBuffers}}). This can can cause mistakes when looking for causes of back pressure (If task is back pressuring whole Flink job, but there is a data skew and only local input channels are being used).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)