You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Aitozi (Jira)" <ji...@apache.org> on 2022/03/22 12:15:00 UTC

[jira] [Comment Edited] (FLINK-25480) Create dashboard/monitoring to see resource usage per E2E test

    [ https://issues.apache.org/jira/browse/FLINK-25480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510448#comment-17510448 ] 

Aitozi edited comment on FLINK-25480 at 3/22/22, 12:14 PM:
-----------------------------------------------------------

FYI, I encounter the same problem with 1.14.4 when running test in container. I test in 16C 32G container, and {{mvn verify}} command exit 137 finally. At the meantime, I opened another screen to run  {{vsar --cpu --mem -l}} to monitor the memory usage. But I still not catch the memory stroke. Hope to be helpful to your guys. I'm curious about it, because it stop me from building our stable CI pipeline.


was (Author: aitozi):
FYI, I encounter the same problem with 1.14.4 when running test in container. I test in 16C 32G container, and {{mvn verify}} command exit 137 finally. Then I open another screen to run  {{vsar --cpu --mem -l}} . But I still not catch the memory stroke. I'm curious about it.

> Create dashboard/monitoring to see resource usage per E2E test
> --------------------------------------------------------------
>
>                 Key: FLINK-25480
>                 URL: https://issues.apache.org/jira/browse/FLINK-25480
>             Project: Flink
>          Issue Type: Improvement
>          Components: Test Infrastructure
>    Affects Versions: 1.15.0, 1.13.6, 1.14.3
>            Reporter: Martijn Visser
>            Priority: Critical
>              Labels: test-stability
>
> Over the past couple of weeks, we've encountered multiple problems with tests failing due to out-of-memory errors and/or exit code 137 happening. These are happening both on Alibaba CI machines, as well as Azure hosted agents. For the Alibaba CI machines, we've mitigated the problem by reducing the number of workers per CI machine from 7 to 5. These workers can spin up multiple Docker containers, especially with Testcontainers getting used more and more. 
> If we can get insights in the resource usage per end-to-end test, it will also help in debugging test infrastructure problems more quickly. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)