You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Danny McCormick (Jira)" <ji...@apache.org> on 2022/04/04 14:28:00 UTC
[jira] [Commented] (BEAM-12815) Flink Go XVR tests fail on TestXLang_Multi: Insufficient number of network buffers

    [ https://issues.apache.org/jira/browse/BEAM-12815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516859#comment-17516859 ] 

Danny McCormick commented on BEAM-12815:
----------------------------------------

I'm going to exclude this test on flink again and stop looking into it for now since it is lower priority than some other work (PR - [https://github.com/apache/beam/pull/17263]). I'll describe my findings so far:

 

The issue (and generally the fix) here is known - flink doesn't have enough memory to allocate the network buffers needed to run our tests. This can be fixed by updating the flink configuration we use when spinning up flink to allocate more memory for network purposes. I did this for precommits in [https://github.com/apache/beam/pull/17067], but unbeknownst to me it didn't affect postcommits because we use a different mechanism for postcommits. The postcommits spin up flink by using the flink shadow jar task. This actually takes an argument for the flink configuration directory - [https://github.com/apache/beam/blob/5a622286db56535592e99b380308443bfeebf6c2/runners/flink/job-server/flink_job_server.gradle#L116] - but when you try to set it from a groovy postcommit file, it doesn't take because the cached version of the shadow jar isn't guaranteed to have the version built with the argument. I tried this unsuccessfully in [https://github.com/apache/beam/pull/17227]

 

The options I see for fixing this are:

1) Figure out a way to rebuild flink every time we run that specific set of postcommit tests and do so with the configuration option set.

2) Set the configuration option on all flink builds going forward. This would also have some impact on the debugging environment people use for flink (and possibly some people spin flink up this way for real workloads, though that's not a recommended path).

 

Neither option is terrible, but neither is ideal. For the moment I'm not going to keep digging in since I don't love either option and I think that effort would be better spent on other work.

> Flink Go XVR tests fail on TestXLang_Multi: Insufficient number of network buffers
> ----------------------------------------------------------------------------------
>
>                 Key: BEAM-12815
>                 URL: https://issues.apache.org/jira/browse/BEAM-12815
>             Project: Beam
>          Issue Type: Bug
>          Components: cross-language, sdk-go, test-failures
>            Reporter: Daniel Oliveira
>            Assignee: Danny McCormick
>            Priority: P3
>             Fix For: Not applicable
>
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> When running the cross-language test suites () Flink fails on TestXLang_Multi with the following error:
> {noformat}
> 19:29:14 2021/08/27 02:29:14  (): java.io.IOException: Insufficient number of network buffers: required 17, but only 16 available. The total number of network buffers is currently set to 2048 of 32768 bytes each. You can increase this number by setting the configuration keys 'taskmanager.memory.network.fraction', 'taskmanager.memory.network.min', and 'taskmanager.memory.network.max'.
> 19:29:14 2021/08/27 02:29:14 Job state: FAILED
> 19:29:14 --- FAIL: TestXLang_Multi (6.26s){noformat}
> This doesn't seem to be a parallelism problem (go test is run with "-p 1" as expected) and is only happening on this specific test.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)