You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Till Rohrmann (Jira)" <ji...@apache.org> on 2019/12/13 17:29:00 UTC

[jira] [Resolved] (FLINK-15013) Flink (on YARN) sometimes needs too many slots

     [ https://issues.apache.org/jira/browse/FLINK-15013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Till Rohrmann resolved FLINK-15013.
-----------------------------------
    Resolution: Fixed

Fixed via

master:
632586f27487836d3b336aa90b1754f150359250
9493c3dbaab3d171ad6a478ec1e5280048143900

1.10.0:
e3171032252c9cb63f4c4698c81a5a6114fb24a6
7f4dc2cf32178babfb812832b7fd43bfd7fd12ed

> Flink (on YARN) sometimes needs too many slots
> ----------------------------------------------
>
>                 Key: FLINK-15013
>                 URL: https://issues.apache.org/jira/browse/FLINK-15013
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.10.0
>            Reporter: Aljoscha Krettek
>            Assignee: Till Rohrmann
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.10.0
>
>         Attachments: DualInputWordCount.jar
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> *THIS IS DIFFERENT FROM FLINK-15007, even though the text looks almost the same.*
> This was discovered while debugging FLINK-14834. In some cases a Flink needs needs more slots to execute than expected. You can see this in some of the logs attached to FLINK-14834.
> You can reproduce this using [https://github.com/aljoscha/docker-hadoop-cluster] to bring up a YARN cluster and then running a compiled Flink in there.
> When you run
> {code:java}
> bin/flink run -m yarn-cluster -p 3 -yjm 1224 -ytm 1224 /root/DualInputWordCount.jar --input hdfs:///wc-in-1 --output hdfs:///wc-out && hdfs dfs -rm -r /wc-out
> {code}
> and check the logs afterwards you will sometimes see three "Requesting new slot..." statements and sometimes you will see four.
> This is the {{git bisect}} log that identifies the first faulty commit ([https://github.com/apache/flink/commit/2ab8b61f2f22f1a1ce7f92cd6b8dd32d2c0c227d|https://github.com/apache/flink/commit/2ab8b61f2f22f1a1ce7f92cd6b8dd32d2c0c227d]):
> {code:java}
> git bisect start
> # good: [09f2f43a1d73c76bf4d3f4a1205269eb860deb14] [FLINK-14154][ml] Add the class for multivariate Gaussian Distribution.
> git bisect good 09f2f43a1d73c76bf4d3f4a1205269eb860deb14
> # bad: [01d6972ab267807b8afccb09a45c454fa76d6c4b] [hotfix] Refactor out slots creation from the TaskSlotTable constructor
> git bisect bad 01d6972ab267807b8afccb09a45c454fa76d6c4b
> # bad: [7a61c582c7213f123e10de4fd11a13d96425fd77] [hotfix] Fix wrong Java doc comment of BroadcastStateBootstrapFunction.Context
> git bisect bad 7a61c582c7213f123e10de4fd11a13d96425fd77
> # good: [edeec8d7420185d1c49b2739827bd921d2c2d485] [hotfix][runtime] Replace all occurrences of letter to mail to unify wording of variables and documentation.
> git bisect good edeec8d7420185d1c49b2739827bd921d2c2d485
> # bad: [1b4ebce86b71d56f44185f1cb83d9a3b51de13df] [FLINK-14262][table-planner-blink] support referencing function with fully/partially qualified names in blink
> git bisect bad 1b4ebce86b71d56f44185f1cb83d9a3b51de13df
> # good: [25a3d9138cd5e39fc786315682586b75d8ac86ea] [hotfix] Move TaskManagerSlot to o.a.f.runtime.resourcemanager.slotmanager
> git bisect good 25a3d9138cd5e39fc786315682586b75d8ac86ea
> # good: [362d7670593adc2e4b20650c8854398727d8102b] [FLINK-12122] Calculate TaskExecutorUtilization when listing available slots
> git bisect good 362d7670593adc2e4b20650c8854398727d8102b
> # bad: [7e8218515baf630e668348a68ff051dfa49c90c3] [FLINK-13969][Checkpointing] Do not allow trigger new checkpoitn after stop the coordinator
> git bisect bad 7e8218515baf630e668348a68ff051dfa49c90c3
> # bad: [269e7f007e855c2bdedf8bad64ef13f516a608a6] [FLINK-12122] Choose SlotSelectionStrategy based on ClusterOptions#EVENLY_SPREAD_OUT_SLOTS_STRATEGY
> git bisect bad 269e7f007e855c2bdedf8bad64ef13f516a608a6
> # bad: [2ab8b61f2f22f1a1ce7f92cd6b8dd32d2c0c227d] [FLINK-12122] Add EvenlySpreadOutLocationPreferenceSlotSelectionStrategy
> git bisect bad 2ab8b61f2f22f1a1ce7f92cd6b8dd32d2c0c227d
> # first bad commit: [2ab8b61f2f22f1a1ce7f92cd6b8dd32d2c0c227d] [FLINK-12122] Add EvenlySpreadOutLocationPreferenceSlotSelectionStrategy
> {code}
> I'm using the streaming WordCount example that I modified to have two "inputs", similar to how the WordCount example is used in the YARN/kerberos/Docker test. Instead of using the input once we use it like this:
> {code}
> text = env.readTextFile(params.get("input")).union(env.readTextFile(params.get("input")));
> {code}
> to create two inputs from the same path. A jar is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)