You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Stephan Ewen <se...@apache.org> on 2019/12/19 10:02:58 UTC

[DISCUSS] Some feedback after trying out the new FLIP-49 memory configurations

Hi all!

I looked into using Flink with the new memory configurations and played
around a bit.
All in all, I think we made a really good step there in removing implicit
assumptions and rather making the configuration of memory sizes more
explicit. My experience was mostly very positive.

There were a few issues I stumbled across that were a bit un-intuitive,
where we could think about some alternatives.

==========
Main points
==========

  - "taskmanager.memory.size" (old main config option) is replaced by
"taskmanager.memory.total-process.size" which has a different meaning in
standalone setups. The old option did not subtract metaspace and other
overhead, while the new option does. That means that with the default
config, standalone clusters get quite a bit less memory. (independent of
managed memory going off heap).

    I am wondering if we could interpret "taskmanager.memory.size" as the
deprecated key for "taskmanager.memory.total-flink.size". That would be in
line with the old mechanism (assuming managed memory is set to off heap).

    The effect would be that the container size on Yarn/Mesos increases,
because from "taskmanager.memory.total-flink.size", we need to add overhead
and metaspace to reach the total process size, rather than cutting off
memory. But if we want, we could even adjust for that in the active
resource manager, getting full backwards compatibility on that part.

    Curious to hear more thoughts there.

  - Mini Cluster tries to imitate exact ratio of memory pools as a
standalone setup. I get the idea behind that, but I am wondering if it is
the right approach here.

    For example: I started a relatively large JVM (large heap size of 10
GB) as a test. With the current logic, the system tries to reserve an
additional 6GB for managed memory which is more than there is memory left.
When you see the error that no memory could be allocated, you need to
understand the magic of how this is derived.

    I am trying to think about this from the perspective of using "Flink as
a Library", which the MiniCluster is close to.
    When starting Flink out of a running process, we cannot assume that we
are the only users of that process and that we can mold the process to our
demands. I think a fix value for managed memory and network memory would
feel more natural in such a setup than a mechanism that is tailored towards
exclusive use of the process.


===========================
Config key names and descriptions
===========================

  - Some off-heap memory goes into direct memory, some does not. This
confused me a bit. For example "taskmanager.memory.framework.off-heap.size"
is counted into MaxDirectMemory while
"taskmanager.memory.task.off-heap.size" is counted as native memory. Maybe
we should rename the keys to reflect that. There is no one "off heap"
memory type after all. Maybe use "taskmanager.memory.task.native: XXXmb"
and "taskmanager.memory.framework.direct: XXXmb" instead?

  - What do you think about renaming "taskmanager.memory.total-flink.size"
to "taskmanager.memory.flink.size" and
"taskmanager.memory.total-process.size" to
"taskmanager.memory.process.size" (or "taskmanager.memory.jvm.total"). I
think these keys may be a bit less clumsy (dropping the "total-") without
loss of expressiveness.

  - The network memory keys are now called "taskmanager.memory.shuffle.*".
To my knowledge, shuffle is usually understood as a redistribution (random,
or maybe by hash of key). As an example, there are many discussions about
"shuffle join versus broadcast join", where "shuffle" is the synonym for
"re-partitioning". We use that memory however for all network operations,
like forward pipes, broadcasts, receiver-side buffering on checkpoints,
etc. I find the name "*.shuffle.*" confusing, I am wondering if users would
find that as well. So throwing in the suggestion to call the options
"taskmanager.memory.network.*".

  - The descriptions for the "taskmanager.memory.jvm-overhead.*" keys say
that it also accounts for I/O direct memory, but the parameter is not
counted into the MaxDirectMemory parameter.


Some minor comments
==================

  - Can make the new ConfigOptions strongly typed with the new
configuration options. For example, directly use MemorySize typed options.
That makes validation automatic and saves us from breaking the options
later.


Best,
Stephan

Re: [DISCUSS] Some feedback after trying out the new FLIP-49 memory configurations

Posted by Xintong Song <to...@gmail.com>.
Thanks all for the discussion.

We agreed to have process.size in the default settings with the explanation
> of flink.size alternative in the comment.
> This way we keep -tm as a shortcut to process.size only and any
> inconsistencies fail fast as if configured in yaml.
>

The conclusions sounds good to me. I'll update the JIRA description and PR
for FLINK-15530.

+1 to resolve this discussion thread. We can keep the default values
discussion in the "tuning FLIP-49 default configuration values" ML thread.


Thank you~

Xintong Song



On Wed, Jan 15, 2020 at 5:54 AM Andrey Zagrebin <az...@apache.org>
wrote:

> Hi all,
>
> Stephan, Till and me had another offline discussion today. Here is the
> outcome of our brainstorm.
>
> We agreed to have process.size in the default settings with the explanation
> of flink.size alternative in the comment.
> This way we keep -tm as a shortcut to process.size only and any
> inconsistencies fail fast as if configured in yaml.
> I will also follow-up on the thread "[Discuss] Tuning FLIP-49 configuration
> default values" with a bit more details.
>
> If no further objections, we can conclude this last point in this
> discussion.
>
> Best,
> Andrey
>
> On Tue, Jan 14, 2020 at 4:28 PM Stephan Ewen <se...@apache.org> wrote:
>
> > I think that problem exists anyways and is independent of the "-tm"
> option.
> >
> > You can have a combination of `task.heap.size` and `managed.size` and
> > `framework.heap.size` that conflicts with `flink.size`. In that case, we
> > get an exception during the startup (incompatible memory configuration)?
> > That is the price for having these "shortcut" options (`flink.size` and
> > `process.size`). But it is a fair price, because the shortcuts are very
> > valuable to most users.
> >
> > That is added with the "-tm" setting is that this is an easy way to get
> the
> > shortcut setting added, even if it was not in the config. So where a
> config
> > worked in standalone, it can now fail in a container setup when "-tm" is
> > used.
> > But that is expected, I guess, when you start manually tune different
> > memory types and end up defining an inconsistent combination. And it
> never
> > conflicts with the default setup, only with manually tuned regions.
> >
> > But this example shows why we need to have log output for the logic that
> > validates and configures the memory settings, so that users understand
> what
> > is happening.
> >
> > Best,
> > Stephan
> >
> >
> > On Tue, Jan 14, 2020 at 2:54 PM Till Rohrmann <tr...@apache.org>
> > wrote:
> >
> > > Clearing the `flink.size` option and setting `process.size` could
> indeed
> > be
> > > a solution. The thing I'm wondering is what would happen if the user
> has
> > > configured `task.heap.size` and `managed.size` instead of `flink.size`?
> > > Would we also ignore these settings? If not, then we risk to run into
> the
> > > situation that TaskExecutorResourceUtils fails because the memory does
> > not
> > > add up. I guess the thing which bugs me a bit is the special casing
> which
> > > could lead easily into inconsistent behaviour if we don't cover all
> > cases.
> > > The consequence of using slightly different concepts (`flink.size`,
> > > `process.size`) in standalone vs. container/Yarn/Mesos mode in order to
> > > keep the status quo is an increased maintenance overhead which we
> should
> > be
> > > aware of.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Tue, Jan 14, 2020 at 3:59 AM Xintong Song <to...@gmail.com>
> > > wrote:
> > >
> > > > True, even we have "process.size" rather than "flink.size" in the
> > default
> > > > config file, user can still have "flink.size" in their custom config
> > > file.
> > > > If we consider "-tm" as a shortcut for users to override the TM
> memory
> > > > size, then it makes less sense to require users to remove
> "flink.size"
> > > from
> > > > their config file whenever then want to use "-tm".
> > > >
> > > > I'm convinced that ignoring "flink.size" might not be a bad idea.
> > > > And I think we should also update the description of "-tm" (in
> > > > "FlinkYarnSessionCli") to explicitly mention that it would overwrite
> > > > "flink.size" and "process.size".
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > >
> > > > On Tue, Jan 14, 2020 at 2:24 AM Stephan Ewen <se...@apache.org>
> wrote:
> > > >
> > > > > Would be good to hear the thoughts of some more Yarn users, though.
> > > > >
> > > > > On Mon, Jan 13, 2020 at 7:23 PM Stephan Ewen <se...@apache.org>
> > wrote:
> > > > >
> > > > > > I think we need an interpretation of "-tm" regardless of what is
> in
> > > the
> > > > > > default configuration, because we can always have a modified
> > > > > configuration
> > > > > > and then a user passes the "-tm" flag.
> > > > > >
> > > > > > I kind of like the first proposal: Interpret "-tm" as "override
> > > memory
> > > > > > size config and set the Yarn TM container size." It would mean
> > > exactly
> > > > > > ignoring "taskmanager.memory.flink.size" and using the "-tm"
> value
> > > as "
> > > > > > "taskmanager.memory.process.size".
> > > > > > That does not sound too bad to me.
> > > > > >
> > > > > > Best,
> > > > > > Stephan
> > > > > >
> > > > > >
> > > > > > On Mon, Jan 13, 2020 at 5:35 PM Andrey Zagrebin <
> > > azagrebin@apache.org>
> > > > > > wrote:
> > > > > >
> > > > > >> Hi all,
> > > > > >>
> > > > > >> While working on changing process memory to Flink memory in
> > default
> > > > > >> configuration, Xintong encountered a problem.
> > > > > >> When -tm option is used to rewrite container memory, basically
> > > process
> > > > > >> memory, it can collide with the default Flink memory.
> > > > > >> For legacy users it should not be a problem as we adjusted the
> > > legacy
> > > > > heap
> > > > > >> size option to be interpreted differently for standalone and
> > > container
> > > > > >> modes.
> > > > > >>
> > > > > >> One solution could be to say in -tm docs that we rewrite both
> > > options
> > > > > >> under
> > > > > >> the hood: process and Flink memory, basically unset Flink memory
> > > from
> > > > > yaml
> > > > > >> config.
> > > > > >> The downside is that this adds more magic.
> > > > > >>
> > > > > >> Alternatively, we can keep process memory in default config and,
> > as
> > > > > >> mentioned before, increase it to maintain the user experience by
> > > > > matching
> > > > > >> the previous default setting for heap (now Flink in standalone)
> > > size.
> > > > > >> The Flink memory can be mentioned in process memory comment as a
> > > > simpler
> > > > > >> alternative which does not require accounting for JVM overhead.
> > > > > >> The downside is again more confusion while trying out Flink and
> > > tuning
> > > > > >> memory at the same time.
> > > > > >> On the other hand, if memory already needs to be tuned it should
> > > > > >> quite quickly lead to the necessity of understanding the memory
> > > model
> > > > in
> > > > > >> Flink.
> > > > > >>
> > > > > >> Best,
> > > > > >> Andrey
> > > > > >>
> > > > > >> On Thu, Jan 9, 2020 at 12:27 PM Stephan Ewen <se...@apache.org>
> > > > wrote:
> > > > > >>
> > > > > >> > Great! Thanks, guys, for the continued effort on this topic!
> > > > > >> >
> > > > > >> > On Thu, Jan 9, 2020 at 5:27 AM Xintong Song <
> > > tonysong820@gmail.com>
> > > > > >> wrote:
> > > > > >> >
> > > > > >> > > Thanks all for the discussion. I believe we have get
> consensus
> > > on
> > > > > all
> > > > > >> the
> > > > > >> > > open questions discussed in this thread.
> > > > > >> > >
> > > > > >> > > Since Andrey already create a jira ticket for renaming
> shuffle
> > > > > memory
> > > > > >> > > config keys with "taskmanager.memory.network.*", I'll create
> > > > ticket
> > > > > >> for
> > > > > >> > the
> > > > > >> > > other topic that puts flink.size in flink-conf.yaml.
> > > > > >> > >
> > > > > >> > > Thank you~
> > > > > >> > >
> > > > > >> > > Xintong Song
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > On Wed, Jan 8, 2020 at 9:39 PM Andrey Zagrebin <
> > > > > azagrebin@apache.org>
> > > > > >> > > wrote:
> > > > > >> > >
> > > > > >> > > > It also looks to me that we should only swap network and
> > > memory
> > > > in
> > > > > >> the
> > > > > >> > > > option names: 'taskmanager.memory.network.*'.
> > > > > >> > > > There is no strong consensus towards using new 'shuffle'
> > > naming
> > > > so
> > > > > >> we
> > > > > >> > can
> > > > > >> > > > just rename it to  'taskmanager.memory.network.*' as
> > 'shuffle'
> > > > > >> naming
> > > > > >> > has
> > > > > >> > > > never been released.
> > > > > >> > > > When we have other shuffle services and start advertising
> > more
> > > > > this
> > > > > >> > > concept
> > > > > >> > > > in Flink, we could revisit again the whole naming for this
> > > > > concept.
> > > > > >> > > > https://jira.apache.org/jira/browse/FLINK-15517
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Some feedback after trying out the new FLIP-49 memory configurations

Posted by Andrey Zagrebin <az...@apache.org>.
Hi all,

Stephan, Till and me had another offline discussion today. Here is the
outcome of our brainstorm.

We agreed to have process.size in the default settings with the explanation
of flink.size alternative in the comment.
This way we keep -tm as a shortcut to process.size only and any
inconsistencies fail fast as if configured in yaml.
I will also follow-up on the thread "[Discuss] Tuning FLIP-49 configuration
default values" with a bit more details.

If no further objections, we can conclude this last point in this
discussion.

Best,
Andrey

On Tue, Jan 14, 2020 at 4:28 PM Stephan Ewen <se...@apache.org> wrote:

> I think that problem exists anyways and is independent of the "-tm" option.
>
> You can have a combination of `task.heap.size` and `managed.size` and
> `framework.heap.size` that conflicts with `flink.size`. In that case, we
> get an exception during the startup (incompatible memory configuration)?
> That is the price for having these "shortcut" options (`flink.size` and
> `process.size`). But it is a fair price, because the shortcuts are very
> valuable to most users.
>
> That is added with the "-tm" setting is that this is an easy way to get the
> shortcut setting added, even if it was not in the config. So where a config
> worked in standalone, it can now fail in a container setup when "-tm" is
> used.
> But that is expected, I guess, when you start manually tune different
> memory types and end up defining an inconsistent combination. And it never
> conflicts with the default setup, only with manually tuned regions.
>
> But this example shows why we need to have log output for the logic that
> validates and configures the memory settings, so that users understand what
> is happening.
>
> Best,
> Stephan
>
>
> On Tue, Jan 14, 2020 at 2:54 PM Till Rohrmann <tr...@apache.org>
> wrote:
>
> > Clearing the `flink.size` option and setting `process.size` could indeed
> be
> > a solution. The thing I'm wondering is what would happen if the user has
> > configured `task.heap.size` and `managed.size` instead of `flink.size`?
> > Would we also ignore these settings? If not, then we risk to run into the
> > situation that TaskExecutorResourceUtils fails because the memory does
> not
> > add up. I guess the thing which bugs me a bit is the special casing which
> > could lead easily into inconsistent behaviour if we don't cover all
> cases.
> > The consequence of using slightly different concepts (`flink.size`,
> > `process.size`) in standalone vs. container/Yarn/Mesos mode in order to
> > keep the status quo is an increased maintenance overhead which we should
> be
> > aware of.
> >
> > Cheers,
> > Till
> >
> > On Tue, Jan 14, 2020 at 3:59 AM Xintong Song <to...@gmail.com>
> > wrote:
> >
> > > True, even we have "process.size" rather than "flink.size" in the
> default
> > > config file, user can still have "flink.size" in their custom config
> > file.
> > > If we consider "-tm" as a shortcut for users to override the TM memory
> > > size, then it makes less sense to require users to remove "flink.size"
> > from
> > > their config file whenever then want to use "-tm".
> > >
> > > I'm convinced that ignoring "flink.size" might not be a bad idea.
> > > And I think we should also update the description of "-tm" (in
> > > "FlinkYarnSessionCli") to explicitly mention that it would overwrite
> > > "flink.size" and "process.size".
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Tue, Jan 14, 2020 at 2:24 AM Stephan Ewen <se...@apache.org> wrote:
> > >
> > > > Would be good to hear the thoughts of some more Yarn users, though.
> > > >
> > > > On Mon, Jan 13, 2020 at 7:23 PM Stephan Ewen <se...@apache.org>
> wrote:
> > > >
> > > > > I think we need an interpretation of "-tm" regardless of what is in
> > the
> > > > > default configuration, because we can always have a modified
> > > > configuration
> > > > > and then a user passes the "-tm" flag.
> > > > >
> > > > > I kind of like the first proposal: Interpret "-tm" as "override
> > memory
> > > > > size config and set the Yarn TM container size." It would mean
> > exactly
> > > > > ignoring "taskmanager.memory.flink.size" and using the "-tm" value
> > as "
> > > > > "taskmanager.memory.process.size".
> > > > > That does not sound too bad to me.
> > > > >
> > > > > Best,
> > > > > Stephan
> > > > >
> > > > >
> > > > > On Mon, Jan 13, 2020 at 5:35 PM Andrey Zagrebin <
> > azagrebin@apache.org>
> > > > > wrote:
> > > > >
> > > > >> Hi all,
> > > > >>
> > > > >> While working on changing process memory to Flink memory in
> default
> > > > >> configuration, Xintong encountered a problem.
> > > > >> When -tm option is used to rewrite container memory, basically
> > process
> > > > >> memory, it can collide with the default Flink memory.
> > > > >> For legacy users it should not be a problem as we adjusted the
> > legacy
> > > > heap
> > > > >> size option to be interpreted differently for standalone and
> > container
> > > > >> modes.
> > > > >>
> > > > >> One solution could be to say in -tm docs that we rewrite both
> > options
> > > > >> under
> > > > >> the hood: process and Flink memory, basically unset Flink memory
> > from
> > > > yaml
> > > > >> config.
> > > > >> The downside is that this adds more magic.
> > > > >>
> > > > >> Alternatively, we can keep process memory in default config and,
> as
> > > > >> mentioned before, increase it to maintain the user experience by
> > > > matching
> > > > >> the previous default setting for heap (now Flink in standalone)
> > size.
> > > > >> The Flink memory can be mentioned in process memory comment as a
> > > simpler
> > > > >> alternative which does not require accounting for JVM overhead.
> > > > >> The downside is again more confusion while trying out Flink and
> > tuning
> > > > >> memory at the same time.
> > > > >> On the other hand, if memory already needs to be tuned it should
> > > > >> quite quickly lead to the necessity of understanding the memory
> > model
> > > in
> > > > >> Flink.
> > > > >>
> > > > >> Best,
> > > > >> Andrey
> > > > >>
> > > > >> On Thu, Jan 9, 2020 at 12:27 PM Stephan Ewen <se...@apache.org>
> > > wrote:
> > > > >>
> > > > >> > Great! Thanks, guys, for the continued effort on this topic!
> > > > >> >
> > > > >> > On Thu, Jan 9, 2020 at 5:27 AM Xintong Song <
> > tonysong820@gmail.com>
> > > > >> wrote:
> > > > >> >
> > > > >> > > Thanks all for the discussion. I believe we have get consensus
> > on
> > > > all
> > > > >> the
> > > > >> > > open questions discussed in this thread.
> > > > >> > >
> > > > >> > > Since Andrey already create a jira ticket for renaming shuffle
> > > > memory
> > > > >> > > config keys with "taskmanager.memory.network.*", I'll create
> > > ticket
> > > > >> for
> > > > >> > the
> > > > >> > > other topic that puts flink.size in flink-conf.yaml.
> > > > >> > >
> > > > >> > > Thank you~
> > > > >> > >
> > > > >> > > Xintong Song
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > > On Wed, Jan 8, 2020 at 9:39 PM Andrey Zagrebin <
> > > > azagrebin@apache.org>
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > > > It also looks to me that we should only swap network and
> > memory
> > > in
> > > > >> the
> > > > >> > > > option names: 'taskmanager.memory.network.*'.
> > > > >> > > > There is no strong consensus towards using new 'shuffle'
> > naming
> > > so
> > > > >> we
> > > > >> > can
> > > > >> > > > just rename it to  'taskmanager.memory.network.*' as
> 'shuffle'
> > > > >> naming
> > > > >> > has
> > > > >> > > > never been released.
> > > > >> > > > When we have other shuffle services and start advertising
> more
> > > > this
> > > > >> > > concept
> > > > >> > > > in Flink, we could revisit again the whole naming for this
> > > > concept.
> > > > >> > > > https://jira.apache.org/jira/browse/FLINK-15517
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Some feedback after trying out the new FLIP-49 memory configurations

Posted by Stephan Ewen <se...@apache.org>.
I think that problem exists anyways and is independent of the "-tm" option.

You can have a combination of `task.heap.size` and `managed.size` and
`framework.heap.size` that conflicts with `flink.size`. In that case, we
get an exception during the startup (incompatible memory configuration)?
That is the price for having these "shortcut" options (`flink.size` and
`process.size`). But it is a fair price, because the shortcuts are very
valuable to most users.

That is added with the "-tm" setting is that this is an easy way to get the
shortcut setting added, even if it was not in the config. So where a config
worked in standalone, it can now fail in a container setup when "-tm" is
used.
But that is expected, I guess, when you start manually tune different
memory types and end up defining an inconsistent combination. And it never
conflicts with the default setup, only with manually tuned regions.

But this example shows why we need to have log output for the logic that
validates and configures the memory settings, so that users understand what
is happening.

Best,
Stephan


On Tue, Jan 14, 2020 at 2:54 PM Till Rohrmann <tr...@apache.org> wrote:

> Clearing the `flink.size` option and setting `process.size` could indeed be
> a solution. The thing I'm wondering is what would happen if the user has
> configured `task.heap.size` and `managed.size` instead of `flink.size`?
> Would we also ignore these settings? If not, then we risk to run into the
> situation that TaskExecutorResourceUtils fails because the memory does not
> add up. I guess the thing which bugs me a bit is the special casing which
> could lead easily into inconsistent behaviour if we don't cover all cases.
> The consequence of using slightly different concepts (`flink.size`,
> `process.size`) in standalone vs. container/Yarn/Mesos mode in order to
> keep the status quo is an increased maintenance overhead which we should be
> aware of.
>
> Cheers,
> Till
>
> On Tue, Jan 14, 2020 at 3:59 AM Xintong Song <to...@gmail.com>
> wrote:
>
> > True, even we have "process.size" rather than "flink.size" in the default
> > config file, user can still have "flink.size" in their custom config
> file.
> > If we consider "-tm" as a shortcut for users to override the TM memory
> > size, then it makes less sense to require users to remove "flink.size"
> from
> > their config file whenever then want to use "-tm".
> >
> > I'm convinced that ignoring "flink.size" might not be a bad idea.
> > And I think we should also update the description of "-tm" (in
> > "FlinkYarnSessionCli") to explicitly mention that it would overwrite
> > "flink.size" and "process.size".
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Tue, Jan 14, 2020 at 2:24 AM Stephan Ewen <se...@apache.org> wrote:
> >
> > > Would be good to hear the thoughts of some more Yarn users, though.
> > >
> > > On Mon, Jan 13, 2020 at 7:23 PM Stephan Ewen <se...@apache.org> wrote:
> > >
> > > > I think we need an interpretation of "-tm" regardless of what is in
> the
> > > > default configuration, because we can always have a modified
> > > configuration
> > > > and then a user passes the "-tm" flag.
> > > >
> > > > I kind of like the first proposal: Interpret "-tm" as "override
> memory
> > > > size config and set the Yarn TM container size." It would mean
> exactly
> > > > ignoring "taskmanager.memory.flink.size" and using the "-tm" value
> as "
> > > > "taskmanager.memory.process.size".
> > > > That does not sound too bad to me.
> > > >
> > > > Best,
> > > > Stephan
> > > >
> > > >
> > > > On Mon, Jan 13, 2020 at 5:35 PM Andrey Zagrebin <
> azagrebin@apache.org>
> > > > wrote:
> > > >
> > > >> Hi all,
> > > >>
> > > >> While working on changing process memory to Flink memory in default
> > > >> configuration, Xintong encountered a problem.
> > > >> When -tm option is used to rewrite container memory, basically
> process
> > > >> memory, it can collide with the default Flink memory.
> > > >> For legacy users it should not be a problem as we adjusted the
> legacy
> > > heap
> > > >> size option to be interpreted differently for standalone and
> container
> > > >> modes.
> > > >>
> > > >> One solution could be to say in -tm docs that we rewrite both
> options
> > > >> under
> > > >> the hood: process and Flink memory, basically unset Flink memory
> from
> > > yaml
> > > >> config.
> > > >> The downside is that this adds more magic.
> > > >>
> > > >> Alternatively, we can keep process memory in default config and, as
> > > >> mentioned before, increase it to maintain the user experience by
> > > matching
> > > >> the previous default setting for heap (now Flink in standalone)
> size.
> > > >> The Flink memory can be mentioned in process memory comment as a
> > simpler
> > > >> alternative which does not require accounting for JVM overhead.
> > > >> The downside is again more confusion while trying out Flink and
> tuning
> > > >> memory at the same time.
> > > >> On the other hand, if memory already needs to be tuned it should
> > > >> quite quickly lead to the necessity of understanding the memory
> model
> > in
> > > >> Flink.
> > > >>
> > > >> Best,
> > > >> Andrey
> > > >>
> > > >> On Thu, Jan 9, 2020 at 12:27 PM Stephan Ewen <se...@apache.org>
> > wrote:
> > > >>
> > > >> > Great! Thanks, guys, for the continued effort on this topic!
> > > >> >
> > > >> > On Thu, Jan 9, 2020 at 5:27 AM Xintong Song <
> tonysong820@gmail.com>
> > > >> wrote:
> > > >> >
> > > >> > > Thanks all for the discussion. I believe we have get consensus
> on
> > > all
> > > >> the
> > > >> > > open questions discussed in this thread.
> > > >> > >
> > > >> > > Since Andrey already create a jira ticket for renaming shuffle
> > > memory
> > > >> > > config keys with "taskmanager.memory.network.*", I'll create
> > ticket
> > > >> for
> > > >> > the
> > > >> > > other topic that puts flink.size in flink-conf.yaml.
> > > >> > >
> > > >> > > Thank you~
> > > >> > >
> > > >> > > Xintong Song
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > On Wed, Jan 8, 2020 at 9:39 PM Andrey Zagrebin <
> > > azagrebin@apache.org>
> > > >> > > wrote:
> > > >> > >
> > > >> > > > It also looks to me that we should only swap network and
> memory
> > in
> > > >> the
> > > >> > > > option names: 'taskmanager.memory.network.*'.
> > > >> > > > There is no strong consensus towards using new 'shuffle'
> naming
> > so
> > > >> we
> > > >> > can
> > > >> > > > just rename it to  'taskmanager.memory.network.*' as 'shuffle'
> > > >> naming
> > > >> > has
> > > >> > > > never been released.
> > > >> > > > When we have other shuffle services and start advertising more
> > > this
> > > >> > > concept
> > > >> > > > in Flink, we could revisit again the whole naming for this
> > > concept.
> > > >> > > > https://jira.apache.org/jira/browse/FLINK-15517
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] Some feedback after trying out the new FLIP-49 memory configurations

Posted by Till Rohrmann <tr...@apache.org>.
Clearing the `flink.size` option and setting `process.size` could indeed be
a solution. The thing I'm wondering is what would happen if the user has
configured `task.heap.size` and `managed.size` instead of `flink.size`?
Would we also ignore these settings? If not, then we risk to run into the
situation that TaskExecutorResourceUtils fails because the memory does not
add up. I guess the thing which bugs me a bit is the special casing which
could lead easily into inconsistent behaviour if we don't cover all cases.
The consequence of using slightly different concepts (`flink.size`,
`process.size`) in standalone vs. container/Yarn/Mesos mode in order to
keep the status quo is an increased maintenance overhead which we should be
aware of.

Cheers,
Till

On Tue, Jan 14, 2020 at 3:59 AM Xintong Song <to...@gmail.com> wrote:

> True, even we have "process.size" rather than "flink.size" in the default
> config file, user can still have "flink.size" in their custom config file.
> If we consider "-tm" as a shortcut for users to override the TM memory
> size, then it makes less sense to require users to remove "flink.size" from
> their config file whenever then want to use "-tm".
>
> I'm convinced that ignoring "flink.size" might not be a bad idea.
> And I think we should also update the description of "-tm" (in
> "FlinkYarnSessionCli") to explicitly mention that it would overwrite
> "flink.size" and "process.size".
>
> Thank you~
>
> Xintong Song
>
>
>
> On Tue, Jan 14, 2020 at 2:24 AM Stephan Ewen <se...@apache.org> wrote:
>
> > Would be good to hear the thoughts of some more Yarn users, though.
> >
> > On Mon, Jan 13, 2020 at 7:23 PM Stephan Ewen <se...@apache.org> wrote:
> >
> > > I think we need an interpretation of "-tm" regardless of what is in the
> > > default configuration, because we can always have a modified
> > configuration
> > > and then a user passes the "-tm" flag.
> > >
> > > I kind of like the first proposal: Interpret "-tm" as "override memory
> > > size config and set the Yarn TM container size." It would mean exactly
> > > ignoring "taskmanager.memory.flink.size" and using the "-tm" value as "
> > > "taskmanager.memory.process.size".
> > > That does not sound too bad to me.
> > >
> > > Best,
> > > Stephan
> > >
> > >
> > > On Mon, Jan 13, 2020 at 5:35 PM Andrey Zagrebin <az...@apache.org>
> > > wrote:
> > >
> > >> Hi all,
> > >>
> > >> While working on changing process memory to Flink memory in default
> > >> configuration, Xintong encountered a problem.
> > >> When -tm option is used to rewrite container memory, basically process
> > >> memory, it can collide with the default Flink memory.
> > >> For legacy users it should not be a problem as we adjusted the legacy
> > heap
> > >> size option to be interpreted differently for standalone and container
> > >> modes.
> > >>
> > >> One solution could be to say in -tm docs that we rewrite both options
> > >> under
> > >> the hood: process and Flink memory, basically unset Flink memory from
> > yaml
> > >> config.
> > >> The downside is that this adds more magic.
> > >>
> > >> Alternatively, we can keep process memory in default config and, as
> > >> mentioned before, increase it to maintain the user experience by
> > matching
> > >> the previous default setting for heap (now Flink in standalone) size.
> > >> The Flink memory can be mentioned in process memory comment as a
> simpler
> > >> alternative which does not require accounting for JVM overhead.
> > >> The downside is again more confusion while trying out Flink and tuning
> > >> memory at the same time.
> > >> On the other hand, if memory already needs to be tuned it should
> > >> quite quickly lead to the necessity of understanding the memory model
> in
> > >> Flink.
> > >>
> > >> Best,
> > >> Andrey
> > >>
> > >> On Thu, Jan 9, 2020 at 12:27 PM Stephan Ewen <se...@apache.org>
> wrote:
> > >>
> > >> > Great! Thanks, guys, for the continued effort on this topic!
> > >> >
> > >> > On Thu, Jan 9, 2020 at 5:27 AM Xintong Song <to...@gmail.com>
> > >> wrote:
> > >> >
> > >> > > Thanks all for the discussion. I believe we have get consensus on
> > all
> > >> the
> > >> > > open questions discussed in this thread.
> > >> > >
> > >> > > Since Andrey already create a jira ticket for renaming shuffle
> > memory
> > >> > > config keys with "taskmanager.memory.network.*", I'll create
> ticket
> > >> for
> > >> > the
> > >> > > other topic that puts flink.size in flink-conf.yaml.
> > >> > >
> > >> > > Thank you~
> > >> > >
> > >> > > Xintong Song
> > >> > >
> > >> > >
> > >> > >
> > >> > > On Wed, Jan 8, 2020 at 9:39 PM Andrey Zagrebin <
> > azagrebin@apache.org>
> > >> > > wrote:
> > >> > >
> > >> > > > It also looks to me that we should only swap network and memory
> in
> > >> the
> > >> > > > option names: 'taskmanager.memory.network.*'.
> > >> > > > There is no strong consensus towards using new 'shuffle' naming
> so
> > >> we
> > >> > can
> > >> > > > just rename it to  'taskmanager.memory.network.*' as 'shuffle'
> > >> naming
> > >> > has
> > >> > > > never been released.
> > >> > > > When we have other shuffle services and start advertising more
> > this
> > >> > > concept
> > >> > > > in Flink, we could revisit again the whole naming for this
> > concept.
> > >> > > > https://jira.apache.org/jira/browse/FLINK-15517
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: [DISCUSS] Some feedback after trying out the new FLIP-49 memory configurations

Posted by Xintong Song <to...@gmail.com>.
True, even we have "process.size" rather than "flink.size" in the default
config file, user can still have "flink.size" in their custom config file.
If we consider "-tm" as a shortcut for users to override the TM memory
size, then it makes less sense to require users to remove "flink.size" from
their config file whenever then want to use "-tm".

I'm convinced that ignoring "flink.size" might not be a bad idea.
And I think we should also update the description of "-tm" (in
"FlinkYarnSessionCli") to explicitly mention that it would overwrite
"flink.size" and "process.size".

Thank you~

Xintong Song



On Tue, Jan 14, 2020 at 2:24 AM Stephan Ewen <se...@apache.org> wrote:

> Would be good to hear the thoughts of some more Yarn users, though.
>
> On Mon, Jan 13, 2020 at 7:23 PM Stephan Ewen <se...@apache.org> wrote:
>
> > I think we need an interpretation of "-tm" regardless of what is in the
> > default configuration, because we can always have a modified
> configuration
> > and then a user passes the "-tm" flag.
> >
> > I kind of like the first proposal: Interpret "-tm" as "override memory
> > size config and set the Yarn TM container size." It would mean exactly
> > ignoring "taskmanager.memory.flink.size" and using the "-tm" value as "
> > "taskmanager.memory.process.size".
> > That does not sound too bad to me.
> >
> > Best,
> > Stephan
> >
> >
> > On Mon, Jan 13, 2020 at 5:35 PM Andrey Zagrebin <az...@apache.org>
> > wrote:
> >
> >> Hi all,
> >>
> >> While working on changing process memory to Flink memory in default
> >> configuration, Xintong encountered a problem.
> >> When -tm option is used to rewrite container memory, basically process
> >> memory, it can collide with the default Flink memory.
> >> For legacy users it should not be a problem as we adjusted the legacy
> heap
> >> size option to be interpreted differently for standalone and container
> >> modes.
> >>
> >> One solution could be to say in -tm docs that we rewrite both options
> >> under
> >> the hood: process and Flink memory, basically unset Flink memory from
> yaml
> >> config.
> >> The downside is that this adds more magic.
> >>
> >> Alternatively, we can keep process memory in default config and, as
> >> mentioned before, increase it to maintain the user experience by
> matching
> >> the previous default setting for heap (now Flink in standalone) size.
> >> The Flink memory can be mentioned in process memory comment as a simpler
> >> alternative which does not require accounting for JVM overhead.
> >> The downside is again more confusion while trying out Flink and tuning
> >> memory at the same time.
> >> On the other hand, if memory already needs to be tuned it should
> >> quite quickly lead to the necessity of understanding the memory model in
> >> Flink.
> >>
> >> Best,
> >> Andrey
> >>
> >> On Thu, Jan 9, 2020 at 12:27 PM Stephan Ewen <se...@apache.org> wrote:
> >>
> >> > Great! Thanks, guys, for the continued effort on this topic!
> >> >
> >> > On Thu, Jan 9, 2020 at 5:27 AM Xintong Song <to...@gmail.com>
> >> wrote:
> >> >
> >> > > Thanks all for the discussion. I believe we have get consensus on
> all
> >> the
> >> > > open questions discussed in this thread.
> >> > >
> >> > > Since Andrey already create a jira ticket for renaming shuffle
> memory
> >> > > config keys with "taskmanager.memory.network.*", I'll create ticket
> >> for
> >> > the
> >> > > other topic that puts flink.size in flink-conf.yaml.
> >> > >
> >> > > Thank you~
> >> > >
> >> > > Xintong Song
> >> > >
> >> > >
> >> > >
> >> > > On Wed, Jan 8, 2020 at 9:39 PM Andrey Zagrebin <
> azagrebin@apache.org>
> >> > > wrote:
> >> > >
> >> > > > It also looks to me that we should only swap network and memory in
> >> the
> >> > > > option names: 'taskmanager.memory.network.*'.
> >> > > > There is no strong consensus towards using new 'shuffle' naming so
> >> we
> >> > can
> >> > > > just rename it to  'taskmanager.memory.network.*' as 'shuffle'
> >> naming
> >> > has
> >> > > > never been released.
> >> > > > When we have other shuffle services and start advertising more
> this
> >> > > concept
> >> > > > in Flink, we could revisit again the whole naming for this
> concept.
> >> > > > https://jira.apache.org/jira/browse/FLINK-15517
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: [DISCUSS] Some feedback after trying out the new FLIP-49 memory configurations

Posted by Stephan Ewen <se...@apache.org>.
Would be good to hear the thoughts of some more Yarn users, though.

On Mon, Jan 13, 2020 at 7:23 PM Stephan Ewen <se...@apache.org> wrote:

> I think we need an interpretation of "-tm" regardless of what is in the
> default configuration, because we can always have a modified configuration
> and then a user passes the "-tm" flag.
>
> I kind of like the first proposal: Interpret "-tm" as "override memory
> size config and set the Yarn TM container size." It would mean exactly
> ignoring "taskmanager.memory.flink.size" and using the "-tm" value as "
> "taskmanager.memory.process.size".
> That does not sound too bad to me.
>
> Best,
> Stephan
>
>
> On Mon, Jan 13, 2020 at 5:35 PM Andrey Zagrebin <az...@apache.org>
> wrote:
>
>> Hi all,
>>
>> While working on changing process memory to Flink memory in default
>> configuration, Xintong encountered a problem.
>> When -tm option is used to rewrite container memory, basically process
>> memory, it can collide with the default Flink memory.
>> For legacy users it should not be a problem as we adjusted the legacy heap
>> size option to be interpreted differently for standalone and container
>> modes.
>>
>> One solution could be to say in -tm docs that we rewrite both options
>> under
>> the hood: process and Flink memory, basically unset Flink memory from yaml
>> config.
>> The downside is that this adds more magic.
>>
>> Alternatively, we can keep process memory in default config and, as
>> mentioned before, increase it to maintain the user experience by matching
>> the previous default setting for heap (now Flink in standalone) size.
>> The Flink memory can be mentioned in process memory comment as a simpler
>> alternative which does not require accounting for JVM overhead.
>> The downside is again more confusion while trying out Flink and tuning
>> memory at the same time.
>> On the other hand, if memory already needs to be tuned it should
>> quite quickly lead to the necessity of understanding the memory model in
>> Flink.
>>
>> Best,
>> Andrey
>>
>> On Thu, Jan 9, 2020 at 12:27 PM Stephan Ewen <se...@apache.org> wrote:
>>
>> > Great! Thanks, guys, for the continued effort on this topic!
>> >
>> > On Thu, Jan 9, 2020 at 5:27 AM Xintong Song <to...@gmail.com>
>> wrote:
>> >
>> > > Thanks all for the discussion. I believe we have get consensus on all
>> the
>> > > open questions discussed in this thread.
>> > >
>> > > Since Andrey already create a jira ticket for renaming shuffle memory
>> > > config keys with "taskmanager.memory.network.*", I'll create ticket
>> for
>> > the
>> > > other topic that puts flink.size in flink-conf.yaml.
>> > >
>> > > Thank you~
>> > >
>> > > Xintong Song
>> > >
>> > >
>> > >
>> > > On Wed, Jan 8, 2020 at 9:39 PM Andrey Zagrebin <az...@apache.org>
>> > > wrote:
>> > >
>> > > > It also looks to me that we should only swap network and memory in
>> the
>> > > > option names: 'taskmanager.memory.network.*'.
>> > > > There is no strong consensus towards using new 'shuffle' naming so
>> we
>> > can
>> > > > just rename it to  'taskmanager.memory.network.*' as 'shuffle'
>> naming
>> > has
>> > > > never been released.
>> > > > When we have other shuffle services and start advertising more this
>> > > concept
>> > > > in Flink, we could revisit again the whole naming for this concept.
>> > > > https://jira.apache.org/jira/browse/FLINK-15517
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] Some feedback after trying out the new FLIP-49 memory configurations

Posted by Stephan Ewen <se...@apache.org>.
I think we need an interpretation of "-tm" regardless of what is in the
default configuration, because we can always have a modified configuration
and then a user passes the "-tm" flag.

I kind of like the first proposal: Interpret "-tm" as "override memory size
config and set the Yarn TM container size." It would mean exactly ignoring
"taskmanager.memory.flink.size" and using the "-tm" value as "
"taskmanager.memory.process.size".
That does not sound too bad to me.

Best,
Stephan


On Mon, Jan 13, 2020 at 5:35 PM Andrey Zagrebin <az...@apache.org>
wrote:

> Hi all,
>
> While working on changing process memory to Flink memory in default
> configuration, Xintong encountered a problem.
> When -tm option is used to rewrite container memory, basically process
> memory, it can collide with the default Flink memory.
> For legacy users it should not be a problem as we adjusted the legacy heap
> size option to be interpreted differently for standalone and container
> modes.
>
> One solution could be to say in -tm docs that we rewrite both options under
> the hood: process and Flink memory, basically unset Flink memory from yaml
> config.
> The downside is that this adds more magic.
>
> Alternatively, we can keep process memory in default config and, as
> mentioned before, increase it to maintain the user experience by matching
> the previous default setting for heap (now Flink in standalone) size.
> The Flink memory can be mentioned in process memory comment as a simpler
> alternative which does not require accounting for JVM overhead.
> The downside is again more confusion while trying out Flink and tuning
> memory at the same time.
> On the other hand, if memory already needs to be tuned it should
> quite quickly lead to the necessity of understanding the memory model in
> Flink.
>
> Best,
> Andrey
>
> On Thu, Jan 9, 2020 at 12:27 PM Stephan Ewen <se...@apache.org> wrote:
>
> > Great! Thanks, guys, for the continued effort on this topic!
> >
> > On Thu, Jan 9, 2020 at 5:27 AM Xintong Song <to...@gmail.com>
> wrote:
> >
> > > Thanks all for the discussion. I believe we have get consensus on all
> the
> > > open questions discussed in this thread.
> > >
> > > Since Andrey already create a jira ticket for renaming shuffle memory
> > > config keys with "taskmanager.memory.network.*", I'll create ticket for
> > the
> > > other topic that puts flink.size in flink-conf.yaml.
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Wed, Jan 8, 2020 at 9:39 PM Andrey Zagrebin <az...@apache.org>
> > > wrote:
> > >
> > > > It also looks to me that we should only swap network and memory in
> the
> > > > option names: 'taskmanager.memory.network.*'.
> > > > There is no strong consensus towards using new 'shuffle' naming so we
> > can
> > > > just rename it to  'taskmanager.memory.network.*' as 'shuffle' naming
> > has
> > > > never been released.
> > > > When we have other shuffle services and start advertising more this
> > > concept
> > > > in Flink, we could revisit again the whole naming for this concept.
> > > > https://jira.apache.org/jira/browse/FLINK-15517
> > > >
> > >
> >
>

Re: [DISCUSS] Some feedback after trying out the new FLIP-49 memory configurations

Posted by Andrey Zagrebin <az...@apache.org>.
Hi all,

While working on changing process memory to Flink memory in default
configuration, Xintong encountered a problem.
When -tm option is used to rewrite container memory, basically process
memory, it can collide with the default Flink memory.
For legacy users it should not be a problem as we adjusted the legacy heap
size option to be interpreted differently for standalone and container
modes.

One solution could be to say in -tm docs that we rewrite both options under
the hood: process and Flink memory, basically unset Flink memory from yaml
config.
The downside is that this adds more magic.

Alternatively, we can keep process memory in default config and, as
mentioned before, increase it to maintain the user experience by matching
the previous default setting for heap (now Flink in standalone) size.
The Flink memory can be mentioned in process memory comment as a simpler
alternative which does not require accounting for JVM overhead.
The downside is again more confusion while trying out Flink and tuning
memory at the same time.
On the other hand, if memory already needs to be tuned it should
quite quickly lead to the necessity of understanding the memory model in
Flink.

Best,
Andrey

On Thu, Jan 9, 2020 at 12:27 PM Stephan Ewen <se...@apache.org> wrote:

> Great! Thanks, guys, for the continued effort on this topic!
>
> On Thu, Jan 9, 2020 at 5:27 AM Xintong Song <to...@gmail.com> wrote:
>
> > Thanks all for the discussion. I believe we have get consensus on all the
> > open questions discussed in this thread.
> >
> > Since Andrey already create a jira ticket for renaming shuffle memory
> > config keys with "taskmanager.memory.network.*", I'll create ticket for
> the
> > other topic that puts flink.size in flink-conf.yaml.
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Wed, Jan 8, 2020 at 9:39 PM Andrey Zagrebin <az...@apache.org>
> > wrote:
> >
> > > It also looks to me that we should only swap network and memory in the
> > > option names: 'taskmanager.memory.network.*'.
> > > There is no strong consensus towards using new 'shuffle' naming so we
> can
> > > just rename it to  'taskmanager.memory.network.*' as 'shuffle' naming
> has
> > > never been released.
> > > When we have other shuffle services and start advertising more this
> > concept
> > > in Flink, we could revisit again the whole naming for this concept.
> > > https://jira.apache.org/jira/browse/FLINK-15517
> > >
> >
>

Re: [DISCUSS] Some feedback after trying out the new FLIP-49 memory configurations

Posted by Stephan Ewen <se...@apache.org>.
Great! Thanks, guys, for the continued effort on this topic!

On Thu, Jan 9, 2020 at 5:27 AM Xintong Song <to...@gmail.com> wrote:

> Thanks all for the discussion. I believe we have get consensus on all the
> open questions discussed in this thread.
>
> Since Andrey already create a jira ticket for renaming shuffle memory
> config keys with "taskmanager.memory.network.*", I'll create ticket for the
> other topic that puts flink.size in flink-conf.yaml.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Wed, Jan 8, 2020 at 9:39 PM Andrey Zagrebin <az...@apache.org>
> wrote:
>
> > It also looks to me that we should only swap network and memory in the
> > option names: 'taskmanager.memory.network.*'.
> > There is no strong consensus towards using new 'shuffle' naming so we can
> > just rename it to  'taskmanager.memory.network.*' as 'shuffle' naming has
> > never been released.
> > When we have other shuffle services and start advertising more this
> concept
> > in Flink, we could revisit again the whole naming for this concept.
> > https://jira.apache.org/jira/browse/FLINK-15517
> >
>

Re: [DISCUSS] Some feedback after trying out the new FLIP-49 memory configurations

Posted by Xintong Song <to...@gmail.com>.
Thanks all for the discussion. I believe we have get consensus on all the
open questions discussed in this thread.

Since Andrey already create a jira ticket for renaming shuffle memory
config keys with "taskmanager.memory.network.*", I'll create ticket for the
other topic that puts flink.size in flink-conf.yaml.

Thank you~

Xintong Song



On Wed, Jan 8, 2020 at 9:39 PM Andrey Zagrebin <az...@apache.org> wrote:

> It also looks to me that we should only swap network and memory in the
> option names: 'taskmanager.memory.network.*'.
> There is no strong consensus towards using new 'shuffle' naming so we can
> just rename it to  'taskmanager.memory.network.*' as 'shuffle' naming has
> never been released.
> When we have other shuffle services and start advertising more this concept
> in Flink, we could revisit again the whole naming for this concept.
> https://jira.apache.org/jira/browse/FLINK-15517
>

Re: [DISCUSS] Some feedback after trying out the new FLIP-49 memory configurations

Posted by Andrey Zagrebin <az...@apache.org>.
It also looks to me that we should only swap network and memory in the
option names: 'taskmanager.memory.network.*'.
There is no strong consensus towards using new 'shuffle' naming so we can
just rename it to  'taskmanager.memory.network.*' as 'shuffle' naming has
never been released.
When we have other shuffle services and start advertising more this concept
in Flink, we could revisit again the whole naming for this concept.
https://jira.apache.org/jira/browse/FLINK-15517

Re: [DISCUSS] Some feedback after trying out the new FLIP-49 memory configurations

Posted by Stephan Ewen <se...@apache.org>.
About the name for the key "taskmanager.memory.network.*" or
"taskmanager.memory.shuffle.*":
This is a nice to have, but not super critical. If there ends up being no
consensus, I can withdraw my concern.

On Wed, Jan 8, 2020 at 4:01 AM Xintong Song <to...@gmail.com> wrote:

> Thanks for the explanations @Stephan, and feedbacks @Jingsong @Andrey
> @Till. I'm also ok with flink.size in flink-conf.yaml.
>
> And if I understand correctly, we have also get consensus on having the
> configuration key 'taskmanager.memory.network.*' (with deprecated key
> 'taskmanager.network.memory.*')?
>
>
> Thank you~
>
> Xintong Song
>
>
>
> On Tue, Jan 7, 2020 at 7:10 PM Till Rohrmann <tr...@apache.org> wrote:
>
> > I'm also slightly in favour of the flink.size instead of process.size. If
> > Xintong is ok with this, then we should update the flink-conf.yaml in the
> > repo.
> >
> > Cheers,
> > Till
> >
> > On Mon, Jan 6, 2020 at 4:09 PM Andrey Zagrebin <az...@apache.org>
> > wrote:
> >
> > > Thank you for more explanation Stephan and feedback Jingsong,
> > >
> > > I see the point now.
> > > ‘taskmanager.memory.flink.size’ is indeed simpler to understand for the
> > > newcomers in the default config because it includes less types of
> memory
> > to
> > > consider.
> > > Most of the time beginners will think about heap size and maybe state
> > size
> > > to tweak in the try-out jobs.
> > > ‘taskmanager.memory.flink.size’ is better scoped for those types of
> > memory.
> > >
> > > The process memory forces to understand the total memory consumption
> > > which is more important for the next steps and will probably require
> > > reading the docs in more depth anyways.
> > > I agree if we decide for ‘flink.size’, it is worth mentioning a pointer
> > to
> > > ‘process.size' in its comment as an alternative to it.
> > >
> > > I am ok with ‘flink.size’ in the default config.
> > >
> > > Best,
> > > Andrey
> > >
> > > On Mon, Dec 30, 2019 at 5:13 AM Jingsong Li <ji...@gmail.com>
> > > wrote:
> > >
> > > > Thank you for your wonderful discussion.
> > > >
> > > > +1 for set "taskmanager.memory.flink.size" in the default config.
> > > > Maybe we can write and explain "taskmanager.memory.process.size" in
> the
> > > > comments.
> > > >
> > > > Most of the time,
> > > > - "trying out" users is less in-depth users and use standalone mode.
> > > > - Production users use active setups.
> > > >
> > > > Default config it very important to "trying out".
> > > > In "trying out", for a novice Java user of Flink, he want to
> configure
> > as
> > > > much memory as my standalone process should be. In fact, it's hard
> for
> > > him
> > > > to realize the JVM overhead. And there is no need that must let him
> > know.
> > > >
> > > > > I don't think it is realistic that users set the process memory to
> > full
> > > > machine memory. There is a lot on the machine as well in most cases.
> > > >
> > > > +1, We often run Flink in a less clean environment, such as the
> > > environment
> > > > have HDFS. Whether we are testing or producing, we will not use all
> the
> > > > memory of the machine, such as always leaving some memory for
> > PageCache.
> > > >
> > > > Best,
> > > > Jingsong Lee
> > > >
> > > > On Sat, Dec 28, 2019 at 7:02 PM Stephan Ewen <se...@apache.org>
> wrote:
> > > >
> > > > > "taskmanager.memory.flink.size" in the default config has a few
> > > > advantages.
> > > > >
> > > > >  - The value in the default config needs to be suitable for "trying
> > > out"
> > > > > Flink, for a good "getting started" experience.
> > > > >
> > > > >   - For trying out Flink, standalone is the most common entry point
> > > > (except
> > > > > running in IDE).
> > > > >
> > > > >   - In standalone setup, from total process memory, we subtract
> > quite a
> > > > bit
> > > > > before we arrive at the usable memory. We also subtract managed
> > memory
> > > > from
> > > > > the heap now. I fear we might end up at a heap that becomes so
> small
> > > that
> > > > > it makes for a bad "getting started" experience.
> > > > >
> > > > >   - I don't think it is realistic that users set the process memory
> > to
> > > > full
> > > > > machine memory. There is a lot on the machine as well in most
> cases.
> > > > >
> > > > >   - In the JVM world, users are used to configuring the heap size
> and
> > > > know
> > > > > that there is additional memory overhead. The
> > > > > "taskmanager.memory.flink.size" option fits well with that mindset.
> > > > >
> > > > >   - One you start to think about the total process memory of Yarn
> > > > > containers, you are already past the getting-started phase and on
> the
> > > > > tuning phase.
> > > > >
> > > > >
> > > > > On Tue, Dec 24, 2019, 10:25 Andrey Zagrebin <az...@apache.org>
> > > > wrote:
> > > > >
> > > > > > Thanks for the summary, Xintong! It makes sense to me.
> > > > > >
> > > > > > How about putting "taskmanager.memory.flink.size" in the
> > > configuration?
> > > > > > > Then new downloaded Flink behaves similar to the previous
> > > Standalone
> > > > > > setups.
> > > > > > > If someone upgrades the binaries, but re-uses their old
> > > > configuration,
> > > > > > > then they get the compatibility as discussed previously.
> > > > > > > We used that approach previously with the fine-grained failover
> > > > > recovery.
> > > > > >
> > > > > >
> > > > > >
> > > > > > > I'm trying to understand why "taskmanager.memory.flink.size"
> > rather
> > > > > than
> > > > > > > "taskmanager.memory.process.size" in the default
> flink-conf.yaml.
> > > Or
> > > > > put
> > > > > > it
> > > > > > > another way, why do we want the new downloaded Flink behaves
> > > similar
> > > > to
> > > > > > > previous Standalone setups rather than previous active mode
> > setups?
> > > > Is
> > > > > > > there any special reason that I overlooked, which makes
> backwards
> > > > > > > compatibility for standalone setups more important than for
> > active
> > > > > > setups?
> > > > > > > IMO, "taskmanager.memory.process.size" is easier for the new
> > > comers.
> > > > > For
> > > > > > > standalone setups, users can simply configure it to their
> > machines'
> > > > > > > available memory size, without needing to worry about leaving
> > > enough
> > > > > > space
> > > > > > > for JVM overehead / metaspace. For containerized setups, it's
> > more
> > > > > > > predictable how many memory the containers / Flink could use,
> > which
> > > > is
> > > > > > more
> > > > > > > friendly for users to manage their resource quota.
> > > > > > > Therefore, unless there is anything I overlooked, I'm in favor
> of
> > > > > putting
> > > > > > > "taskmanager.memory.process.size" rather than
> > > > > > > "taskmanager.memory.flink.size" in the default configuration.
> > > > > >
> > > > > >
> > > > > > I agree that having "taskmanager.memory.process.size" in default
> > > config
> > > > > > should be easier to understand and tweak for the new users
> because
> > it
> > > > is
> > > > > > just what they are ready to spend for Flink application.
> > > > > > The problem is when users upgrade Flink and use the new default
> > > > > > configuration then the behaviour can change:
> > > > > > - either if we put process memory then Flink memory shrinks and
> the
> > > new
> > > > > > default option contradicts their previous understanding
> > > > > > - or if we put Flink memory then larger container is requested.
> > > > > > The shrinking of memory sounds more implicit and worse. The
> > increased
> > > > > > container request will just fail in the worst case and the memory
> > > setup
> > > > > can
> > > > > > be revisited.
> > > > > > We could increase the default "taskmanager.memory.process.size"
> to
> > > > better
> > > > > > align it with the previous default setup for standalone
> > > > > > but this would not remove the possible confusion problem for the
> > old
> > > > > users,
> > > > > > on the other hand the option is new and we can add a comment how
> to
> > > > > migrate
> > > > > > from the old one.
> > > > > >
> > > > > > All in all, now I also tend to have
> > "taskmanager.memory.process.size"
> > > > in
> > > > > > the default config unless there are more reasons for having less
> > > > > confusion
> > > > > > for the old standalone users.
> > > > > >
> > > > > > Best,
> > > > > > Andrey
> > > > > >
> > > > > > On Tue, Dec 24, 2019 at 5:50 AM Xintong Song <
> > tonysong820@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > It seems we have already get consensus on most of the issues.
> > > Thanks
> > > > > > > everyone for the good discussion.
> > > > > > >
> > > > > > > While there are still open questions under discussion, I'd like
> > to
> > > > > > > summarize the discussion so far, and list the action items that
> > we
> > > > > > already
> > > > > > > get consensus on. In this way, we can already start working on
> > > these
> > > > > > items
> > > > > > > while discussing the remaining open questions. Please correct
> me
> > if
> > > > the
> > > > > > > summary does not reflect your argument.
> > > > > > >
> > > > > > > *Action Items that we already get consensus on:*
> > > > > > >
> > > > > > >    - Map "taskmanager.heap.size" to
> > "taskmanager.memory.flink.size"
> > > > for
> > > > > > >    standalone setups, and "taskmanager.memory.process.size" for
> > > > active
> > > > > > setups.
> > > > > > >    (should be mentioned in release notes)
> > > > > > >    - If not explicitly configured, MiniCluster should have
> fixed
> > > > > default
> > > > > > >    network and managed memory sizes. (should be mentioned in
> > docs &
> > > > > > release
> > > > > > >    notes)
> > > > > > >    - Change the memory config options' type from String to
> > > MemorySize
> > > > > > >    - Change the config option key
> > > > "taskmanager.memory.total-flink.size"
> > > > > > >    to "taskmanager.memory.flink.size"
> > > > > > >    - Change the config option key
> > > > > "taskmanager.memory.total-process.size"
> > > > > > >    to "taskmanager.memory.process.size"
> > > > > > >    - Update descriptions for
> > > > > "taskmanager.memory.framework.off-heap.size"
> > > > > > >    and "taskmanager.memory.task.off-heap.size" to explicitly
> > state
> > > > that
> > > > > > >       - Both direct and native memory are accounted
> > > > > > >       - Will be fully counted into MaxDirectMemorySize
> > > > > > >    - Update descriptions for
> > > > > > >    "taskmanager.memory.jvm-overhead.[min|max|fraction]" to
> remove
> > > > "I/O
> > > > > > direct
> > > > > > >    memory" and explicitly state that it's not counted into
> > > > > > MaxDirectMemorySize
> > > > > > >    - Print MemorySize with proper unit. (non-blocker for 1.10)
> > > > > > >
> > > > > > >
> > > > > > > *Questions that are still open:*
> > > > > > >
> > > > > > >    - Which config option do we put in the default
> > flink-conf.yaml?
> > > > > > >       - "taskmanager.memory.flink.size"
> > > > > > >       - "taskmanager.memory.process.size"
> > > > > > >       - Deprecated "taskmanager.heap.size"
> > > > > > >    - What is proper keys for network / shuffle memory
> > > > > > >       - "taskmanager.memory.shuffle.*"
> > > > > > >       - "taskmanager.memory.network.*"
> > > > > > >       - "taskmanager.network.memory.*"
> > > > > > >
> > > > > > >
> > > > > > > Thank you~
> > > > > > >
> > > > > > > Xintong Song
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Dec 24, 2019 at 10:19 AM Xintong Song <
> > > tonysong820@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > >> How about putting "taskmanager.memory.flink.size" in the
> > > > > configuration?
> > > > > > >>> Then new downloaded Flink behaves similar to the previous
> > > > Standalone
> > > > > > setups.
> > > > > > >>> If someone upgrades the binaries, but re-uses their old
> > > > > configuration,
> > > > > > >>> then they get the compatibility as discussed previously.
> > > > > > >>> We used that approach previously with the fine-grained
> failover
> > > > > > recovery.
> > > > > > >>
> > > > > > >>
> > > > > > >> I'm trying to understand why "taskmanager.memory.flink.size"
> > > rather
> > > > > than
> > > > > > >> "taskmanager.memory.process.size" in the default
> > flink-conf.yaml.
> > > Or
> > > > > > put it
> > > > > > >> another way, why do we want the new downloaded Flink behaves
> > > similar
> > > > > to
> > > > > > >> previous Standalone setups rather than previous active mode
> > > setups?
> > > > Is
> > > > > > >> there any special reason that I overlooked, which makes
> > backwards
> > > > > > >> compatibility for standalone setups more important than for
> > active
> > > > > > setups?
> > > > > > >>
> > > > > > >> IMO, "taskmanager.memory.process.size" is easier for the new
> > > comers.
> > > > > For
> > > > > > >> standalone setups, users can simply configure it to their
> > > machines'
> > > > > > >> available memory size, without needing to worry about leaving
> > > enough
> > > > > > space
> > > > > > >> for JVM overehead / metaspace. For containerized setups, it's
> > more
> > > > > > >> predictable how many memory the containers / Flink could use,
> > > which
> > > > is
> > > > > > more
> > > > > > >> friendly for users to manage their resource quota.
> > > > > > >>
> > > > > > >> Therefore, unless there is anything I overlooked, I'm in favor
> > of
> > > > > > putting
> > > > > > >> "taskmanager.memory.process.size" rather than
> > > > > > >> "taskmanager.memory.flink.size" in the default configuration.
> > > > > > >>
> > > > > > >>
> > > > > > >> Thank you~
> > > > > > >>
> > > > > > >> Xintong Song
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> On Tue, Dec 24, 2019 at 4:27 AM Andrey Zagrebin <
> > > > azagrebin@apache.org
> > > > > >
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >>> How about putting "taskmanager.memory.flink.size" in the
> > > > > configuration?
> > > > > > >>>> Then new downloaded Flink behaves similar to the previous
> > > > Standalone
> > > > > > setups.
> > > > > > >>>> If someone upgrades the binaries, but re-uses their old
> > > > > configuration,
> > > > > > >>>> then they get the compatibility as discussed previously.
> > > > > > >>>> We used that approach previously with the fine-grained
> > failover
> > > > > > >>>> recovery.
> > > > > > >>>
> > > > > > >>> +1, this sounds like a good compromise.
> > > > > > >>>
> > > > > > >>> +1 to not have more options for off-heap memory, that would
> get
> > > > > > >>>> confusing fast. We can keep the name "off-heap" for both
> task
> > > and
> > > > > > >>>> framework, as long as they mean the same thing: native plus
> > > > direct,
> > > > > > and
> > > > > > >>>> fully counted into MaxDirectMemory. I would suggest to
> update
> > > the
> > > > > > config
> > > > > > >>>> descriptions then to reflect that.
> > > > > > >>>>
> > > > > > >>> True, this should be explained in the config descriptions.
> > > > > > >>>
> > > > > > >>> looks good to me
> > > > > > >>>
> > > > > > >>> From a user's perspective I believe
> > "taskmanager.memory.network"
> > > > > would
> > > > > > >>>> be easier to understand as not everyone knows exactly what
> the
> > > > > shuffle
> > > > > > >>>> service is. I see the point that it would be a bit imprecise
> > as
> > > we
> > > > > > can have
> > > > > > >>>> different shuffle implementations but I would go with the
> ease
> > > of
> > > > > > >>>> use/understanding here. Moreover, I think that we won't have
> > > many
> > > > > > different
> > > > > > >>>> shuffle service implementations in the foreseeable future.
> > > > > > >>>
> > > > > > >>> I agree that if we cannot find any other convincing names for
> > the
> > > > > > >>> options, we should keep what we already have and change it if
> > the
> > > > > > >>> alternative is convincing enough.
> > > > > > >>> The question is also whether we still want to rename it
> because
> > > it
> > > > > was
> > > > > > >>> "taskmanager.network.*memory*.*" in 1.9 but
> > > > > > "taskmanager.*memory*.network.*"
> > > > > > >>> is more aligned with other new memory option names.
> > > > > > >>> Or we can just 'un'-deprecate
> "taskmanager.network.*memory*.*".
> > > > > > >>>
> > > > > > >>> On Mon, Dec 23, 2019 at 8:42 PM Stephan Ewen <
> sewen@apache.org
> > >
> > > > > wrote:
> > > > > > >>>
> > > > > > >>>> How about putting "taskmanager.memory.flink.size" in the
> > > > > > configuration?
> > > > > > >>>> Then new downloaded Flink behaves similar to the previous
> > > > Standalone
> > > > > > setups.
> > > > > > >>>>
> > > > > > >>>> If someone upgrades the binaries, but re-uses their old
> > > > > configuration,
> > > > > > >>>> then they get the compatibility as discussed previously.
> > > > > > >>>> We used that approach previously with the fine-grained
> > failover
> > > > > > >>>> recovery.
> > > > > > >>>>
> > > > > > >>>> On Mon, Dec 23, 2019 at 3:27 AM Xintong Song <
> > > > tonysong820@gmail.com
> > > > > >
> > > > > > >>>> wrote:
> > > > > > >>>>
> > > > > > >>>>> +1 to not have more options for off-heap memory, that would
> > get
> > > > > > >>>>>> confusing fast. We can keep the name "off-heap" for both
> > task
> > > > and
> > > > > > >>>>>> framework, as long as they mean the same thing: native
> plus
> > > > > direct,
> > > > > > and
> > > > > > >>>>>> fully counted into MaxDirectMemory. I would suggest to
> > update
> > > > the
> > > > > > config
> > > > > > >>>>>> descriptions then to reflect that.
> > > > > > >>>>>>
> > > > > > >>>>> True, this should be explained in the config descriptions.
> > > > > > >>>>>
> > > > > > >>>>> Which configuration option will be set in Flink's default
> > > > > > >>>>>> flink-conf.yaml? If we want to maintain the existing
> > behaviour
> > > > it
> > > > > > would
> > > > > > >>>>>> have to be the then deprecated taskmanager.heap.size
> config
> > > > > option.
> > > > > > If we
> > > > > > >>>>>> are ok with Yarn requesting slightly larger containers,
> then
> > > it
> > > > > > could also
> > > > > > >>>>>> be taskmanager.memory.total-flink.size.
> > > > > > >>>>>>
> > > > > > >>>>> Good point. Currently, we have
> > > > > > >>>>> "taskmanager.memory.total-process.size". In order to
> preserve
> > > the
> > > > > > previous
> > > > > > >>>>> behavior, we need to have "taskmanager.heap.size" so it can
> > be
> > > > > > mapped to
> > > > > > >>>>> different new options in standalone / active setups.
> > > > > > >>>>> I think we can have the deprecated "taskmanager.heap.size"
> in
> > > the
> > > > > > >>>>> default flink-conf.yaml, and also have the
> > > > > > >>>>> new "taskmanager.memory.total-process.size" in a commented
> > > line.
> > > > We
> > > > > > can
> > > > > > >>>>> explain how the deprecated config option behaves
> differently
> > in
> > > > the
> > > > > > >>>>> comments, so that user can switch to the new config options
> > if
> > > > they
> > > > > > want to.
> > > > > > >>>>>
> > > > > > >>>>> Thank you~
> > > > > > >>>>>
> > > > > > >>>>> Xintong Song
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>> On Sat, Dec 21, 2019 at 1:00 AM Till Rohrmann <
> > > > > trohrmann@apache.org>
> > > > > > >>>>> wrote:
> > > > > > >>>>>
> > > > > > >>>>>> Thanks for the feedback and good discussion everyone. I
> left
> > > > some
> > > > > > >>>>>> comments inline.
> > > > > > >>>>>>
> > > > > > >>>>>> On Fri, Dec 20, 2019 at 1:59 PM Stephan Ewen <
> > > sewen@apache.org>
> > > > > > >>>>>> wrote:
> > > > > > >>>>>>
> > > > > > >>>>>>> +1 to not have more options for off-heap memory, that
> would
> > > get
> > > > > > >>>>>>> confusing fast. We can keep the name "off-heap" for both
> > task
> > > > and
> > > > > > >>>>>>> framework, as long as they mean the same thing: native
> plus
> > > > > > direct, and
> > > > > > >>>>>>> fully counted into MaxDirectMemory. I would suggest to
> > update
> > > > the
> > > > > > config
> > > > > > >>>>>>> descriptions then to reflect that.
> > > > > > >>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>>>> On Fri, Dec 20, 2019 at 1:03 PM Xintong Song <
> > > > > > tonysong820@gmail.com>
> > > > > > >>>>>>> wrote:
> > > > > > >>>>>>>
> > > > > > >>>>>>>> Regarding the framework/task direct/native memory
> > options, I
> > > > > tend
> > > > > > >>>>>>>> to think it differently. I'm in favor of keep the
> > > > > > "*.off-heap.size" for the
> > > > > > >>>>>>>> config option keys.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>    - It's not necessary IMO to expose the difference
> > > concepts
> > > > of
> > > > > > >>>>>>>>    direct / native memory to the users.
> > > > > > >>>>>>>>    - I would avoid introducing more options for native
> > > memory
> > > > if
> > > > > > >>>>>>>>    possible. Taking fine grained resource management and
> > > > dynamic
> > > > > > slot
> > > > > > >>>>>>>>    allocation into consideration, that also means
> > introduce
> > > > more
> > > > > > fields into
> > > > > > >>>>>>>>    ResourceSpec / ResourceProfile.
> > > > > > >>>>>>>>    - My gut feeling is that having a relative loose
> > > > > > >>>>>>>>    MaxDirectMemory should not be a big problem.
> > > > > > >>>>>>>>    - In most cases, the task / framework off-heap memory
> > > > should
> > > > > be
> > > > > > >>>>>>>>       mainly (if not all) direct memory, so the
> difference
> > > > > > between derived
> > > > > > >>>>>>>>       MaxDirectMemory and the ideal direct memory limit
> > > should
> > > > > > not be too much.
> > > > > > >>>>>>>>       - We do not have a good way to know the exact size
> > > > needed
> > > > > > >>>>>>>>       for jvm overhead / metaspace and framework / task
> > > > off-heap
> > > > > > memory, thus
> > > > > > >>>>>>>>       having to conservatively reserve slightly more
> > memory
> > > > then
> > > > > > what actually
> > > > > > >>>>>>>>       needed. Such reserved but not used memory can
> cover
> > > for
> > > > > the
> > > > > > small
> > > > > > >>>>>>>>       MaxDirectMemory error.
> > > > > > >>>>>>>>       -
> > > > > > >>>>>>>>       - MaxDirectMemory is not the only way to trigger
> > full
> > > > gc.
> > > > > We
> > > > > > >>>>>>>>       still heap activities that can also trigger the
> gc.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Regarding the memory type config options, I've looked
> into
> > > the
> > > > > > >>>>>>>> latest ConfigOptions changes. I think it shouldn't be
> too
> > > > > > complicated to
> > > > > > >>>>>>>> change the config options to use memory type, and I can
> > > handle
> > > > > it
> > > > > > maybe
> > > > > > >>>>>>>> during your vacations.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Also agree with improving MemorySize logging and
> parsing.
> > > This
> > > > > > >>>>>>>> should not be a blocker that has to be done in 1.10. I
> > would
> > > > say
> > > > > > we finish
> > > > > > >>>>>>>> other works (testability, documentation and those
> > discussed
> > > in
> > > > > > this thread)
> > > > > > >>>>>>>> first, and get to this only if we have time.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Thank you~
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Xintong Song
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> On Fri, Dec 20, 2019 at 6:07 PM Andrey Zagrebin <
> > > > > > >>>>>>>> azagrebin.apache@gmail.com> wrote:
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>> Hi Stephan and Xintong,
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> Thanks for the further FLIP-49 feedbacks.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>   - "taskmanager.memory.size" (old main config option)
> is
> > > > > > replaced
> > > > > > >>>>>>>>>> by "taskmanager.memory.total-process.size" which has a
> > > > > > different meaning in
> > > > > > >>>>>>>>>> standalone setups. The old option did not subtract
> > > metaspace
> > > > > > and other
> > > > > > >>>>>>>>>> overhead, while the new option does. That means that
> > with
> > > > the
> > > > > > default
> > > > > > >>>>>>>>>> config, standalone clusters get quite a bit less
> memory.
> > > > > > (independent of
> > > > > > >>>>>>>>>> managed memory going off heap).
> > > > > > >>>>>>>>>>     I am wondering if we could interpret
> > > > > > >>>>>>>>>> "taskmanager.memory.size" as the deprecated key for
> > > > > > >>>>>>>>>> "taskmanager.memory.total-flink.size". That would be
> in
> > > line
> > > > > > with the old
> > > > > > >>>>>>>>>> mechanism (assuming managed memory is set to off
> heap).
> > > > > > >>>>>>>>>>     The effect would be that the container size on
> > > > Yarn/Mesos
> > > > > > >>>>>>>>>> increases, because from
> > > > "taskmanager.memory.total-flink.size",
> > > > > > we need to
> > > > > > >>>>>>>>>> add overhead and metaspace to reach the total process
> > > size,
> > > > > > rather than
> > > > > > >>>>>>>>>> cutting off memory. But if we want, we could even
> adjust
> > > for
> > > > > > that in the
> > > > > > >>>>>>>>>> active resource manager, getting full backwards
> > > > compatibility
> > > > > > on that part.
> > > > > > >>>>>>>>>>     Curious to hear more thoughts there.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> I believe you mean "taskmanager.heap.size".
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> I think the problem here is that the
> > > > > > >>>>>>>>> legacy "taskmanager.heap.size" was used differently in
> > > > > > standalone setups
> > > > > > >>>>>>>>> and active yarn / mesos setups, and such different
> > > > calculation
> > > > > > logics and
> > > > > > >>>>>>>>> behaviors are exactly what we want to avoid with
> FLIP-49.
> > > > > > Therefore, I'm
> > > > > > >>>>>>>>> not in favor of treating
> > > > "taskmanager.memory.total-flink.size"
> > > > > > differently
> > > > > > >>>>>>>>> for standalone and active setups.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> I think what we really want is probably
> > > > > > >>>>>>>>> mapping "taskmanager.heap.size" to different new config
> > > > options
> > > > > > in
> > > > > > >>>>>>>>> different setups. How about we mark
> > "taskmanager.heap.size"
> > > > as
> > > > > > deprecated
> > > > > > >>>>>>>>> key for neither of
> > "taskmanager.memory.total-process.size"
> > > > and
> > > > > > >>>>>>>>> "taskmanager.memory.total-flink.size". Instead, we
> parse
> > it
> > > > (if
> > > > > > explicitly
> > > > > > >>>>>>>>> configured) in startup scripts / active resource
> > managers,
> > > > and
> > > > > > set the
> > > > > > >>>>>>>>> value to "taskmanager.memory.total-flink.size" in the
> > > scripts
> > > > > and
> > > > > > >>>>>>>>> "taskmanager.memory.total-process.size" in active
> > resource
> > > > > > managers (if the
> > > > > > >>>>>>>>> new config options are not configured). We can provide
> > util
> > > > > > methods in
> > > > > > >>>>>>>>> TaskExecutorResourceUtils for such conversions, to keep
> > all
> > > > the
> > > > > > >>>>>>>>> configuration logics at one place.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> I agree that the problem is that the legacy option
> > > > > > >>>>>>>>> ‘taskmanager.heap.size’ has different semantics for
> > > > > > standalone/container.
> > > > > > >>>>>>>>> We had it initially falling back to
> > > > > > 'taskmanager.memory.total-flink.size’
> > > > > > >>>>>>>>> but I changed that to align it with container cut-off.
> > Now
> > > I
> > > > > see
> > > > > > it changes
> > > > > > >>>>>>>>> standalone setup then.
> > > > > > >>>>>>>>> +1 for supporting its backwards compatibility
> differently
> > > for
> > > > > > >>>>>>>>> standalone/container setups.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>
> > > > > > >>>>>> Which configuration option will be set in Flink's default
> > > > > > >>>>>> flink-conf.yaml? If we want to maintain the existing
> > behaviour
> > > > it
> > > > > > would
> > > > > > >>>>>> have to be the then deprecated taskmanager.heap.size
> config
> > > > > option.
> > > > > > If we
> > > > > > >>>>>> are ok with Yarn requesting slightly larger containers,
> then
> > > it
> > > > > > could also
> > > > > > >>>>>> be taskmanager.memory.total-flink.size.
> > > > > > >>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>   - Mini Cluster tries to imitate exact ratio of memory
> > > pools
> > > > > as
> > > > > > a
> > > > > > >>>>>>>>>> standalone setup. I get the idea behind that, but I am
> > > > > > wondering if it is
> > > > > > >>>>>>>>>> the right approach here.
> > > > > > >>>>>>>>>>     For example: I started a relatively large JVM
> (large
> > > > heap
> > > > > > >>>>>>>>>> size of 10 GB) as a test. With the current logic, the
> > > system
> > > > > > tries to
> > > > > > >>>>>>>>>> reserve an additional 6GB for managed memory which is
> > more
> > > > > than
> > > > > > there is
> > > > > > >>>>>>>>>> memory left. When you see the error that no memory
> could
> > > be
> > > > > > allocated, you
> > > > > > >>>>>>>>>> need to understand the magic of how this is derived.
> > > > > > >>>>>>>>>>     I am trying to think about this from the
> perspective
> > > of
> > > > > > using
> > > > > > >>>>>>>>>> "Flink as a Library", which the MiniCluster is close
> to.
> > > > > > >>>>>>>>>>     When starting Flink out of a running process, we
> > > cannot
> > > > > > >>>>>>>>>> assume that we are the only users of that process and
> > that
> > > > we
> > > > > > can mold the
> > > > > > >>>>>>>>>> process to our demands. I think a fix value for
> managed
> > > > memory
> > > > > > and network
> > > > > > >>>>>>>>>> memory would feel more natural in such a setup than a
> > > > > mechanism
> > > > > > that is
> > > > > > >>>>>>>>>> tailored towards exclusive use of the process.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> +1 on having fixed values for managed / shuffle memory.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> also +1 for that, if user has not specified any main
> > > options
> > > > to
> > > > > > >>>>>>>>> derive memory. We should also log this fixing of
> managed
> > /
> > > > > > shuffle memory.
> > > > > > >>>>>>>>> And just noticed, we could also sanity check framework
> > and
> > > if
> > > > > > >>>>>>>>> explicitly configured task heap against available JVM
> > heap,
> > > > and
> > > > > > at least
> > > > > > >>>>>>>>> log inconsistencies.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>   - Some off-heap memory goes into direct memory, some
> > does
> > > > > not.
> > > > > > >>>>>>>>>> This confused me a bit. For example
> > > > > > >>>>>>>>>> "taskmanager.memory.framework.off-heap.size" is
> counted
> > > into
> > > > > > >>>>>>>>>> MaxDirectMemory while
> > > > "taskmanager.memory.task.off-heap.size"
> > > > > > is counted as
> > > > > > >>>>>>>>>> native memory. Maybe we should rename the keys to
> > reflect
> > > > > that.
> > > > > > There is no
> > > > > > >>>>>>>>>> one "off heap" memory type after all. Maybe use
> > > > > > >>>>>>>>>> "taskmanager.memory.task.native: XXXmb" and
> > > > > > >>>>>>>>>> "taskmanager.memory.framework.direct: XXXmb" instead?
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> I believe "taskmanager.memory.task.off-heap.size" is
> also
> > > > > > >>>>>>>>> accounted in the max direct memory size limit. The
> > > confusion
> > > > > > probably comes
> > > > > > >>>>>>>>> from that "taskmanager.memory.framework.off-heap.size"
> > > > > > explicitly mentioned
> > > > > > >>>>>>>>> that in its description while
> > > > > > "taskmanager.memory.task.off-heap.size"
> > > > > > >>>>>>>>> didn't. That's actually because the framework off-heap
> > > memory
> > > > > is
> > > > > > introduced
> > > > > > >>>>>>>>> later in a separate commit. We should fix that.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> For framework / task off-heap memory, we do not
> > > differentiate
> > > > > > >>>>>>>>> direct / native memory usage. That means the configure
> > > value
> > > > > for
> > > > > > these two
> > > > > > >>>>>>>>> options could be a mixture of direct / native memory.
> > Since
> > > > we
> > > > > > do not know
> > > > > > >>>>>>>>> the portion of direct memory out of the configured
> value,
> > > we
> > > > > have
> > > > > > >>>>>>>>> to conservatively account it all into the max direct
> > memory
> > > > > size
> > > > > > limit.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> *==>  In that case, I am a bit confused. For the total
> > size
> > > > > > >>>>>>>>> calculation, it is fine. But why do we then set
> > > > > MaxDirectMemory?
> > > > > > It is a
> > > > > > >>>>>>>>> difficult parameter, and the main reason to set it was
> > (if
> > > I
> > > > > > recall
> > > > > > >>>>>>>>> correctly) to trigger GC based on direct memory
> > allocation
> > > > (to
> > > > > > free heap
> > > > > > >>>>>>>>> structures that then in turn release direct memory). If
> > the
> > > > > > limit is
> > > > > > >>>>>>>>> anyways too high (because we also count native memory
> in
> > > > there)
> > > > > > such that
> > > > > > >>>>>>>>> we can exceed the total process (container) memory, why
> > do
> > > we
> > > > > > set it then?*
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> I always also thought about it as providing more safety
> > net
> > > > for
> > > > > > >>>>>>>>> direct allocations but GC thing looks more important.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> +1 for fixing docs for
> > > > 'taskmanager.memory.task.off-heap.size’
> > > > > > and
> > > > > > >>>>>>>>> renaming to ‘direct' as this is what really happens
> > > > > > >>>>>>>>> if we want to support direct limit more exact than now.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> I also think that it is hard to separate direct /
> native
> > > > memory
> > > > > > >>>>>>>>> unless we introduce even more options.
> > > > > > >>>>>>>>> If user wants to keep the direct limit tight to a
> certain
> > > > value
> > > > > > >>>>>>>>> but also use native memory outside of it,
> > > > > > >>>>>>>>> she would have to increase something else, like JVM
> > > overhead
> > > > to
> > > > > > >>>>>>>>> account for it and there is no other better way.
> > > > > > >>>>>>>>> Having more options to account for the native memory
> > > outside
> > > > of
> > > > > > >>>>>>>>> direct limit complicates things but can be introduced
> > later
> > > > if
> > > > > > needed.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>   - What do you think about renaming
> > > > > > >>>>>>>>>> "taskmanager.memory.total-flink.size" to
> > > > > > "taskmanager.memory.flink.size"
> > > > > > >>>>>>>>>> and "taskmanager.memory.total-process.size" to
> > > > > > >>>>>>>>>> "taskmanager.memory.process.size" (or
> > > > > > "taskmanager.memory.jvm.total"). I
> > > > > > >>>>>>>>>> think these keys may be a bit less clumsy (dropping
> the
> > > > > > "total-") without
> > > > > > >>>>>>>>>> loss of expressiveness.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> +1 on this.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> +1 as well. Also an option:
> > > > > > >>>>>>>>> 'taskmanager.memory.total-process.size’ ->
> > > > > > >>>>>>>>> ‘taskmanager.memory.jvm.process.size’,
> > > > > > >>>>>>>>> although it can be also mentioned in docs that we mean
> > JVM
> > > > > > process.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>
> > > > > > >>>>>> I'd be in favour of Stephan's proposal for the config keys
> > as
> > > > > > shorter
> > > > > > >>>>>> is usually better and they are still descriptive enough.
> > > Between
> > > > > > >>>>>> "taskmanager.memory.process.size" and
> > > > > > "taskmanager.memory.jvm.total" I
> > > > > > >>>>>> would slightly favour the first variant.
> > > > > > >>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>>>>>>   - The network memory keys are now called
> > > > > > >>>>>>>>>> "taskmanager.memory.shuffle.*". To my knowledge,
> shuffle
> > > is
> > > > > > usually
> > > > > > >>>>>>>>>> understood as a redistribution (random, or maybe by
> hash
> > > of
> > > > > > key). As an
> > > > > > >>>>>>>>>> example, there are many discussions about "shuffle
> join
> > > > versus
> > > > > > broadcast
> > > > > > >>>>>>>>>> join", where "shuffle" is the synonym for
> > > "re-partitioning".
> > > > > We
> > > > > > use that
> > > > > > >>>>>>>>>> memory however for all network operations, like
> forward
> > > > pipes,
> > > > > > broadcasts,
> > > > > > >>>>>>>>>> receiver-side buffering on checkpoints, etc. I find
> the
> > > name
> > > > > > "*.shuffle.*"
> > > > > > >>>>>>>>>> confusing, I am wondering if users would find that as
> > > well.
> > > > So
> > > > > > throwing in
> > > > > > >>>>>>>>>> the suggestion to call the options
> > > > > > "taskmanager.memory.network.*".
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> +0 on this one. I'm ok with
> > "taskmanager.memory.network.*".
> > > > On
> > > > > > the
> > > > > > >>>>>>>>> other hand, one can also argue that this part of memory
> > is
> > > > used
> > > > > > by
> > > > > > >>>>>>>>> ShuffleEnvironment, and the key
> > > > "taskmanager.memory.shuffle.*"
> > > > > > points more
> > > > > > >>>>>>>>> directly to the shuffle service components.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> *==> In that case, the name "Shuffle Environment" may
> be
> > a
> > > > bit
> > > > > > >>>>>>>>> incorrect, because it is doing not only shuffles as
> well.
> > > The
> > > > > > >>>>>>>>> ShuffleEnvironment is also more internal, so the name
> is
> > > not
> > > > > too
> > > > > > critical.
> > > > > > >>>>>>>>> This isn't super high priority for me, but if we want
> to
> > > > adjust
> > > > > > it, better
> > > > > > >>>>>>>>> earlier than later.*
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> This is also a bit controversial topic for me. Indeed,
> we
> > > > have
> > > > > > >>>>>>>>> always used ’network’ for this concept of task data
> > > shuffling
> > > > > > over the
> > > > > > >>>>>>>>> network and this can confuse existing users.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> On the other hand for the new users and in a long term,
> > > > > ’network’
> > > > > > >>>>>>>>> can delude into a conclusion that all network memory is
> > > > managed
> > > > > > by this
> > > > > > >>>>>>>>> option.
> > > > > > >>>>>>>>> Also other types of shuffle might not directly deal
> with
> > > > > network
> > > > > > >>>>>>>>> at all.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> By calling it shuffle, we were somewhat biased by
> > > > understanding
> > > > > > it
> > > > > > >>>>>>>>> in term of map/reduce. This is rather an inter-task
> data
> > > > > > exchange.
> > > > > > >>>>>>>>> Maybe then 'taskmanager.memory.shuffle.communication.*’
> > or
> > > > > > >>>>>>>>>
> > > ‘taskmanager.memory.task.shuffle/communication/io/network.*’.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>
> > > > > > >>>>>>  From a user's perspective I believe
> > > > "taskmanager.memory.network"
> > > > > > >>>>>> would be easier to understand as not everyone knows
> exactly
> > > what
> > > > > the
> > > > > > >>>>>> shuffle service is. I see the point that it would be a bit
> > > > > > imprecise as we
> > > > > > >>>>>> can have different shuffle implementations but I would go
> > with
> > > > the
> > > > > > ease of
> > > > > > >>>>>> use/understanding here. Moreover, I think that we won't
> have
> > > > many
> > > > > > different
> > > > > > >>>>>> shuffle service implementations in the foreseeable future.
> > > > > > >>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>>>>>>   - The descriptions for the
> > > > > "taskmanager.memory.jvm-overhead.*"
> > > > > > >>>>>>>>>> keys say that it also accounts for I/O direct memory,
> > but
> > > > the
> > > > > > parameter is
> > > > > > >>>>>>>>>> not counted into the MaxDirectMemory parameter.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> True. Since we already have framework off-heap memory
> > > > accounted
> > > > > > >>>>>>>>> for ad hoc direct memory usages, accounting all of
> > > > jvm-overhead
> > > > > > also into
> > > > > > >>>>>>>>> max direct memory limit seems not necessary. I would
> > > suggest
> > > > to
> > > > > > remove "I/O
> > > > > > >>>>>>>>> direct memory" from the description, and explicitly
> > mention
> > > > > that
> > > > > > this
> > > > > > >>>>>>>>> option does not account for direct memory and will not
> be
> > > > > > accounted into
> > > > > > >>>>>>>>> max direct memory limit. WDYT?
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> +1
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>   - Can make the new ConfigOptions strongly typed with
> > the
> > > > new
> > > > > > >>>>>>>>>> configuration options. For example, directly use
> > > MemorySize
> > > > > > typed options.
> > > > > > >>>>>>>>>> That makes validation automatic and saves us from
> > breaking
> > > > the
> > > > > > options
> > > > > > >>>>>>>>>> later.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> +1. Wasn't aware of the new memory type config options.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> +1
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> *==> Thanks. Do you need help with adjusting this?*
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> I would appreciate it.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> Also small side note, we extensively use MemorySize in
> > logs
> > > > now
> > > > > > >>>>>>>>> but it might be not always readable as its string
> > > > > representation
> > > > > > is only in
> > > > > > >>>>>>>>> bytes atm
> > > > > > >>>>>>>>> and does not reduce it to kb/mb/etc in case of big
> bytes
> > > > value.
> > > > > > We
> > > > > > >>>>>>>>> could have at least some .prettyPrint function to use
> in
> > > > logs.
> > > > > > >>>>>>>>> And .fromMegabytes/etc factory methods would improve
> code
> > > > > > >>>>>>>>> readability instead of .parse(int + “m”).
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> Best,
> > > > > > >>>>>>>>> Andrey
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> On 20 Dec 2019, at 12:13, Stephan Ewen <
> sewen@apache.org
> > >
> > > > > wrote:
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> Hi Xintong!
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> Please find my answers inline:
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>>   - "taskmanager.memory.size" (old main config option)
> > is
> > > > > > >>>>>>>>>>> replaced by "taskmanager.memory.total-process.size"
> > which
> > > > has
> > > > > > a different
> > > > > > >>>>>>>>>>> meaning in standalone setups. The old option did not
> > > > subtract
> > > > > > metaspace and
> > > > > > >>>>>>>>>>> other overhead, while the new option does. That means
> > > that
> > > > > > with the default
> > > > > > >>>>>>>>>>> config, standalone clusters get quite a bit less
> > memory.
> > > > > > (independent of
> > > > > > >>>>>>>>>>> managed memory going off heap).
> > > > > > >>>>>>>>>>>     I am wondering if we could interpret
> > > > > > >>>>>>>>>>> "taskmanager.memory.size" as the deprecated key for
> > > > > > >>>>>>>>>>> "taskmanager.memory.total-flink.size". That would be
> in
> > > > line
> > > > > > with the old
> > > > > > >>>>>>>>>>> mechanism (assuming managed memory is set to off
> heap).
> > > > > > >>>>>>>>>>>     The effect would be that the container size on
> > > > Yarn/Mesos
> > > > > > >>>>>>>>>>> increases, because from
> > > > > "taskmanager.memory.total-flink.size",
> > > > > > we need to
> > > > > > >>>>>>>>>>> add overhead and metaspace to reach the total process
> > > size,
> > > > > > rather than
> > > > > > >>>>>>>>>>> cutting off memory. But if we want, we could even
> > adjust
> > > > for
> > > > > > that in the
> > > > > > >>>>>>>>>>> active resource manager, getting full backwards
> > > > compatibility
> > > > > > on that part.
> > > > > > >>>>>>>>>>>     Curious to hear more thoughts there.
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> I believe you mean "taskmanager.heap.size".
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> *==> Yes*
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>> I think the problem here is that the
> > > > > > >>>>>>>>>> legacy "taskmanager.heap.size" was used differently in
> > > > > > standalone setups
> > > > > > >>>>>>>>>> and active yarn / mesos setups, and such different
> > > > calculation
> > > > > > logics and
> > > > > > >>>>>>>>>> behaviors are exactly what we want to avoid with
> > FLIP-49.
> > > > > > Therefore, I'm
> > > > > > >>>>>>>>>> not in favor of treating
> > > > "taskmanager.memory.total-flink.size"
> > > > > > differently
> > > > > > >>>>>>>>>> for standalone and active setups.
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> I think what we really want is probably
> > > > > > >>>>>>>>>> mapping "taskmanager.heap.size" to different new
> config
> > > > > options
> > > > > > in
> > > > > > >>>>>>>>>> different setups. How about we mark
> > > "taskmanager.heap.size"
> > > > as
> > > > > > deprecated
> > > > > > >>>>>>>>>> key for neither of
> > "taskmanager.memory.total-process.size"
> > > > and
> > > > > > >>>>>>>>>> "taskmanager.memory.total-flink.size". Instead, we
> parse
> > > it
> > > > > (if
> > > > > > explicitly
> > > > > > >>>>>>>>>> configured) in startup scripts / active resource
> > managers,
> > > > and
> > > > > > set the
> > > > > > >>>>>>>>>> value to "taskmanager.memory.total-flink.size" in the
> > > > scripts
> > > > > > and
> > > > > > >>>>>>>>>> "taskmanager.memory.total-process.size" in active
> > resource
> > > > > > managers (if the
> > > > > > >>>>>>>>>> new config options are not configured). We can provide
> > > util
> > > > > > methods in
> > > > > > >>>>>>>>>> TaskExecutorResourceUtils for such conversions, to
> keep
> > > all
> > > > > the
> > > > > > >>>>>>>>>> configuration logics at one place.
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> *==> This is pretty much what I meant as well (maybe my
> > > > > > >>>>>>>>> description was not very clear), so +1 for that
> > mechanism*
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>   - Mini Cluster tries to imitate exact ratio of memory
> > > pools
> > > > > as
> > > > > > a
> > > > > > >>>>>>>>>>> standalone setup. I get the idea behind that, but I
> am
> > > > > > wondering if it is
> > > > > > >>>>>>>>>>> the right approach here.
> > > > > > >>>>>>>>>>>     For example: I started a relatively large JVM
> > (large
> > > > heap
> > > > > > >>>>>>>>>>> size of 10 GB) as a test. With the current logic, the
> > > > system
> > > > > > tries to
> > > > > > >>>>>>>>>>> reserve an additional 6GB for managed memory which is
> > > more
> > > > > > than there is
> > > > > > >>>>>>>>>>> memory left. When you see the error that no memory
> > could
> > > be
> > > > > > allocated, you
> > > > > > >>>>>>>>>>> need to understand the magic of how this is derived.
> > > > > > >>>>>>>>>>>     I am trying to think about this from the
> > perspective
> > > of
> > > > > > >>>>>>>>>>> using "Flink as a Library", which the MiniCluster is
> > > close
> > > > > to.
> > > > > > >>>>>>>>>>>     When starting Flink out of a running process, we
> > > cannot
> > > > > > >>>>>>>>>>> assume that we are the only users of that process and
> > > that
> > > > we
> > > > > > can mold the
> > > > > > >>>>>>>>>>> process to our demands. I think a fix value for
> managed
> > > > > memory
> > > > > > and network
> > > > > > >>>>>>>>>>> memory would feel more natural in such a setup than a
> > > > > > mechanism that is
> > > > > > >>>>>>>>>>> tailored towards exclusive use of the process.
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> +1 on having fixed values for managed / shuffle
> memory.
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> *==> Cool, let's also see what Andrey and Till think
> > here.*
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>>   - Some off-heap memory goes into direct memory, some
> > > does
> > > > > not.
> > > > > > >>>>>>>>>>> This confused me a bit. For example
> > > > > > >>>>>>>>>>> "taskmanager.memory.framework.off-heap.size" is
> counted
> > > > into
> > > > > > >>>>>>>>>>> MaxDirectMemory while
> > > > "taskmanager.memory.task.off-heap.size"
> > > > > > is counted as
> > > > > > >>>>>>>>>>> native memory. Maybe we should rename the keys to
> > reflect
> > > > > > that. There is no
> > > > > > >>>>>>>>>>> one "off heap" memory type after all. Maybe use
> > > > > > >>>>>>>>>>> "taskmanager.memory.task.native: XXXmb" and
> > > > > > >>>>>>>>>>> "taskmanager.memory.framework.direct: XXXmb" instead?
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> I believe "taskmanager.memory.task.off-heap.size" is
> > also
> > > > > > >>>>>>>>>> accounted in the max direct memory size limit. The
> > > confusion
> > > > > > probably comes
> > > > > > >>>>>>>>>> from that "taskmanager.memory.framework.off-heap.size"
> > > > > > explicitly mentioned
> > > > > > >>>>>>>>>> that in its description while
> > > > > > "taskmanager.memory.task.off-heap.size"
> > > > > > >>>>>>>>>> didn't. That's actually because the framework off-heap
> > > > memory
> > > > > > is introduced
> > > > > > >>>>>>>>>> later in a separate commit. We should fix that.
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> For framework / task off-heap memory, we do not
> > > > differentiate
> > > > > > >>>>>>>>>> direct / native memory usage. That means the configure
> > > value
> > > > > > for these two
> > > > > > >>>>>>>>>> options could be a mixture of direct / native memory.
> > > Since
> > > > we
> > > > > > do not know
> > > > > > >>>>>>>>>> the portion of direct memory out of the configured
> > value,
> > > we
> > > > > > have
> > > > > > >>>>>>>>>> to conservatively account it all into the max direct
> > > memory
> > > > > > size limit.
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> *==>  In that case, I am a bit confused. For the total
> > size
> > > > > > >>>>>>>>> calculation, it is fine. But why do we then set
> > > > > MaxDirectMemory?
> > > > > > It is a
> > > > > > >>>>>>>>> difficult parameter, and the main reason to set it was
> > (if
> > > I
> > > > > > recall
> > > > > > >>>>>>>>> correctly) to trigger GC based on direct memory
> > allocation
> > > > (to
> > > > > > free heap
> > > > > > >>>>>>>>> structures that then in turn release direct memory). If
> > the
> > > > > > limit is
> > > > > > >>>>>>>>> anyways too high (because we also count native memory
> in
> > > > there)
> > > > > > such that
> > > > > > >>>>>>>>> we can exceed the total process (container) memory, why
> > do
> > > we
> > > > > > set it then?*
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>   - The network memory keys are now called
> > > > > > >>>>>>>>>>> "taskmanager.memory.shuffle.*". To my knowledge,
> > shuffle
> > > is
> > > > > > usually
> > > > > > >>>>>>>>>>> understood as a redistribution (random, or maybe by
> > hash
> > > of
> > > > > > key). As an
> > > > > > >>>>>>>>>>> example, there are many discussions about "shuffle
> join
> > > > > versus
> > > > > > broadcast
> > > > > > >>>>>>>>>>> join", where "shuffle" is the synonym for
> > > > "re-partitioning".
> > > > > > We use that
> > > > > > >>>>>>>>>>> memory however for all network operations, like
> forward
> > > > > pipes,
> > > > > > broadcasts,
> > > > > > >>>>>>>>>>> receiver-side buffering on checkpoints, etc. I find
> the
> > > > name
> > > > > > "*.shuffle.*"
> > > > > > >>>>>>>>>>> confusing, I am wondering if users would find that as
> > > well.
> > > > > So
> > > > > > throwing in
> > > > > > >>>>>>>>>>> the suggestion to call the options
> > > > > > "taskmanager.memory.network.*".
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> +0 on this one. I'm ok with
> > > "taskmanager.memory.network.*".
> > > > On
> > > > > > >>>>>>>>>> the other hand, one can also argue that this part of
> > > memory
> > > > is
> > > > > > used by
> > > > > > >>>>>>>>>> ShuffleEnvironment, and the key
> > > > "taskmanager.memory.shuffle.*"
> > > > > > points more
> > > > > > >>>>>>>>>> directly to the shuffle service components.
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> *==> In that case, the name "Shuffle Environment" may
> be
> > a
> > > > bit
> > > > > > >>>>>>>>> incorrect, because it is doing not only shuffles as
> well.
> > > The
> > > > > > >>>>>>>>> ShuffleEnvironment is also more internal, so the name
> is
> > > not
> > > > > too
> > > > > > critical.
> > > > > > >>>>>>>>> This isn't super high priority for me, but if we want
> to
> > > > adjust
> > > > > > it, better
> > > > > > >>>>>>>>> earlier than later.*
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>    - The descriptions for the
> > > > > "taskmanager.memory.jvm-overhead.*"
> > > > > > >>>>>>>>> keys say that it also accounts for I/O direct memory,
> but
> > > the
> > > > > > parameter is
> > > > > > >>>>>>>>> not counted into the MaxDirectMemory parameter.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> True. Since we already have framework off-heap memory
> > > > > accounted
> > > > > > >>>>>>>>>> for ad hoc direct memory usages, accounting all of
> > > > > jvm-overhead
> > > > > > also into
> > > > > > >>>>>>>>>> max direct memory limit seems not necessary. I would
> > > suggest
> > > > > to
> > > > > > remove "I/O
> > > > > > >>>>>>>>>> direct memory" from the description, and explicitly
> > > mention
> > > > > > that this
> > > > > > >>>>>>>>>> option does not account for direct memory and will not
> > be
> > > > > > accounted into
> > > > > > >>>>>>>>>> max direct memory limit. WDYT?
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> *==> Sounds good. *
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>>   - Can make the new ConfigOptions strongly typed with
> > the
> > > > new
> > > > > > >>>>>>>>>>> configuration options. For example, directly use
> > > MemorySize
> > > > > > typed options.
> > > > > > >>>>>>>>>>> That makes validation automatic and saves us from
> > > breaking
> > > > > the
> > > > > > options
> > > > > > >>>>>>>>>>> later.
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> +1. Wasn't aware of the new memory type config
> options.
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> *==> Thanks. Do you need help with adjusting this?*
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> Best,
> > > > > > >>>>>>>>> Stephan
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Best, Jingsong Lee
> > > >
> > >
> >
>

Re: [DISCUSS] Some feedback after trying out the new FLIP-49 memory configurations

Posted by Xintong Song <to...@gmail.com>.
Thanks for the explanations @Stephan, and feedbacks @Jingsong @Andrey
@Till. I'm also ok with flink.size in flink-conf.yaml.

And if I understand correctly, we have also get consensus on having the
configuration key 'taskmanager.memory.network.*' (with deprecated key
'taskmanager.network.memory.*')?


Thank you~

Xintong Song



On Tue, Jan 7, 2020 at 7:10 PM Till Rohrmann <tr...@apache.org> wrote:

> I'm also slightly in favour of the flink.size instead of process.size. If
> Xintong is ok with this, then we should update the flink-conf.yaml in the
> repo.
>
> Cheers,
> Till
>
> On Mon, Jan 6, 2020 at 4:09 PM Andrey Zagrebin <az...@apache.org>
> wrote:
>
> > Thank you for more explanation Stephan and feedback Jingsong,
> >
> > I see the point now.
> > ‘taskmanager.memory.flink.size’ is indeed simpler to understand for the
> > newcomers in the default config because it includes less types of memory
> to
> > consider.
> > Most of the time beginners will think about heap size and maybe state
> size
> > to tweak in the try-out jobs.
> > ‘taskmanager.memory.flink.size’ is better scoped for those types of
> memory.
> >
> > The process memory forces to understand the total memory consumption
> > which is more important for the next steps and will probably require
> > reading the docs in more depth anyways.
> > I agree if we decide for ‘flink.size’, it is worth mentioning a pointer
> to
> > ‘process.size' in its comment as an alternative to it.
> >
> > I am ok with ‘flink.size’ in the default config.
> >
> > Best,
> > Andrey
> >
> > On Mon, Dec 30, 2019 at 5:13 AM Jingsong Li <ji...@gmail.com>
> > wrote:
> >
> > > Thank you for your wonderful discussion.
> > >
> > > +1 for set "taskmanager.memory.flink.size" in the default config.
> > > Maybe we can write and explain "taskmanager.memory.process.size" in the
> > > comments.
> > >
> > > Most of the time,
> > > - "trying out" users is less in-depth users and use standalone mode.
> > > - Production users use active setups.
> > >
> > > Default config it very important to "trying out".
> > > In "trying out", for a novice Java user of Flink, he want to configure
> as
> > > much memory as my standalone process should be. In fact, it's hard for
> > him
> > > to realize the JVM overhead. And there is no need that must let him
> know.
> > >
> > > > I don't think it is realistic that users set the process memory to
> full
> > > machine memory. There is a lot on the machine as well in most cases.
> > >
> > > +1, We often run Flink in a less clean environment, such as the
> > environment
> > > have HDFS. Whether we are testing or producing, we will not use all the
> > > memory of the machine, such as always leaving some memory for
> PageCache.
> > >
> > > Best,
> > > Jingsong Lee
> > >
> > > On Sat, Dec 28, 2019 at 7:02 PM Stephan Ewen <se...@apache.org> wrote:
> > >
> > > > "taskmanager.memory.flink.size" in the default config has a few
> > > advantages.
> > > >
> > > >  - The value in the default config needs to be suitable for "trying
> > out"
> > > > Flink, for a good "getting started" experience.
> > > >
> > > >   - For trying out Flink, standalone is the most common entry point
> > > (except
> > > > running in IDE).
> > > >
> > > >   - In standalone setup, from total process memory, we subtract
> quite a
> > > bit
> > > > before we arrive at the usable memory. We also subtract managed
> memory
> > > from
> > > > the heap now. I fear we might end up at a heap that becomes so small
> > that
> > > > it makes for a bad "getting started" experience.
> > > >
> > > >   - I don't think it is realistic that users set the process memory
> to
> > > full
> > > > machine memory. There is a lot on the machine as well in most cases.
> > > >
> > > >   - In the JVM world, users are used to configuring the heap size and
> > > know
> > > > that there is additional memory overhead. The
> > > > "taskmanager.memory.flink.size" option fits well with that mindset.
> > > >
> > > >   - One you start to think about the total process memory of Yarn
> > > > containers, you are already past the getting-started phase and on the
> > > > tuning phase.
> > > >
> > > >
> > > > On Tue, Dec 24, 2019, 10:25 Andrey Zagrebin <az...@apache.org>
> > > wrote:
> > > >
> > > > > Thanks for the summary, Xintong! It makes sense to me.
> > > > >
> > > > > How about putting "taskmanager.memory.flink.size" in the
> > configuration?
> > > > > > Then new downloaded Flink behaves similar to the previous
> > Standalone
> > > > > setups.
> > > > > > If someone upgrades the binaries, but re-uses their old
> > > configuration,
> > > > > > then they get the compatibility as discussed previously.
> > > > > > We used that approach previously with the fine-grained failover
> > > > recovery.
> > > > >
> > > > >
> > > > >
> > > > > > I'm trying to understand why "taskmanager.memory.flink.size"
> rather
> > > > than
> > > > > > "taskmanager.memory.process.size" in the default flink-conf.yaml.
> > Or
> > > > put
> > > > > it
> > > > > > another way, why do we want the new downloaded Flink behaves
> > similar
> > > to
> > > > > > previous Standalone setups rather than previous active mode
> setups?
> > > Is
> > > > > > there any special reason that I overlooked, which makes backwards
> > > > > > compatibility for standalone setups more important than for
> active
> > > > > setups?
> > > > > > IMO, "taskmanager.memory.process.size" is easier for the new
> > comers.
> > > > For
> > > > > > standalone setups, users can simply configure it to their
> machines'
> > > > > > available memory size, without needing to worry about leaving
> > enough
> > > > > space
> > > > > > for JVM overehead / metaspace. For containerized setups, it's
> more
> > > > > > predictable how many memory the containers / Flink could use,
> which
> > > is
> > > > > more
> > > > > > friendly for users to manage their resource quota.
> > > > > > Therefore, unless there is anything I overlooked, I'm in favor of
> > > > putting
> > > > > > "taskmanager.memory.process.size" rather than
> > > > > > "taskmanager.memory.flink.size" in the default configuration.
> > > > >
> > > > >
> > > > > I agree that having "taskmanager.memory.process.size" in default
> > config
> > > > > should be easier to understand and tweak for the new users because
> it
> > > is
> > > > > just what they are ready to spend for Flink application.
> > > > > The problem is when users upgrade Flink and use the new default
> > > > > configuration then the behaviour can change:
> > > > > - either if we put process memory then Flink memory shrinks and the
> > new
> > > > > default option contradicts their previous understanding
> > > > > - or if we put Flink memory then larger container is requested.
> > > > > The shrinking of memory sounds more implicit and worse. The
> increased
> > > > > container request will just fail in the worst case and the memory
> > setup
> > > > can
> > > > > be revisited.
> > > > > We could increase the default "taskmanager.memory.process.size" to
> > > better
> > > > > align it with the previous default setup for standalone
> > > > > but this would not remove the possible confusion problem for the
> old
> > > > users,
> > > > > on the other hand the option is new and we can add a comment how to
> > > > migrate
> > > > > from the old one.
> > > > >
> > > > > All in all, now I also tend to have
> "taskmanager.memory.process.size"
> > > in
> > > > > the default config unless there are more reasons for having less
> > > > confusion
> > > > > for the old standalone users.
> > > > >
> > > > > Best,
> > > > > Andrey
> > > > >
> > > > > On Tue, Dec 24, 2019 at 5:50 AM Xintong Song <
> tonysong820@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > It seems we have already get consensus on most of the issues.
> > Thanks
> > > > > > everyone for the good discussion.
> > > > > >
> > > > > > While there are still open questions under discussion, I'd like
> to
> > > > > > summarize the discussion so far, and list the action items that
> we
> > > > > already
> > > > > > get consensus on. In this way, we can already start working on
> > these
> > > > > items
> > > > > > while discussing the remaining open questions. Please correct me
> if
> > > the
> > > > > > summary does not reflect your argument.
> > > > > >
> > > > > > *Action Items that we already get consensus on:*
> > > > > >
> > > > > >    - Map "taskmanager.heap.size" to
> "taskmanager.memory.flink.size"
> > > for
> > > > > >    standalone setups, and "taskmanager.memory.process.size" for
> > > active
> > > > > setups.
> > > > > >    (should be mentioned in release notes)
> > > > > >    - If not explicitly configured, MiniCluster should have fixed
> > > > default
> > > > > >    network and managed memory sizes. (should be mentioned in
> docs &
> > > > > release
> > > > > >    notes)
> > > > > >    - Change the memory config options' type from String to
> > MemorySize
> > > > > >    - Change the config option key
> > > "taskmanager.memory.total-flink.size"
> > > > > >    to "taskmanager.memory.flink.size"
> > > > > >    - Change the config option key
> > > > "taskmanager.memory.total-process.size"
> > > > > >    to "taskmanager.memory.process.size"
> > > > > >    - Update descriptions for
> > > > "taskmanager.memory.framework.off-heap.size"
> > > > > >    and "taskmanager.memory.task.off-heap.size" to explicitly
> state
> > > that
> > > > > >       - Both direct and native memory are accounted
> > > > > >       - Will be fully counted into MaxDirectMemorySize
> > > > > >    - Update descriptions for
> > > > > >    "taskmanager.memory.jvm-overhead.[min|max|fraction]" to remove
> > > "I/O
> > > > > direct
> > > > > >    memory" and explicitly state that it's not counted into
> > > > > MaxDirectMemorySize
> > > > > >    - Print MemorySize with proper unit. (non-blocker for 1.10)
> > > > > >
> > > > > >
> > > > > > *Questions that are still open:*
> > > > > >
> > > > > >    - Which config option do we put in the default
> flink-conf.yaml?
> > > > > >       - "taskmanager.memory.flink.size"
> > > > > >       - "taskmanager.memory.process.size"
> > > > > >       - Deprecated "taskmanager.heap.size"
> > > > > >    - What is proper keys for network / shuffle memory
> > > > > >       - "taskmanager.memory.shuffle.*"
> > > > > >       - "taskmanager.memory.network.*"
> > > > > >       - "taskmanager.network.memory.*"
> > > > > >
> > > > > >
> > > > > > Thank you~
> > > > > >
> > > > > > Xintong Song
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Dec 24, 2019 at 10:19 AM Xintong Song <
> > tonysong820@gmail.com
> > > >
> > > > > > wrote:
> > > > > >
> > > > > >> How about putting "taskmanager.memory.flink.size" in the
> > > > configuration?
> > > > > >>> Then new downloaded Flink behaves similar to the previous
> > > Standalone
> > > > > setups.
> > > > > >>> If someone upgrades the binaries, but re-uses their old
> > > > configuration,
> > > > > >>> then they get the compatibility as discussed previously.
> > > > > >>> We used that approach previously with the fine-grained failover
> > > > > recovery.
> > > > > >>
> > > > > >>
> > > > > >> I'm trying to understand why "taskmanager.memory.flink.size"
> > rather
> > > > than
> > > > > >> "taskmanager.memory.process.size" in the default
> flink-conf.yaml.
> > Or
> > > > > put it
> > > > > >> another way, why do we want the new downloaded Flink behaves
> > similar
> > > > to
> > > > > >> previous Standalone setups rather than previous active mode
> > setups?
> > > Is
> > > > > >> there any special reason that I overlooked, which makes
> backwards
> > > > > >> compatibility for standalone setups more important than for
> active
> > > > > setups?
> > > > > >>
> > > > > >> IMO, "taskmanager.memory.process.size" is easier for the new
> > comers.
> > > > For
> > > > > >> standalone setups, users can simply configure it to their
> > machines'
> > > > > >> available memory size, without needing to worry about leaving
> > enough
> > > > > space
> > > > > >> for JVM overehead / metaspace. For containerized setups, it's
> more
> > > > > >> predictable how many memory the containers / Flink could use,
> > which
> > > is
> > > > > more
> > > > > >> friendly for users to manage their resource quota.
> > > > > >>
> > > > > >> Therefore, unless there is anything I overlooked, I'm in favor
> of
> > > > > putting
> > > > > >> "taskmanager.memory.process.size" rather than
> > > > > >> "taskmanager.memory.flink.size" in the default configuration.
> > > > > >>
> > > > > >>
> > > > > >> Thank you~
> > > > > >>
> > > > > >> Xintong Song
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On Tue, Dec 24, 2019 at 4:27 AM Andrey Zagrebin <
> > > azagrebin@apache.org
> > > > >
> > > > > >> wrote:
> > > > > >>
> > > > > >>> How about putting "taskmanager.memory.flink.size" in the
> > > > configuration?
> > > > > >>>> Then new downloaded Flink behaves similar to the previous
> > > Standalone
> > > > > setups.
> > > > > >>>> If someone upgrades the binaries, but re-uses their old
> > > > configuration,
> > > > > >>>> then they get the compatibility as discussed previously.
> > > > > >>>> We used that approach previously with the fine-grained
> failover
> > > > > >>>> recovery.
> > > > > >>>
> > > > > >>> +1, this sounds like a good compromise.
> > > > > >>>
> > > > > >>> +1 to not have more options for off-heap memory, that would get
> > > > > >>>> confusing fast. We can keep the name "off-heap" for both task
> > and
> > > > > >>>> framework, as long as they mean the same thing: native plus
> > > direct,
> > > > > and
> > > > > >>>> fully counted into MaxDirectMemory. I would suggest to update
> > the
> > > > > config
> > > > > >>>> descriptions then to reflect that.
> > > > > >>>>
> > > > > >>> True, this should be explained in the config descriptions.
> > > > > >>>
> > > > > >>> looks good to me
> > > > > >>>
> > > > > >>> From a user's perspective I believe
> "taskmanager.memory.network"
> > > > would
> > > > > >>>> be easier to understand as not everyone knows exactly what the
> > > > shuffle
> > > > > >>>> service is. I see the point that it would be a bit imprecise
> as
> > we
> > > > > can have
> > > > > >>>> different shuffle implementations but I would go with the ease
> > of
> > > > > >>>> use/understanding here. Moreover, I think that we won't have
> > many
> > > > > different
> > > > > >>>> shuffle service implementations in the foreseeable future.
> > > > > >>>
> > > > > >>> I agree that if we cannot find any other convincing names for
> the
> > > > > >>> options, we should keep what we already have and change it if
> the
> > > > > >>> alternative is convincing enough.
> > > > > >>> The question is also whether we still want to rename it because
> > it
> > > > was
> > > > > >>> "taskmanager.network.*memory*.*" in 1.9 but
> > > > > "taskmanager.*memory*.network.*"
> > > > > >>> is more aligned with other new memory option names.
> > > > > >>> Or we can just 'un'-deprecate "taskmanager.network.*memory*.*".
> > > > > >>>
> > > > > >>> On Mon, Dec 23, 2019 at 8:42 PM Stephan Ewen <sewen@apache.org
> >
> > > > wrote:
> > > > > >>>
> > > > > >>>> How about putting "taskmanager.memory.flink.size" in the
> > > > > configuration?
> > > > > >>>> Then new downloaded Flink behaves similar to the previous
> > > Standalone
> > > > > setups.
> > > > > >>>>
> > > > > >>>> If someone upgrades the binaries, but re-uses their old
> > > > configuration,
> > > > > >>>> then they get the compatibility as discussed previously.
> > > > > >>>> We used that approach previously with the fine-grained
> failover
> > > > > >>>> recovery.
> > > > > >>>>
> > > > > >>>> On Mon, Dec 23, 2019 at 3:27 AM Xintong Song <
> > > tonysong820@gmail.com
> > > > >
> > > > > >>>> wrote:
> > > > > >>>>
> > > > > >>>>> +1 to not have more options for off-heap memory, that would
> get
> > > > > >>>>>> confusing fast. We can keep the name "off-heap" for both
> task
> > > and
> > > > > >>>>>> framework, as long as they mean the same thing: native plus
> > > > direct,
> > > > > and
> > > > > >>>>>> fully counted into MaxDirectMemory. I would suggest to
> update
> > > the
> > > > > config
> > > > > >>>>>> descriptions then to reflect that.
> > > > > >>>>>>
> > > > > >>>>> True, this should be explained in the config descriptions.
> > > > > >>>>>
> > > > > >>>>> Which configuration option will be set in Flink's default
> > > > > >>>>>> flink-conf.yaml? If we want to maintain the existing
> behaviour
> > > it
> > > > > would
> > > > > >>>>>> have to be the then deprecated taskmanager.heap.size config
> > > > option.
> > > > > If we
> > > > > >>>>>> are ok with Yarn requesting slightly larger containers, then
> > it
> > > > > could also
> > > > > >>>>>> be taskmanager.memory.total-flink.size.
> > > > > >>>>>>
> > > > > >>>>> Good point. Currently, we have
> > > > > >>>>> "taskmanager.memory.total-process.size". In order to preserve
> > the
> > > > > previous
> > > > > >>>>> behavior, we need to have "taskmanager.heap.size" so it can
> be
> > > > > mapped to
> > > > > >>>>> different new options in standalone / active setups.
> > > > > >>>>> I think we can have the deprecated "taskmanager.heap.size" in
> > the
> > > > > >>>>> default flink-conf.yaml, and also have the
> > > > > >>>>> new "taskmanager.memory.total-process.size" in a commented
> > line.
> > > We
> > > > > can
> > > > > >>>>> explain how the deprecated config option behaves differently
> in
> > > the
> > > > > >>>>> comments, so that user can switch to the new config options
> if
> > > they
> > > > > want to.
> > > > > >>>>>
> > > > > >>>>> Thank you~
> > > > > >>>>>
> > > > > >>>>> Xintong Song
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>> On Sat, Dec 21, 2019 at 1:00 AM Till Rohrmann <
> > > > trohrmann@apache.org>
> > > > > >>>>> wrote:
> > > > > >>>>>
> > > > > >>>>>> Thanks for the feedback and good discussion everyone. I left
> > > some
> > > > > >>>>>> comments inline.
> > > > > >>>>>>
> > > > > >>>>>> On Fri, Dec 20, 2019 at 1:59 PM Stephan Ewen <
> > sewen@apache.org>
> > > > > >>>>>> wrote:
> > > > > >>>>>>
> > > > > >>>>>>> +1 to not have more options for off-heap memory, that would
> > get
> > > > > >>>>>>> confusing fast. We can keep the name "off-heap" for both
> task
> > > and
> > > > > >>>>>>> framework, as long as they mean the same thing: native plus
> > > > > direct, and
> > > > > >>>>>>> fully counted into MaxDirectMemory. I would suggest to
> update
> > > the
> > > > > config
> > > > > >>>>>>> descriptions then to reflect that.
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>> On Fri, Dec 20, 2019 at 1:03 PM Xintong Song <
> > > > > tonysong820@gmail.com>
> > > > > >>>>>>> wrote:
> > > > > >>>>>>>
> > > > > >>>>>>>> Regarding the framework/task direct/native memory
> options, I
> > > > tend
> > > > > >>>>>>>> to think it differently. I'm in favor of keep the
> > > > > "*.off-heap.size" for the
> > > > > >>>>>>>> config option keys.
> > > > > >>>>>>>>
> > > > > >>>>>>>>    - It's not necessary IMO to expose the difference
> > concepts
> > > of
> > > > > >>>>>>>>    direct / native memory to the users.
> > > > > >>>>>>>>    - I would avoid introducing more options for native
> > memory
> > > if
> > > > > >>>>>>>>    possible. Taking fine grained resource management and
> > > dynamic
> > > > > slot
> > > > > >>>>>>>>    allocation into consideration, that also means
> introduce
> > > more
> > > > > fields into
> > > > > >>>>>>>>    ResourceSpec / ResourceProfile.
> > > > > >>>>>>>>    - My gut feeling is that having a relative loose
> > > > > >>>>>>>>    MaxDirectMemory should not be a big problem.
> > > > > >>>>>>>>    - In most cases, the task / framework off-heap memory
> > > should
> > > > be
> > > > > >>>>>>>>       mainly (if not all) direct memory, so the difference
> > > > > between derived
> > > > > >>>>>>>>       MaxDirectMemory and the ideal direct memory limit
> > should
> > > > > not be too much.
> > > > > >>>>>>>>       - We do not have a good way to know the exact size
> > > needed
> > > > > >>>>>>>>       for jvm overhead / metaspace and framework / task
> > > off-heap
> > > > > memory, thus
> > > > > >>>>>>>>       having to conservatively reserve slightly more
> memory
> > > then
> > > > > what actually
> > > > > >>>>>>>>       needed. Such reserved but not used memory can cover
> > for
> > > > the
> > > > > small
> > > > > >>>>>>>>       MaxDirectMemory error.
> > > > > >>>>>>>>       -
> > > > > >>>>>>>>       - MaxDirectMemory is not the only way to trigger
> full
> > > gc.
> > > > We
> > > > > >>>>>>>>       still heap activities that can also trigger the gc.
> > > > > >>>>>>>>
> > > > > >>>>>>>> Regarding the memory type config options, I've looked into
> > the
> > > > > >>>>>>>> latest ConfigOptions changes. I think it shouldn't be too
> > > > > complicated to
> > > > > >>>>>>>> change the config options to use memory type, and I can
> > handle
> > > > it
> > > > > maybe
> > > > > >>>>>>>> during your vacations.
> > > > > >>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>> Also agree with improving MemorySize logging and parsing.
> > This
> > > > > >>>>>>>> should not be a blocker that has to be done in 1.10. I
> would
> > > say
> > > > > we finish
> > > > > >>>>>>>> other works (testability, documentation and those
> discussed
> > in
> > > > > this thread)
> > > > > >>>>>>>> first, and get to this only if we have time.
> > > > > >>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>> Thank you~
> > > > > >>>>>>>>
> > > > > >>>>>>>> Xintong Song
> > > > > >>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>> On Fri, Dec 20, 2019 at 6:07 PM Andrey Zagrebin <
> > > > > >>>>>>>> azagrebin.apache@gmail.com> wrote:
> > > > > >>>>>>>>
> > > > > >>>>>>>>> Hi Stephan and Xintong,
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Thanks for the further FLIP-49 feedbacks.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>   - "taskmanager.memory.size" (old main config option) is
> > > > > replaced
> > > > > >>>>>>>>>> by "taskmanager.memory.total-process.size" which has a
> > > > > different meaning in
> > > > > >>>>>>>>>> standalone setups. The old option did not subtract
> > metaspace
> > > > > and other
> > > > > >>>>>>>>>> overhead, while the new option does. That means that
> with
> > > the
> > > > > default
> > > > > >>>>>>>>>> config, standalone clusters get quite a bit less memory.
> > > > > (independent of
> > > > > >>>>>>>>>> managed memory going off heap).
> > > > > >>>>>>>>>>     I am wondering if we could interpret
> > > > > >>>>>>>>>> "taskmanager.memory.size" as the deprecated key for
> > > > > >>>>>>>>>> "taskmanager.memory.total-flink.size". That would be in
> > line
> > > > > with the old
> > > > > >>>>>>>>>> mechanism (assuming managed memory is set to off heap).
> > > > > >>>>>>>>>>     The effect would be that the container size on
> > > Yarn/Mesos
> > > > > >>>>>>>>>> increases, because from
> > > "taskmanager.memory.total-flink.size",
> > > > > we need to
> > > > > >>>>>>>>>> add overhead and metaspace to reach the total process
> > size,
> > > > > rather than
> > > > > >>>>>>>>>> cutting off memory. But if we want, we could even adjust
> > for
> > > > > that in the
> > > > > >>>>>>>>>> active resource manager, getting full backwards
> > > compatibility
> > > > > on that part.
> > > > > >>>>>>>>>>     Curious to hear more thoughts there.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> I believe you mean "taskmanager.heap.size".
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> I think the problem here is that the
> > > > > >>>>>>>>> legacy "taskmanager.heap.size" was used differently in
> > > > > standalone setups
> > > > > >>>>>>>>> and active yarn / mesos setups, and such different
> > > calculation
> > > > > logics and
> > > > > >>>>>>>>> behaviors are exactly what we want to avoid with FLIP-49.
> > > > > Therefore, I'm
> > > > > >>>>>>>>> not in favor of treating
> > > "taskmanager.memory.total-flink.size"
> > > > > differently
> > > > > >>>>>>>>> for standalone and active setups.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> I think what we really want is probably
> > > > > >>>>>>>>> mapping "taskmanager.heap.size" to different new config
> > > options
> > > > > in
> > > > > >>>>>>>>> different setups. How about we mark
> "taskmanager.heap.size"
> > > as
> > > > > deprecated
> > > > > >>>>>>>>> key for neither of
> "taskmanager.memory.total-process.size"
> > > and
> > > > > >>>>>>>>> "taskmanager.memory.total-flink.size". Instead, we parse
> it
> > > (if
> > > > > explicitly
> > > > > >>>>>>>>> configured) in startup scripts / active resource
> managers,
> > > and
> > > > > set the
> > > > > >>>>>>>>> value to "taskmanager.memory.total-flink.size" in the
> > scripts
> > > > and
> > > > > >>>>>>>>> "taskmanager.memory.total-process.size" in active
> resource
> > > > > managers (if the
> > > > > >>>>>>>>> new config options are not configured). We can provide
> util
> > > > > methods in
> > > > > >>>>>>>>> TaskExecutorResourceUtils for such conversions, to keep
> all
> > > the
> > > > > >>>>>>>>> configuration logics at one place.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> I agree that the problem is that the legacy option
> > > > > >>>>>>>>> ‘taskmanager.heap.size’ has different semantics for
> > > > > standalone/container.
> > > > > >>>>>>>>> We had it initially falling back to
> > > > > 'taskmanager.memory.total-flink.size’
> > > > > >>>>>>>>> but I changed that to align it with container cut-off.
> Now
> > I
> > > > see
> > > > > it changes
> > > > > >>>>>>>>> standalone setup then.
> > > > > >>>>>>>>> +1 for supporting its backwards compatibility differently
> > for
> > > > > >>>>>>>>> standalone/container setups.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>> Which configuration option will be set in Flink's default
> > > > > >>>>>> flink-conf.yaml? If we want to maintain the existing
> behaviour
> > > it
> > > > > would
> > > > > >>>>>> have to be the then deprecated taskmanager.heap.size config
> > > > option.
> > > > > If we
> > > > > >>>>>> are ok with Yarn requesting slightly larger containers, then
> > it
> > > > > could also
> > > > > >>>>>> be taskmanager.memory.total-flink.size.
> > > > > >>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>   - Mini Cluster tries to imitate exact ratio of memory
> > pools
> > > > as
> > > > > a
> > > > > >>>>>>>>>> standalone setup. I get the idea behind that, but I am
> > > > > wondering if it is
> > > > > >>>>>>>>>> the right approach here.
> > > > > >>>>>>>>>>     For example: I started a relatively large JVM (large
> > > heap
> > > > > >>>>>>>>>> size of 10 GB) as a test. With the current logic, the
> > system
> > > > > tries to
> > > > > >>>>>>>>>> reserve an additional 6GB for managed memory which is
> more
> > > > than
> > > > > there is
> > > > > >>>>>>>>>> memory left. When you see the error that no memory could
> > be
> > > > > allocated, you
> > > > > >>>>>>>>>> need to understand the magic of how this is derived.
> > > > > >>>>>>>>>>     I am trying to think about this from the perspective
> > of
> > > > > using
> > > > > >>>>>>>>>> "Flink as a Library", which the MiniCluster is close to.
> > > > > >>>>>>>>>>     When starting Flink out of a running process, we
> > cannot
> > > > > >>>>>>>>>> assume that we are the only users of that process and
> that
> > > we
> > > > > can mold the
> > > > > >>>>>>>>>> process to our demands. I think a fix value for managed
> > > memory
> > > > > and network
> > > > > >>>>>>>>>> memory would feel more natural in such a setup than a
> > > > mechanism
> > > > > that is
> > > > > >>>>>>>>>> tailored towards exclusive use of the process.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> +1 on having fixed values for managed / shuffle memory.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> also +1 for that, if user has not specified any main
> > options
> > > to
> > > > > >>>>>>>>> derive memory. We should also log this fixing of managed
> /
> > > > > shuffle memory.
> > > > > >>>>>>>>> And just noticed, we could also sanity check framework
> and
> > if
> > > > > >>>>>>>>> explicitly configured task heap against available JVM
> heap,
> > > and
> > > > > at least
> > > > > >>>>>>>>> log inconsistencies.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>   - Some off-heap memory goes into direct memory, some
> does
> > > > not.
> > > > > >>>>>>>>>> This confused me a bit. For example
> > > > > >>>>>>>>>> "taskmanager.memory.framework.off-heap.size" is counted
> > into
> > > > > >>>>>>>>>> MaxDirectMemory while
> > > "taskmanager.memory.task.off-heap.size"
> > > > > is counted as
> > > > > >>>>>>>>>> native memory. Maybe we should rename the keys to
> reflect
> > > > that.
> > > > > There is no
> > > > > >>>>>>>>>> one "off heap" memory type after all. Maybe use
> > > > > >>>>>>>>>> "taskmanager.memory.task.native: XXXmb" and
> > > > > >>>>>>>>>> "taskmanager.memory.framework.direct: XXXmb" instead?
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> I believe "taskmanager.memory.task.off-heap.size" is also
> > > > > >>>>>>>>> accounted in the max direct memory size limit. The
> > confusion
> > > > > probably comes
> > > > > >>>>>>>>> from that "taskmanager.memory.framework.off-heap.size"
> > > > > explicitly mentioned
> > > > > >>>>>>>>> that in its description while
> > > > > "taskmanager.memory.task.off-heap.size"
> > > > > >>>>>>>>> didn't. That's actually because the framework off-heap
> > memory
> > > > is
> > > > > introduced
> > > > > >>>>>>>>> later in a separate commit. We should fix that.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> For framework / task off-heap memory, we do not
> > differentiate
> > > > > >>>>>>>>> direct / native memory usage. That means the configure
> > value
> > > > for
> > > > > these two
> > > > > >>>>>>>>> options could be a mixture of direct / native memory.
> Since
> > > we
> > > > > do not know
> > > > > >>>>>>>>> the portion of direct memory out of the configured value,
> > we
> > > > have
> > > > > >>>>>>>>> to conservatively account it all into the max direct
> memory
> > > > size
> > > > > limit.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> *==>  In that case, I am a bit confused. For the total
> size
> > > > > >>>>>>>>> calculation, it is fine. But why do we then set
> > > > MaxDirectMemory?
> > > > > It is a
> > > > > >>>>>>>>> difficult parameter, and the main reason to set it was
> (if
> > I
> > > > > recall
> > > > > >>>>>>>>> correctly) to trigger GC based on direct memory
> allocation
> > > (to
> > > > > free heap
> > > > > >>>>>>>>> structures that then in turn release direct memory). If
> the
> > > > > limit is
> > > > > >>>>>>>>> anyways too high (because we also count native memory in
> > > there)
> > > > > such that
> > > > > >>>>>>>>> we can exceed the total process (container) memory, why
> do
> > we
> > > > > set it then?*
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> I always also thought about it as providing more safety
> net
> > > for
> > > > > >>>>>>>>> direct allocations but GC thing looks more important.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> +1 for fixing docs for
> > > 'taskmanager.memory.task.off-heap.size’
> > > > > and
> > > > > >>>>>>>>> renaming to ‘direct' as this is what really happens
> > > > > >>>>>>>>> if we want to support direct limit more exact than now.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> I also think that it is hard to separate direct / native
> > > memory
> > > > > >>>>>>>>> unless we introduce even more options.
> > > > > >>>>>>>>> If user wants to keep the direct limit tight to a certain
> > > value
> > > > > >>>>>>>>> but also use native memory outside of it,
> > > > > >>>>>>>>> she would have to increase something else, like JVM
> > overhead
> > > to
> > > > > >>>>>>>>> account for it and there is no other better way.
> > > > > >>>>>>>>> Having more options to account for the native memory
> > outside
> > > of
> > > > > >>>>>>>>> direct limit complicates things but can be introduced
> later
> > > if
> > > > > needed.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>   - What do you think about renaming
> > > > > >>>>>>>>>> "taskmanager.memory.total-flink.size" to
> > > > > "taskmanager.memory.flink.size"
> > > > > >>>>>>>>>> and "taskmanager.memory.total-process.size" to
> > > > > >>>>>>>>>> "taskmanager.memory.process.size" (or
> > > > > "taskmanager.memory.jvm.total"). I
> > > > > >>>>>>>>>> think these keys may be a bit less clumsy (dropping the
> > > > > "total-") without
> > > > > >>>>>>>>>> loss of expressiveness.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> +1 on this.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> +1 as well. Also an option:
> > > > > >>>>>>>>> 'taskmanager.memory.total-process.size’ ->
> > > > > >>>>>>>>> ‘taskmanager.memory.jvm.process.size’,
> > > > > >>>>>>>>> although it can be also mentioned in docs that we mean
> JVM
> > > > > process.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>> I'd be in favour of Stephan's proposal for the config keys
> as
> > > > > shorter
> > > > > >>>>>> is usually better and they are still descriptive enough.
> > Between
> > > > > >>>>>> "taskmanager.memory.process.size" and
> > > > > "taskmanager.memory.jvm.total" I
> > > > > >>>>>> would slightly favour the first variant.
> > > > > >>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>>>>   - The network memory keys are now called
> > > > > >>>>>>>>>> "taskmanager.memory.shuffle.*". To my knowledge, shuffle
> > is
> > > > > usually
> > > > > >>>>>>>>>> understood as a redistribution (random, or maybe by hash
> > of
> > > > > key). As an
> > > > > >>>>>>>>>> example, there are many discussions about "shuffle join
> > > versus
> > > > > broadcast
> > > > > >>>>>>>>>> join", where "shuffle" is the synonym for
> > "re-partitioning".
> > > > We
> > > > > use that
> > > > > >>>>>>>>>> memory however for all network operations, like forward
> > > pipes,
> > > > > broadcasts,
> > > > > >>>>>>>>>> receiver-side buffering on checkpoints, etc. I find the
> > name
> > > > > "*.shuffle.*"
> > > > > >>>>>>>>>> confusing, I am wondering if users would find that as
> > well.
> > > So
> > > > > throwing in
> > > > > >>>>>>>>>> the suggestion to call the options
> > > > > "taskmanager.memory.network.*".
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> +0 on this one. I'm ok with
> "taskmanager.memory.network.*".
> > > On
> > > > > the
> > > > > >>>>>>>>> other hand, one can also argue that this part of memory
> is
> > > used
> > > > > by
> > > > > >>>>>>>>> ShuffleEnvironment, and the key
> > > "taskmanager.memory.shuffle.*"
> > > > > points more
> > > > > >>>>>>>>> directly to the shuffle service components.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> *==> In that case, the name "Shuffle Environment" may be
> a
> > > bit
> > > > > >>>>>>>>> incorrect, because it is doing not only shuffles as well.
> > The
> > > > > >>>>>>>>> ShuffleEnvironment is also more internal, so the name is
> > not
> > > > too
> > > > > critical.
> > > > > >>>>>>>>> This isn't super high priority for me, but if we want to
> > > adjust
> > > > > it, better
> > > > > >>>>>>>>> earlier than later.*
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> This is also a bit controversial topic for me. Indeed, we
> > > have
> > > > > >>>>>>>>> always used ’network’ for this concept of task data
> > shuffling
> > > > > over the
> > > > > >>>>>>>>> network and this can confuse existing users.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> On the other hand for the new users and in a long term,
> > > > ’network’
> > > > > >>>>>>>>> can delude into a conclusion that all network memory is
> > > managed
> > > > > by this
> > > > > >>>>>>>>> option.
> > > > > >>>>>>>>> Also other types of shuffle might not directly deal with
> > > > network
> > > > > >>>>>>>>> at all.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> By calling it shuffle, we were somewhat biased by
> > > understanding
> > > > > it
> > > > > >>>>>>>>> in term of map/reduce. This is rather an inter-task data
> > > > > exchange.
> > > > > >>>>>>>>> Maybe then 'taskmanager.memory.shuffle.communication.*’
> or
> > > > > >>>>>>>>>
> > ‘taskmanager.memory.task.shuffle/communication/io/network.*’.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>  From a user's perspective I believe
> > > "taskmanager.memory.network"
> > > > > >>>>>> would be easier to understand as not everyone knows exactly
> > what
> > > > the
> > > > > >>>>>> shuffle service is. I see the point that it would be a bit
> > > > > imprecise as we
> > > > > >>>>>> can have different shuffle implementations but I would go
> with
> > > the
> > > > > ease of
> > > > > >>>>>> use/understanding here. Moreover, I think that we won't have
> > > many
> > > > > different
> > > > > >>>>>> shuffle service implementations in the foreseeable future.
> > > > > >>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>>>>   - The descriptions for the
> > > > "taskmanager.memory.jvm-overhead.*"
> > > > > >>>>>>>>>> keys say that it also accounts for I/O direct memory,
> but
> > > the
> > > > > parameter is
> > > > > >>>>>>>>>> not counted into the MaxDirectMemory parameter.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> True. Since we already have framework off-heap memory
> > > accounted
> > > > > >>>>>>>>> for ad hoc direct memory usages, accounting all of
> > > jvm-overhead
> > > > > also into
> > > > > >>>>>>>>> max direct memory limit seems not necessary. I would
> > suggest
> > > to
> > > > > remove "I/O
> > > > > >>>>>>>>> direct memory" from the description, and explicitly
> mention
> > > > that
> > > > > this
> > > > > >>>>>>>>> option does not account for direct memory and will not be
> > > > > accounted into
> > > > > >>>>>>>>> max direct memory limit. WDYT?
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> +1
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>   - Can make the new ConfigOptions strongly typed with
> the
> > > new
> > > > > >>>>>>>>>> configuration options. For example, directly use
> > MemorySize
> > > > > typed options.
> > > > > >>>>>>>>>> That makes validation automatic and saves us from
> breaking
> > > the
> > > > > options
> > > > > >>>>>>>>>> later.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> +1. Wasn't aware of the new memory type config options.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> +1
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> *==> Thanks. Do you need help with adjusting this?*
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> I would appreciate it.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Also small side note, we extensively use MemorySize in
> logs
> > > now
> > > > > >>>>>>>>> but it might be not always readable as its string
> > > > representation
> > > > > is only in
> > > > > >>>>>>>>> bytes atm
> > > > > >>>>>>>>> and does not reduce it to kb/mb/etc in case of big bytes
> > > value.
> > > > > We
> > > > > >>>>>>>>> could have at least some .prettyPrint function to use in
> > > logs.
> > > > > >>>>>>>>> And .fromMegabytes/etc factory methods would improve code
> > > > > >>>>>>>>> readability instead of .parse(int + “m”).
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Best,
> > > > > >>>>>>>>> Andrey
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> On 20 Dec 2019, at 12:13, Stephan Ewen <sewen@apache.org
> >
> > > > wrote:
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Hi Xintong!
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Please find my answers inline:
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>>   - "taskmanager.memory.size" (old main config option)
> is
> > > > > >>>>>>>>>>> replaced by "taskmanager.memory.total-process.size"
> which
> > > has
> > > > > a different
> > > > > >>>>>>>>>>> meaning in standalone setups. The old option did not
> > > subtract
> > > > > metaspace and
> > > > > >>>>>>>>>>> other overhead, while the new option does. That means
> > that
> > > > > with the default
> > > > > >>>>>>>>>>> config, standalone clusters get quite a bit less
> memory.
> > > > > (independent of
> > > > > >>>>>>>>>>> managed memory going off heap).
> > > > > >>>>>>>>>>>     I am wondering if we could interpret
> > > > > >>>>>>>>>>> "taskmanager.memory.size" as the deprecated key for
> > > > > >>>>>>>>>>> "taskmanager.memory.total-flink.size". That would be in
> > > line
> > > > > with the old
> > > > > >>>>>>>>>>> mechanism (assuming managed memory is set to off heap).
> > > > > >>>>>>>>>>>     The effect would be that the container size on
> > > Yarn/Mesos
> > > > > >>>>>>>>>>> increases, because from
> > > > "taskmanager.memory.total-flink.size",
> > > > > we need to
> > > > > >>>>>>>>>>> add overhead and metaspace to reach the total process
> > size,
> > > > > rather than
> > > > > >>>>>>>>>>> cutting off memory. But if we want, we could even
> adjust
> > > for
> > > > > that in the
> > > > > >>>>>>>>>>> active resource manager, getting full backwards
> > > compatibility
> > > > > on that part.
> > > > > >>>>>>>>>>>     Curious to hear more thoughts there.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> I believe you mean "taskmanager.heap.size".
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> *==> Yes*
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> I think the problem here is that the
> > > > > >>>>>>>>>> legacy "taskmanager.heap.size" was used differently in
> > > > > standalone setups
> > > > > >>>>>>>>>> and active yarn / mesos setups, and such different
> > > calculation
> > > > > logics and
> > > > > >>>>>>>>>> behaviors are exactly what we want to avoid with
> FLIP-49.
> > > > > Therefore, I'm
> > > > > >>>>>>>>>> not in favor of treating
> > > "taskmanager.memory.total-flink.size"
> > > > > differently
> > > > > >>>>>>>>>> for standalone and active setups.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> I think what we really want is probably
> > > > > >>>>>>>>>> mapping "taskmanager.heap.size" to different new config
> > > > options
> > > > > in
> > > > > >>>>>>>>>> different setups. How about we mark
> > "taskmanager.heap.size"
> > > as
> > > > > deprecated
> > > > > >>>>>>>>>> key for neither of
> "taskmanager.memory.total-process.size"
> > > and
> > > > > >>>>>>>>>> "taskmanager.memory.total-flink.size". Instead, we parse
> > it
> > > > (if
> > > > > explicitly
> > > > > >>>>>>>>>> configured) in startup scripts / active resource
> managers,
> > > and
> > > > > set the
> > > > > >>>>>>>>>> value to "taskmanager.memory.total-flink.size" in the
> > > scripts
> > > > > and
> > > > > >>>>>>>>>> "taskmanager.memory.total-process.size" in active
> resource
> > > > > managers (if the
> > > > > >>>>>>>>>> new config options are not configured). We can provide
> > util
> > > > > methods in
> > > > > >>>>>>>>>> TaskExecutorResourceUtils for such conversions, to keep
> > all
> > > > the
> > > > > >>>>>>>>>> configuration logics at one place.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> *==> This is pretty much what I meant as well (maybe my
> > > > > >>>>>>>>> description was not very clear), so +1 for that
> mechanism*
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>   - Mini Cluster tries to imitate exact ratio of memory
> > pools
> > > > as
> > > > > a
> > > > > >>>>>>>>>>> standalone setup. I get the idea behind that, but I am
> > > > > wondering if it is
> > > > > >>>>>>>>>>> the right approach here.
> > > > > >>>>>>>>>>>     For example: I started a relatively large JVM
> (large
> > > heap
> > > > > >>>>>>>>>>> size of 10 GB) as a test. With the current logic, the
> > > system
> > > > > tries to
> > > > > >>>>>>>>>>> reserve an additional 6GB for managed memory which is
> > more
> > > > > than there is
> > > > > >>>>>>>>>>> memory left. When you see the error that no memory
> could
> > be
> > > > > allocated, you
> > > > > >>>>>>>>>>> need to understand the magic of how this is derived.
> > > > > >>>>>>>>>>>     I am trying to think about this from the
> perspective
> > of
> > > > > >>>>>>>>>>> using "Flink as a Library", which the MiniCluster is
> > close
> > > > to.
> > > > > >>>>>>>>>>>     When starting Flink out of a running process, we
> > cannot
> > > > > >>>>>>>>>>> assume that we are the only users of that process and
> > that
> > > we
> > > > > can mold the
> > > > > >>>>>>>>>>> process to our demands. I think a fix value for managed
> > > > memory
> > > > > and network
> > > > > >>>>>>>>>>> memory would feel more natural in such a setup than a
> > > > > mechanism that is
> > > > > >>>>>>>>>>> tailored towards exclusive use of the process.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> +1 on having fixed values for managed / shuffle memory.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> *==> Cool, let's also see what Andrey and Till think
> here.*
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>>   - Some off-heap memory goes into direct memory, some
> > does
> > > > not.
> > > > > >>>>>>>>>>> This confused me a bit. For example
> > > > > >>>>>>>>>>> "taskmanager.memory.framework.off-heap.size" is counted
> > > into
> > > > > >>>>>>>>>>> MaxDirectMemory while
> > > "taskmanager.memory.task.off-heap.size"
> > > > > is counted as
> > > > > >>>>>>>>>>> native memory. Maybe we should rename the keys to
> reflect
> > > > > that. There is no
> > > > > >>>>>>>>>>> one "off heap" memory type after all. Maybe use
> > > > > >>>>>>>>>>> "taskmanager.memory.task.native: XXXmb" and
> > > > > >>>>>>>>>>> "taskmanager.memory.framework.direct: XXXmb" instead?
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> I believe "taskmanager.memory.task.off-heap.size" is
> also
> > > > > >>>>>>>>>> accounted in the max direct memory size limit. The
> > confusion
> > > > > probably comes
> > > > > >>>>>>>>>> from that "taskmanager.memory.framework.off-heap.size"
> > > > > explicitly mentioned
> > > > > >>>>>>>>>> that in its description while
> > > > > "taskmanager.memory.task.off-heap.size"
> > > > > >>>>>>>>>> didn't. That's actually because the framework off-heap
> > > memory
> > > > > is introduced
> > > > > >>>>>>>>>> later in a separate commit. We should fix that.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> For framework / task off-heap memory, we do not
> > > differentiate
> > > > > >>>>>>>>>> direct / native memory usage. That means the configure
> > value
> > > > > for these two
> > > > > >>>>>>>>>> options could be a mixture of direct / native memory.
> > Since
> > > we
> > > > > do not know
> > > > > >>>>>>>>>> the portion of direct memory out of the configured
> value,
> > we
> > > > > have
> > > > > >>>>>>>>>> to conservatively account it all into the max direct
> > memory
> > > > > size limit.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> *==>  In that case, I am a bit confused. For the total
> size
> > > > > >>>>>>>>> calculation, it is fine. But why do we then set
> > > > MaxDirectMemory?
> > > > > It is a
> > > > > >>>>>>>>> difficult parameter, and the main reason to set it was
> (if
> > I
> > > > > recall
> > > > > >>>>>>>>> correctly) to trigger GC based on direct memory
> allocation
> > > (to
> > > > > free heap
> > > > > >>>>>>>>> structures that then in turn release direct memory). If
> the
> > > > > limit is
> > > > > >>>>>>>>> anyways too high (because we also count native memory in
> > > there)
> > > > > such that
> > > > > >>>>>>>>> we can exceed the total process (container) memory, why
> do
> > we
> > > > > set it then?*
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>   - The network memory keys are now called
> > > > > >>>>>>>>>>> "taskmanager.memory.shuffle.*". To my knowledge,
> shuffle
> > is
> > > > > usually
> > > > > >>>>>>>>>>> understood as a redistribution (random, or maybe by
> hash
> > of
> > > > > key). As an
> > > > > >>>>>>>>>>> example, there are many discussions about "shuffle join
> > > > versus
> > > > > broadcast
> > > > > >>>>>>>>>>> join", where "shuffle" is the synonym for
> > > "re-partitioning".
> > > > > We use that
> > > > > >>>>>>>>>>> memory however for all network operations, like forward
> > > > pipes,
> > > > > broadcasts,
> > > > > >>>>>>>>>>> receiver-side buffering on checkpoints, etc. I find the
> > > name
> > > > > "*.shuffle.*"
> > > > > >>>>>>>>>>> confusing, I am wondering if users would find that as
> > well.
> > > > So
> > > > > throwing in
> > > > > >>>>>>>>>>> the suggestion to call the options
> > > > > "taskmanager.memory.network.*".
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> +0 on this one. I'm ok with
> > "taskmanager.memory.network.*".
> > > On
> > > > > >>>>>>>>>> the other hand, one can also argue that this part of
> > memory
> > > is
> > > > > used by
> > > > > >>>>>>>>>> ShuffleEnvironment, and the key
> > > "taskmanager.memory.shuffle.*"
> > > > > points more
> > > > > >>>>>>>>>> directly to the shuffle service components.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> *==> In that case, the name "Shuffle Environment" may be
> a
> > > bit
> > > > > >>>>>>>>> incorrect, because it is doing not only shuffles as well.
> > The
> > > > > >>>>>>>>> ShuffleEnvironment is also more internal, so the name is
> > not
> > > > too
> > > > > critical.
> > > > > >>>>>>>>> This isn't super high priority for me, but if we want to
> > > adjust
> > > > > it, better
> > > > > >>>>>>>>> earlier than later.*
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>    - The descriptions for the
> > > > "taskmanager.memory.jvm-overhead.*"
> > > > > >>>>>>>>> keys say that it also accounts for I/O direct memory, but
> > the
> > > > > parameter is
> > > > > >>>>>>>>> not counted into the MaxDirectMemory parameter.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> True. Since we already have framework off-heap memory
> > > > accounted
> > > > > >>>>>>>>>> for ad hoc direct memory usages, accounting all of
> > > > jvm-overhead
> > > > > also into
> > > > > >>>>>>>>>> max direct memory limit seems not necessary. I would
> > suggest
> > > > to
> > > > > remove "I/O
> > > > > >>>>>>>>>> direct memory" from the description, and explicitly
> > mention
> > > > > that this
> > > > > >>>>>>>>>> option does not account for direct memory and will not
> be
> > > > > accounted into
> > > > > >>>>>>>>>> max direct memory limit. WDYT?
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> *==> Sounds good. *
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>>   - Can make the new ConfigOptions strongly typed with
> the
> > > new
> > > > > >>>>>>>>>>> configuration options. For example, directly use
> > MemorySize
> > > > > typed options.
> > > > > >>>>>>>>>>> That makes validation automatic and saves us from
> > breaking
> > > > the
> > > > > options
> > > > > >>>>>>>>>>> later.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> +1. Wasn't aware of the new memory type config options.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> *==> Thanks. Do you need help with adjusting this?*
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Best,
> > > > > >>>>>>>>> Stephan
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best, Jingsong Lee
> > >
> >
>

Re: [DISCUSS] Some feedback after trying out the new FLIP-49 memory configurations

Posted by Till Rohrmann <tr...@apache.org>.
I'm also slightly in favour of the flink.size instead of process.size. If
Xintong is ok with this, then we should update the flink-conf.yaml in the
repo.

Cheers,
Till

On Mon, Jan 6, 2020 at 4:09 PM Andrey Zagrebin <az...@apache.org> wrote:

> Thank you for more explanation Stephan and feedback Jingsong,
>
> I see the point now.
> ‘taskmanager.memory.flink.size’ is indeed simpler to understand for the
> newcomers in the default config because it includes less types of memory to
> consider.
> Most of the time beginners will think about heap size and maybe state size
> to tweak in the try-out jobs.
> ‘taskmanager.memory.flink.size’ is better scoped for those types of memory.
>
> The process memory forces to understand the total memory consumption
> which is more important for the next steps and will probably require
> reading the docs in more depth anyways.
> I agree if we decide for ‘flink.size’, it is worth mentioning a pointer to
> ‘process.size' in its comment as an alternative to it.
>
> I am ok with ‘flink.size’ in the default config.
>
> Best,
> Andrey
>
> On Mon, Dec 30, 2019 at 5:13 AM Jingsong Li <ji...@gmail.com>
> wrote:
>
> > Thank you for your wonderful discussion.
> >
> > +1 for set "taskmanager.memory.flink.size" in the default config.
> > Maybe we can write and explain "taskmanager.memory.process.size" in the
> > comments.
> >
> > Most of the time,
> > - "trying out" users is less in-depth users and use standalone mode.
> > - Production users use active setups.
> >
> > Default config it very important to "trying out".
> > In "trying out", for a novice Java user of Flink, he want to configure as
> > much memory as my standalone process should be. In fact, it's hard for
> him
> > to realize the JVM overhead. And there is no need that must let him know.
> >
> > > I don't think it is realistic that users set the process memory to full
> > machine memory. There is a lot on the machine as well in most cases.
> >
> > +1, We often run Flink in a less clean environment, such as the
> environment
> > have HDFS. Whether we are testing or producing, we will not use all the
> > memory of the machine, such as always leaving some memory for PageCache.
> >
> > Best,
> > Jingsong Lee
> >
> > On Sat, Dec 28, 2019 at 7:02 PM Stephan Ewen <se...@apache.org> wrote:
> >
> > > "taskmanager.memory.flink.size" in the default config has a few
> > advantages.
> > >
> > >  - The value in the default config needs to be suitable for "trying
> out"
> > > Flink, for a good "getting started" experience.
> > >
> > >   - For trying out Flink, standalone is the most common entry point
> > (except
> > > running in IDE).
> > >
> > >   - In standalone setup, from total process memory, we subtract quite a
> > bit
> > > before we arrive at the usable memory. We also subtract managed memory
> > from
> > > the heap now. I fear we might end up at a heap that becomes so small
> that
> > > it makes for a bad "getting started" experience.
> > >
> > >   - I don't think it is realistic that users set the process memory to
> > full
> > > machine memory. There is a lot on the machine as well in most cases.
> > >
> > >   - In the JVM world, users are used to configuring the heap size and
> > know
> > > that there is additional memory overhead. The
> > > "taskmanager.memory.flink.size" option fits well with that mindset.
> > >
> > >   - One you start to think about the total process memory of Yarn
> > > containers, you are already past the getting-started phase and on the
> > > tuning phase.
> > >
> > >
> > > On Tue, Dec 24, 2019, 10:25 Andrey Zagrebin <az...@apache.org>
> > wrote:
> > >
> > > > Thanks for the summary, Xintong! It makes sense to me.
> > > >
> > > > How about putting "taskmanager.memory.flink.size" in the
> configuration?
> > > > > Then new downloaded Flink behaves similar to the previous
> Standalone
> > > > setups.
> > > > > If someone upgrades the binaries, but re-uses their old
> > configuration,
> > > > > then they get the compatibility as discussed previously.
> > > > > We used that approach previously with the fine-grained failover
> > > recovery.
> > > >
> > > >
> > > >
> > > > > I'm trying to understand why "taskmanager.memory.flink.size" rather
> > > than
> > > > > "taskmanager.memory.process.size" in the default flink-conf.yaml.
> Or
> > > put
> > > > it
> > > > > another way, why do we want the new downloaded Flink behaves
> similar
> > to
> > > > > previous Standalone setups rather than previous active mode setups?
> > Is
> > > > > there any special reason that I overlooked, which makes backwards
> > > > > compatibility for standalone setups more important than for active
> > > > setups?
> > > > > IMO, "taskmanager.memory.process.size" is easier for the new
> comers.
> > > For
> > > > > standalone setups, users can simply configure it to their machines'
> > > > > available memory size, without needing to worry about leaving
> enough
> > > > space
> > > > > for JVM overehead / metaspace. For containerized setups, it's more
> > > > > predictable how many memory the containers / Flink could use, which
> > is
> > > > more
> > > > > friendly for users to manage their resource quota.
> > > > > Therefore, unless there is anything I overlooked, I'm in favor of
> > > putting
> > > > > "taskmanager.memory.process.size" rather than
> > > > > "taskmanager.memory.flink.size" in the default configuration.
> > > >
> > > >
> > > > I agree that having "taskmanager.memory.process.size" in default
> config
> > > > should be easier to understand and tweak for the new users because it
> > is
> > > > just what they are ready to spend for Flink application.
> > > > The problem is when users upgrade Flink and use the new default
> > > > configuration then the behaviour can change:
> > > > - either if we put process memory then Flink memory shrinks and the
> new
> > > > default option contradicts their previous understanding
> > > > - or if we put Flink memory then larger container is requested.
> > > > The shrinking of memory sounds more implicit and worse. The increased
> > > > container request will just fail in the worst case and the memory
> setup
> > > can
> > > > be revisited.
> > > > We could increase the default "taskmanager.memory.process.size" to
> > better
> > > > align it with the previous default setup for standalone
> > > > but this would not remove the possible confusion problem for the old
> > > users,
> > > > on the other hand the option is new and we can add a comment how to
> > > migrate
> > > > from the old one.
> > > >
> > > > All in all, now I also tend to have "taskmanager.memory.process.size"
> > in
> > > > the default config unless there are more reasons for having less
> > > confusion
> > > > for the old standalone users.
> > > >
> > > > Best,
> > > > Andrey
> > > >
> > > > On Tue, Dec 24, 2019 at 5:50 AM Xintong Song <to...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > It seems we have already get consensus on most of the issues.
> Thanks
> > > > > everyone for the good discussion.
> > > > >
> > > > > While there are still open questions under discussion, I'd like to
> > > > > summarize the discussion so far, and list the action items that we
> > > > already
> > > > > get consensus on. In this way, we can already start working on
> these
> > > > items
> > > > > while discussing the remaining open questions. Please correct me if
> > the
> > > > > summary does not reflect your argument.
> > > > >
> > > > > *Action Items that we already get consensus on:*
> > > > >
> > > > >    - Map "taskmanager.heap.size" to "taskmanager.memory.flink.size"
> > for
> > > > >    standalone setups, and "taskmanager.memory.process.size" for
> > active
> > > > setups.
> > > > >    (should be mentioned in release notes)
> > > > >    - If not explicitly configured, MiniCluster should have fixed
> > > default
> > > > >    network and managed memory sizes. (should be mentioned in docs &
> > > > release
> > > > >    notes)
> > > > >    - Change the memory config options' type from String to
> MemorySize
> > > > >    - Change the config option key
> > "taskmanager.memory.total-flink.size"
> > > > >    to "taskmanager.memory.flink.size"
> > > > >    - Change the config option key
> > > "taskmanager.memory.total-process.size"
> > > > >    to "taskmanager.memory.process.size"
> > > > >    - Update descriptions for
> > > "taskmanager.memory.framework.off-heap.size"
> > > > >    and "taskmanager.memory.task.off-heap.size" to explicitly state
> > that
> > > > >       - Both direct and native memory are accounted
> > > > >       - Will be fully counted into MaxDirectMemorySize
> > > > >    - Update descriptions for
> > > > >    "taskmanager.memory.jvm-overhead.[min|max|fraction]" to remove
> > "I/O
> > > > direct
> > > > >    memory" and explicitly state that it's not counted into
> > > > MaxDirectMemorySize
> > > > >    - Print MemorySize with proper unit. (non-blocker for 1.10)
> > > > >
> > > > >
> > > > > *Questions that are still open:*
> > > > >
> > > > >    - Which config option do we put in the default flink-conf.yaml?
> > > > >       - "taskmanager.memory.flink.size"
> > > > >       - "taskmanager.memory.process.size"
> > > > >       - Deprecated "taskmanager.heap.size"
> > > > >    - What is proper keys for network / shuffle memory
> > > > >       - "taskmanager.memory.shuffle.*"
> > > > >       - "taskmanager.memory.network.*"
> > > > >       - "taskmanager.network.memory.*"
> > > > >
> > > > >
> > > > > Thank you~
> > > > >
> > > > > Xintong Song
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Dec 24, 2019 at 10:19 AM Xintong Song <
> tonysong820@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > >> How about putting "taskmanager.memory.flink.size" in the
> > > configuration?
> > > > >>> Then new downloaded Flink behaves similar to the previous
> > Standalone
> > > > setups.
> > > > >>> If someone upgrades the binaries, but re-uses their old
> > > configuration,
> > > > >>> then they get the compatibility as discussed previously.
> > > > >>> We used that approach previously with the fine-grained failover
> > > > recovery.
> > > > >>
> > > > >>
> > > > >> I'm trying to understand why "taskmanager.memory.flink.size"
> rather
> > > than
> > > > >> "taskmanager.memory.process.size" in the default flink-conf.yaml.
> Or
> > > > put it
> > > > >> another way, why do we want the new downloaded Flink behaves
> similar
> > > to
> > > > >> previous Standalone setups rather than previous active mode
> setups?
> > Is
> > > > >> there any special reason that I overlooked, which makes backwards
> > > > >> compatibility for standalone setups more important than for active
> > > > setups?
> > > > >>
> > > > >> IMO, "taskmanager.memory.process.size" is easier for the new
> comers.
> > > For
> > > > >> standalone setups, users can simply configure it to their
> machines'
> > > > >> available memory size, without needing to worry about leaving
> enough
> > > > space
> > > > >> for JVM overehead / metaspace. For containerized setups, it's more
> > > > >> predictable how many memory the containers / Flink could use,
> which
> > is
> > > > more
> > > > >> friendly for users to manage their resource quota.
> > > > >>
> > > > >> Therefore, unless there is anything I overlooked, I'm in favor of
> > > > putting
> > > > >> "taskmanager.memory.process.size" rather than
> > > > >> "taskmanager.memory.flink.size" in the default configuration.
> > > > >>
> > > > >>
> > > > >> Thank you~
> > > > >>
> > > > >> Xintong Song
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Tue, Dec 24, 2019 at 4:27 AM Andrey Zagrebin <
> > azagrebin@apache.org
> > > >
> > > > >> wrote:
> > > > >>
> > > > >>> How about putting "taskmanager.memory.flink.size" in the
> > > configuration?
> > > > >>>> Then new downloaded Flink behaves similar to the previous
> > Standalone
> > > > setups.
> > > > >>>> If someone upgrades the binaries, but re-uses their old
> > > configuration,
> > > > >>>> then they get the compatibility as discussed previously.
> > > > >>>> We used that approach previously with the fine-grained failover
> > > > >>>> recovery.
> > > > >>>
> > > > >>> +1, this sounds like a good compromise.
> > > > >>>
> > > > >>> +1 to not have more options for off-heap memory, that would get
> > > > >>>> confusing fast. We can keep the name "off-heap" for both task
> and
> > > > >>>> framework, as long as they mean the same thing: native plus
> > direct,
> > > > and
> > > > >>>> fully counted into MaxDirectMemory. I would suggest to update
> the
> > > > config
> > > > >>>> descriptions then to reflect that.
> > > > >>>>
> > > > >>> True, this should be explained in the config descriptions.
> > > > >>>
> > > > >>> looks good to me
> > > > >>>
> > > > >>> From a user's perspective I believe "taskmanager.memory.network"
> > > would
> > > > >>>> be easier to understand as not everyone knows exactly what the
> > > shuffle
> > > > >>>> service is. I see the point that it would be a bit imprecise as
> we
> > > > can have
> > > > >>>> different shuffle implementations but I would go with the ease
> of
> > > > >>>> use/understanding here. Moreover, I think that we won't have
> many
> > > > different
> > > > >>>> shuffle service implementations in the foreseeable future.
> > > > >>>
> > > > >>> I agree that if we cannot find any other convincing names for the
> > > > >>> options, we should keep what we already have and change it if the
> > > > >>> alternative is convincing enough.
> > > > >>> The question is also whether we still want to rename it because
> it
> > > was
> > > > >>> "taskmanager.network.*memory*.*" in 1.9 but
> > > > "taskmanager.*memory*.network.*"
> > > > >>> is more aligned with other new memory option names.
> > > > >>> Or we can just 'un'-deprecate "taskmanager.network.*memory*.*".
> > > > >>>
> > > > >>> On Mon, Dec 23, 2019 at 8:42 PM Stephan Ewen <se...@apache.org>
> > > wrote:
> > > > >>>
> > > > >>>> How about putting "taskmanager.memory.flink.size" in the
> > > > configuration?
> > > > >>>> Then new downloaded Flink behaves similar to the previous
> > Standalone
> > > > setups.
> > > > >>>>
> > > > >>>> If someone upgrades the binaries, but re-uses their old
> > > configuration,
> > > > >>>> then they get the compatibility as discussed previously.
> > > > >>>> We used that approach previously with the fine-grained failover
> > > > >>>> recovery.
> > > > >>>>
> > > > >>>> On Mon, Dec 23, 2019 at 3:27 AM Xintong Song <
> > tonysong820@gmail.com
> > > >
> > > > >>>> wrote:
> > > > >>>>
> > > > >>>>> +1 to not have more options for off-heap memory, that would get
> > > > >>>>>> confusing fast. We can keep the name "off-heap" for both task
> > and
> > > > >>>>>> framework, as long as they mean the same thing: native plus
> > > direct,
> > > > and
> > > > >>>>>> fully counted into MaxDirectMemory. I would suggest to update
> > the
> > > > config
> > > > >>>>>> descriptions then to reflect that.
> > > > >>>>>>
> > > > >>>>> True, this should be explained in the config descriptions.
> > > > >>>>>
> > > > >>>>> Which configuration option will be set in Flink's default
> > > > >>>>>> flink-conf.yaml? If we want to maintain the existing behaviour
> > it
> > > > would
> > > > >>>>>> have to be the then deprecated taskmanager.heap.size config
> > > option.
> > > > If we
> > > > >>>>>> are ok with Yarn requesting slightly larger containers, then
> it
> > > > could also
> > > > >>>>>> be taskmanager.memory.total-flink.size.
> > > > >>>>>>
> > > > >>>>> Good point. Currently, we have
> > > > >>>>> "taskmanager.memory.total-process.size". In order to preserve
> the
> > > > previous
> > > > >>>>> behavior, we need to have "taskmanager.heap.size" so it can be
> > > > mapped to
> > > > >>>>> different new options in standalone / active setups.
> > > > >>>>> I think we can have the deprecated "taskmanager.heap.size" in
> the
> > > > >>>>> default flink-conf.yaml, and also have the
> > > > >>>>> new "taskmanager.memory.total-process.size" in a commented
> line.
> > We
> > > > can
> > > > >>>>> explain how the deprecated config option behaves differently in
> > the
> > > > >>>>> comments, so that user can switch to the new config options if
> > they
> > > > want to.
> > > > >>>>>
> > > > >>>>> Thank you~
> > > > >>>>>
> > > > >>>>> Xintong Song
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> On Sat, Dec 21, 2019 at 1:00 AM Till Rohrmann <
> > > trohrmann@apache.org>
> > > > >>>>> wrote:
> > > > >>>>>
> > > > >>>>>> Thanks for the feedback and good discussion everyone. I left
> > some
> > > > >>>>>> comments inline.
> > > > >>>>>>
> > > > >>>>>> On Fri, Dec 20, 2019 at 1:59 PM Stephan Ewen <
> sewen@apache.org>
> > > > >>>>>> wrote:
> > > > >>>>>>
> > > > >>>>>>> +1 to not have more options for off-heap memory, that would
> get
> > > > >>>>>>> confusing fast. We can keep the name "off-heap" for both task
> > and
> > > > >>>>>>> framework, as long as they mean the same thing: native plus
> > > > direct, and
> > > > >>>>>>> fully counted into MaxDirectMemory. I would suggest to update
> > the
> > > > config
> > > > >>>>>>> descriptions then to reflect that.
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> On Fri, Dec 20, 2019 at 1:03 PM Xintong Song <
> > > > tonysong820@gmail.com>
> > > > >>>>>>> wrote:
> > > > >>>>>>>
> > > > >>>>>>>> Regarding the framework/task direct/native memory options, I
> > > tend
> > > > >>>>>>>> to think it differently. I'm in favor of keep the
> > > > "*.off-heap.size" for the
> > > > >>>>>>>> config option keys.
> > > > >>>>>>>>
> > > > >>>>>>>>    - It's not necessary IMO to expose the difference
> concepts
> > of
> > > > >>>>>>>>    direct / native memory to the users.
> > > > >>>>>>>>    - I would avoid introducing more options for native
> memory
> > if
> > > > >>>>>>>>    possible. Taking fine grained resource management and
> > dynamic
> > > > slot
> > > > >>>>>>>>    allocation into consideration, that also means introduce
> > more
> > > > fields into
> > > > >>>>>>>>    ResourceSpec / ResourceProfile.
> > > > >>>>>>>>    - My gut feeling is that having a relative loose
> > > > >>>>>>>>    MaxDirectMemory should not be a big problem.
> > > > >>>>>>>>    - In most cases, the task / framework off-heap memory
> > should
> > > be
> > > > >>>>>>>>       mainly (if not all) direct memory, so the difference
> > > > between derived
> > > > >>>>>>>>       MaxDirectMemory and the ideal direct memory limit
> should
> > > > not be too much.
> > > > >>>>>>>>       - We do not have a good way to know the exact size
> > needed
> > > > >>>>>>>>       for jvm overhead / metaspace and framework / task
> > off-heap
> > > > memory, thus
> > > > >>>>>>>>       having to conservatively reserve slightly more memory
> > then
> > > > what actually
> > > > >>>>>>>>       needed. Such reserved but not used memory can cover
> for
> > > the
> > > > small
> > > > >>>>>>>>       MaxDirectMemory error.
> > > > >>>>>>>>       -
> > > > >>>>>>>>       - MaxDirectMemory is not the only way to trigger full
> > gc.
> > > We
> > > > >>>>>>>>       still heap activities that can also trigger the gc.
> > > > >>>>>>>>
> > > > >>>>>>>> Regarding the memory type config options, I've looked into
> the
> > > > >>>>>>>> latest ConfigOptions changes. I think it shouldn't be too
> > > > complicated to
> > > > >>>>>>>> change the config options to use memory type, and I can
> handle
> > > it
> > > > maybe
> > > > >>>>>>>> during your vacations.
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> Also agree with improving MemorySize logging and parsing.
> This
> > > > >>>>>>>> should not be a blocker that has to be done in 1.10. I would
> > say
> > > > we finish
> > > > >>>>>>>> other works (testability, documentation and those discussed
> in
> > > > this thread)
> > > > >>>>>>>> first, and get to this only if we have time.
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> Thank you~
> > > > >>>>>>>>
> > > > >>>>>>>> Xintong Song
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> On Fri, Dec 20, 2019 at 6:07 PM Andrey Zagrebin <
> > > > >>>>>>>> azagrebin.apache@gmail.com> wrote:
> > > > >>>>>>>>
> > > > >>>>>>>>> Hi Stephan and Xintong,
> > > > >>>>>>>>>
> > > > >>>>>>>>> Thanks for the further FLIP-49 feedbacks.
> > > > >>>>>>>>>
> > > > >>>>>>>>>   - "taskmanager.memory.size" (old main config option) is
> > > > replaced
> > > > >>>>>>>>>> by "taskmanager.memory.total-process.size" which has a
> > > > different meaning in
> > > > >>>>>>>>>> standalone setups. The old option did not subtract
> metaspace
> > > > and other
> > > > >>>>>>>>>> overhead, while the new option does. That means that with
> > the
> > > > default
> > > > >>>>>>>>>> config, standalone clusters get quite a bit less memory.
> > > > (independent of
> > > > >>>>>>>>>> managed memory going off heap).
> > > > >>>>>>>>>>     I am wondering if we could interpret
> > > > >>>>>>>>>> "taskmanager.memory.size" as the deprecated key for
> > > > >>>>>>>>>> "taskmanager.memory.total-flink.size". That would be in
> line
> > > > with the old
> > > > >>>>>>>>>> mechanism (assuming managed memory is set to off heap).
> > > > >>>>>>>>>>     The effect would be that the container size on
> > Yarn/Mesos
> > > > >>>>>>>>>> increases, because from
> > "taskmanager.memory.total-flink.size",
> > > > we need to
> > > > >>>>>>>>>> add overhead and metaspace to reach the total process
> size,
> > > > rather than
> > > > >>>>>>>>>> cutting off memory. But if we want, we could even adjust
> for
> > > > that in the
> > > > >>>>>>>>>> active resource manager, getting full backwards
> > compatibility
> > > > on that part.
> > > > >>>>>>>>>>     Curious to hear more thoughts there.
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> I believe you mean "taskmanager.heap.size".
> > > > >>>>>>>>>
> > > > >>>>>>>>> I think the problem here is that the
> > > > >>>>>>>>> legacy "taskmanager.heap.size" was used differently in
> > > > standalone setups
> > > > >>>>>>>>> and active yarn / mesos setups, and such different
> > calculation
> > > > logics and
> > > > >>>>>>>>> behaviors are exactly what we want to avoid with FLIP-49.
> > > > Therefore, I'm
> > > > >>>>>>>>> not in favor of treating
> > "taskmanager.memory.total-flink.size"
> > > > differently
> > > > >>>>>>>>> for standalone and active setups.
> > > > >>>>>>>>>
> > > > >>>>>>>>> I think what we really want is probably
> > > > >>>>>>>>> mapping "taskmanager.heap.size" to different new config
> > options
> > > > in
> > > > >>>>>>>>> different setups. How about we mark "taskmanager.heap.size"
> > as
> > > > deprecated
> > > > >>>>>>>>> key for neither of "taskmanager.memory.total-process.size"
> > and
> > > > >>>>>>>>> "taskmanager.memory.total-flink.size". Instead, we parse it
> > (if
> > > > explicitly
> > > > >>>>>>>>> configured) in startup scripts / active resource managers,
> > and
> > > > set the
> > > > >>>>>>>>> value to "taskmanager.memory.total-flink.size" in the
> scripts
> > > and
> > > > >>>>>>>>> "taskmanager.memory.total-process.size" in active resource
> > > > managers (if the
> > > > >>>>>>>>> new config options are not configured). We can provide util
> > > > methods in
> > > > >>>>>>>>> TaskExecutorResourceUtils for such conversions, to keep all
> > the
> > > > >>>>>>>>> configuration logics at one place.
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> I agree that the problem is that the legacy option
> > > > >>>>>>>>> ‘taskmanager.heap.size’ has different semantics for
> > > > standalone/container.
> > > > >>>>>>>>> We had it initially falling back to
> > > > 'taskmanager.memory.total-flink.size’
> > > > >>>>>>>>> but I changed that to align it with container cut-off. Now
> I
> > > see
> > > > it changes
> > > > >>>>>>>>> standalone setup then.
> > > > >>>>>>>>> +1 for supporting its backwards compatibility differently
> for
> > > > >>>>>>>>> standalone/container setups.
> > > > >>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>> Which configuration option will be set in Flink's default
> > > > >>>>>> flink-conf.yaml? If we want to maintain the existing behaviour
> > it
> > > > would
> > > > >>>>>> have to be the then deprecated taskmanager.heap.size config
> > > option.
> > > > If we
> > > > >>>>>> are ok with Yarn requesting slightly larger containers, then
> it
> > > > could also
> > > > >>>>>> be taskmanager.memory.total-flink.size.
> > > > >>>>>>
> > > > >>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>   - Mini Cluster tries to imitate exact ratio of memory
> pools
> > > as
> > > > a
> > > > >>>>>>>>>> standalone setup. I get the idea behind that, but I am
> > > > wondering if it is
> > > > >>>>>>>>>> the right approach here.
> > > > >>>>>>>>>>     For example: I started a relatively large JVM (large
> > heap
> > > > >>>>>>>>>> size of 10 GB) as a test. With the current logic, the
> system
> > > > tries to
> > > > >>>>>>>>>> reserve an additional 6GB for managed memory which is more
> > > than
> > > > there is
> > > > >>>>>>>>>> memory left. When you see the error that no memory could
> be
> > > > allocated, you
> > > > >>>>>>>>>> need to understand the magic of how this is derived.
> > > > >>>>>>>>>>     I am trying to think about this from the perspective
> of
> > > > using
> > > > >>>>>>>>>> "Flink as a Library", which the MiniCluster is close to.
> > > > >>>>>>>>>>     When starting Flink out of a running process, we
> cannot
> > > > >>>>>>>>>> assume that we are the only users of that process and that
> > we
> > > > can mold the
> > > > >>>>>>>>>> process to our demands. I think a fix value for managed
> > memory
> > > > and network
> > > > >>>>>>>>>> memory would feel more natural in such a setup than a
> > > mechanism
> > > > that is
> > > > >>>>>>>>>> tailored towards exclusive use of the process.
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> +1 on having fixed values for managed / shuffle memory.
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> also +1 for that, if user has not specified any main
> options
> > to
> > > > >>>>>>>>> derive memory. We should also log this fixing of managed /
> > > > shuffle memory.
> > > > >>>>>>>>> And just noticed, we could also sanity check framework and
> if
> > > > >>>>>>>>> explicitly configured task heap against available JVM heap,
> > and
> > > > at least
> > > > >>>>>>>>> log inconsistencies.
> > > > >>>>>>>>>
> > > > >>>>>>>>>   - Some off-heap memory goes into direct memory, some does
> > > not.
> > > > >>>>>>>>>> This confused me a bit. For example
> > > > >>>>>>>>>> "taskmanager.memory.framework.off-heap.size" is counted
> into
> > > > >>>>>>>>>> MaxDirectMemory while
> > "taskmanager.memory.task.off-heap.size"
> > > > is counted as
> > > > >>>>>>>>>> native memory. Maybe we should rename the keys to reflect
> > > that.
> > > > There is no
> > > > >>>>>>>>>> one "off heap" memory type after all. Maybe use
> > > > >>>>>>>>>> "taskmanager.memory.task.native: XXXmb" and
> > > > >>>>>>>>>> "taskmanager.memory.framework.direct: XXXmb" instead?
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> I believe "taskmanager.memory.task.off-heap.size" is also
> > > > >>>>>>>>> accounted in the max direct memory size limit. The
> confusion
> > > > probably comes
> > > > >>>>>>>>> from that "taskmanager.memory.framework.off-heap.size"
> > > > explicitly mentioned
> > > > >>>>>>>>> that in its description while
> > > > "taskmanager.memory.task.off-heap.size"
> > > > >>>>>>>>> didn't. That's actually because the framework off-heap
> memory
> > > is
> > > > introduced
> > > > >>>>>>>>> later in a separate commit. We should fix that.
> > > > >>>>>>>>>
> > > > >>>>>>>>> For framework / task off-heap memory, we do not
> differentiate
> > > > >>>>>>>>> direct / native memory usage. That means the configure
> value
> > > for
> > > > these two
> > > > >>>>>>>>> options could be a mixture of direct / native memory. Since
> > we
> > > > do not know
> > > > >>>>>>>>> the portion of direct memory out of the configured value,
> we
> > > have
> > > > >>>>>>>>> to conservatively account it all into the max direct memory
> > > size
> > > > limit.
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> *==>  In that case, I am a bit confused. For the total size
> > > > >>>>>>>>> calculation, it is fine. But why do we then set
> > > MaxDirectMemory?
> > > > It is a
> > > > >>>>>>>>> difficult parameter, and the main reason to set it was (if
> I
> > > > recall
> > > > >>>>>>>>> correctly) to trigger GC based on direct memory allocation
> > (to
> > > > free heap
> > > > >>>>>>>>> structures that then in turn release direct memory). If the
> > > > limit is
> > > > >>>>>>>>> anyways too high (because we also count native memory in
> > there)
> > > > such that
> > > > >>>>>>>>> we can exceed the total process (container) memory, why do
> we
> > > > set it then?*
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> I always also thought about it as providing more safety net
> > for
> > > > >>>>>>>>> direct allocations but GC thing looks more important.
> > > > >>>>>>>>>
> > > > >>>>>>>>> +1 for fixing docs for
> > 'taskmanager.memory.task.off-heap.size’
> > > > and
> > > > >>>>>>>>> renaming to ‘direct' as this is what really happens
> > > > >>>>>>>>> if we want to support direct limit more exact than now.
> > > > >>>>>>>>>
> > > > >>>>>>>>> I also think that it is hard to separate direct / native
> > memory
> > > > >>>>>>>>> unless we introduce even more options.
> > > > >>>>>>>>> If user wants to keep the direct limit tight to a certain
> > value
> > > > >>>>>>>>> but also use native memory outside of it,
> > > > >>>>>>>>> she would have to increase something else, like JVM
> overhead
> > to
> > > > >>>>>>>>> account for it and there is no other better way.
> > > > >>>>>>>>> Having more options to account for the native memory
> outside
> > of
> > > > >>>>>>>>> direct limit complicates things but can be introduced later
> > if
> > > > needed.
> > > > >>>>>>>>>
> > > > >>>>>>>>>   - What do you think about renaming
> > > > >>>>>>>>>> "taskmanager.memory.total-flink.size" to
> > > > "taskmanager.memory.flink.size"
> > > > >>>>>>>>>> and "taskmanager.memory.total-process.size" to
> > > > >>>>>>>>>> "taskmanager.memory.process.size" (or
> > > > "taskmanager.memory.jvm.total"). I
> > > > >>>>>>>>>> think these keys may be a bit less clumsy (dropping the
> > > > "total-") without
> > > > >>>>>>>>>> loss of expressiveness.
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> +1 on this.
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> +1 as well. Also an option:
> > > > >>>>>>>>> 'taskmanager.memory.total-process.size’ ->
> > > > >>>>>>>>> ‘taskmanager.memory.jvm.process.size’,
> > > > >>>>>>>>> although it can be also mentioned in docs that we mean JVM
> > > > process.
> > > > >>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>> I'd be in favour of Stephan's proposal for the config keys as
> > > > shorter
> > > > >>>>>> is usually better and they are still descriptive enough.
> Between
> > > > >>>>>> "taskmanager.memory.process.size" and
> > > > "taskmanager.memory.jvm.total" I
> > > > >>>>>> would slightly favour the first variant.
> > > > >>>>>>
> > > > >>>>>>>
> > > > >>>>>>>>>   - The network memory keys are now called
> > > > >>>>>>>>>> "taskmanager.memory.shuffle.*". To my knowledge, shuffle
> is
> > > > usually
> > > > >>>>>>>>>> understood as a redistribution (random, or maybe by hash
> of
> > > > key). As an
> > > > >>>>>>>>>> example, there are many discussions about "shuffle join
> > versus
> > > > broadcast
> > > > >>>>>>>>>> join", where "shuffle" is the synonym for
> "re-partitioning".
> > > We
> > > > use that
> > > > >>>>>>>>>> memory however for all network operations, like forward
> > pipes,
> > > > broadcasts,
> > > > >>>>>>>>>> receiver-side buffering on checkpoints, etc. I find the
> name
> > > > "*.shuffle.*"
> > > > >>>>>>>>>> confusing, I am wondering if users would find that as
> well.
> > So
> > > > throwing in
> > > > >>>>>>>>>> the suggestion to call the options
> > > > "taskmanager.memory.network.*".
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> +0 on this one. I'm ok with "taskmanager.memory.network.*".
> > On
> > > > the
> > > > >>>>>>>>> other hand, one can also argue that this part of memory is
> > used
> > > > by
> > > > >>>>>>>>> ShuffleEnvironment, and the key
> > "taskmanager.memory.shuffle.*"
> > > > points more
> > > > >>>>>>>>> directly to the shuffle service components.
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> *==> In that case, the name "Shuffle Environment" may be a
> > bit
> > > > >>>>>>>>> incorrect, because it is doing not only shuffles as well.
> The
> > > > >>>>>>>>> ShuffleEnvironment is also more internal, so the name is
> not
> > > too
> > > > critical.
> > > > >>>>>>>>> This isn't super high priority for me, but if we want to
> > adjust
> > > > it, better
> > > > >>>>>>>>> earlier than later.*
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> This is also a bit controversial topic for me. Indeed, we
> > have
> > > > >>>>>>>>> always used ’network’ for this concept of task data
> shuffling
> > > > over the
> > > > >>>>>>>>> network and this can confuse existing users.
> > > > >>>>>>>>>
> > > > >>>>>>>>> On the other hand for the new users and in a long term,
> > > ’network’
> > > > >>>>>>>>> can delude into a conclusion that all network memory is
> > managed
> > > > by this
> > > > >>>>>>>>> option.
> > > > >>>>>>>>> Also other types of shuffle might not directly deal with
> > > network
> > > > >>>>>>>>> at all.
> > > > >>>>>>>>>
> > > > >>>>>>>>> By calling it shuffle, we were somewhat biased by
> > understanding
> > > > it
> > > > >>>>>>>>> in term of map/reduce. This is rather an inter-task data
> > > > exchange.
> > > > >>>>>>>>> Maybe then 'taskmanager.memory.shuffle.communication.*’ or
> > > > >>>>>>>>>
> ‘taskmanager.memory.task.shuffle/communication/io/network.*’.
> > > > >>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>  From a user's perspective I believe
> > "taskmanager.memory.network"
> > > > >>>>>> would be easier to understand as not everyone knows exactly
> what
> > > the
> > > > >>>>>> shuffle service is. I see the point that it would be a bit
> > > > imprecise as we
> > > > >>>>>> can have different shuffle implementations but I would go with
> > the
> > > > ease of
> > > > >>>>>> use/understanding here. Moreover, I think that we won't have
> > many
> > > > different
> > > > >>>>>> shuffle service implementations in the foreseeable future.
> > > > >>>>>>
> > > > >>>>>>>
> > > > >>>>>>>>>   - The descriptions for the
> > > "taskmanager.memory.jvm-overhead.*"
> > > > >>>>>>>>>> keys say that it also accounts for I/O direct memory, but
> > the
> > > > parameter is
> > > > >>>>>>>>>> not counted into the MaxDirectMemory parameter.
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> True. Since we already have framework off-heap memory
> > accounted
> > > > >>>>>>>>> for ad hoc direct memory usages, accounting all of
> > jvm-overhead
> > > > also into
> > > > >>>>>>>>> max direct memory limit seems not necessary. I would
> suggest
> > to
> > > > remove "I/O
> > > > >>>>>>>>> direct memory" from the description, and explicitly mention
> > > that
> > > > this
> > > > >>>>>>>>> option does not account for direct memory and will not be
> > > > accounted into
> > > > >>>>>>>>> max direct memory limit. WDYT?
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> +1
> > > > >>>>>>>>>
> > > > >>>>>>>>>   - Can make the new ConfigOptions strongly typed with the
> > new
> > > > >>>>>>>>>> configuration options. For example, directly use
> MemorySize
> > > > typed options.
> > > > >>>>>>>>>> That makes validation automatic and saves us from breaking
> > the
> > > > options
> > > > >>>>>>>>>> later.
> > > > >>>>>>>>>
> > > > >>>>>>>>> +1. Wasn't aware of the new memory type config options.
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> +1
> > > > >>>>>>>>>
> > > > >>>>>>>>> *==> Thanks. Do you need help with adjusting this?*
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> I would appreciate it.
> > > > >>>>>>>>>
> > > > >>>>>>>>> Also small side note, we extensively use MemorySize in logs
> > now
> > > > >>>>>>>>> but it might be not always readable as its string
> > > representation
> > > > is only in
> > > > >>>>>>>>> bytes atm
> > > > >>>>>>>>> and does not reduce it to kb/mb/etc in case of big bytes
> > value.
> > > > We
> > > > >>>>>>>>> could have at least some .prettyPrint function to use in
> > logs.
> > > > >>>>>>>>> And .fromMegabytes/etc factory methods would improve code
> > > > >>>>>>>>> readability instead of .parse(int + “m”).
> > > > >>>>>>>>>
> > > > >>>>>>>>> Best,
> > > > >>>>>>>>> Andrey
> > > > >>>>>>>>>
> > > > >>>>>>>>> On 20 Dec 2019, at 12:13, Stephan Ewen <se...@apache.org>
> > > wrote:
> > > > >>>>>>>>>
> > > > >>>>>>>>> Hi Xintong!
> > > > >>>>>>>>>
> > > > >>>>>>>>> Please find my answers inline:
> > > > >>>>>>>>>
> > > > >>>>>>>>>>   - "taskmanager.memory.size" (old main config option) is
> > > > >>>>>>>>>>> replaced by "taskmanager.memory.total-process.size" which
> > has
> > > > a different
> > > > >>>>>>>>>>> meaning in standalone setups. The old option did not
> > subtract
> > > > metaspace and
> > > > >>>>>>>>>>> other overhead, while the new option does. That means
> that
> > > > with the default
> > > > >>>>>>>>>>> config, standalone clusters get quite a bit less memory.
> > > > (independent of
> > > > >>>>>>>>>>> managed memory going off heap).
> > > > >>>>>>>>>>>     I am wondering if we could interpret
> > > > >>>>>>>>>>> "taskmanager.memory.size" as the deprecated key for
> > > > >>>>>>>>>>> "taskmanager.memory.total-flink.size". That would be in
> > line
> > > > with the old
> > > > >>>>>>>>>>> mechanism (assuming managed memory is set to off heap).
> > > > >>>>>>>>>>>     The effect would be that the container size on
> > Yarn/Mesos
> > > > >>>>>>>>>>> increases, because from
> > > "taskmanager.memory.total-flink.size",
> > > > we need to
> > > > >>>>>>>>>>> add overhead and metaspace to reach the total process
> size,
> > > > rather than
> > > > >>>>>>>>>>> cutting off memory. But if we want, we could even adjust
> > for
> > > > that in the
> > > > >>>>>>>>>>> active resource manager, getting full backwards
> > compatibility
> > > > on that part.
> > > > >>>>>>>>>>>     Curious to hear more thoughts there.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> I believe you mean "taskmanager.heap.size".
> > > > >>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> *==> Yes*
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>> I think the problem here is that the
> > > > >>>>>>>>>> legacy "taskmanager.heap.size" was used differently in
> > > > standalone setups
> > > > >>>>>>>>>> and active yarn / mesos setups, and such different
> > calculation
> > > > logics and
> > > > >>>>>>>>>> behaviors are exactly what we want to avoid with FLIP-49.
> > > > Therefore, I'm
> > > > >>>>>>>>>> not in favor of treating
> > "taskmanager.memory.total-flink.size"
> > > > differently
> > > > >>>>>>>>>> for standalone and active setups.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> I think what we really want is probably
> > > > >>>>>>>>>> mapping "taskmanager.heap.size" to different new config
> > > options
> > > > in
> > > > >>>>>>>>>> different setups. How about we mark
> "taskmanager.heap.size"
> > as
> > > > deprecated
> > > > >>>>>>>>>> key for neither of "taskmanager.memory.total-process.size"
> > and
> > > > >>>>>>>>>> "taskmanager.memory.total-flink.size". Instead, we parse
> it
> > > (if
> > > > explicitly
> > > > >>>>>>>>>> configured) in startup scripts / active resource managers,
> > and
> > > > set the
> > > > >>>>>>>>>> value to "taskmanager.memory.total-flink.size" in the
> > scripts
> > > > and
> > > > >>>>>>>>>> "taskmanager.memory.total-process.size" in active resource
> > > > managers (if the
> > > > >>>>>>>>>> new config options are not configured). We can provide
> util
> > > > methods in
> > > > >>>>>>>>>> TaskExecutorResourceUtils for such conversions, to keep
> all
> > > the
> > > > >>>>>>>>>> configuration logics at one place.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> *==> This is pretty much what I meant as well (maybe my
> > > > >>>>>>>>> description was not very clear), so +1 for that mechanism*
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>   - Mini Cluster tries to imitate exact ratio of memory
> pools
> > > as
> > > > a
> > > > >>>>>>>>>>> standalone setup. I get the idea behind that, but I am
> > > > wondering if it is
> > > > >>>>>>>>>>> the right approach here.
> > > > >>>>>>>>>>>     For example: I started a relatively large JVM (large
> > heap
> > > > >>>>>>>>>>> size of 10 GB) as a test. With the current logic, the
> > system
> > > > tries to
> > > > >>>>>>>>>>> reserve an additional 6GB for managed memory which is
> more
> > > > than there is
> > > > >>>>>>>>>>> memory left. When you see the error that no memory could
> be
> > > > allocated, you
> > > > >>>>>>>>>>> need to understand the magic of how this is derived.
> > > > >>>>>>>>>>>     I am trying to think about this from the perspective
> of
> > > > >>>>>>>>>>> using "Flink as a Library", which the MiniCluster is
> close
> > > to.
> > > > >>>>>>>>>>>     When starting Flink out of a running process, we
> cannot
> > > > >>>>>>>>>>> assume that we are the only users of that process and
> that
> > we
> > > > can mold the
> > > > >>>>>>>>>>> process to our demands. I think a fix value for managed
> > > memory
> > > > and network
> > > > >>>>>>>>>>> memory would feel more natural in such a setup than a
> > > > mechanism that is
> > > > >>>>>>>>>>> tailored towards exclusive use of the process.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> +1 on having fixed values for managed / shuffle memory.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> *==> Cool, let's also see what Andrey and Till think here.*
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>>   - Some off-heap memory goes into direct memory, some
> does
> > > not.
> > > > >>>>>>>>>>> This confused me a bit. For example
> > > > >>>>>>>>>>> "taskmanager.memory.framework.off-heap.size" is counted
> > into
> > > > >>>>>>>>>>> MaxDirectMemory while
> > "taskmanager.memory.task.off-heap.size"
> > > > is counted as
> > > > >>>>>>>>>>> native memory. Maybe we should rename the keys to reflect
> > > > that. There is no
> > > > >>>>>>>>>>> one "off heap" memory type after all. Maybe use
> > > > >>>>>>>>>>> "taskmanager.memory.task.native: XXXmb" and
> > > > >>>>>>>>>>> "taskmanager.memory.framework.direct: XXXmb" instead?
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> I believe "taskmanager.memory.task.off-heap.size" is also
> > > > >>>>>>>>>> accounted in the max direct memory size limit. The
> confusion
> > > > probably comes
> > > > >>>>>>>>>> from that "taskmanager.memory.framework.off-heap.size"
> > > > explicitly mentioned
> > > > >>>>>>>>>> that in its description while
> > > > "taskmanager.memory.task.off-heap.size"
> > > > >>>>>>>>>> didn't. That's actually because the framework off-heap
> > memory
> > > > is introduced
> > > > >>>>>>>>>> later in a separate commit. We should fix that.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> For framework / task off-heap memory, we do not
> > differentiate
> > > > >>>>>>>>>> direct / native memory usage. That means the configure
> value
> > > > for these two
> > > > >>>>>>>>>> options could be a mixture of direct / native memory.
> Since
> > we
> > > > do not know
> > > > >>>>>>>>>> the portion of direct memory out of the configured value,
> we
> > > > have
> > > > >>>>>>>>>> to conservatively account it all into the max direct
> memory
> > > > size limit.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> *==>  In that case, I am a bit confused. For the total size
> > > > >>>>>>>>> calculation, it is fine. But why do we then set
> > > MaxDirectMemory?
> > > > It is a
> > > > >>>>>>>>> difficult parameter, and the main reason to set it was (if
> I
> > > > recall
> > > > >>>>>>>>> correctly) to trigger GC based on direct memory allocation
> > (to
> > > > free heap
> > > > >>>>>>>>> structures that then in turn release direct memory). If the
> > > > limit is
> > > > >>>>>>>>> anyways too high (because we also count native memory in
> > there)
> > > > such that
> > > > >>>>>>>>> we can exceed the total process (container) memory, why do
> we
> > > > set it then?*
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>   - The network memory keys are now called
> > > > >>>>>>>>>>> "taskmanager.memory.shuffle.*". To my knowledge, shuffle
> is
> > > > usually
> > > > >>>>>>>>>>> understood as a redistribution (random, or maybe by hash
> of
> > > > key). As an
> > > > >>>>>>>>>>> example, there are many discussions about "shuffle join
> > > versus
> > > > broadcast
> > > > >>>>>>>>>>> join", where "shuffle" is the synonym for
> > "re-partitioning".
> > > > We use that
> > > > >>>>>>>>>>> memory however for all network operations, like forward
> > > pipes,
> > > > broadcasts,
> > > > >>>>>>>>>>> receiver-side buffering on checkpoints, etc. I find the
> > name
> > > > "*.shuffle.*"
> > > > >>>>>>>>>>> confusing, I am wondering if users would find that as
> well.
> > > So
> > > > throwing in
> > > > >>>>>>>>>>> the suggestion to call the options
> > > > "taskmanager.memory.network.*".
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> +0 on this one. I'm ok with
> "taskmanager.memory.network.*".
> > On
> > > > >>>>>>>>>> the other hand, one can also argue that this part of
> memory
> > is
> > > > used by
> > > > >>>>>>>>>> ShuffleEnvironment, and the key
> > "taskmanager.memory.shuffle.*"
> > > > points more
> > > > >>>>>>>>>> directly to the shuffle service components.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> *==> In that case, the name "Shuffle Environment" may be a
> > bit
> > > > >>>>>>>>> incorrect, because it is doing not only shuffles as well.
> The
> > > > >>>>>>>>> ShuffleEnvironment is also more internal, so the name is
> not
> > > too
> > > > critical.
> > > > >>>>>>>>> This isn't super high priority for me, but if we want to
> > adjust
> > > > it, better
> > > > >>>>>>>>> earlier than later.*
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>    - The descriptions for the
> > > "taskmanager.memory.jvm-overhead.*"
> > > > >>>>>>>>> keys say that it also accounts for I/O direct memory, but
> the
> > > > parameter is
> > > > >>>>>>>>> not counted into the MaxDirectMemory parameter.
> > > > >>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> True. Since we already have framework off-heap memory
> > > accounted
> > > > >>>>>>>>>> for ad hoc direct memory usages, accounting all of
> > > jvm-overhead
> > > > also into
> > > > >>>>>>>>>> max direct memory limit seems not necessary. I would
> suggest
> > > to
> > > > remove "I/O
> > > > >>>>>>>>>> direct memory" from the description, and explicitly
> mention
> > > > that this
> > > > >>>>>>>>>> option does not account for direct memory and will not be
> > > > accounted into
> > > > >>>>>>>>>> max direct memory limit. WDYT?
> > > > >>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> *==> Sounds good. *
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>>   - Can make the new ConfigOptions strongly typed with the
> > new
> > > > >>>>>>>>>>> configuration options. For example, directly use
> MemorySize
> > > > typed options.
> > > > >>>>>>>>>>> That makes validation automatic and saves us from
> breaking
> > > the
> > > > options
> > > > >>>>>>>>>>> later.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> +1. Wasn't aware of the new memory type config options.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> *==> Thanks. Do you need help with adjusting this?*
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> Best,
> > > > >>>>>>>>> Stephan
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > >
> > >
> >
> >
> > --
> > Best, Jingsong Lee
> >
>

Re: [DISCUSS] Some feedback after trying out the new FLIP-49 memory configurations

Posted by Andrey Zagrebin <az...@apache.org>.
Thank you for more explanation Stephan and feedback Jingsong,

I see the point now.
‘taskmanager.memory.flink.size’ is indeed simpler to understand for the
newcomers in the default config because it includes less types of memory to
consider.
Most of the time beginners will think about heap size and maybe state size
to tweak in the try-out jobs.
‘taskmanager.memory.flink.size’ is better scoped for those types of memory.

The process memory forces to understand the total memory consumption
which is more important for the next steps and will probably require
reading the docs in more depth anyways.
I agree if we decide for ‘flink.size’, it is worth mentioning a pointer to
‘process.size' in its comment as an alternative to it.

I am ok with ‘flink.size’ in the default config.

Best,
Andrey

On Mon, Dec 30, 2019 at 5:13 AM Jingsong Li <ji...@gmail.com> wrote:

> Thank you for your wonderful discussion.
>
> +1 for set "taskmanager.memory.flink.size" in the default config.
> Maybe we can write and explain "taskmanager.memory.process.size" in the
> comments.
>
> Most of the time,
> - "trying out" users is less in-depth users and use standalone mode.
> - Production users use active setups.
>
> Default config it very important to "trying out".
> In "trying out", for a novice Java user of Flink, he want to configure as
> much memory as my standalone process should be. In fact, it's hard for him
> to realize the JVM overhead. And there is no need that must let him know.
>
> > I don't think it is realistic that users set the process memory to full
> machine memory. There is a lot on the machine as well in most cases.
>
> +1, We often run Flink in a less clean environment, such as the environment
> have HDFS. Whether we are testing or producing, we will not use all the
> memory of the machine, such as always leaving some memory for PageCache.
>
> Best,
> Jingsong Lee
>
> On Sat, Dec 28, 2019 at 7:02 PM Stephan Ewen <se...@apache.org> wrote:
>
> > "taskmanager.memory.flink.size" in the default config has a few
> advantages.
> >
> >  - The value in the default config needs to be suitable for "trying out"
> > Flink, for a good "getting started" experience.
> >
> >   - For trying out Flink, standalone is the most common entry point
> (except
> > running in IDE).
> >
> >   - In standalone setup, from total process memory, we subtract quite a
> bit
> > before we arrive at the usable memory. We also subtract managed memory
> from
> > the heap now. I fear we might end up at a heap that becomes so small that
> > it makes for a bad "getting started" experience.
> >
> >   - I don't think it is realistic that users set the process memory to
> full
> > machine memory. There is a lot on the machine as well in most cases.
> >
> >   - In the JVM world, users are used to configuring the heap size and
> know
> > that there is additional memory overhead. The
> > "taskmanager.memory.flink.size" option fits well with that mindset.
> >
> >   - One you start to think about the total process memory of Yarn
> > containers, you are already past the getting-started phase and on the
> > tuning phase.
> >
> >
> > On Tue, Dec 24, 2019, 10:25 Andrey Zagrebin <az...@apache.org>
> wrote:
> >
> > > Thanks for the summary, Xintong! It makes sense to me.
> > >
> > > How about putting "taskmanager.memory.flink.size" in the configuration?
> > > > Then new downloaded Flink behaves similar to the previous Standalone
> > > setups.
> > > > If someone upgrades the binaries, but re-uses their old
> configuration,
> > > > then they get the compatibility as discussed previously.
> > > > We used that approach previously with the fine-grained failover
> > recovery.
> > >
> > >
> > >
> > > > I'm trying to understand why "taskmanager.memory.flink.size" rather
> > than
> > > > "taskmanager.memory.process.size" in the default flink-conf.yaml. Or
> > put
> > > it
> > > > another way, why do we want the new downloaded Flink behaves similar
> to
> > > > previous Standalone setups rather than previous active mode setups?
> Is
> > > > there any special reason that I overlooked, which makes backwards
> > > > compatibility for standalone setups more important than for active
> > > setups?
> > > > IMO, "taskmanager.memory.process.size" is easier for the new comers.
> > For
> > > > standalone setups, users can simply configure it to their machines'
> > > > available memory size, without needing to worry about leaving enough
> > > space
> > > > for JVM overehead / metaspace. For containerized setups, it's more
> > > > predictable how many memory the containers / Flink could use, which
> is
> > > more
> > > > friendly for users to manage their resource quota.
> > > > Therefore, unless there is anything I overlooked, I'm in favor of
> > putting
> > > > "taskmanager.memory.process.size" rather than
> > > > "taskmanager.memory.flink.size" in the default configuration.
> > >
> > >
> > > I agree that having "taskmanager.memory.process.size" in default config
> > > should be easier to understand and tweak for the new users because it
> is
> > > just what they are ready to spend for Flink application.
> > > The problem is when users upgrade Flink and use the new default
> > > configuration then the behaviour can change:
> > > - either if we put process memory then Flink memory shrinks and the new
> > > default option contradicts their previous understanding
> > > - or if we put Flink memory then larger container is requested.
> > > The shrinking of memory sounds more implicit and worse. The increased
> > > container request will just fail in the worst case and the memory setup
> > can
> > > be revisited.
> > > We could increase the default "taskmanager.memory.process.size" to
> better
> > > align it with the previous default setup for standalone
> > > but this would not remove the possible confusion problem for the old
> > users,
> > > on the other hand the option is new and we can add a comment how to
> > migrate
> > > from the old one.
> > >
> > > All in all, now I also tend to have "taskmanager.memory.process.size"
> in
> > > the default config unless there are more reasons for having less
> > confusion
> > > for the old standalone users.
> > >
> > > Best,
> > > Andrey
> > >
> > > On Tue, Dec 24, 2019 at 5:50 AM Xintong Song <to...@gmail.com>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > It seems we have already get consensus on most of the issues. Thanks
> > > > everyone for the good discussion.
> > > >
> > > > While there are still open questions under discussion, I'd like to
> > > > summarize the discussion so far, and list the action items that we
> > > already
> > > > get consensus on. In this way, we can already start working on these
> > > items
> > > > while discussing the remaining open questions. Please correct me if
> the
> > > > summary does not reflect your argument.
> > > >
> > > > *Action Items that we already get consensus on:*
> > > >
> > > >    - Map "taskmanager.heap.size" to "taskmanager.memory.flink.size"
> for
> > > >    standalone setups, and "taskmanager.memory.process.size" for
> active
> > > setups.
> > > >    (should be mentioned in release notes)
> > > >    - If not explicitly configured, MiniCluster should have fixed
> > default
> > > >    network and managed memory sizes. (should be mentioned in docs &
> > > release
> > > >    notes)
> > > >    - Change the memory config options' type from String to MemorySize
> > > >    - Change the config option key
> "taskmanager.memory.total-flink.size"
> > > >    to "taskmanager.memory.flink.size"
> > > >    - Change the config option key
> > "taskmanager.memory.total-process.size"
> > > >    to "taskmanager.memory.process.size"
> > > >    - Update descriptions for
> > "taskmanager.memory.framework.off-heap.size"
> > > >    and "taskmanager.memory.task.off-heap.size" to explicitly state
> that
> > > >       - Both direct and native memory are accounted
> > > >       - Will be fully counted into MaxDirectMemorySize
> > > >    - Update descriptions for
> > > >    "taskmanager.memory.jvm-overhead.[min|max|fraction]" to remove
> "I/O
> > > direct
> > > >    memory" and explicitly state that it's not counted into
> > > MaxDirectMemorySize
> > > >    - Print MemorySize with proper unit. (non-blocker for 1.10)
> > > >
> > > >
> > > > *Questions that are still open:*
> > > >
> > > >    - Which config option do we put in the default flink-conf.yaml?
> > > >       - "taskmanager.memory.flink.size"
> > > >       - "taskmanager.memory.process.size"
> > > >       - Deprecated "taskmanager.heap.size"
> > > >    - What is proper keys for network / shuffle memory
> > > >       - "taskmanager.memory.shuffle.*"
> > > >       - "taskmanager.memory.network.*"
> > > >       - "taskmanager.network.memory.*"
> > > >
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > >
> > > > On Tue, Dec 24, 2019 at 10:19 AM Xintong Song <tonysong820@gmail.com
> >
> > > > wrote:
> > > >
> > > >> How about putting "taskmanager.memory.flink.size" in the
> > configuration?
> > > >>> Then new downloaded Flink behaves similar to the previous
> Standalone
> > > setups.
> > > >>> If someone upgrades the binaries, but re-uses their old
> > configuration,
> > > >>> then they get the compatibility as discussed previously.
> > > >>> We used that approach previously with the fine-grained failover
> > > recovery.
> > > >>
> > > >>
> > > >> I'm trying to understand why "taskmanager.memory.flink.size" rather
> > than
> > > >> "taskmanager.memory.process.size" in the default flink-conf.yaml. Or
> > > put it
> > > >> another way, why do we want the new downloaded Flink behaves similar
> > to
> > > >> previous Standalone setups rather than previous active mode setups?
> Is
> > > >> there any special reason that I overlooked, which makes backwards
> > > >> compatibility for standalone setups more important than for active
> > > setups?
> > > >>
> > > >> IMO, "taskmanager.memory.process.size" is easier for the new comers.
> > For
> > > >> standalone setups, users can simply configure it to their machines'
> > > >> available memory size, without needing to worry about leaving enough
> > > space
> > > >> for JVM overehead / metaspace. For containerized setups, it's more
> > > >> predictable how many memory the containers / Flink could use, which
> is
> > > more
> > > >> friendly for users to manage their resource quota.
> > > >>
> > > >> Therefore, unless there is anything I overlooked, I'm in favor of
> > > putting
> > > >> "taskmanager.memory.process.size" rather than
> > > >> "taskmanager.memory.flink.size" in the default configuration.
> > > >>
> > > >>
> > > >> Thank you~
> > > >>
> > > >> Xintong Song
> > > >>
> > > >>
> > > >>
> > > >> On Tue, Dec 24, 2019 at 4:27 AM Andrey Zagrebin <
> azagrebin@apache.org
> > >
> > > >> wrote:
> > > >>
> > > >>> How about putting "taskmanager.memory.flink.size" in the
> > configuration?
> > > >>>> Then new downloaded Flink behaves similar to the previous
> Standalone
> > > setups.
> > > >>>> If someone upgrades the binaries, but re-uses their old
> > configuration,
> > > >>>> then they get the compatibility as discussed previously.
> > > >>>> We used that approach previously with the fine-grained failover
> > > >>>> recovery.
> > > >>>
> > > >>> +1, this sounds like a good compromise.
> > > >>>
> > > >>> +1 to not have more options for off-heap memory, that would get
> > > >>>> confusing fast. We can keep the name "off-heap" for both task and
> > > >>>> framework, as long as they mean the same thing: native plus
> direct,
> > > and
> > > >>>> fully counted into MaxDirectMemory. I would suggest to update the
> > > config
> > > >>>> descriptions then to reflect that.
> > > >>>>
> > > >>> True, this should be explained in the config descriptions.
> > > >>>
> > > >>> looks good to me
> > > >>>
> > > >>> From a user's perspective I believe "taskmanager.memory.network"
> > would
> > > >>>> be easier to understand as not everyone knows exactly what the
> > shuffle
> > > >>>> service is. I see the point that it would be a bit imprecise as we
> > > can have
> > > >>>> different shuffle implementations but I would go with the ease of
> > > >>>> use/understanding here. Moreover, I think that we won't have many
> > > different
> > > >>>> shuffle service implementations in the foreseeable future.
> > > >>>
> > > >>> I agree that if we cannot find any other convincing names for the
> > > >>> options, we should keep what we already have and change it if the
> > > >>> alternative is convincing enough.
> > > >>> The question is also whether we still want to rename it because it
> > was
> > > >>> "taskmanager.network.*memory*.*" in 1.9 but
> > > "taskmanager.*memory*.network.*"
> > > >>> is more aligned with other new memory option names.
> > > >>> Or we can just 'un'-deprecate "taskmanager.network.*memory*.*".
> > > >>>
> > > >>> On Mon, Dec 23, 2019 at 8:42 PM Stephan Ewen <se...@apache.org>
> > wrote:
> > > >>>
> > > >>>> How about putting "taskmanager.memory.flink.size" in the
> > > configuration?
> > > >>>> Then new downloaded Flink behaves similar to the previous
> Standalone
> > > setups.
> > > >>>>
> > > >>>> If someone upgrades the binaries, but re-uses their old
> > configuration,
> > > >>>> then they get the compatibility as discussed previously.
> > > >>>> We used that approach previously with the fine-grained failover
> > > >>>> recovery.
> > > >>>>
> > > >>>> On Mon, Dec 23, 2019 at 3:27 AM Xintong Song <
> tonysong820@gmail.com
> > >
> > > >>>> wrote:
> > > >>>>
> > > >>>>> +1 to not have more options for off-heap memory, that would get
> > > >>>>>> confusing fast. We can keep the name "off-heap" for both task
> and
> > > >>>>>> framework, as long as they mean the same thing: native plus
> > direct,
> > > and
> > > >>>>>> fully counted into MaxDirectMemory. I would suggest to update
> the
> > > config
> > > >>>>>> descriptions then to reflect that.
> > > >>>>>>
> > > >>>>> True, this should be explained in the config descriptions.
> > > >>>>>
> > > >>>>> Which configuration option will be set in Flink's default
> > > >>>>>> flink-conf.yaml? If we want to maintain the existing behaviour
> it
> > > would
> > > >>>>>> have to be the then deprecated taskmanager.heap.size config
> > option.
> > > If we
> > > >>>>>> are ok with Yarn requesting slightly larger containers, then it
> > > could also
> > > >>>>>> be taskmanager.memory.total-flink.size.
> > > >>>>>>
> > > >>>>> Good point. Currently, we have
> > > >>>>> "taskmanager.memory.total-process.size". In order to preserve the
> > > previous
> > > >>>>> behavior, we need to have "taskmanager.heap.size" so it can be
> > > mapped to
> > > >>>>> different new options in standalone / active setups.
> > > >>>>> I think we can have the deprecated "taskmanager.heap.size" in the
> > > >>>>> default flink-conf.yaml, and also have the
> > > >>>>> new "taskmanager.memory.total-process.size" in a commented line.
> We
> > > can
> > > >>>>> explain how the deprecated config option behaves differently in
> the
> > > >>>>> comments, so that user can switch to the new config options if
> they
> > > want to.
> > > >>>>>
> > > >>>>> Thank you~
> > > >>>>>
> > > >>>>> Xintong Song
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> On Sat, Dec 21, 2019 at 1:00 AM Till Rohrmann <
> > trohrmann@apache.org>
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>>> Thanks for the feedback and good discussion everyone. I left
> some
> > > >>>>>> comments inline.
> > > >>>>>>
> > > >>>>>> On Fri, Dec 20, 2019 at 1:59 PM Stephan Ewen <se...@apache.org>
> > > >>>>>> wrote:
> > > >>>>>>
> > > >>>>>>> +1 to not have more options for off-heap memory, that would get
> > > >>>>>>> confusing fast. We can keep the name "off-heap" for both task
> and
> > > >>>>>>> framework, as long as they mean the same thing: native plus
> > > direct, and
> > > >>>>>>> fully counted into MaxDirectMemory. I would suggest to update
> the
> > > config
> > > >>>>>>> descriptions then to reflect that.
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> On Fri, Dec 20, 2019 at 1:03 PM Xintong Song <
> > > tonysong820@gmail.com>
> > > >>>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>>> Regarding the framework/task direct/native memory options, I
> > tend
> > > >>>>>>>> to think it differently. I'm in favor of keep the
> > > "*.off-heap.size" for the
> > > >>>>>>>> config option keys.
> > > >>>>>>>>
> > > >>>>>>>>    - It's not necessary IMO to expose the difference concepts
> of
> > > >>>>>>>>    direct / native memory to the users.
> > > >>>>>>>>    - I would avoid introducing more options for native memory
> if
> > > >>>>>>>>    possible. Taking fine grained resource management and
> dynamic
> > > slot
> > > >>>>>>>>    allocation into consideration, that also means introduce
> more
> > > fields into
> > > >>>>>>>>    ResourceSpec / ResourceProfile.
> > > >>>>>>>>    - My gut feeling is that having a relative loose
> > > >>>>>>>>    MaxDirectMemory should not be a big problem.
> > > >>>>>>>>    - In most cases, the task / framework off-heap memory
> should
> > be
> > > >>>>>>>>       mainly (if not all) direct memory, so the difference
> > > between derived
> > > >>>>>>>>       MaxDirectMemory and the ideal direct memory limit should
> > > not be too much.
> > > >>>>>>>>       - We do not have a good way to know the exact size
> needed
> > > >>>>>>>>       for jvm overhead / metaspace and framework / task
> off-heap
> > > memory, thus
> > > >>>>>>>>       having to conservatively reserve slightly more memory
> then
> > > what actually
> > > >>>>>>>>       needed. Such reserved but not used memory can cover for
> > the
> > > small
> > > >>>>>>>>       MaxDirectMemory error.
> > > >>>>>>>>       -
> > > >>>>>>>>       - MaxDirectMemory is not the only way to trigger full
> gc.
> > We
> > > >>>>>>>>       still heap activities that can also trigger the gc.
> > > >>>>>>>>
> > > >>>>>>>> Regarding the memory type config options, I've looked into the
> > > >>>>>>>> latest ConfigOptions changes. I think it shouldn't be too
> > > complicated to
> > > >>>>>>>> change the config options to use memory type, and I can handle
> > it
> > > maybe
> > > >>>>>>>> during your vacations.
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> Also agree with improving MemorySize logging and parsing. This
> > > >>>>>>>> should not be a blocker that has to be done in 1.10. I would
> say
> > > we finish
> > > >>>>>>>> other works (testability, documentation and those discussed in
> > > this thread)
> > > >>>>>>>> first, and get to this only if we have time.
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> Thank you~
> > > >>>>>>>>
> > > >>>>>>>> Xintong Song
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> On Fri, Dec 20, 2019 at 6:07 PM Andrey Zagrebin <
> > > >>>>>>>> azagrebin.apache@gmail.com> wrote:
> > > >>>>>>>>
> > > >>>>>>>>> Hi Stephan and Xintong,
> > > >>>>>>>>>
> > > >>>>>>>>> Thanks for the further FLIP-49 feedbacks.
> > > >>>>>>>>>
> > > >>>>>>>>>   - "taskmanager.memory.size" (old main config option) is
> > > replaced
> > > >>>>>>>>>> by "taskmanager.memory.total-process.size" which has a
> > > different meaning in
> > > >>>>>>>>>> standalone setups. The old option did not subtract metaspace
> > > and other
> > > >>>>>>>>>> overhead, while the new option does. That means that with
> the
> > > default
> > > >>>>>>>>>> config, standalone clusters get quite a bit less memory.
> > > (independent of
> > > >>>>>>>>>> managed memory going off heap).
> > > >>>>>>>>>>     I am wondering if we could interpret
> > > >>>>>>>>>> "taskmanager.memory.size" as the deprecated key for
> > > >>>>>>>>>> "taskmanager.memory.total-flink.size". That would be in line
> > > with the old
> > > >>>>>>>>>> mechanism (assuming managed memory is set to off heap).
> > > >>>>>>>>>>     The effect would be that the container size on
> Yarn/Mesos
> > > >>>>>>>>>> increases, because from
> "taskmanager.memory.total-flink.size",
> > > we need to
> > > >>>>>>>>>> add overhead and metaspace to reach the total process size,
> > > rather than
> > > >>>>>>>>>> cutting off memory. But if we want, we could even adjust for
> > > that in the
> > > >>>>>>>>>> active resource manager, getting full backwards
> compatibility
> > > on that part.
> > > >>>>>>>>>>     Curious to hear more thoughts there.
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> I believe you mean "taskmanager.heap.size".
> > > >>>>>>>>>
> > > >>>>>>>>> I think the problem here is that the
> > > >>>>>>>>> legacy "taskmanager.heap.size" was used differently in
> > > standalone setups
> > > >>>>>>>>> and active yarn / mesos setups, and such different
> calculation
> > > logics and
> > > >>>>>>>>> behaviors are exactly what we want to avoid with FLIP-49.
> > > Therefore, I'm
> > > >>>>>>>>> not in favor of treating
> "taskmanager.memory.total-flink.size"
> > > differently
> > > >>>>>>>>> for standalone and active setups.
> > > >>>>>>>>>
> > > >>>>>>>>> I think what we really want is probably
> > > >>>>>>>>> mapping "taskmanager.heap.size" to different new config
> options
> > > in
> > > >>>>>>>>> different setups. How about we mark "taskmanager.heap.size"
> as
> > > deprecated
> > > >>>>>>>>> key for neither of "taskmanager.memory.total-process.size"
> and
> > > >>>>>>>>> "taskmanager.memory.total-flink.size". Instead, we parse it
> (if
> > > explicitly
> > > >>>>>>>>> configured) in startup scripts / active resource managers,
> and
> > > set the
> > > >>>>>>>>> value to "taskmanager.memory.total-flink.size" in the scripts
> > and
> > > >>>>>>>>> "taskmanager.memory.total-process.size" in active resource
> > > managers (if the
> > > >>>>>>>>> new config options are not configured). We can provide util
> > > methods in
> > > >>>>>>>>> TaskExecutorResourceUtils for such conversions, to keep all
> the
> > > >>>>>>>>> configuration logics at one place.
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> I agree that the problem is that the legacy option
> > > >>>>>>>>> ‘taskmanager.heap.size’ has different semantics for
> > > standalone/container.
> > > >>>>>>>>> We had it initially falling back to
> > > 'taskmanager.memory.total-flink.size’
> > > >>>>>>>>> but I changed that to align it with container cut-off. Now I
> > see
> > > it changes
> > > >>>>>>>>> standalone setup then.
> > > >>>>>>>>> +1 for supporting its backwards compatibility differently for
> > > >>>>>>>>> standalone/container setups.
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>> Which configuration option will be set in Flink's default
> > > >>>>>> flink-conf.yaml? If we want to maintain the existing behaviour
> it
> > > would
> > > >>>>>> have to be the then deprecated taskmanager.heap.size config
> > option.
> > > If we
> > > >>>>>> are ok with Yarn requesting slightly larger containers, then it
> > > could also
> > > >>>>>> be taskmanager.memory.total-flink.size.
> > > >>>>>>
> > > >>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>   - Mini Cluster tries to imitate exact ratio of memory pools
> > as
> > > a
> > > >>>>>>>>>> standalone setup. I get the idea behind that, but I am
> > > wondering if it is
> > > >>>>>>>>>> the right approach here.
> > > >>>>>>>>>>     For example: I started a relatively large JVM (large
> heap
> > > >>>>>>>>>> size of 10 GB) as a test. With the current logic, the system
> > > tries to
> > > >>>>>>>>>> reserve an additional 6GB for managed memory which is more
> > than
> > > there is
> > > >>>>>>>>>> memory left. When you see the error that no memory could be
> > > allocated, you
> > > >>>>>>>>>> need to understand the magic of how this is derived.
> > > >>>>>>>>>>     I am trying to think about this from the perspective of
> > > using
> > > >>>>>>>>>> "Flink as a Library", which the MiniCluster is close to.
> > > >>>>>>>>>>     When starting Flink out of a running process, we cannot
> > > >>>>>>>>>> assume that we are the only users of that process and that
> we
> > > can mold the
> > > >>>>>>>>>> process to our demands. I think a fix value for managed
> memory
> > > and network
> > > >>>>>>>>>> memory would feel more natural in such a setup than a
> > mechanism
> > > that is
> > > >>>>>>>>>> tailored towards exclusive use of the process.
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> +1 on having fixed values for managed / shuffle memory.
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> also +1 for that, if user has not specified any main options
> to
> > > >>>>>>>>> derive memory. We should also log this fixing of managed /
> > > shuffle memory.
> > > >>>>>>>>> And just noticed, we could also sanity check framework and if
> > > >>>>>>>>> explicitly configured task heap against available JVM heap,
> and
> > > at least
> > > >>>>>>>>> log inconsistencies.
> > > >>>>>>>>>
> > > >>>>>>>>>   - Some off-heap memory goes into direct memory, some does
> > not.
> > > >>>>>>>>>> This confused me a bit. For example
> > > >>>>>>>>>> "taskmanager.memory.framework.off-heap.size" is counted into
> > > >>>>>>>>>> MaxDirectMemory while
> "taskmanager.memory.task.off-heap.size"
> > > is counted as
> > > >>>>>>>>>> native memory. Maybe we should rename the keys to reflect
> > that.
> > > There is no
> > > >>>>>>>>>> one "off heap" memory type after all. Maybe use
> > > >>>>>>>>>> "taskmanager.memory.task.native: XXXmb" and
> > > >>>>>>>>>> "taskmanager.memory.framework.direct: XXXmb" instead?
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> I believe "taskmanager.memory.task.off-heap.size" is also
> > > >>>>>>>>> accounted in the max direct memory size limit. The confusion
> > > probably comes
> > > >>>>>>>>> from that "taskmanager.memory.framework.off-heap.size"
> > > explicitly mentioned
> > > >>>>>>>>> that in its description while
> > > "taskmanager.memory.task.off-heap.size"
> > > >>>>>>>>> didn't. That's actually because the framework off-heap memory
> > is
> > > introduced
> > > >>>>>>>>> later in a separate commit. We should fix that.
> > > >>>>>>>>>
> > > >>>>>>>>> For framework / task off-heap memory, we do not differentiate
> > > >>>>>>>>> direct / native memory usage. That means the configure value
> > for
> > > these two
> > > >>>>>>>>> options could be a mixture of direct / native memory. Since
> we
> > > do not know
> > > >>>>>>>>> the portion of direct memory out of the configured value, we
> > have
> > > >>>>>>>>> to conservatively account it all into the max direct memory
> > size
> > > limit.
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> *==>  In that case, I am a bit confused. For the total size
> > > >>>>>>>>> calculation, it is fine. But why do we then set
> > MaxDirectMemory?
> > > It is a
> > > >>>>>>>>> difficult parameter, and the main reason to set it was (if I
> > > recall
> > > >>>>>>>>> correctly) to trigger GC based on direct memory allocation
> (to
> > > free heap
> > > >>>>>>>>> structures that then in turn release direct memory). If the
> > > limit is
> > > >>>>>>>>> anyways too high (because we also count native memory in
> there)
> > > such that
> > > >>>>>>>>> we can exceed the total process (container) memory, why do we
> > > set it then?*
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> I always also thought about it as providing more safety net
> for
> > > >>>>>>>>> direct allocations but GC thing looks more important.
> > > >>>>>>>>>
> > > >>>>>>>>> +1 for fixing docs for
> 'taskmanager.memory.task.off-heap.size’
> > > and
> > > >>>>>>>>> renaming to ‘direct' as this is what really happens
> > > >>>>>>>>> if we want to support direct limit more exact than now.
> > > >>>>>>>>>
> > > >>>>>>>>> I also think that it is hard to separate direct / native
> memory
> > > >>>>>>>>> unless we introduce even more options.
> > > >>>>>>>>> If user wants to keep the direct limit tight to a certain
> value
> > > >>>>>>>>> but also use native memory outside of it,
> > > >>>>>>>>> she would have to increase something else, like JVM overhead
> to
> > > >>>>>>>>> account for it and there is no other better way.
> > > >>>>>>>>> Having more options to account for the native memory outside
> of
> > > >>>>>>>>> direct limit complicates things but can be introduced later
> if
> > > needed.
> > > >>>>>>>>>
> > > >>>>>>>>>   - What do you think about renaming
> > > >>>>>>>>>> "taskmanager.memory.total-flink.size" to
> > > "taskmanager.memory.flink.size"
> > > >>>>>>>>>> and "taskmanager.memory.total-process.size" to
> > > >>>>>>>>>> "taskmanager.memory.process.size" (or
> > > "taskmanager.memory.jvm.total"). I
> > > >>>>>>>>>> think these keys may be a bit less clumsy (dropping the
> > > "total-") without
> > > >>>>>>>>>> loss of expressiveness.
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> +1 on this.
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> +1 as well. Also an option:
> > > >>>>>>>>> 'taskmanager.memory.total-process.size’ ->
> > > >>>>>>>>> ‘taskmanager.memory.jvm.process.size’,
> > > >>>>>>>>> although it can be also mentioned in docs that we mean JVM
> > > process.
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>> I'd be in favour of Stephan's proposal for the config keys as
> > > shorter
> > > >>>>>> is usually better and they are still descriptive enough. Between
> > > >>>>>> "taskmanager.memory.process.size" and
> > > "taskmanager.memory.jvm.total" I
> > > >>>>>> would slightly favour the first variant.
> > > >>>>>>
> > > >>>>>>>
> > > >>>>>>>>>   - The network memory keys are now called
> > > >>>>>>>>>> "taskmanager.memory.shuffle.*". To my knowledge, shuffle is
> > > usually
> > > >>>>>>>>>> understood as a redistribution (random, or maybe by hash of
> > > key). As an
> > > >>>>>>>>>> example, there are many discussions about "shuffle join
> versus
> > > broadcast
> > > >>>>>>>>>> join", where "shuffle" is the synonym for "re-partitioning".
> > We
> > > use that
> > > >>>>>>>>>> memory however for all network operations, like forward
> pipes,
> > > broadcasts,
> > > >>>>>>>>>> receiver-side buffering on checkpoints, etc. I find the name
> > > "*.shuffle.*"
> > > >>>>>>>>>> confusing, I am wondering if users would find that as well.
> So
> > > throwing in
> > > >>>>>>>>>> the suggestion to call the options
> > > "taskmanager.memory.network.*".
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> +0 on this one. I'm ok with "taskmanager.memory.network.*".
> On
> > > the
> > > >>>>>>>>> other hand, one can also argue that this part of memory is
> used
> > > by
> > > >>>>>>>>> ShuffleEnvironment, and the key
> "taskmanager.memory.shuffle.*"
> > > points more
> > > >>>>>>>>> directly to the shuffle service components.
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> *==> In that case, the name "Shuffle Environment" may be a
> bit
> > > >>>>>>>>> incorrect, because it is doing not only shuffles as well. The
> > > >>>>>>>>> ShuffleEnvironment is also more internal, so the name is not
> > too
> > > critical.
> > > >>>>>>>>> This isn't super high priority for me, but if we want to
> adjust
> > > it, better
> > > >>>>>>>>> earlier than later.*
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> This is also a bit controversial topic for me. Indeed, we
> have
> > > >>>>>>>>> always used ’network’ for this concept of task data shuffling
> > > over the
> > > >>>>>>>>> network and this can confuse existing users.
> > > >>>>>>>>>
> > > >>>>>>>>> On the other hand for the new users and in a long term,
> > ’network’
> > > >>>>>>>>> can delude into a conclusion that all network memory is
> managed
> > > by this
> > > >>>>>>>>> option.
> > > >>>>>>>>> Also other types of shuffle might not directly deal with
> > network
> > > >>>>>>>>> at all.
> > > >>>>>>>>>
> > > >>>>>>>>> By calling it shuffle, we were somewhat biased by
> understanding
> > > it
> > > >>>>>>>>> in term of map/reduce. This is rather an inter-task data
> > > exchange.
> > > >>>>>>>>> Maybe then 'taskmanager.memory.shuffle.communication.*’ or
> > > >>>>>>>>> ‘taskmanager.memory.task.shuffle/communication/io/network.*’.
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>  From a user's perspective I believe
> "taskmanager.memory.network"
> > > >>>>>> would be easier to understand as not everyone knows exactly what
> > the
> > > >>>>>> shuffle service is. I see the point that it would be a bit
> > > imprecise as we
> > > >>>>>> can have different shuffle implementations but I would go with
> the
> > > ease of
> > > >>>>>> use/understanding here. Moreover, I think that we won't have
> many
> > > different
> > > >>>>>> shuffle service implementations in the foreseeable future.
> > > >>>>>>
> > > >>>>>>>
> > > >>>>>>>>>   - The descriptions for the
> > "taskmanager.memory.jvm-overhead.*"
> > > >>>>>>>>>> keys say that it also accounts for I/O direct memory, but
> the
> > > parameter is
> > > >>>>>>>>>> not counted into the MaxDirectMemory parameter.
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> True. Since we already have framework off-heap memory
> accounted
> > > >>>>>>>>> for ad hoc direct memory usages, accounting all of
> jvm-overhead
> > > also into
> > > >>>>>>>>> max direct memory limit seems not necessary. I would suggest
> to
> > > remove "I/O
> > > >>>>>>>>> direct memory" from the description, and explicitly mention
> > that
> > > this
> > > >>>>>>>>> option does not account for direct memory and will not be
> > > accounted into
> > > >>>>>>>>> max direct memory limit. WDYT?
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> +1
> > > >>>>>>>>>
> > > >>>>>>>>>   - Can make the new ConfigOptions strongly typed with the
> new
> > > >>>>>>>>>> configuration options. For example, directly use MemorySize
> > > typed options.
> > > >>>>>>>>>> That makes validation automatic and saves us from breaking
> the
> > > options
> > > >>>>>>>>>> later.
> > > >>>>>>>>>
> > > >>>>>>>>> +1. Wasn't aware of the new memory type config options.
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> +1
> > > >>>>>>>>>
> > > >>>>>>>>> *==> Thanks. Do you need help with adjusting this?*
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> I would appreciate it.
> > > >>>>>>>>>
> > > >>>>>>>>> Also small side note, we extensively use MemorySize in logs
> now
> > > >>>>>>>>> but it might be not always readable as its string
> > representation
> > > is only in
> > > >>>>>>>>> bytes atm
> > > >>>>>>>>> and does not reduce it to kb/mb/etc in case of big bytes
> value.
> > > We
> > > >>>>>>>>> could have at least some .prettyPrint function to use in
> logs.
> > > >>>>>>>>> And .fromMegabytes/etc factory methods would improve code
> > > >>>>>>>>> readability instead of .parse(int + “m”).
> > > >>>>>>>>>
> > > >>>>>>>>> Best,
> > > >>>>>>>>> Andrey
> > > >>>>>>>>>
> > > >>>>>>>>> On 20 Dec 2019, at 12:13, Stephan Ewen <se...@apache.org>
> > wrote:
> > > >>>>>>>>>
> > > >>>>>>>>> Hi Xintong!
> > > >>>>>>>>>
> > > >>>>>>>>> Please find my answers inline:
> > > >>>>>>>>>
> > > >>>>>>>>>>   - "taskmanager.memory.size" (old main config option) is
> > > >>>>>>>>>>> replaced by "taskmanager.memory.total-process.size" which
> has
> > > a different
> > > >>>>>>>>>>> meaning in standalone setups. The old option did not
> subtract
> > > metaspace and
> > > >>>>>>>>>>> other overhead, while the new option does. That means that
> > > with the default
> > > >>>>>>>>>>> config, standalone clusters get quite a bit less memory.
> > > (independent of
> > > >>>>>>>>>>> managed memory going off heap).
> > > >>>>>>>>>>>     I am wondering if we could interpret
> > > >>>>>>>>>>> "taskmanager.memory.size" as the deprecated key for
> > > >>>>>>>>>>> "taskmanager.memory.total-flink.size". That would be in
> line
> > > with the old
> > > >>>>>>>>>>> mechanism (assuming managed memory is set to off heap).
> > > >>>>>>>>>>>     The effect would be that the container size on
> Yarn/Mesos
> > > >>>>>>>>>>> increases, because from
> > "taskmanager.memory.total-flink.size",
> > > we need to
> > > >>>>>>>>>>> add overhead and metaspace to reach the total process size,
> > > rather than
> > > >>>>>>>>>>> cutting off memory. But if we want, we could even adjust
> for
> > > that in the
> > > >>>>>>>>>>> active resource manager, getting full backwards
> compatibility
> > > on that part.
> > > >>>>>>>>>>>     Curious to hear more thoughts there.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> I believe you mean "taskmanager.heap.size".
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> *==> Yes*
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>> I think the problem here is that the
> > > >>>>>>>>>> legacy "taskmanager.heap.size" was used differently in
> > > standalone setups
> > > >>>>>>>>>> and active yarn / mesos setups, and such different
> calculation
> > > logics and
> > > >>>>>>>>>> behaviors are exactly what we want to avoid with FLIP-49.
> > > Therefore, I'm
> > > >>>>>>>>>> not in favor of treating
> "taskmanager.memory.total-flink.size"
> > > differently
> > > >>>>>>>>>> for standalone and active setups.
> > > >>>>>>>>>>
> > > >>>>>>>>>> I think what we really want is probably
> > > >>>>>>>>>> mapping "taskmanager.heap.size" to different new config
> > options
> > > in
> > > >>>>>>>>>> different setups. How about we mark "taskmanager.heap.size"
> as
> > > deprecated
> > > >>>>>>>>>> key for neither of "taskmanager.memory.total-process.size"
> and
> > > >>>>>>>>>> "taskmanager.memory.total-flink.size". Instead, we parse it
> > (if
> > > explicitly
> > > >>>>>>>>>> configured) in startup scripts / active resource managers,
> and
> > > set the
> > > >>>>>>>>>> value to "taskmanager.memory.total-flink.size" in the
> scripts
> > > and
> > > >>>>>>>>>> "taskmanager.memory.total-process.size" in active resource
> > > managers (if the
> > > >>>>>>>>>> new config options are not configured). We can provide util
> > > methods in
> > > >>>>>>>>>> TaskExecutorResourceUtils for such conversions, to keep all
> > the
> > > >>>>>>>>>> configuration logics at one place.
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> *==> This is pretty much what I meant as well (maybe my
> > > >>>>>>>>> description was not very clear), so +1 for that mechanism*
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>   - Mini Cluster tries to imitate exact ratio of memory pools
> > as
> > > a
> > > >>>>>>>>>>> standalone setup. I get the idea behind that, but I am
> > > wondering if it is
> > > >>>>>>>>>>> the right approach here.
> > > >>>>>>>>>>>     For example: I started a relatively large JVM (large
> heap
> > > >>>>>>>>>>> size of 10 GB) as a test. With the current logic, the
> system
> > > tries to
> > > >>>>>>>>>>> reserve an additional 6GB for managed memory which is more
> > > than there is
> > > >>>>>>>>>>> memory left. When you see the error that no memory could be
> > > allocated, you
> > > >>>>>>>>>>> need to understand the magic of how this is derived.
> > > >>>>>>>>>>>     I am trying to think about this from the perspective of
> > > >>>>>>>>>>> using "Flink as a Library", which the MiniCluster is close
> > to.
> > > >>>>>>>>>>>     When starting Flink out of a running process, we cannot
> > > >>>>>>>>>>> assume that we are the only users of that process and that
> we
> > > can mold the
> > > >>>>>>>>>>> process to our demands. I think a fix value for managed
> > memory
> > > and network
> > > >>>>>>>>>>> memory would feel more natural in such a setup than a
> > > mechanism that is
> > > >>>>>>>>>>> tailored towards exclusive use of the process.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> +1 on having fixed values for managed / shuffle memory.
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> *==> Cool, let's also see what Andrey and Till think here.*
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>>   - Some off-heap memory goes into direct memory, some does
> > not.
> > > >>>>>>>>>>> This confused me a bit. For example
> > > >>>>>>>>>>> "taskmanager.memory.framework.off-heap.size" is counted
> into
> > > >>>>>>>>>>> MaxDirectMemory while
> "taskmanager.memory.task.off-heap.size"
> > > is counted as
> > > >>>>>>>>>>> native memory. Maybe we should rename the keys to reflect
> > > that. There is no
> > > >>>>>>>>>>> one "off heap" memory type after all. Maybe use
> > > >>>>>>>>>>> "taskmanager.memory.task.native: XXXmb" and
> > > >>>>>>>>>>> "taskmanager.memory.framework.direct: XXXmb" instead?
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> I believe "taskmanager.memory.task.off-heap.size" is also
> > > >>>>>>>>>> accounted in the max direct memory size limit. The confusion
> > > probably comes
> > > >>>>>>>>>> from that "taskmanager.memory.framework.off-heap.size"
> > > explicitly mentioned
> > > >>>>>>>>>> that in its description while
> > > "taskmanager.memory.task.off-heap.size"
> > > >>>>>>>>>> didn't. That's actually because the framework off-heap
> memory
> > > is introduced
> > > >>>>>>>>>> later in a separate commit. We should fix that.
> > > >>>>>>>>>>
> > > >>>>>>>>>> For framework / task off-heap memory, we do not
> differentiate
> > > >>>>>>>>>> direct / native memory usage. That means the configure value
> > > for these two
> > > >>>>>>>>>> options could be a mixture of direct / native memory. Since
> we
> > > do not know
> > > >>>>>>>>>> the portion of direct memory out of the configured value, we
> > > have
> > > >>>>>>>>>> to conservatively account it all into the max direct memory
> > > size limit.
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> *==>  In that case, I am a bit confused. For the total size
> > > >>>>>>>>> calculation, it is fine. But why do we then set
> > MaxDirectMemory?
> > > It is a
> > > >>>>>>>>> difficult parameter, and the main reason to set it was (if I
> > > recall
> > > >>>>>>>>> correctly) to trigger GC based on direct memory allocation
> (to
> > > free heap
> > > >>>>>>>>> structures that then in turn release direct memory). If the
> > > limit is
> > > >>>>>>>>> anyways too high (because we also count native memory in
> there)
> > > such that
> > > >>>>>>>>> we can exceed the total process (container) memory, why do we
> > > set it then?*
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>   - The network memory keys are now called
> > > >>>>>>>>>>> "taskmanager.memory.shuffle.*". To my knowledge, shuffle is
> > > usually
> > > >>>>>>>>>>> understood as a redistribution (random, or maybe by hash of
> > > key). As an
> > > >>>>>>>>>>> example, there are many discussions about "shuffle join
> > versus
> > > broadcast
> > > >>>>>>>>>>> join", where "shuffle" is the synonym for
> "re-partitioning".
> > > We use that
> > > >>>>>>>>>>> memory however for all network operations, like forward
> > pipes,
> > > broadcasts,
> > > >>>>>>>>>>> receiver-side buffering on checkpoints, etc. I find the
> name
> > > "*.shuffle.*"
> > > >>>>>>>>>>> confusing, I am wondering if users would find that as well.
> > So
> > > throwing in
> > > >>>>>>>>>>> the suggestion to call the options
> > > "taskmanager.memory.network.*".
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> +0 on this one. I'm ok with "taskmanager.memory.network.*".
> On
> > > >>>>>>>>>> the other hand, one can also argue that this part of memory
> is
> > > used by
> > > >>>>>>>>>> ShuffleEnvironment, and the key
> "taskmanager.memory.shuffle.*"
> > > points more
> > > >>>>>>>>>> directly to the shuffle service components.
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> *==> In that case, the name "Shuffle Environment" may be a
> bit
> > > >>>>>>>>> incorrect, because it is doing not only shuffles as well. The
> > > >>>>>>>>> ShuffleEnvironment is also more internal, so the name is not
> > too
> > > critical.
> > > >>>>>>>>> This isn't super high priority for me, but if we want to
> adjust
> > > it, better
> > > >>>>>>>>> earlier than later.*
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>    - The descriptions for the
> > "taskmanager.memory.jvm-overhead.*"
> > > >>>>>>>>> keys say that it also accounts for I/O direct memory, but the
> > > parameter is
> > > >>>>>>>>> not counted into the MaxDirectMemory parameter.
> > > >>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> True. Since we already have framework off-heap memory
> > accounted
> > > >>>>>>>>>> for ad hoc direct memory usages, accounting all of
> > jvm-overhead
> > > also into
> > > >>>>>>>>>> max direct memory limit seems not necessary. I would suggest
> > to
> > > remove "I/O
> > > >>>>>>>>>> direct memory" from the description, and explicitly mention
> > > that this
> > > >>>>>>>>>> option does not account for direct memory and will not be
> > > accounted into
> > > >>>>>>>>>> max direct memory limit. WDYT?
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> *==> Sounds good. *
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>>   - Can make the new ConfigOptions strongly typed with the
> new
> > > >>>>>>>>>>> configuration options. For example, directly use MemorySize
> > > typed options.
> > > >>>>>>>>>>> That makes validation automatic and saves us from breaking
> > the
> > > options
> > > >>>>>>>>>>> later.
> > > >>>>>>>>>>
> > > >>>>>>>>>> +1. Wasn't aware of the new memory type config options.
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> *==> Thanks. Do you need help with adjusting this?*
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> Best,
> > > >>>>>>>>> Stephan
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > >
> >
>
>
> --
> Best, Jingsong Lee
>

Re: [DISCUSS] Some feedback after trying out the new FLIP-49 memory configurations

Posted by Jingsong Li <ji...@gmail.com>.
Thank you for your wonderful discussion.

+1 for set "taskmanager.memory.flink.size" in the default config.
Maybe we can write and explain "taskmanager.memory.process.size" in the
comments.

Most of the time,
- "trying out" users is less in-depth users and use standalone mode.
- Production users use active setups.

Default config it very important to "trying out".
In "trying out", for a novice Java user of Flink, he want to configure as
much memory as my standalone process should be. In fact, it's hard for him
to realize the JVM overhead. And there is no need that must let him know.

> I don't think it is realistic that users set the process memory to full
machine memory. There is a lot on the machine as well in most cases.

+1, We often run Flink in a less clean environment, such as the environment
have HDFS. Whether we are testing or producing, we will not use all the
memory of the machine, such as always leaving some memory for PageCache.

Best,
Jingsong Lee

On Sat, Dec 28, 2019 at 7:02 PM Stephan Ewen <se...@apache.org> wrote:

> "taskmanager.memory.flink.size" in the default config has a few advantages.
>
>  - The value in the default config needs to be suitable for "trying out"
> Flink, for a good "getting started" experience.
>
>   - For trying out Flink, standalone is the most common entry point (except
> running in IDE).
>
>   - In standalone setup, from total process memory, we subtract quite a bit
> before we arrive at the usable memory. We also subtract managed memory from
> the heap now. I fear we might end up at a heap that becomes so small that
> it makes for a bad "getting started" experience.
>
>   - I don't think it is realistic that users set the process memory to full
> machine memory. There is a lot on the machine as well in most cases.
>
>   - In the JVM world, users are used to configuring the heap size and know
> that there is additional memory overhead. The
> "taskmanager.memory.flink.size" option fits well with that mindset.
>
>   - One you start to think about the total process memory of Yarn
> containers, you are already past the getting-started phase and on the
> tuning phase.
>
>
> On Tue, Dec 24, 2019, 10:25 Andrey Zagrebin <az...@apache.org> wrote:
>
> > Thanks for the summary, Xintong! It makes sense to me.
> >
> > How about putting "taskmanager.memory.flink.size" in the configuration?
> > > Then new downloaded Flink behaves similar to the previous Standalone
> > setups.
> > > If someone upgrades the binaries, but re-uses their old configuration,
> > > then they get the compatibility as discussed previously.
> > > We used that approach previously with the fine-grained failover
> recovery.
> >
> >
> >
> > > I'm trying to understand why "taskmanager.memory.flink.size" rather
> than
> > > "taskmanager.memory.process.size" in the default flink-conf.yaml. Or
> put
> > it
> > > another way, why do we want the new downloaded Flink behaves similar to
> > > previous Standalone setups rather than previous active mode setups? Is
> > > there any special reason that I overlooked, which makes backwards
> > > compatibility for standalone setups more important than for active
> > setups?
> > > IMO, "taskmanager.memory.process.size" is easier for the new comers.
> For
> > > standalone setups, users can simply configure it to their machines'
> > > available memory size, without needing to worry about leaving enough
> > space
> > > for JVM overehead / metaspace. For containerized setups, it's more
> > > predictable how many memory the containers / Flink could use, which is
> > more
> > > friendly for users to manage their resource quota.
> > > Therefore, unless there is anything I overlooked, I'm in favor of
> putting
> > > "taskmanager.memory.process.size" rather than
> > > "taskmanager.memory.flink.size" in the default configuration.
> >
> >
> > I agree that having "taskmanager.memory.process.size" in default config
> > should be easier to understand and tweak for the new users because it is
> > just what they are ready to spend for Flink application.
> > The problem is when users upgrade Flink and use the new default
> > configuration then the behaviour can change:
> > - either if we put process memory then Flink memory shrinks and the new
> > default option contradicts their previous understanding
> > - or if we put Flink memory then larger container is requested.
> > The shrinking of memory sounds more implicit and worse. The increased
> > container request will just fail in the worst case and the memory setup
> can
> > be revisited.
> > We could increase the default "taskmanager.memory.process.size" to better
> > align it with the previous default setup for standalone
> > but this would not remove the possible confusion problem for the old
> users,
> > on the other hand the option is new and we can add a comment how to
> migrate
> > from the old one.
> >
> > All in all, now I also tend to have "taskmanager.memory.process.size" in
> > the default config unless there are more reasons for having less
> confusion
> > for the old standalone users.
> >
> > Best,
> > Andrey
> >
> > On Tue, Dec 24, 2019 at 5:50 AM Xintong Song <to...@gmail.com>
> > wrote:
> >
> > > Hi all,
> > >
> > > It seems we have already get consensus on most of the issues. Thanks
> > > everyone for the good discussion.
> > >
> > > While there are still open questions under discussion, I'd like to
> > > summarize the discussion so far, and list the action items that we
> > already
> > > get consensus on. In this way, we can already start working on these
> > items
> > > while discussing the remaining open questions. Please correct me if the
> > > summary does not reflect your argument.
> > >
> > > *Action Items that we already get consensus on:*
> > >
> > >    - Map "taskmanager.heap.size" to "taskmanager.memory.flink.size" for
> > >    standalone setups, and "taskmanager.memory.process.size" for active
> > setups.
> > >    (should be mentioned in release notes)
> > >    - If not explicitly configured, MiniCluster should have fixed
> default
> > >    network and managed memory sizes. (should be mentioned in docs &
> > release
> > >    notes)
> > >    - Change the memory config options' type from String to MemorySize
> > >    - Change the config option key "taskmanager.memory.total-flink.size"
> > >    to "taskmanager.memory.flink.size"
> > >    - Change the config option key
> "taskmanager.memory.total-process.size"
> > >    to "taskmanager.memory.process.size"
> > >    - Update descriptions for
> "taskmanager.memory.framework.off-heap.size"
> > >    and "taskmanager.memory.task.off-heap.size" to explicitly state that
> > >       - Both direct and native memory are accounted
> > >       - Will be fully counted into MaxDirectMemorySize
> > >    - Update descriptions for
> > >    "taskmanager.memory.jvm-overhead.[min|max|fraction]" to remove "I/O
> > direct
> > >    memory" and explicitly state that it's not counted into
> > MaxDirectMemorySize
> > >    - Print MemorySize with proper unit. (non-blocker for 1.10)
> > >
> > >
> > > *Questions that are still open:*
> > >
> > >    - Which config option do we put in the default flink-conf.yaml?
> > >       - "taskmanager.memory.flink.size"
> > >       - "taskmanager.memory.process.size"
> > >       - Deprecated "taskmanager.heap.size"
> > >    - What is proper keys for network / shuffle memory
> > >       - "taskmanager.memory.shuffle.*"
> > >       - "taskmanager.memory.network.*"
> > >       - "taskmanager.network.memory.*"
> > >
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Tue, Dec 24, 2019 at 10:19 AM Xintong Song <to...@gmail.com>
> > > wrote:
> > >
> > >> How about putting "taskmanager.memory.flink.size" in the
> configuration?
> > >>> Then new downloaded Flink behaves similar to the previous Standalone
> > setups.
> > >>> If someone upgrades the binaries, but re-uses their old
> configuration,
> > >>> then they get the compatibility as discussed previously.
> > >>> We used that approach previously with the fine-grained failover
> > recovery.
> > >>
> > >>
> > >> I'm trying to understand why "taskmanager.memory.flink.size" rather
> than
> > >> "taskmanager.memory.process.size" in the default flink-conf.yaml. Or
> > put it
> > >> another way, why do we want the new downloaded Flink behaves similar
> to
> > >> previous Standalone setups rather than previous active mode setups? Is
> > >> there any special reason that I overlooked, which makes backwards
> > >> compatibility for standalone setups more important than for active
> > setups?
> > >>
> > >> IMO, "taskmanager.memory.process.size" is easier for the new comers.
> For
> > >> standalone setups, users can simply configure it to their machines'
> > >> available memory size, without needing to worry about leaving enough
> > space
> > >> for JVM overehead / metaspace. For containerized setups, it's more
> > >> predictable how many memory the containers / Flink could use, which is
> > more
> > >> friendly for users to manage their resource quota.
> > >>
> > >> Therefore, unless there is anything I overlooked, I'm in favor of
> > putting
> > >> "taskmanager.memory.process.size" rather than
> > >> "taskmanager.memory.flink.size" in the default configuration.
> > >>
> > >>
> > >> Thank you~
> > >>
> > >> Xintong Song
> > >>
> > >>
> > >>
> > >> On Tue, Dec 24, 2019 at 4:27 AM Andrey Zagrebin <azagrebin@apache.org
> >
> > >> wrote:
> > >>
> > >>> How about putting "taskmanager.memory.flink.size" in the
> configuration?
> > >>>> Then new downloaded Flink behaves similar to the previous Standalone
> > setups.
> > >>>> If someone upgrades the binaries, but re-uses their old
> configuration,
> > >>>> then they get the compatibility as discussed previously.
> > >>>> We used that approach previously with the fine-grained failover
> > >>>> recovery.
> > >>>
> > >>> +1, this sounds like a good compromise.
> > >>>
> > >>> +1 to not have more options for off-heap memory, that would get
> > >>>> confusing fast. We can keep the name "off-heap" for both task and
> > >>>> framework, as long as they mean the same thing: native plus direct,
> > and
> > >>>> fully counted into MaxDirectMemory. I would suggest to update the
> > config
> > >>>> descriptions then to reflect that.
> > >>>>
> > >>> True, this should be explained in the config descriptions.
> > >>>
> > >>> looks good to me
> > >>>
> > >>> From a user's perspective I believe "taskmanager.memory.network"
> would
> > >>>> be easier to understand as not everyone knows exactly what the
> shuffle
> > >>>> service is. I see the point that it would be a bit imprecise as we
> > can have
> > >>>> different shuffle implementations but I would go with the ease of
> > >>>> use/understanding here. Moreover, I think that we won't have many
> > different
> > >>>> shuffle service implementations in the foreseeable future.
> > >>>
> > >>> I agree that if we cannot find any other convincing names for the
> > >>> options, we should keep what we already have and change it if the
> > >>> alternative is convincing enough.
> > >>> The question is also whether we still want to rename it because it
> was
> > >>> "taskmanager.network.*memory*.*" in 1.9 but
> > "taskmanager.*memory*.network.*"
> > >>> is more aligned with other new memory option names.
> > >>> Or we can just 'un'-deprecate "taskmanager.network.*memory*.*".
> > >>>
> > >>> On Mon, Dec 23, 2019 at 8:42 PM Stephan Ewen <se...@apache.org>
> wrote:
> > >>>
> > >>>> How about putting "taskmanager.memory.flink.size" in the
> > configuration?
> > >>>> Then new downloaded Flink behaves similar to the previous Standalone
> > setups.
> > >>>>
> > >>>> If someone upgrades the binaries, but re-uses their old
> configuration,
> > >>>> then they get the compatibility as discussed previously.
> > >>>> We used that approach previously with the fine-grained failover
> > >>>> recovery.
> > >>>>
> > >>>> On Mon, Dec 23, 2019 at 3:27 AM Xintong Song <tonysong820@gmail.com
> >
> > >>>> wrote:
> > >>>>
> > >>>>> +1 to not have more options for off-heap memory, that would get
> > >>>>>> confusing fast. We can keep the name "off-heap" for both task and
> > >>>>>> framework, as long as they mean the same thing: native plus
> direct,
> > and
> > >>>>>> fully counted into MaxDirectMemory. I would suggest to update the
> > config
> > >>>>>> descriptions then to reflect that.
> > >>>>>>
> > >>>>> True, this should be explained in the config descriptions.
> > >>>>>
> > >>>>> Which configuration option will be set in Flink's default
> > >>>>>> flink-conf.yaml? If we want to maintain the existing behaviour it
> > would
> > >>>>>> have to be the then deprecated taskmanager.heap.size config
> option.
> > If we
> > >>>>>> are ok with Yarn requesting slightly larger containers, then it
> > could also
> > >>>>>> be taskmanager.memory.total-flink.size.
> > >>>>>>
> > >>>>> Good point. Currently, we have
> > >>>>> "taskmanager.memory.total-process.size". In order to preserve the
> > previous
> > >>>>> behavior, we need to have "taskmanager.heap.size" so it can be
> > mapped to
> > >>>>> different new options in standalone / active setups.
> > >>>>> I think we can have the deprecated "taskmanager.heap.size" in the
> > >>>>> default flink-conf.yaml, and also have the
> > >>>>> new "taskmanager.memory.total-process.size" in a commented line. We
> > can
> > >>>>> explain how the deprecated config option behaves differently in the
> > >>>>> comments, so that user can switch to the new config options if they
> > want to.
> > >>>>>
> > >>>>> Thank you~
> > >>>>>
> > >>>>> Xintong Song
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On Sat, Dec 21, 2019 at 1:00 AM Till Rohrmann <
> trohrmann@apache.org>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Thanks for the feedback and good discussion everyone. I left some
> > >>>>>> comments inline.
> > >>>>>>
> > >>>>>> On Fri, Dec 20, 2019 at 1:59 PM Stephan Ewen <se...@apache.org>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> +1 to not have more options for off-heap memory, that would get
> > >>>>>>> confusing fast. We can keep the name "off-heap" for both task and
> > >>>>>>> framework, as long as they mean the same thing: native plus
> > direct, and
> > >>>>>>> fully counted into MaxDirectMemory. I would suggest to update the
> > config
> > >>>>>>> descriptions then to reflect that.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Fri, Dec 20, 2019 at 1:03 PM Xintong Song <
> > tonysong820@gmail.com>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Regarding the framework/task direct/native memory options, I
> tend
> > >>>>>>>> to think it differently. I'm in favor of keep the
> > "*.off-heap.size" for the
> > >>>>>>>> config option keys.
> > >>>>>>>>
> > >>>>>>>>    - It's not necessary IMO to expose the difference concepts of
> > >>>>>>>>    direct / native memory to the users.
> > >>>>>>>>    - I would avoid introducing more options for native memory if
> > >>>>>>>>    possible. Taking fine grained resource management and dynamic
> > slot
> > >>>>>>>>    allocation into consideration, that also means introduce more
> > fields into
> > >>>>>>>>    ResourceSpec / ResourceProfile.
> > >>>>>>>>    - My gut feeling is that having a relative loose
> > >>>>>>>>    MaxDirectMemory should not be a big problem.
> > >>>>>>>>    - In most cases, the task / framework off-heap memory should
> be
> > >>>>>>>>       mainly (if not all) direct memory, so the difference
> > between derived
> > >>>>>>>>       MaxDirectMemory and the ideal direct memory limit should
> > not be too much.
> > >>>>>>>>       - We do not have a good way to know the exact size needed
> > >>>>>>>>       for jvm overhead / metaspace and framework / task off-heap
> > memory, thus
> > >>>>>>>>       having to conservatively reserve slightly more memory then
> > what actually
> > >>>>>>>>       needed. Such reserved but not used memory can cover for
> the
> > small
> > >>>>>>>>       MaxDirectMemory error.
> > >>>>>>>>       -
> > >>>>>>>>       - MaxDirectMemory is not the only way to trigger full gc.
> We
> > >>>>>>>>       still heap activities that can also trigger the gc.
> > >>>>>>>>
> > >>>>>>>> Regarding the memory type config options, I've looked into the
> > >>>>>>>> latest ConfigOptions changes. I think it shouldn't be too
> > complicated to
> > >>>>>>>> change the config options to use memory type, and I can handle
> it
> > maybe
> > >>>>>>>> during your vacations.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Also agree with improving MemorySize logging and parsing. This
> > >>>>>>>> should not be a blocker that has to be done in 1.10. I would say
> > we finish
> > >>>>>>>> other works (testability, documentation and those discussed in
> > this thread)
> > >>>>>>>> first, and get to this only if we have time.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Thank you~
> > >>>>>>>>
> > >>>>>>>> Xintong Song
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Fri, Dec 20, 2019 at 6:07 PM Andrey Zagrebin <
> > >>>>>>>> azagrebin.apache@gmail.com> wrote:
> > >>>>>>>>
> > >>>>>>>>> Hi Stephan and Xintong,
> > >>>>>>>>>
> > >>>>>>>>> Thanks for the further FLIP-49 feedbacks.
> > >>>>>>>>>
> > >>>>>>>>>   - "taskmanager.memory.size" (old main config option) is
> > replaced
> > >>>>>>>>>> by "taskmanager.memory.total-process.size" which has a
> > different meaning in
> > >>>>>>>>>> standalone setups. The old option did not subtract metaspace
> > and other
> > >>>>>>>>>> overhead, while the new option does. That means that with the
> > default
> > >>>>>>>>>> config, standalone clusters get quite a bit less memory.
> > (independent of
> > >>>>>>>>>> managed memory going off heap).
> > >>>>>>>>>>     I am wondering if we could interpret
> > >>>>>>>>>> "taskmanager.memory.size" as the deprecated key for
> > >>>>>>>>>> "taskmanager.memory.total-flink.size". That would be in line
> > with the old
> > >>>>>>>>>> mechanism (assuming managed memory is set to off heap).
> > >>>>>>>>>>     The effect would be that the container size on Yarn/Mesos
> > >>>>>>>>>> increases, because from "taskmanager.memory.total-flink.size",
> > we need to
> > >>>>>>>>>> add overhead and metaspace to reach the total process size,
> > rather than
> > >>>>>>>>>> cutting off memory. But if we want, we could even adjust for
> > that in the
> > >>>>>>>>>> active resource manager, getting full backwards compatibility
> > on that part.
> > >>>>>>>>>>     Curious to hear more thoughts there.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> I believe you mean "taskmanager.heap.size".
> > >>>>>>>>>
> > >>>>>>>>> I think the problem here is that the
> > >>>>>>>>> legacy "taskmanager.heap.size" was used differently in
> > standalone setups
> > >>>>>>>>> and active yarn / mesos setups, and such different calculation
> > logics and
> > >>>>>>>>> behaviors are exactly what we want to avoid with FLIP-49.
> > Therefore, I'm
> > >>>>>>>>> not in favor of treating "taskmanager.memory.total-flink.size"
> > differently
> > >>>>>>>>> for standalone and active setups.
> > >>>>>>>>>
> > >>>>>>>>> I think what we really want is probably
> > >>>>>>>>> mapping "taskmanager.heap.size" to different new config options
> > in
> > >>>>>>>>> different setups. How about we mark "taskmanager.heap.size" as
> > deprecated
> > >>>>>>>>> key for neither of "taskmanager.memory.total-process.size" and
> > >>>>>>>>> "taskmanager.memory.total-flink.size". Instead, we parse it (if
> > explicitly
> > >>>>>>>>> configured) in startup scripts / active resource managers, and
> > set the
> > >>>>>>>>> value to "taskmanager.memory.total-flink.size" in the scripts
> and
> > >>>>>>>>> "taskmanager.memory.total-process.size" in active resource
> > managers (if the
> > >>>>>>>>> new config options are not configured). We can provide util
> > methods in
> > >>>>>>>>> TaskExecutorResourceUtils for such conversions, to keep all the
> > >>>>>>>>> configuration logics at one place.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> I agree that the problem is that the legacy option
> > >>>>>>>>> ‘taskmanager.heap.size’ has different semantics for
> > standalone/container.
> > >>>>>>>>> We had it initially falling back to
> > 'taskmanager.memory.total-flink.size’
> > >>>>>>>>> but I changed that to align it with container cut-off. Now I
> see
> > it changes
> > >>>>>>>>> standalone setup then.
> > >>>>>>>>> +1 for supporting its backwards compatibility differently for
> > >>>>>>>>> standalone/container setups.
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>> Which configuration option will be set in Flink's default
> > >>>>>> flink-conf.yaml? If we want to maintain the existing behaviour it
> > would
> > >>>>>> have to be the then deprecated taskmanager.heap.size config
> option.
> > If we
> > >>>>>> are ok with Yarn requesting slightly larger containers, then it
> > could also
> > >>>>>> be taskmanager.memory.total-flink.size.
> > >>>>>>
> > >>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>   - Mini Cluster tries to imitate exact ratio of memory pools
> as
> > a
> > >>>>>>>>>> standalone setup. I get the idea behind that, but I am
> > wondering if it is
> > >>>>>>>>>> the right approach here.
> > >>>>>>>>>>     For example: I started a relatively large JVM (large heap
> > >>>>>>>>>> size of 10 GB) as a test. With the current logic, the system
> > tries to
> > >>>>>>>>>> reserve an additional 6GB for managed memory which is more
> than
> > there is
> > >>>>>>>>>> memory left. When you see the error that no memory could be
> > allocated, you
> > >>>>>>>>>> need to understand the magic of how this is derived.
> > >>>>>>>>>>     I am trying to think about this from the perspective of
> > using
> > >>>>>>>>>> "Flink as a Library", which the MiniCluster is close to.
> > >>>>>>>>>>     When starting Flink out of a running process, we cannot
> > >>>>>>>>>> assume that we are the only users of that process and that we
> > can mold the
> > >>>>>>>>>> process to our demands. I think a fix value for managed memory
> > and network
> > >>>>>>>>>> memory would feel more natural in such a setup than a
> mechanism
> > that is
> > >>>>>>>>>> tailored towards exclusive use of the process.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> +1 on having fixed values for managed / shuffle memory.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> also +1 for that, if user has not specified any main options to
> > >>>>>>>>> derive memory. We should also log this fixing of managed /
> > shuffle memory.
> > >>>>>>>>> And just noticed, we could also sanity check framework and if
> > >>>>>>>>> explicitly configured task heap against available JVM heap, and
> > at least
> > >>>>>>>>> log inconsistencies.
> > >>>>>>>>>
> > >>>>>>>>>   - Some off-heap memory goes into direct memory, some does
> not.
> > >>>>>>>>>> This confused me a bit. For example
> > >>>>>>>>>> "taskmanager.memory.framework.off-heap.size" is counted into
> > >>>>>>>>>> MaxDirectMemory while "taskmanager.memory.task.off-heap.size"
> > is counted as
> > >>>>>>>>>> native memory. Maybe we should rename the keys to reflect
> that.
> > There is no
> > >>>>>>>>>> one "off heap" memory type after all. Maybe use
> > >>>>>>>>>> "taskmanager.memory.task.native: XXXmb" and
> > >>>>>>>>>> "taskmanager.memory.framework.direct: XXXmb" instead?
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> I believe "taskmanager.memory.task.off-heap.size" is also
> > >>>>>>>>> accounted in the max direct memory size limit. The confusion
> > probably comes
> > >>>>>>>>> from that "taskmanager.memory.framework.off-heap.size"
> > explicitly mentioned
> > >>>>>>>>> that in its description while
> > "taskmanager.memory.task.off-heap.size"
> > >>>>>>>>> didn't. That's actually because the framework off-heap memory
> is
> > introduced
> > >>>>>>>>> later in a separate commit. We should fix that.
> > >>>>>>>>>
> > >>>>>>>>> For framework / task off-heap memory, we do not differentiate
> > >>>>>>>>> direct / native memory usage. That means the configure value
> for
> > these two
> > >>>>>>>>> options could be a mixture of direct / native memory. Since we
> > do not know
> > >>>>>>>>> the portion of direct memory out of the configured value, we
> have
> > >>>>>>>>> to conservatively account it all into the max direct memory
> size
> > limit.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> *==>  In that case, I am a bit confused. For the total size
> > >>>>>>>>> calculation, it is fine. But why do we then set
> MaxDirectMemory?
> > It is a
> > >>>>>>>>> difficult parameter, and the main reason to set it was (if I
> > recall
> > >>>>>>>>> correctly) to trigger GC based on direct memory allocation (to
> > free heap
> > >>>>>>>>> structures that then in turn release direct memory). If the
> > limit is
> > >>>>>>>>> anyways too high (because we also count native memory in there)
> > such that
> > >>>>>>>>> we can exceed the total process (container) memory, why do we
> > set it then?*
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> I always also thought about it as providing more safety net for
> > >>>>>>>>> direct allocations but GC thing looks more important.
> > >>>>>>>>>
> > >>>>>>>>> +1 for fixing docs for 'taskmanager.memory.task.off-heap.size’
> > and
> > >>>>>>>>> renaming to ‘direct' as this is what really happens
> > >>>>>>>>> if we want to support direct limit more exact than now.
> > >>>>>>>>>
> > >>>>>>>>> I also think that it is hard to separate direct / native memory
> > >>>>>>>>> unless we introduce even more options.
> > >>>>>>>>> If user wants to keep the direct limit tight to a certain value
> > >>>>>>>>> but also use native memory outside of it,
> > >>>>>>>>> she would have to increase something else, like JVM overhead to
> > >>>>>>>>> account for it and there is no other better way.
> > >>>>>>>>> Having more options to account for the native memory outside of
> > >>>>>>>>> direct limit complicates things but can be introduced later if
> > needed.
> > >>>>>>>>>
> > >>>>>>>>>   - What do you think about renaming
> > >>>>>>>>>> "taskmanager.memory.total-flink.size" to
> > "taskmanager.memory.flink.size"
> > >>>>>>>>>> and "taskmanager.memory.total-process.size" to
> > >>>>>>>>>> "taskmanager.memory.process.size" (or
> > "taskmanager.memory.jvm.total"). I
> > >>>>>>>>>> think these keys may be a bit less clumsy (dropping the
> > "total-") without
> > >>>>>>>>>> loss of expressiveness.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> +1 on this.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> +1 as well. Also an option:
> > >>>>>>>>> 'taskmanager.memory.total-process.size’ ->
> > >>>>>>>>> ‘taskmanager.memory.jvm.process.size’,
> > >>>>>>>>> although it can be also mentioned in docs that we mean JVM
> > process.
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>> I'd be in favour of Stephan's proposal for the config keys as
> > shorter
> > >>>>>> is usually better and they are still descriptive enough. Between
> > >>>>>> "taskmanager.memory.process.size" and
> > "taskmanager.memory.jvm.total" I
> > >>>>>> would slightly favour the first variant.
> > >>>>>>
> > >>>>>>>
> > >>>>>>>>>   - The network memory keys are now called
> > >>>>>>>>>> "taskmanager.memory.shuffle.*". To my knowledge, shuffle is
> > usually
> > >>>>>>>>>> understood as a redistribution (random, or maybe by hash of
> > key). As an
> > >>>>>>>>>> example, there are many discussions about "shuffle join versus
> > broadcast
> > >>>>>>>>>> join", where "shuffle" is the synonym for "re-partitioning".
> We
> > use that
> > >>>>>>>>>> memory however for all network operations, like forward pipes,
> > broadcasts,
> > >>>>>>>>>> receiver-side buffering on checkpoints, etc. I find the name
> > "*.shuffle.*"
> > >>>>>>>>>> confusing, I am wondering if users would find that as well. So
> > throwing in
> > >>>>>>>>>> the suggestion to call the options
> > "taskmanager.memory.network.*".
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> +0 on this one. I'm ok with "taskmanager.memory.network.*". On
> > the
> > >>>>>>>>> other hand, one can also argue that this part of memory is used
> > by
> > >>>>>>>>> ShuffleEnvironment, and the key "taskmanager.memory.shuffle.*"
> > points more
> > >>>>>>>>> directly to the shuffle service components.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> *==> In that case, the name "Shuffle Environment" may be a bit
> > >>>>>>>>> incorrect, because it is doing not only shuffles as well. The
> > >>>>>>>>> ShuffleEnvironment is also more internal, so the name is not
> too
> > critical.
> > >>>>>>>>> This isn't super high priority for me, but if we want to adjust
> > it, better
> > >>>>>>>>> earlier than later.*
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> This is also a bit controversial topic for me. Indeed, we have
> > >>>>>>>>> always used ’network’ for this concept of task data shuffling
> > over the
> > >>>>>>>>> network and this can confuse existing users.
> > >>>>>>>>>
> > >>>>>>>>> On the other hand for the new users and in a long term,
> ’network’
> > >>>>>>>>> can delude into a conclusion that all network memory is managed
> > by this
> > >>>>>>>>> option.
> > >>>>>>>>> Also other types of shuffle might not directly deal with
> network
> > >>>>>>>>> at all.
> > >>>>>>>>>
> > >>>>>>>>> By calling it shuffle, we were somewhat biased by understanding
> > it
> > >>>>>>>>> in term of map/reduce. This is rather an inter-task data
> > exchange.
> > >>>>>>>>> Maybe then 'taskmanager.memory.shuffle.communication.*’ or
> > >>>>>>>>> ‘taskmanager.memory.task.shuffle/communication/io/network.*’.
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>  From a user's perspective I believe "taskmanager.memory.network"
> > >>>>>> would be easier to understand as not everyone knows exactly what
> the
> > >>>>>> shuffle service is. I see the point that it would be a bit
> > imprecise as we
> > >>>>>> can have different shuffle implementations but I would go with the
> > ease of
> > >>>>>> use/understanding here. Moreover, I think that we won't have many
> > different
> > >>>>>> shuffle service implementations in the foreseeable future.
> > >>>>>>
> > >>>>>>>
> > >>>>>>>>>   - The descriptions for the
> "taskmanager.memory.jvm-overhead.*"
> > >>>>>>>>>> keys say that it also accounts for I/O direct memory, but the
> > parameter is
> > >>>>>>>>>> not counted into the MaxDirectMemory parameter.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> True. Since we already have framework off-heap memory accounted
> > >>>>>>>>> for ad hoc direct memory usages, accounting all of jvm-overhead
> > also into
> > >>>>>>>>> max direct memory limit seems not necessary. I would suggest to
> > remove "I/O
> > >>>>>>>>> direct memory" from the description, and explicitly mention
> that
> > this
> > >>>>>>>>> option does not account for direct memory and will not be
> > accounted into
> > >>>>>>>>> max direct memory limit. WDYT?
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> +1
> > >>>>>>>>>
> > >>>>>>>>>   - Can make the new ConfigOptions strongly typed with the new
> > >>>>>>>>>> configuration options. For example, directly use MemorySize
> > typed options.
> > >>>>>>>>>> That makes validation automatic and saves us from breaking the
> > options
> > >>>>>>>>>> later.
> > >>>>>>>>>
> > >>>>>>>>> +1. Wasn't aware of the new memory type config options.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> +1
> > >>>>>>>>>
> > >>>>>>>>> *==> Thanks. Do you need help with adjusting this?*
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> I would appreciate it.
> > >>>>>>>>>
> > >>>>>>>>> Also small side note, we extensively use MemorySize in logs now
> > >>>>>>>>> but it might be not always readable as its string
> representation
> > is only in
> > >>>>>>>>> bytes atm
> > >>>>>>>>> and does not reduce it to kb/mb/etc in case of big bytes value.
> > We
> > >>>>>>>>> could have at least some .prettyPrint function to use in logs.
> > >>>>>>>>> And .fromMegabytes/etc factory methods would improve code
> > >>>>>>>>> readability instead of .parse(int + “m”).
> > >>>>>>>>>
> > >>>>>>>>> Best,
> > >>>>>>>>> Andrey
> > >>>>>>>>>
> > >>>>>>>>> On 20 Dec 2019, at 12:13, Stephan Ewen <se...@apache.org>
> wrote:
> > >>>>>>>>>
> > >>>>>>>>> Hi Xintong!
> > >>>>>>>>>
> > >>>>>>>>> Please find my answers inline:
> > >>>>>>>>>
> > >>>>>>>>>>   - "taskmanager.memory.size" (old main config option) is
> > >>>>>>>>>>> replaced by "taskmanager.memory.total-process.size" which has
> > a different
> > >>>>>>>>>>> meaning in standalone setups. The old option did not subtract
> > metaspace and
> > >>>>>>>>>>> other overhead, while the new option does. That means that
> > with the default
> > >>>>>>>>>>> config, standalone clusters get quite a bit less memory.
> > (independent of
> > >>>>>>>>>>> managed memory going off heap).
> > >>>>>>>>>>>     I am wondering if we could interpret
> > >>>>>>>>>>> "taskmanager.memory.size" as the deprecated key for
> > >>>>>>>>>>> "taskmanager.memory.total-flink.size". That would be in line
> > with the old
> > >>>>>>>>>>> mechanism (assuming managed memory is set to off heap).
> > >>>>>>>>>>>     The effect would be that the container size on Yarn/Mesos
> > >>>>>>>>>>> increases, because from
> "taskmanager.memory.total-flink.size",
> > we need to
> > >>>>>>>>>>> add overhead and metaspace to reach the total process size,
> > rather than
> > >>>>>>>>>>> cutting off memory. But if we want, we could even adjust for
> > that in the
> > >>>>>>>>>>> active resource manager, getting full backwards compatibility
> > on that part.
> > >>>>>>>>>>>     Curious to hear more thoughts there.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> I believe you mean "taskmanager.heap.size".
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> *==> Yes*
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>> I think the problem here is that the
> > >>>>>>>>>> legacy "taskmanager.heap.size" was used differently in
> > standalone setups
> > >>>>>>>>>> and active yarn / mesos setups, and such different calculation
> > logics and
> > >>>>>>>>>> behaviors are exactly what we want to avoid with FLIP-49.
> > Therefore, I'm
> > >>>>>>>>>> not in favor of treating "taskmanager.memory.total-flink.size"
> > differently
> > >>>>>>>>>> for standalone and active setups.
> > >>>>>>>>>>
> > >>>>>>>>>> I think what we really want is probably
> > >>>>>>>>>> mapping "taskmanager.heap.size" to different new config
> options
> > in
> > >>>>>>>>>> different setups. How about we mark "taskmanager.heap.size" as
> > deprecated
> > >>>>>>>>>> key for neither of "taskmanager.memory.total-process.size" and
> > >>>>>>>>>> "taskmanager.memory.total-flink.size". Instead, we parse it
> (if
> > explicitly
> > >>>>>>>>>> configured) in startup scripts / active resource managers, and
> > set the
> > >>>>>>>>>> value to "taskmanager.memory.total-flink.size" in the scripts
> > and
> > >>>>>>>>>> "taskmanager.memory.total-process.size" in active resource
> > managers (if the
> > >>>>>>>>>> new config options are not configured). We can provide util
> > methods in
> > >>>>>>>>>> TaskExecutorResourceUtils for such conversions, to keep all
> the
> > >>>>>>>>>> configuration logics at one place.
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> *==> This is pretty much what I meant as well (maybe my
> > >>>>>>>>> description was not very clear), so +1 for that mechanism*
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>   - Mini Cluster tries to imitate exact ratio of memory pools
> as
> > a
> > >>>>>>>>>>> standalone setup. I get the idea behind that, but I am
> > wondering if it is
> > >>>>>>>>>>> the right approach here.
> > >>>>>>>>>>>     For example: I started a relatively large JVM (large heap
> > >>>>>>>>>>> size of 10 GB) as a test. With the current logic, the system
> > tries to
> > >>>>>>>>>>> reserve an additional 6GB for managed memory which is more
> > than there is
> > >>>>>>>>>>> memory left. When you see the error that no memory could be
> > allocated, you
> > >>>>>>>>>>> need to understand the magic of how this is derived.
> > >>>>>>>>>>>     I am trying to think about this from the perspective of
> > >>>>>>>>>>> using "Flink as a Library", which the MiniCluster is close
> to.
> > >>>>>>>>>>>     When starting Flink out of a running process, we cannot
> > >>>>>>>>>>> assume that we are the only users of that process and that we
> > can mold the
> > >>>>>>>>>>> process to our demands. I think a fix value for managed
> memory
> > and network
> > >>>>>>>>>>> memory would feel more natural in such a setup than a
> > mechanism that is
> > >>>>>>>>>>> tailored towards exclusive use of the process.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> +1 on having fixed values for managed / shuffle memory.
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> *==> Cool, let's also see what Andrey and Till think here.*
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>>   - Some off-heap memory goes into direct memory, some does
> not.
> > >>>>>>>>>>> This confused me a bit. For example
> > >>>>>>>>>>> "taskmanager.memory.framework.off-heap.size" is counted into
> > >>>>>>>>>>> MaxDirectMemory while "taskmanager.memory.task.off-heap.size"
> > is counted as
> > >>>>>>>>>>> native memory. Maybe we should rename the keys to reflect
> > that. There is no
> > >>>>>>>>>>> one "off heap" memory type after all. Maybe use
> > >>>>>>>>>>> "taskmanager.memory.task.native: XXXmb" and
> > >>>>>>>>>>> "taskmanager.memory.framework.direct: XXXmb" instead?
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> I believe "taskmanager.memory.task.off-heap.size" is also
> > >>>>>>>>>> accounted in the max direct memory size limit. The confusion
> > probably comes
> > >>>>>>>>>> from that "taskmanager.memory.framework.off-heap.size"
> > explicitly mentioned
> > >>>>>>>>>> that in its description while
> > "taskmanager.memory.task.off-heap.size"
> > >>>>>>>>>> didn't. That's actually because the framework off-heap memory
> > is introduced
> > >>>>>>>>>> later in a separate commit. We should fix that.
> > >>>>>>>>>>
> > >>>>>>>>>> For framework / task off-heap memory, we do not differentiate
> > >>>>>>>>>> direct / native memory usage. That means the configure value
> > for these two
> > >>>>>>>>>> options could be a mixture of direct / native memory. Since we
> > do not know
> > >>>>>>>>>> the portion of direct memory out of the configured value, we
> > have
> > >>>>>>>>>> to conservatively account it all into the max direct memory
> > size limit.
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> *==>  In that case, I am a bit confused. For the total size
> > >>>>>>>>> calculation, it is fine. But why do we then set
> MaxDirectMemory?
> > It is a
> > >>>>>>>>> difficult parameter, and the main reason to set it was (if I
> > recall
> > >>>>>>>>> correctly) to trigger GC based on direct memory allocation (to
> > free heap
> > >>>>>>>>> structures that then in turn release direct memory). If the
> > limit is
> > >>>>>>>>> anyways too high (because we also count native memory in there)
> > such that
> > >>>>>>>>> we can exceed the total process (container) memory, why do we
> > set it then?*
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>   - The network memory keys are now called
> > >>>>>>>>>>> "taskmanager.memory.shuffle.*". To my knowledge, shuffle is
> > usually
> > >>>>>>>>>>> understood as a redistribution (random, or maybe by hash of
> > key). As an
> > >>>>>>>>>>> example, there are many discussions about "shuffle join
> versus
> > broadcast
> > >>>>>>>>>>> join", where "shuffle" is the synonym for "re-partitioning".
> > We use that
> > >>>>>>>>>>> memory however for all network operations, like forward
> pipes,
> > broadcasts,
> > >>>>>>>>>>> receiver-side buffering on checkpoints, etc. I find the name
> > "*.shuffle.*"
> > >>>>>>>>>>> confusing, I am wondering if users would find that as well.
> So
> > throwing in
> > >>>>>>>>>>> the suggestion to call the options
> > "taskmanager.memory.network.*".
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> +0 on this one. I'm ok with "taskmanager.memory.network.*". On
> > >>>>>>>>>> the other hand, one can also argue that this part of memory is
> > used by
> > >>>>>>>>>> ShuffleEnvironment, and the key "taskmanager.memory.shuffle.*"
> > points more
> > >>>>>>>>>> directly to the shuffle service components.
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> *==> In that case, the name "Shuffle Environment" may be a bit
> > >>>>>>>>> incorrect, because it is doing not only shuffles as well. The
> > >>>>>>>>> ShuffleEnvironment is also more internal, so the name is not
> too
> > critical.
> > >>>>>>>>> This isn't super high priority for me, but if we want to adjust
> > it, better
> > >>>>>>>>> earlier than later.*
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>    - The descriptions for the
> "taskmanager.memory.jvm-overhead.*"
> > >>>>>>>>> keys say that it also accounts for I/O direct memory, but the
> > parameter is
> > >>>>>>>>> not counted into the MaxDirectMemory parameter.
> > >>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> True. Since we already have framework off-heap memory
> accounted
> > >>>>>>>>>> for ad hoc direct memory usages, accounting all of
> jvm-overhead
> > also into
> > >>>>>>>>>> max direct memory limit seems not necessary. I would suggest
> to
> > remove "I/O
> > >>>>>>>>>> direct memory" from the description, and explicitly mention
> > that this
> > >>>>>>>>>> option does not account for direct memory and will not be
> > accounted into
> > >>>>>>>>>> max direct memory limit. WDYT?
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> *==> Sounds good. *
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>>   - Can make the new ConfigOptions strongly typed with the new
> > >>>>>>>>>>> configuration options. For example, directly use MemorySize
> > typed options.
> > >>>>>>>>>>> That makes validation automatic and saves us from breaking
> the
> > options
> > >>>>>>>>>>> later.
> > >>>>>>>>>>
> > >>>>>>>>>> +1. Wasn't aware of the new memory type config options.
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> *==> Thanks. Do you need help with adjusting this?*
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Best,
> > >>>>>>>>> Stephan
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> >
>


-- 
Best, Jingsong Lee

Re: [DISCUSS] Some feedback after trying out the new FLIP-49 memory configurations

Posted by Stephan Ewen <se...@apache.org>.
"taskmanager.memory.flink.size" in the default config has a few advantages.

 - The value in the default config needs to be suitable for "trying out"
Flink, for a good "getting started" experience.

  - For trying out Flink, standalone is the most common entry point (except
running in IDE).

  - In standalone setup, from total process memory, we subtract quite a bit
before we arrive at the usable memory. We also subtract managed memory from
the heap now. I fear we might end up at a heap that becomes so small that
it makes for a bad "getting started" experience.

  - I don't think it is realistic that users set the process memory to full
machine memory. There is a lot on the machine as well in most cases.

  - In the JVM world, users are used to configuring the heap size and know
that there is additional memory overhead. The
"taskmanager.memory.flink.size" option fits well with that mindset.

  - One you start to think about the total process memory of Yarn
containers, you are already past the getting-started phase and on the
tuning phase.


On Tue, Dec 24, 2019, 10:25 Andrey Zagrebin <az...@apache.org> wrote:

> Thanks for the summary, Xintong! It makes sense to me.
>
> How about putting "taskmanager.memory.flink.size" in the configuration?
> > Then new downloaded Flink behaves similar to the previous Standalone
> setups.
> > If someone upgrades the binaries, but re-uses their old configuration,
> > then they get the compatibility as discussed previously.
> > We used that approach previously with the fine-grained failover recovery.
>
>
>
> > I'm trying to understand why "taskmanager.memory.flink.size" rather than
> > "taskmanager.memory.process.size" in the default flink-conf.yaml. Or put
> it
> > another way, why do we want the new downloaded Flink behaves similar to
> > previous Standalone setups rather than previous active mode setups? Is
> > there any special reason that I overlooked, which makes backwards
> > compatibility for standalone setups more important than for active
> setups?
> > IMO, "taskmanager.memory.process.size" is easier for the new comers. For
> > standalone setups, users can simply configure it to their machines'
> > available memory size, without needing to worry about leaving enough
> space
> > for JVM overehead / metaspace. For containerized setups, it's more
> > predictable how many memory the containers / Flink could use, which is
> more
> > friendly for users to manage their resource quota.
> > Therefore, unless there is anything I overlooked, I'm in favor of putting
> > "taskmanager.memory.process.size" rather than
> > "taskmanager.memory.flink.size" in the default configuration.
>
>
> I agree that having "taskmanager.memory.process.size" in default config
> should be easier to understand and tweak for the new users because it is
> just what they are ready to spend for Flink application.
> The problem is when users upgrade Flink and use the new default
> configuration then the behaviour can change:
> - either if we put process memory then Flink memory shrinks and the new
> default option contradicts their previous understanding
> - or if we put Flink memory then larger container is requested.
> The shrinking of memory sounds more implicit and worse. The increased
> container request will just fail in the worst case and the memory setup can
> be revisited.
> We could increase the default "taskmanager.memory.process.size" to better
> align it with the previous default setup for standalone
> but this would not remove the possible confusion problem for the old users,
> on the other hand the option is new and we can add a comment how to migrate
> from the old one.
>
> All in all, now I also tend to have "taskmanager.memory.process.size" in
> the default config unless there are more reasons for having less confusion
> for the old standalone users.
>
> Best,
> Andrey
>
> On Tue, Dec 24, 2019 at 5:50 AM Xintong Song <to...@gmail.com>
> wrote:
>
> > Hi all,
> >
> > It seems we have already get consensus on most of the issues. Thanks
> > everyone for the good discussion.
> >
> > While there are still open questions under discussion, I'd like to
> > summarize the discussion so far, and list the action items that we
> already
> > get consensus on. In this way, we can already start working on these
> items
> > while discussing the remaining open questions. Please correct me if the
> > summary does not reflect your argument.
> >
> > *Action Items that we already get consensus on:*
> >
> >    - Map "taskmanager.heap.size" to "taskmanager.memory.flink.size" for
> >    standalone setups, and "taskmanager.memory.process.size" for active
> setups.
> >    (should be mentioned in release notes)
> >    - If not explicitly configured, MiniCluster should have fixed default
> >    network and managed memory sizes. (should be mentioned in docs &
> release
> >    notes)
> >    - Change the memory config options' type from String to MemorySize
> >    - Change the config option key "taskmanager.memory.total-flink.size"
> >    to "taskmanager.memory.flink.size"
> >    - Change the config option key "taskmanager.memory.total-process.size"
> >    to "taskmanager.memory.process.size"
> >    - Update descriptions for "taskmanager.memory.framework.off-heap.size"
> >    and "taskmanager.memory.task.off-heap.size" to explicitly state that
> >       - Both direct and native memory are accounted
> >       - Will be fully counted into MaxDirectMemorySize
> >    - Update descriptions for
> >    "taskmanager.memory.jvm-overhead.[min|max|fraction]" to remove "I/O
> direct
> >    memory" and explicitly state that it's not counted into
> MaxDirectMemorySize
> >    - Print MemorySize with proper unit. (non-blocker for 1.10)
> >
> >
> > *Questions that are still open:*
> >
> >    - Which config option do we put in the default flink-conf.yaml?
> >       - "taskmanager.memory.flink.size"
> >       - "taskmanager.memory.process.size"
> >       - Deprecated "taskmanager.heap.size"
> >    - What is proper keys for network / shuffle memory
> >       - "taskmanager.memory.shuffle.*"
> >       - "taskmanager.memory.network.*"
> >       - "taskmanager.network.memory.*"
> >
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Tue, Dec 24, 2019 at 10:19 AM Xintong Song <to...@gmail.com>
> > wrote:
> >
> >> How about putting "taskmanager.memory.flink.size" in the configuration?
> >>> Then new downloaded Flink behaves similar to the previous Standalone
> setups.
> >>> If someone upgrades the binaries, but re-uses their old configuration,
> >>> then they get the compatibility as discussed previously.
> >>> We used that approach previously with the fine-grained failover
> recovery.
> >>
> >>
> >> I'm trying to understand why "taskmanager.memory.flink.size" rather than
> >> "taskmanager.memory.process.size" in the default flink-conf.yaml. Or
> put it
> >> another way, why do we want the new downloaded Flink behaves similar to
> >> previous Standalone setups rather than previous active mode setups? Is
> >> there any special reason that I overlooked, which makes backwards
> >> compatibility for standalone setups more important than for active
> setups?
> >>
> >> IMO, "taskmanager.memory.process.size" is easier for the new comers. For
> >> standalone setups, users can simply configure it to their machines'
> >> available memory size, without needing to worry about leaving enough
> space
> >> for JVM overehead / metaspace. For containerized setups, it's more
> >> predictable how many memory the containers / Flink could use, which is
> more
> >> friendly for users to manage their resource quota.
> >>
> >> Therefore, unless there is anything I overlooked, I'm in favor of
> putting
> >> "taskmanager.memory.process.size" rather than
> >> "taskmanager.memory.flink.size" in the default configuration.
> >>
> >>
> >> Thank you~
> >>
> >> Xintong Song
> >>
> >>
> >>
> >> On Tue, Dec 24, 2019 at 4:27 AM Andrey Zagrebin <az...@apache.org>
> >> wrote:
> >>
> >>> How about putting "taskmanager.memory.flink.size" in the configuration?
> >>>> Then new downloaded Flink behaves similar to the previous Standalone
> setups.
> >>>> If someone upgrades the binaries, but re-uses their old configuration,
> >>>> then they get the compatibility as discussed previously.
> >>>> We used that approach previously with the fine-grained failover
> >>>> recovery.
> >>>
> >>> +1, this sounds like a good compromise.
> >>>
> >>> +1 to not have more options for off-heap memory, that would get
> >>>> confusing fast. We can keep the name "off-heap" for both task and
> >>>> framework, as long as they mean the same thing: native plus direct,
> and
> >>>> fully counted into MaxDirectMemory. I would suggest to update the
> config
> >>>> descriptions then to reflect that.
> >>>>
> >>> True, this should be explained in the config descriptions.
> >>>
> >>> looks good to me
> >>>
> >>> From a user's perspective I believe "taskmanager.memory.network" would
> >>>> be easier to understand as not everyone knows exactly what the shuffle
> >>>> service is. I see the point that it would be a bit imprecise as we
> can have
> >>>> different shuffle implementations but I would go with the ease of
> >>>> use/understanding here. Moreover, I think that we won't have many
> different
> >>>> shuffle service implementations in the foreseeable future.
> >>>
> >>> I agree that if we cannot find any other convincing names for the
> >>> options, we should keep what we already have and change it if the
> >>> alternative is convincing enough.
> >>> The question is also whether we still want to rename it because it was
> >>> "taskmanager.network.*memory*.*" in 1.9 but
> "taskmanager.*memory*.network.*"
> >>> is more aligned with other new memory option names.
> >>> Or we can just 'un'-deprecate "taskmanager.network.*memory*.*".
> >>>
> >>> On Mon, Dec 23, 2019 at 8:42 PM Stephan Ewen <se...@apache.org> wrote:
> >>>
> >>>> How about putting "taskmanager.memory.flink.size" in the
> configuration?
> >>>> Then new downloaded Flink behaves similar to the previous Standalone
> setups.
> >>>>
> >>>> If someone upgrades the binaries, but re-uses their old configuration,
> >>>> then they get the compatibility as discussed previously.
> >>>> We used that approach previously with the fine-grained failover
> >>>> recovery.
> >>>>
> >>>> On Mon, Dec 23, 2019 at 3:27 AM Xintong Song <to...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> +1 to not have more options for off-heap memory, that would get
> >>>>>> confusing fast. We can keep the name "off-heap" for both task and
> >>>>>> framework, as long as they mean the same thing: native plus direct,
> and
> >>>>>> fully counted into MaxDirectMemory. I would suggest to update the
> config
> >>>>>> descriptions then to reflect that.
> >>>>>>
> >>>>> True, this should be explained in the config descriptions.
> >>>>>
> >>>>> Which configuration option will be set in Flink's default
> >>>>>> flink-conf.yaml? If we want to maintain the existing behaviour it
> would
> >>>>>> have to be the then deprecated taskmanager.heap.size config option.
> If we
> >>>>>> are ok with Yarn requesting slightly larger containers, then it
> could also
> >>>>>> be taskmanager.memory.total-flink.size.
> >>>>>>
> >>>>> Good point. Currently, we have
> >>>>> "taskmanager.memory.total-process.size". In order to preserve the
> previous
> >>>>> behavior, we need to have "taskmanager.heap.size" so it can be
> mapped to
> >>>>> different new options in standalone / active setups.
> >>>>> I think we can have the deprecated "taskmanager.heap.size" in the
> >>>>> default flink-conf.yaml, and also have the
> >>>>> new "taskmanager.memory.total-process.size" in a commented line. We
> can
> >>>>> explain how the deprecated config option behaves differently in the
> >>>>> comments, so that user can switch to the new config options if they
> want to.
> >>>>>
> >>>>> Thank you~
> >>>>>
> >>>>> Xintong Song
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Sat, Dec 21, 2019 at 1:00 AM Till Rohrmann <tr...@apache.org>
> >>>>> wrote:
> >>>>>
> >>>>>> Thanks for the feedback and good discussion everyone. I left some
> >>>>>> comments inline.
> >>>>>>
> >>>>>> On Fri, Dec 20, 2019 at 1:59 PM Stephan Ewen <se...@apache.org>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> +1 to not have more options for off-heap memory, that would get
> >>>>>>> confusing fast. We can keep the name "off-heap" for both task and
> >>>>>>> framework, as long as they mean the same thing: native plus
> direct, and
> >>>>>>> fully counted into MaxDirectMemory. I would suggest to update the
> config
> >>>>>>> descriptions then to reflect that.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Fri, Dec 20, 2019 at 1:03 PM Xintong Song <
> tonysong820@gmail.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Regarding the framework/task direct/native memory options, I tend
> >>>>>>>> to think it differently. I'm in favor of keep the
> "*.off-heap.size" for the
> >>>>>>>> config option keys.
> >>>>>>>>
> >>>>>>>>    - It's not necessary IMO to expose the difference concepts of
> >>>>>>>>    direct / native memory to the users.
> >>>>>>>>    - I would avoid introducing more options for native memory if
> >>>>>>>>    possible. Taking fine grained resource management and dynamic
> slot
> >>>>>>>>    allocation into consideration, that also means introduce more
> fields into
> >>>>>>>>    ResourceSpec / ResourceProfile.
> >>>>>>>>    - My gut feeling is that having a relative loose
> >>>>>>>>    MaxDirectMemory should not be a big problem.
> >>>>>>>>    - In most cases, the task / framework off-heap memory should be
> >>>>>>>>       mainly (if not all) direct memory, so the difference
> between derived
> >>>>>>>>       MaxDirectMemory and the ideal direct memory limit should
> not be too much.
> >>>>>>>>       - We do not have a good way to know the exact size needed
> >>>>>>>>       for jvm overhead / metaspace and framework / task off-heap
> memory, thus
> >>>>>>>>       having to conservatively reserve slightly more memory then
> what actually
> >>>>>>>>       needed. Such reserved but not used memory can cover for the
> small
> >>>>>>>>       MaxDirectMemory error.
> >>>>>>>>       -
> >>>>>>>>       - MaxDirectMemory is not the only way to trigger full gc. We
> >>>>>>>>       still heap activities that can also trigger the gc.
> >>>>>>>>
> >>>>>>>> Regarding the memory type config options, I've looked into the
> >>>>>>>> latest ConfigOptions changes. I think it shouldn't be too
> complicated to
> >>>>>>>> change the config options to use memory type, and I can handle it
> maybe
> >>>>>>>> during your vacations.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Also agree with improving MemorySize logging and parsing. This
> >>>>>>>> should not be a blocker that has to be done in 1.10. I would say
> we finish
> >>>>>>>> other works (testability, documentation and those discussed in
> this thread)
> >>>>>>>> first, and get to this only if we have time.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thank you~
> >>>>>>>>
> >>>>>>>> Xintong Song
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Fri, Dec 20, 2019 at 6:07 PM Andrey Zagrebin <
> >>>>>>>> azagrebin.apache@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>>> Hi Stephan and Xintong,
> >>>>>>>>>
> >>>>>>>>> Thanks for the further FLIP-49 feedbacks.
> >>>>>>>>>
> >>>>>>>>>   - "taskmanager.memory.size" (old main config option) is
> replaced
> >>>>>>>>>> by "taskmanager.memory.total-process.size" which has a
> different meaning in
> >>>>>>>>>> standalone setups. The old option did not subtract metaspace
> and other
> >>>>>>>>>> overhead, while the new option does. That means that with the
> default
> >>>>>>>>>> config, standalone clusters get quite a bit less memory.
> (independent of
> >>>>>>>>>> managed memory going off heap).
> >>>>>>>>>>     I am wondering if we could interpret
> >>>>>>>>>> "taskmanager.memory.size" as the deprecated key for
> >>>>>>>>>> "taskmanager.memory.total-flink.size". That would be in line
> with the old
> >>>>>>>>>> mechanism (assuming managed memory is set to off heap).
> >>>>>>>>>>     The effect would be that the container size on Yarn/Mesos
> >>>>>>>>>> increases, because from "taskmanager.memory.total-flink.size",
> we need to
> >>>>>>>>>> add overhead and metaspace to reach the total process size,
> rather than
> >>>>>>>>>> cutting off memory. But if we want, we could even adjust for
> that in the
> >>>>>>>>>> active resource manager, getting full backwards compatibility
> on that part.
> >>>>>>>>>>     Curious to hear more thoughts there.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I believe you mean "taskmanager.heap.size".
> >>>>>>>>>
> >>>>>>>>> I think the problem here is that the
> >>>>>>>>> legacy "taskmanager.heap.size" was used differently in
> standalone setups
> >>>>>>>>> and active yarn / mesos setups, and such different calculation
> logics and
> >>>>>>>>> behaviors are exactly what we want to avoid with FLIP-49.
> Therefore, I'm
> >>>>>>>>> not in favor of treating "taskmanager.memory.total-flink.size"
> differently
> >>>>>>>>> for standalone and active setups.
> >>>>>>>>>
> >>>>>>>>> I think what we really want is probably
> >>>>>>>>> mapping "taskmanager.heap.size" to different new config options
> in
> >>>>>>>>> different setups. How about we mark "taskmanager.heap.size" as
> deprecated
> >>>>>>>>> key for neither of "taskmanager.memory.total-process.size" and
> >>>>>>>>> "taskmanager.memory.total-flink.size". Instead, we parse it (if
> explicitly
> >>>>>>>>> configured) in startup scripts / active resource managers, and
> set the
> >>>>>>>>> value to "taskmanager.memory.total-flink.size" in the scripts and
> >>>>>>>>> "taskmanager.memory.total-process.size" in active resource
> managers (if the
> >>>>>>>>> new config options are not configured). We can provide util
> methods in
> >>>>>>>>> TaskExecutorResourceUtils for such conversions, to keep all the
> >>>>>>>>> configuration logics at one place.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I agree that the problem is that the legacy option
> >>>>>>>>> ‘taskmanager.heap.size’ has different semantics for
> standalone/container.
> >>>>>>>>> We had it initially falling back to
> 'taskmanager.memory.total-flink.size’
> >>>>>>>>> but I changed that to align it with container cut-off. Now I see
> it changes
> >>>>>>>>> standalone setup then.
> >>>>>>>>> +1 for supporting its backwards compatibility differently for
> >>>>>>>>> standalone/container setups.
> >>>>>>>>>
> >>>>>>>>
> >>>>>> Which configuration option will be set in Flink's default
> >>>>>> flink-conf.yaml? If we want to maintain the existing behaviour it
> would
> >>>>>> have to be the then deprecated taskmanager.heap.size config option.
> If we
> >>>>>> are ok with Yarn requesting slightly larger containers, then it
> could also
> >>>>>> be taskmanager.memory.total-flink.size.
> >>>>>>
> >>>>>>>
> >>>>>>>>>
> >>>>>>>>>   - Mini Cluster tries to imitate exact ratio of memory pools as
> a
> >>>>>>>>>> standalone setup. I get the idea behind that, but I am
> wondering if it is
> >>>>>>>>>> the right approach here.
> >>>>>>>>>>     For example: I started a relatively large JVM (large heap
> >>>>>>>>>> size of 10 GB) as a test. With the current logic, the system
> tries to
> >>>>>>>>>> reserve an additional 6GB for managed memory which is more than
> there is
> >>>>>>>>>> memory left. When you see the error that no memory could be
> allocated, you
> >>>>>>>>>> need to understand the magic of how this is derived.
> >>>>>>>>>>     I am trying to think about this from the perspective of
> using
> >>>>>>>>>> "Flink as a Library", which the MiniCluster is close to.
> >>>>>>>>>>     When starting Flink out of a running process, we cannot
> >>>>>>>>>> assume that we are the only users of that process and that we
> can mold the
> >>>>>>>>>> process to our demands. I think a fix value for managed memory
> and network
> >>>>>>>>>> memory would feel more natural in such a setup than a mechanism
> that is
> >>>>>>>>>> tailored towards exclusive use of the process.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> +1 on having fixed values for managed / shuffle memory.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> also +1 for that, if user has not specified any main options to
> >>>>>>>>> derive memory. We should also log this fixing of managed /
> shuffle memory.
> >>>>>>>>> And just noticed, we could also sanity check framework and if
> >>>>>>>>> explicitly configured task heap against available JVM heap, and
> at least
> >>>>>>>>> log inconsistencies.
> >>>>>>>>>
> >>>>>>>>>   - Some off-heap memory goes into direct memory, some does not.
> >>>>>>>>>> This confused me a bit. For example
> >>>>>>>>>> "taskmanager.memory.framework.off-heap.size" is counted into
> >>>>>>>>>> MaxDirectMemory while "taskmanager.memory.task.off-heap.size"
> is counted as
> >>>>>>>>>> native memory. Maybe we should rename the keys to reflect that.
> There is no
> >>>>>>>>>> one "off heap" memory type after all. Maybe use
> >>>>>>>>>> "taskmanager.memory.task.native: XXXmb" and
> >>>>>>>>>> "taskmanager.memory.framework.direct: XXXmb" instead?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I believe "taskmanager.memory.task.off-heap.size" is also
> >>>>>>>>> accounted in the max direct memory size limit. The confusion
> probably comes
> >>>>>>>>> from that "taskmanager.memory.framework.off-heap.size"
> explicitly mentioned
> >>>>>>>>> that in its description while
> "taskmanager.memory.task.off-heap.size"
> >>>>>>>>> didn't. That's actually because the framework off-heap memory is
> introduced
> >>>>>>>>> later in a separate commit. We should fix that.
> >>>>>>>>>
> >>>>>>>>> For framework / task off-heap memory, we do not differentiate
> >>>>>>>>> direct / native memory usage. That means the configure value for
> these two
> >>>>>>>>> options could be a mixture of direct / native memory. Since we
> do not know
> >>>>>>>>> the portion of direct memory out of the configured value, we have
> >>>>>>>>> to conservatively account it all into the max direct memory size
> limit.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> *==>  In that case, I am a bit confused. For the total size
> >>>>>>>>> calculation, it is fine. But why do we then set MaxDirectMemory?
> It is a
> >>>>>>>>> difficult parameter, and the main reason to set it was (if I
> recall
> >>>>>>>>> correctly) to trigger GC based on direct memory allocation (to
> free heap
> >>>>>>>>> structures that then in turn release direct memory). If the
> limit is
> >>>>>>>>> anyways too high (because we also count native memory in there)
> such that
> >>>>>>>>> we can exceed the total process (container) memory, why do we
> set it then?*
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I always also thought about it as providing more safety net for
> >>>>>>>>> direct allocations but GC thing looks more important.
> >>>>>>>>>
> >>>>>>>>> +1 for fixing docs for 'taskmanager.memory.task.off-heap.size’
> and
> >>>>>>>>> renaming to ‘direct' as this is what really happens
> >>>>>>>>> if we want to support direct limit more exact than now.
> >>>>>>>>>
> >>>>>>>>> I also think that it is hard to separate direct / native memory
> >>>>>>>>> unless we introduce even more options.
> >>>>>>>>> If user wants to keep the direct limit tight to a certain value
> >>>>>>>>> but also use native memory outside of it,
> >>>>>>>>> she would have to increase something else, like JVM overhead to
> >>>>>>>>> account for it and there is no other better way.
> >>>>>>>>> Having more options to account for the native memory outside of
> >>>>>>>>> direct limit complicates things but can be introduced later if
> needed.
> >>>>>>>>>
> >>>>>>>>>   - What do you think about renaming
> >>>>>>>>>> "taskmanager.memory.total-flink.size" to
> "taskmanager.memory.flink.size"
> >>>>>>>>>> and "taskmanager.memory.total-process.size" to
> >>>>>>>>>> "taskmanager.memory.process.size" (or
> "taskmanager.memory.jvm.total"). I
> >>>>>>>>>> think these keys may be a bit less clumsy (dropping the
> "total-") without
> >>>>>>>>>> loss of expressiveness.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> +1 on this.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> +1 as well. Also an option:
> >>>>>>>>> 'taskmanager.memory.total-process.size’ ->
> >>>>>>>>> ‘taskmanager.memory.jvm.process.size’,
> >>>>>>>>> although it can be also mentioned in docs that we mean JVM
> process.
> >>>>>>>>>
> >>>>>>>>
> >>>>>> I'd be in favour of Stephan's proposal for the config keys as
> shorter
> >>>>>> is usually better and they are still descriptive enough. Between
> >>>>>> "taskmanager.memory.process.size" and
> "taskmanager.memory.jvm.total" I
> >>>>>> would slightly favour the first variant.
> >>>>>>
> >>>>>>>
> >>>>>>>>>   - The network memory keys are now called
> >>>>>>>>>> "taskmanager.memory.shuffle.*". To my knowledge, shuffle is
> usually
> >>>>>>>>>> understood as a redistribution (random, or maybe by hash of
> key). As an
> >>>>>>>>>> example, there are many discussions about "shuffle join versus
> broadcast
> >>>>>>>>>> join", where "shuffle" is the synonym for "re-partitioning". We
> use that
> >>>>>>>>>> memory however for all network operations, like forward pipes,
> broadcasts,
> >>>>>>>>>> receiver-side buffering on checkpoints, etc. I find the name
> "*.shuffle.*"
> >>>>>>>>>> confusing, I am wondering if users would find that as well. So
> throwing in
> >>>>>>>>>> the suggestion to call the options
> "taskmanager.memory.network.*".
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> +0 on this one. I'm ok with "taskmanager.memory.network.*". On
> the
> >>>>>>>>> other hand, one can also argue that this part of memory is used
> by
> >>>>>>>>> ShuffleEnvironment, and the key "taskmanager.memory.shuffle.*"
> points more
> >>>>>>>>> directly to the shuffle service components.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> *==> In that case, the name "Shuffle Environment" may be a bit
> >>>>>>>>> incorrect, because it is doing not only shuffles as well. The
> >>>>>>>>> ShuffleEnvironment is also more internal, so the name is not too
> critical.
> >>>>>>>>> This isn't super high priority for me, but if we want to adjust
> it, better
> >>>>>>>>> earlier than later.*
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> This is also a bit controversial topic for me. Indeed, we have
> >>>>>>>>> always used ’network’ for this concept of task data shuffling
> over the
> >>>>>>>>> network and this can confuse existing users.
> >>>>>>>>>
> >>>>>>>>> On the other hand for the new users and in a long term, ’network’
> >>>>>>>>> can delude into a conclusion that all network memory is managed
> by this
> >>>>>>>>> option.
> >>>>>>>>> Also other types of shuffle might not directly deal with network
> >>>>>>>>> at all.
> >>>>>>>>>
> >>>>>>>>> By calling it shuffle, we were somewhat biased by understanding
> it
> >>>>>>>>> in term of map/reduce. This is rather an inter-task data
> exchange.
> >>>>>>>>> Maybe then 'taskmanager.memory.shuffle.communication.*’ or
> >>>>>>>>> ‘taskmanager.memory.task.shuffle/communication/io/network.*’.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>  From a user's perspective I believe "taskmanager.memory.network"
> >>>>>> would be easier to understand as not everyone knows exactly what the
> >>>>>> shuffle service is. I see the point that it would be a bit
> imprecise as we
> >>>>>> can have different shuffle implementations but I would go with the
> ease of
> >>>>>> use/understanding here. Moreover, I think that we won't have many
> different
> >>>>>> shuffle service implementations in the foreseeable future.
> >>>>>>
> >>>>>>>
> >>>>>>>>>   - The descriptions for the "taskmanager.memory.jvm-overhead.*"
> >>>>>>>>>> keys say that it also accounts for I/O direct memory, but the
> parameter is
> >>>>>>>>>> not counted into the MaxDirectMemory parameter.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> True. Since we already have framework off-heap memory accounted
> >>>>>>>>> for ad hoc direct memory usages, accounting all of jvm-overhead
> also into
> >>>>>>>>> max direct memory limit seems not necessary. I would suggest to
> remove "I/O
> >>>>>>>>> direct memory" from the description, and explicitly mention that
> this
> >>>>>>>>> option does not account for direct memory and will not be
> accounted into
> >>>>>>>>> max direct memory limit. WDYT?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> +1
> >>>>>>>>>
> >>>>>>>>>   - Can make the new ConfigOptions strongly typed with the new
> >>>>>>>>>> configuration options. For example, directly use MemorySize
> typed options.
> >>>>>>>>>> That makes validation automatic and saves us from breaking the
> options
> >>>>>>>>>> later.
> >>>>>>>>>
> >>>>>>>>> +1. Wasn't aware of the new memory type config options.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> +1
> >>>>>>>>>
> >>>>>>>>> *==> Thanks. Do you need help with adjusting this?*
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I would appreciate it.
> >>>>>>>>>
> >>>>>>>>> Also small side note, we extensively use MemorySize in logs now
> >>>>>>>>> but it might be not always readable as its string representation
> is only in
> >>>>>>>>> bytes atm
> >>>>>>>>> and does not reduce it to kb/mb/etc in case of big bytes value.
> We
> >>>>>>>>> could have at least some .prettyPrint function to use in logs.
> >>>>>>>>> And .fromMegabytes/etc factory methods would improve code
> >>>>>>>>> readability instead of .parse(int + “m”).
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> Andrey
> >>>>>>>>>
> >>>>>>>>> On 20 Dec 2019, at 12:13, Stephan Ewen <se...@apache.org> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi Xintong!
> >>>>>>>>>
> >>>>>>>>> Please find my answers inline:
> >>>>>>>>>
> >>>>>>>>>>   - "taskmanager.memory.size" (old main config option) is
> >>>>>>>>>>> replaced by "taskmanager.memory.total-process.size" which has
> a different
> >>>>>>>>>>> meaning in standalone setups. The old option did not subtract
> metaspace and
> >>>>>>>>>>> other overhead, while the new option does. That means that
> with the default
> >>>>>>>>>>> config, standalone clusters get quite a bit less memory.
> (independent of
> >>>>>>>>>>> managed memory going off heap).
> >>>>>>>>>>>     I am wondering if we could interpret
> >>>>>>>>>>> "taskmanager.memory.size" as the deprecated key for
> >>>>>>>>>>> "taskmanager.memory.total-flink.size". That would be in line
> with the old
> >>>>>>>>>>> mechanism (assuming managed memory is set to off heap).
> >>>>>>>>>>>     The effect would be that the container size on Yarn/Mesos
> >>>>>>>>>>> increases, because from "taskmanager.memory.total-flink.size",
> we need to
> >>>>>>>>>>> add overhead and metaspace to reach the total process size,
> rather than
> >>>>>>>>>>> cutting off memory. But if we want, we could even adjust for
> that in the
> >>>>>>>>>>> active resource manager, getting full backwards compatibility
> on that part.
> >>>>>>>>>>>     Curious to hear more thoughts there.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I believe you mean "taskmanager.heap.size".
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> *==> Yes*
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> I think the problem here is that the
> >>>>>>>>>> legacy "taskmanager.heap.size" was used differently in
> standalone setups
> >>>>>>>>>> and active yarn / mesos setups, and such different calculation
> logics and
> >>>>>>>>>> behaviors are exactly what we want to avoid with FLIP-49.
> Therefore, I'm
> >>>>>>>>>> not in favor of treating "taskmanager.memory.total-flink.size"
> differently
> >>>>>>>>>> for standalone and active setups.
> >>>>>>>>>>
> >>>>>>>>>> I think what we really want is probably
> >>>>>>>>>> mapping "taskmanager.heap.size" to different new config options
> in
> >>>>>>>>>> different setups. How about we mark "taskmanager.heap.size" as
> deprecated
> >>>>>>>>>> key for neither of "taskmanager.memory.total-process.size" and
> >>>>>>>>>> "taskmanager.memory.total-flink.size". Instead, we parse it (if
> explicitly
> >>>>>>>>>> configured) in startup scripts / active resource managers, and
> set the
> >>>>>>>>>> value to "taskmanager.memory.total-flink.size" in the scripts
> and
> >>>>>>>>>> "taskmanager.memory.total-process.size" in active resource
> managers (if the
> >>>>>>>>>> new config options are not configured). We can provide util
> methods in
> >>>>>>>>>> TaskExecutorResourceUtils for such conversions, to keep all the
> >>>>>>>>>> configuration logics at one place.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> *==> This is pretty much what I meant as well (maybe my
> >>>>>>>>> description was not very clear), so +1 for that mechanism*
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>   - Mini Cluster tries to imitate exact ratio of memory pools as
> a
> >>>>>>>>>>> standalone setup. I get the idea behind that, but I am
> wondering if it is
> >>>>>>>>>>> the right approach here.
> >>>>>>>>>>>     For example: I started a relatively large JVM (large heap
> >>>>>>>>>>> size of 10 GB) as a test. With the current logic, the system
> tries to
> >>>>>>>>>>> reserve an additional 6GB for managed memory which is more
> than there is
> >>>>>>>>>>> memory left. When you see the error that no memory could be
> allocated, you
> >>>>>>>>>>> need to understand the magic of how this is derived.
> >>>>>>>>>>>     I am trying to think about this from the perspective of
> >>>>>>>>>>> using "Flink as a Library", which the MiniCluster is close to.
> >>>>>>>>>>>     When starting Flink out of a running process, we cannot
> >>>>>>>>>>> assume that we are the only users of that process and that we
> can mold the
> >>>>>>>>>>> process to our demands. I think a fix value for managed memory
> and network
> >>>>>>>>>>> memory would feel more natural in such a setup than a
> mechanism that is
> >>>>>>>>>>> tailored towards exclusive use of the process.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> +1 on having fixed values for managed / shuffle memory.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> *==> Cool, let's also see what Andrey and Till think here.*
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>   - Some off-heap memory goes into direct memory, some does not.
> >>>>>>>>>>> This confused me a bit. For example
> >>>>>>>>>>> "taskmanager.memory.framework.off-heap.size" is counted into
> >>>>>>>>>>> MaxDirectMemory while "taskmanager.memory.task.off-heap.size"
> is counted as
> >>>>>>>>>>> native memory. Maybe we should rename the keys to reflect
> that. There is no
> >>>>>>>>>>> one "off heap" memory type after all. Maybe use
> >>>>>>>>>>> "taskmanager.memory.task.native: XXXmb" and
> >>>>>>>>>>> "taskmanager.memory.framework.direct: XXXmb" instead?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I believe "taskmanager.memory.task.off-heap.size" is also
> >>>>>>>>>> accounted in the max direct memory size limit. The confusion
> probably comes
> >>>>>>>>>> from that "taskmanager.memory.framework.off-heap.size"
> explicitly mentioned
> >>>>>>>>>> that in its description while
> "taskmanager.memory.task.off-heap.size"
> >>>>>>>>>> didn't. That's actually because the framework off-heap memory
> is introduced
> >>>>>>>>>> later in a separate commit. We should fix that.
> >>>>>>>>>>
> >>>>>>>>>> For framework / task off-heap memory, we do not differentiate
> >>>>>>>>>> direct / native memory usage. That means the configure value
> for these two
> >>>>>>>>>> options could be a mixture of direct / native memory. Since we
> do not know
> >>>>>>>>>> the portion of direct memory out of the configured value, we
> have
> >>>>>>>>>> to conservatively account it all into the max direct memory
> size limit.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> *==>  In that case, I am a bit confused. For the total size
> >>>>>>>>> calculation, it is fine. But why do we then set MaxDirectMemory?
> It is a
> >>>>>>>>> difficult parameter, and the main reason to set it was (if I
> recall
> >>>>>>>>> correctly) to trigger GC based on direct memory allocation (to
> free heap
> >>>>>>>>> structures that then in turn release direct memory). If the
> limit is
> >>>>>>>>> anyways too high (because we also count native memory in there)
> such that
> >>>>>>>>> we can exceed the total process (container) memory, why do we
> set it then?*
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>   - The network memory keys are now called
> >>>>>>>>>>> "taskmanager.memory.shuffle.*". To my knowledge, shuffle is
> usually
> >>>>>>>>>>> understood as a redistribution (random, or maybe by hash of
> key). As an
> >>>>>>>>>>> example, there are many discussions about "shuffle join versus
> broadcast
> >>>>>>>>>>> join", where "shuffle" is the synonym for "re-partitioning".
> We use that
> >>>>>>>>>>> memory however for all network operations, like forward pipes,
> broadcasts,
> >>>>>>>>>>> receiver-side buffering on checkpoints, etc. I find the name
> "*.shuffle.*"
> >>>>>>>>>>> confusing, I am wondering if users would find that as well. So
> throwing in
> >>>>>>>>>>> the suggestion to call the options
> "taskmanager.memory.network.*".
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> +0 on this one. I'm ok with "taskmanager.memory.network.*". On
> >>>>>>>>>> the other hand, one can also argue that this part of memory is
> used by
> >>>>>>>>>> ShuffleEnvironment, and the key "taskmanager.memory.shuffle.*"
> points more
> >>>>>>>>>> directly to the shuffle service components.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> *==> In that case, the name "Shuffle Environment" may be a bit
> >>>>>>>>> incorrect, because it is doing not only shuffles as well. The
> >>>>>>>>> ShuffleEnvironment is also more internal, so the name is not too
> critical.
> >>>>>>>>> This isn't super high priority for me, but if we want to adjust
> it, better
> >>>>>>>>> earlier than later.*
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>    - The descriptions for the "taskmanager.memory.jvm-overhead.*"
> >>>>>>>>> keys say that it also accounts for I/O direct memory, but the
> parameter is
> >>>>>>>>> not counted into the MaxDirectMemory parameter.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> True. Since we already have framework off-heap memory accounted
> >>>>>>>>>> for ad hoc direct memory usages, accounting all of jvm-overhead
> also into
> >>>>>>>>>> max direct memory limit seems not necessary. I would suggest to
> remove "I/O
> >>>>>>>>>> direct memory" from the description, and explicitly mention
> that this
> >>>>>>>>>> option does not account for direct memory and will not be
> accounted into
> >>>>>>>>>> max direct memory limit. WDYT?
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> *==> Sounds good. *
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>   - Can make the new ConfigOptions strongly typed with the new
> >>>>>>>>>>> configuration options. For example, directly use MemorySize
> typed options.
> >>>>>>>>>>> That makes validation automatic and saves us from breaking the
> options
> >>>>>>>>>>> later.
> >>>>>>>>>>
> >>>>>>>>>> +1. Wasn't aware of the new memory type config options.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> *==> Thanks. Do you need help with adjusting this?*
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> Stephan
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
>

Re: [DISCUSS] Some feedback after trying out the new FLIP-49 memory configurations

Posted by Andrey Zagrebin <az...@apache.org>.
Thanks for the summary, Xintong! It makes sense to me.

How about putting "taskmanager.memory.flink.size" in the configuration?
> Then new downloaded Flink behaves similar to the previous Standalone setups.
> If someone upgrades the binaries, but re-uses their old configuration,
> then they get the compatibility as discussed previously.
> We used that approach previously with the fine-grained failover recovery.



> I'm trying to understand why "taskmanager.memory.flink.size" rather than
> "taskmanager.memory.process.size" in the default flink-conf.yaml. Or put it
> another way, why do we want the new downloaded Flink behaves similar to
> previous Standalone setups rather than previous active mode setups? Is
> there any special reason that I overlooked, which makes backwards
> compatibility for standalone setups more important than for active setups?
> IMO, "taskmanager.memory.process.size" is easier for the new comers. For
> standalone setups, users can simply configure it to their machines'
> available memory size, without needing to worry about leaving enough space
> for JVM overehead / metaspace. For containerized setups, it's more
> predictable how many memory the containers / Flink could use, which is more
> friendly for users to manage their resource quota.
> Therefore, unless there is anything I overlooked, I'm in favor of putting
> "taskmanager.memory.process.size" rather than
> "taskmanager.memory.flink.size" in the default configuration.


I agree that having "taskmanager.memory.process.size" in default config
should be easier to understand and tweak for the new users because it is
just what they are ready to spend for Flink application.
The problem is when users upgrade Flink and use the new default
configuration then the behaviour can change:
- either if we put process memory then Flink memory shrinks and the new
default option contradicts their previous understanding
- or if we put Flink memory then larger container is requested.
The shrinking of memory sounds more implicit and worse. The increased
container request will just fail in the worst case and the memory setup can
be revisited.
We could increase the default "taskmanager.memory.process.size" to better
align it with the previous default setup for standalone
but this would not remove the possible confusion problem for the old users,
on the other hand the option is new and we can add a comment how to migrate
from the old one.

All in all, now I also tend to have "taskmanager.memory.process.size" in
the default config unless there are more reasons for having less confusion
for the old standalone users.

Best,
Andrey

On Tue, Dec 24, 2019 at 5:50 AM Xintong Song <to...@gmail.com> wrote:

> Hi all,
>
> It seems we have already get consensus on most of the issues. Thanks
> everyone for the good discussion.
>
> While there are still open questions under discussion, I'd like to
> summarize the discussion so far, and list the action items that we already
> get consensus on. In this way, we can already start working on these items
> while discussing the remaining open questions. Please correct me if the
> summary does not reflect your argument.
>
> *Action Items that we already get consensus on:*
>
>    - Map "taskmanager.heap.size" to "taskmanager.memory.flink.size" for
>    standalone setups, and "taskmanager.memory.process.size" for active setups.
>    (should be mentioned in release notes)
>    - If not explicitly configured, MiniCluster should have fixed default
>    network and managed memory sizes. (should be mentioned in docs & release
>    notes)
>    - Change the memory config options' type from String to MemorySize
>    - Change the config option key "taskmanager.memory.total-flink.size"
>    to "taskmanager.memory.flink.size"
>    - Change the config option key "taskmanager.memory.total-process.size"
>    to "taskmanager.memory.process.size"
>    - Update descriptions for "taskmanager.memory.framework.off-heap.size"
>    and "taskmanager.memory.task.off-heap.size" to explicitly state that
>       - Both direct and native memory are accounted
>       - Will be fully counted into MaxDirectMemorySize
>    - Update descriptions for
>    "taskmanager.memory.jvm-overhead.[min|max|fraction]" to remove "I/O direct
>    memory" and explicitly state that it's not counted into MaxDirectMemorySize
>    - Print MemorySize with proper unit. (non-blocker for 1.10)
>
>
> *Questions that are still open:*
>
>    - Which config option do we put in the default flink-conf.yaml?
>       - "taskmanager.memory.flink.size"
>       - "taskmanager.memory.process.size"
>       - Deprecated "taskmanager.heap.size"
>    - What is proper keys for network / shuffle memory
>       - "taskmanager.memory.shuffle.*"
>       - "taskmanager.memory.network.*"
>       - "taskmanager.network.memory.*"
>
>
> Thank you~
>
> Xintong Song
>
>
>
> On Tue, Dec 24, 2019 at 10:19 AM Xintong Song <to...@gmail.com>
> wrote:
>
>> How about putting "taskmanager.memory.flink.size" in the configuration?
>>> Then new downloaded Flink behaves similar to the previous Standalone setups.
>>> If someone upgrades the binaries, but re-uses their old configuration,
>>> then they get the compatibility as discussed previously.
>>> We used that approach previously with the fine-grained failover recovery.
>>
>>
>> I'm trying to understand why "taskmanager.memory.flink.size" rather than
>> "taskmanager.memory.process.size" in the default flink-conf.yaml. Or put it
>> another way, why do we want the new downloaded Flink behaves similar to
>> previous Standalone setups rather than previous active mode setups? Is
>> there any special reason that I overlooked, which makes backwards
>> compatibility for standalone setups more important than for active setups?
>>
>> IMO, "taskmanager.memory.process.size" is easier for the new comers. For
>> standalone setups, users can simply configure it to their machines'
>> available memory size, without needing to worry about leaving enough space
>> for JVM overehead / metaspace. For containerized setups, it's more
>> predictable how many memory the containers / Flink could use, which is more
>> friendly for users to manage their resource quota.
>>
>> Therefore, unless there is anything I overlooked, I'm in favor of putting
>> "taskmanager.memory.process.size" rather than
>> "taskmanager.memory.flink.size" in the default configuration.
>>
>>
>> Thank you~
>>
>> Xintong Song
>>
>>
>>
>> On Tue, Dec 24, 2019 at 4:27 AM Andrey Zagrebin <az...@apache.org>
>> wrote:
>>
>>> How about putting "taskmanager.memory.flink.size" in the configuration?
>>>> Then new downloaded Flink behaves similar to the previous Standalone setups.
>>>> If someone upgrades the binaries, but re-uses their old configuration,
>>>> then they get the compatibility as discussed previously.
>>>> We used that approach previously with the fine-grained failover
>>>> recovery.
>>>
>>> +1, this sounds like a good compromise.
>>>
>>> +1 to not have more options for off-heap memory, that would get
>>>> confusing fast. We can keep the name "off-heap" for both task and
>>>> framework, as long as they mean the same thing: native plus direct, and
>>>> fully counted into MaxDirectMemory. I would suggest to update the config
>>>> descriptions then to reflect that.
>>>>
>>> True, this should be explained in the config descriptions.
>>>
>>> looks good to me
>>>
>>> From a user's perspective I believe "taskmanager.memory.network" would
>>>> be easier to understand as not everyone knows exactly what the shuffle
>>>> service is. I see the point that it would be a bit imprecise as we can have
>>>> different shuffle implementations but I would go with the ease of
>>>> use/understanding here. Moreover, I think that we won't have many different
>>>> shuffle service implementations in the foreseeable future.
>>>
>>> I agree that if we cannot find any other convincing names for the
>>> options, we should keep what we already have and change it if the
>>> alternative is convincing enough.
>>> The question is also whether we still want to rename it because it was
>>> "taskmanager.network.*memory*.*" in 1.9 but "taskmanager.*memory*.network.*"
>>> is more aligned with other new memory option names.
>>> Or we can just 'un'-deprecate "taskmanager.network.*memory*.*".
>>>
>>> On Mon, Dec 23, 2019 at 8:42 PM Stephan Ewen <se...@apache.org> wrote:
>>>
>>>> How about putting "taskmanager.memory.flink.size" in the configuration?
>>>> Then new downloaded Flink behaves similar to the previous Standalone setups.
>>>>
>>>> If someone upgrades the binaries, but re-uses their old configuration,
>>>> then they get the compatibility as discussed previously.
>>>> We used that approach previously with the fine-grained failover
>>>> recovery.
>>>>
>>>> On Mon, Dec 23, 2019 at 3:27 AM Xintong Song <to...@gmail.com>
>>>> wrote:
>>>>
>>>>> +1 to not have more options for off-heap memory, that would get
>>>>>> confusing fast. We can keep the name "off-heap" for both task and
>>>>>> framework, as long as they mean the same thing: native plus direct, and
>>>>>> fully counted into MaxDirectMemory. I would suggest to update the config
>>>>>> descriptions then to reflect that.
>>>>>>
>>>>> True, this should be explained in the config descriptions.
>>>>>
>>>>> Which configuration option will be set in Flink's default
>>>>>> flink-conf.yaml? If we want to maintain the existing behaviour it would
>>>>>> have to be the then deprecated taskmanager.heap.size config option. If we
>>>>>> are ok with Yarn requesting slightly larger containers, then it could also
>>>>>> be taskmanager.memory.total-flink.size.
>>>>>>
>>>>> Good point. Currently, we have
>>>>> "taskmanager.memory.total-process.size". In order to preserve the previous
>>>>> behavior, we need to have "taskmanager.heap.size" so it can be mapped to
>>>>> different new options in standalone / active setups.
>>>>> I think we can have the deprecated "taskmanager.heap.size" in the
>>>>> default flink-conf.yaml, and also have the
>>>>> new "taskmanager.memory.total-process.size" in a commented line. We can
>>>>> explain how the deprecated config option behaves differently in the
>>>>> comments, so that user can switch to the new config options if they want to.
>>>>>
>>>>> Thank you~
>>>>>
>>>>> Xintong Song
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Dec 21, 2019 at 1:00 AM Till Rohrmann <tr...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Thanks for the feedback and good discussion everyone. I left some
>>>>>> comments inline.
>>>>>>
>>>>>> On Fri, Dec 20, 2019 at 1:59 PM Stephan Ewen <se...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> +1 to not have more options for off-heap memory, that would get
>>>>>>> confusing fast. We can keep the name "off-heap" for both task and
>>>>>>> framework, as long as they mean the same thing: native plus direct, and
>>>>>>> fully counted into MaxDirectMemory. I would suggest to update the config
>>>>>>> descriptions then to reflect that.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Dec 20, 2019 at 1:03 PM Xintong Song <to...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Regarding the framework/task direct/native memory options, I tend
>>>>>>>> to think it differently. I'm in favor of keep the "*.off-heap.size" for the
>>>>>>>> config option keys.
>>>>>>>>
>>>>>>>>    - It's not necessary IMO to expose the difference concepts of
>>>>>>>>    direct / native memory to the users.
>>>>>>>>    - I would avoid introducing more options for native memory if
>>>>>>>>    possible. Taking fine grained resource management and dynamic slot
>>>>>>>>    allocation into consideration, that also means introduce more fields into
>>>>>>>>    ResourceSpec / ResourceProfile.
>>>>>>>>    - My gut feeling is that having a relative loose
>>>>>>>>    MaxDirectMemory should not be a big problem.
>>>>>>>>    - In most cases, the task / framework off-heap memory should be
>>>>>>>>       mainly (if not all) direct memory, so the difference between derived
>>>>>>>>       MaxDirectMemory and the ideal direct memory limit should not be too much.
>>>>>>>>       - We do not have a good way to know the exact size needed
>>>>>>>>       for jvm overhead / metaspace and framework / task off-heap memory, thus
>>>>>>>>       having to conservatively reserve slightly more memory then what actually
>>>>>>>>       needed. Such reserved but not used memory can cover for the small
>>>>>>>>       MaxDirectMemory error.
>>>>>>>>       -
>>>>>>>>       - MaxDirectMemory is not the only way to trigger full gc. We
>>>>>>>>       still heap activities that can also trigger the gc.
>>>>>>>>
>>>>>>>> Regarding the memory type config options, I've looked into the
>>>>>>>> latest ConfigOptions changes. I think it shouldn't be too complicated to
>>>>>>>> change the config options to use memory type, and I can handle it maybe
>>>>>>>> during your vacations.
>>>>>>>>
>>>>>>>>
>>>>>>>> Also agree with improving MemorySize logging and parsing. This
>>>>>>>> should not be a blocker that has to be done in 1.10. I would say we finish
>>>>>>>> other works (testability, documentation and those discussed in this thread)
>>>>>>>> first, and get to this only if we have time.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thank you~
>>>>>>>>
>>>>>>>> Xintong Song
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Dec 20, 2019 at 6:07 PM Andrey Zagrebin <
>>>>>>>> azagrebin.apache@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Stephan and Xintong,
>>>>>>>>>
>>>>>>>>> Thanks for the further FLIP-49 feedbacks.
>>>>>>>>>
>>>>>>>>>   - "taskmanager.memory.size" (old main config option) is replaced
>>>>>>>>>> by "taskmanager.memory.total-process.size" which has a different meaning in
>>>>>>>>>> standalone setups. The old option did not subtract metaspace and other
>>>>>>>>>> overhead, while the new option does. That means that with the default
>>>>>>>>>> config, standalone clusters get quite a bit less memory. (independent of
>>>>>>>>>> managed memory going off heap).
>>>>>>>>>>     I am wondering if we could interpret
>>>>>>>>>> "taskmanager.memory.size" as the deprecated key for
>>>>>>>>>> "taskmanager.memory.total-flink.size". That would be in line with the old
>>>>>>>>>> mechanism (assuming managed memory is set to off heap).
>>>>>>>>>>     The effect would be that the container size on Yarn/Mesos
>>>>>>>>>> increases, because from "taskmanager.memory.total-flink.size", we need to
>>>>>>>>>> add overhead and metaspace to reach the total process size, rather than
>>>>>>>>>> cutting off memory. But if we want, we could even adjust for that in the
>>>>>>>>>> active resource manager, getting full backwards compatibility on that part.
>>>>>>>>>>     Curious to hear more thoughts there.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I believe you mean "taskmanager.heap.size".
>>>>>>>>>
>>>>>>>>> I think the problem here is that the
>>>>>>>>> legacy "taskmanager.heap.size" was used differently in standalone setups
>>>>>>>>> and active yarn / mesos setups, and such different calculation logics and
>>>>>>>>> behaviors are exactly what we want to avoid with FLIP-49. Therefore, I'm
>>>>>>>>> not in favor of treating "taskmanager.memory.total-flink.size" differently
>>>>>>>>> for standalone and active setups.
>>>>>>>>>
>>>>>>>>> I think what we really want is probably
>>>>>>>>> mapping "taskmanager.heap.size" to different new config options in
>>>>>>>>> different setups. How about we mark "taskmanager.heap.size" as deprecated
>>>>>>>>> key for neither of "taskmanager.memory.total-process.size" and
>>>>>>>>> "taskmanager.memory.total-flink.size". Instead, we parse it (if explicitly
>>>>>>>>> configured) in startup scripts / active resource managers, and set the
>>>>>>>>> value to "taskmanager.memory.total-flink.size" in the scripts and
>>>>>>>>> "taskmanager.memory.total-process.size" in active resource managers (if the
>>>>>>>>> new config options are not configured). We can provide util methods in
>>>>>>>>> TaskExecutorResourceUtils for such conversions, to keep all the
>>>>>>>>> configuration logics at one place.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I agree that the problem is that the legacy option
>>>>>>>>> ‘taskmanager.heap.size’ has different semantics for standalone/container.
>>>>>>>>> We had it initially falling back to 'taskmanager.memory.total-flink.size’
>>>>>>>>> but I changed that to align it with container cut-off. Now I see it changes
>>>>>>>>> standalone setup then.
>>>>>>>>> +1 for supporting its backwards compatibility differently for
>>>>>>>>> standalone/container setups.
>>>>>>>>>
>>>>>>>>
>>>>>> Which configuration option will be set in Flink's default
>>>>>> flink-conf.yaml? If we want to maintain the existing behaviour it would
>>>>>> have to be the then deprecated taskmanager.heap.size config option. If we
>>>>>> are ok with Yarn requesting slightly larger containers, then it could also
>>>>>> be taskmanager.memory.total-flink.size.
>>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>>   - Mini Cluster tries to imitate exact ratio of memory pools as a
>>>>>>>>>> standalone setup. I get the idea behind that, but I am wondering if it is
>>>>>>>>>> the right approach here.
>>>>>>>>>>     For example: I started a relatively large JVM (large heap
>>>>>>>>>> size of 10 GB) as a test. With the current logic, the system tries to
>>>>>>>>>> reserve an additional 6GB for managed memory which is more than there is
>>>>>>>>>> memory left. When you see the error that no memory could be allocated, you
>>>>>>>>>> need to understand the magic of how this is derived.
>>>>>>>>>>     I am trying to think about this from the perspective of using
>>>>>>>>>> "Flink as a Library", which the MiniCluster is close to.
>>>>>>>>>>     When starting Flink out of a running process, we cannot
>>>>>>>>>> assume that we are the only users of that process and that we can mold the
>>>>>>>>>> process to our demands. I think a fix value for managed memory and network
>>>>>>>>>> memory would feel more natural in such a setup than a mechanism that is
>>>>>>>>>> tailored towards exclusive use of the process.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> +1 on having fixed values for managed / shuffle memory.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> also +1 for that, if user has not specified any main options to
>>>>>>>>> derive memory. We should also log this fixing of managed / shuffle memory.
>>>>>>>>> And just noticed, we could also sanity check framework and if
>>>>>>>>> explicitly configured task heap against available JVM heap, and at least
>>>>>>>>> log inconsistencies.
>>>>>>>>>
>>>>>>>>>   - Some off-heap memory goes into direct memory, some does not.
>>>>>>>>>> This confused me a bit. For example
>>>>>>>>>> "taskmanager.memory.framework.off-heap.size" is counted into
>>>>>>>>>> MaxDirectMemory while "taskmanager.memory.task.off-heap.size" is counted as
>>>>>>>>>> native memory. Maybe we should rename the keys to reflect that. There is no
>>>>>>>>>> one "off heap" memory type after all. Maybe use
>>>>>>>>>> "taskmanager.memory.task.native: XXXmb" and
>>>>>>>>>> "taskmanager.memory.framework.direct: XXXmb" instead?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I believe "taskmanager.memory.task.off-heap.size" is also
>>>>>>>>> accounted in the max direct memory size limit. The confusion probably comes
>>>>>>>>> from that "taskmanager.memory.framework.off-heap.size" explicitly mentioned
>>>>>>>>> that in its description while "taskmanager.memory.task.off-heap.size"
>>>>>>>>> didn't. That's actually because the framework off-heap memory is introduced
>>>>>>>>> later in a separate commit. We should fix that.
>>>>>>>>>
>>>>>>>>> For framework / task off-heap memory, we do not differentiate
>>>>>>>>> direct / native memory usage. That means the configure value for these two
>>>>>>>>> options could be a mixture of direct / native memory. Since we do not know
>>>>>>>>> the portion of direct memory out of the configured value, we have
>>>>>>>>> to conservatively account it all into the max direct memory size limit.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *==>  In that case, I am a bit confused. For the total size
>>>>>>>>> calculation, it is fine. But why do we then set MaxDirectMemory? It is a
>>>>>>>>> difficult parameter, and the main reason to set it was (if I recall
>>>>>>>>> correctly) to trigger GC based on direct memory allocation (to free heap
>>>>>>>>> structures that then in turn release direct memory). If the limit is
>>>>>>>>> anyways too high (because we also count native memory in there) such that
>>>>>>>>> we can exceed the total process (container) memory, why do we set it then?*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I always also thought about it as providing more safety net for
>>>>>>>>> direct allocations but GC thing looks more important.
>>>>>>>>>
>>>>>>>>> +1 for fixing docs for 'taskmanager.memory.task.off-heap.size’ and
>>>>>>>>> renaming to ‘direct' as this is what really happens
>>>>>>>>> if we want to support direct limit more exact than now.
>>>>>>>>>
>>>>>>>>> I also think that it is hard to separate direct / native memory
>>>>>>>>> unless we introduce even more options.
>>>>>>>>> If user wants to keep the direct limit tight to a certain value
>>>>>>>>> but also use native memory outside of it,
>>>>>>>>> she would have to increase something else, like JVM overhead to
>>>>>>>>> account for it and there is no other better way.
>>>>>>>>> Having more options to account for the native memory outside of
>>>>>>>>> direct limit complicates things but can be introduced later if needed.
>>>>>>>>>
>>>>>>>>>   - What do you think about renaming
>>>>>>>>>> "taskmanager.memory.total-flink.size" to "taskmanager.memory.flink.size"
>>>>>>>>>> and "taskmanager.memory.total-process.size" to
>>>>>>>>>> "taskmanager.memory.process.size" (or "taskmanager.memory.jvm.total"). I
>>>>>>>>>> think these keys may be a bit less clumsy (dropping the "total-") without
>>>>>>>>>> loss of expressiveness.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> +1 on this.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> +1 as well. Also an option:
>>>>>>>>> 'taskmanager.memory.total-process.size’ ->
>>>>>>>>> ‘taskmanager.memory.jvm.process.size’,
>>>>>>>>> although it can be also mentioned in docs that we mean JVM process.
>>>>>>>>>
>>>>>>>>
>>>>>> I'd be in favour of Stephan's proposal for the config keys as shorter
>>>>>> is usually better and they are still descriptive enough. Between
>>>>>> "taskmanager.memory.process.size" and "taskmanager.memory.jvm.total" I
>>>>>> would slightly favour the first variant.
>>>>>>
>>>>>>>
>>>>>>>>>   - The network memory keys are now called
>>>>>>>>>> "taskmanager.memory.shuffle.*". To my knowledge, shuffle is usually
>>>>>>>>>> understood as a redistribution (random, or maybe by hash of key). As an
>>>>>>>>>> example, there are many discussions about "shuffle join versus broadcast
>>>>>>>>>> join", where "shuffle" is the synonym for "re-partitioning". We use that
>>>>>>>>>> memory however for all network operations, like forward pipes, broadcasts,
>>>>>>>>>> receiver-side buffering on checkpoints, etc. I find the name "*.shuffle.*"
>>>>>>>>>> confusing, I am wondering if users would find that as well. So throwing in
>>>>>>>>>> the suggestion to call the options "taskmanager.memory.network.*".
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> +0 on this one. I'm ok with "taskmanager.memory.network.*". On the
>>>>>>>>> other hand, one can also argue that this part of memory is used by
>>>>>>>>> ShuffleEnvironment, and the key "taskmanager.memory.shuffle.*" points more
>>>>>>>>> directly to the shuffle service components.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *==> In that case, the name "Shuffle Environment" may be a bit
>>>>>>>>> incorrect, because it is doing not only shuffles as well. The
>>>>>>>>> ShuffleEnvironment is also more internal, so the name is not too critical.
>>>>>>>>> This isn't super high priority for me, but if we want to adjust it, better
>>>>>>>>> earlier than later.*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is also a bit controversial topic for me. Indeed, we have
>>>>>>>>> always used ’network’ for this concept of task data shuffling over the
>>>>>>>>> network and this can confuse existing users.
>>>>>>>>>
>>>>>>>>> On the other hand for the new users and in a long term, ’network’
>>>>>>>>> can delude into a conclusion that all network memory is managed by this
>>>>>>>>> option.
>>>>>>>>> Also other types of shuffle might not directly deal with network
>>>>>>>>> at all.
>>>>>>>>>
>>>>>>>>> By calling it shuffle, we were somewhat biased by understanding it
>>>>>>>>> in term of map/reduce. This is rather an inter-task data exchange.
>>>>>>>>> Maybe then 'taskmanager.memory.shuffle.communication.*’ or
>>>>>>>>> ‘taskmanager.memory.task.shuffle/communication/io/network.*’.
>>>>>>>>>
>>>>>>>>
>>>>>>  From a user's perspective I believe "taskmanager.memory.network"
>>>>>> would be easier to understand as not everyone knows exactly what the
>>>>>> shuffle service is. I see the point that it would be a bit imprecise as we
>>>>>> can have different shuffle implementations but I would go with the ease of
>>>>>> use/understanding here. Moreover, I think that we won't have many different
>>>>>> shuffle service implementations in the foreseeable future.
>>>>>>
>>>>>>>
>>>>>>>>>   - The descriptions for the "taskmanager.memory.jvm-overhead.*"
>>>>>>>>>> keys say that it also accounts for I/O direct memory, but the parameter is
>>>>>>>>>> not counted into the MaxDirectMemory parameter.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> True. Since we already have framework off-heap memory accounted
>>>>>>>>> for ad hoc direct memory usages, accounting all of jvm-overhead also into
>>>>>>>>> max direct memory limit seems not necessary. I would suggest to remove "I/O
>>>>>>>>> direct memory" from the description, and explicitly mention that this
>>>>>>>>> option does not account for direct memory and will not be accounted into
>>>>>>>>> max direct memory limit. WDYT?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> +1
>>>>>>>>>
>>>>>>>>>   - Can make the new ConfigOptions strongly typed with the new
>>>>>>>>>> configuration options. For example, directly use MemorySize typed options.
>>>>>>>>>> That makes validation automatic and saves us from breaking the options
>>>>>>>>>> later.
>>>>>>>>>
>>>>>>>>> +1. Wasn't aware of the new memory type config options.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> +1
>>>>>>>>>
>>>>>>>>> *==> Thanks. Do you need help with adjusting this?*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I would appreciate it.
>>>>>>>>>
>>>>>>>>> Also small side note, we extensively use MemorySize in logs now
>>>>>>>>> but it might be not always readable as its string representation is only in
>>>>>>>>> bytes atm
>>>>>>>>> and does not reduce it to kb/mb/etc in case of big bytes value. We
>>>>>>>>> could have at least some .prettyPrint function to use in logs.
>>>>>>>>> And .fromMegabytes/etc factory methods would improve code
>>>>>>>>> readability instead of .parse(int + “m”).
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Andrey
>>>>>>>>>
>>>>>>>>> On 20 Dec 2019, at 12:13, Stephan Ewen <se...@apache.org> wrote:
>>>>>>>>>
>>>>>>>>> Hi Xintong!
>>>>>>>>>
>>>>>>>>> Please find my answers inline:
>>>>>>>>>
>>>>>>>>>>   - "taskmanager.memory.size" (old main config option) is
>>>>>>>>>>> replaced by "taskmanager.memory.total-process.size" which has a different
>>>>>>>>>>> meaning in standalone setups. The old option did not subtract metaspace and
>>>>>>>>>>> other overhead, while the new option does. That means that with the default
>>>>>>>>>>> config, standalone clusters get quite a bit less memory. (independent of
>>>>>>>>>>> managed memory going off heap).
>>>>>>>>>>>     I am wondering if we could interpret
>>>>>>>>>>> "taskmanager.memory.size" as the deprecated key for
>>>>>>>>>>> "taskmanager.memory.total-flink.size". That would be in line with the old
>>>>>>>>>>> mechanism (assuming managed memory is set to off heap).
>>>>>>>>>>>     The effect would be that the container size on Yarn/Mesos
>>>>>>>>>>> increases, because from "taskmanager.memory.total-flink.size", we need to
>>>>>>>>>>> add overhead and metaspace to reach the total process size, rather than
>>>>>>>>>>> cutting off memory. But if we want, we could even adjust for that in the
>>>>>>>>>>> active resource manager, getting full backwards compatibility on that part.
>>>>>>>>>>>     Curious to hear more thoughts there.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I believe you mean "taskmanager.heap.size".
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *==> Yes*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I think the problem here is that the
>>>>>>>>>> legacy "taskmanager.heap.size" was used differently in standalone setups
>>>>>>>>>> and active yarn / mesos setups, and such different calculation logics and
>>>>>>>>>> behaviors are exactly what we want to avoid with FLIP-49. Therefore, I'm
>>>>>>>>>> not in favor of treating "taskmanager.memory.total-flink.size" differently
>>>>>>>>>> for standalone and active setups.
>>>>>>>>>>
>>>>>>>>>> I think what we really want is probably
>>>>>>>>>> mapping "taskmanager.heap.size" to different new config options in
>>>>>>>>>> different setups. How about we mark "taskmanager.heap.size" as deprecated
>>>>>>>>>> key for neither of "taskmanager.memory.total-process.size" and
>>>>>>>>>> "taskmanager.memory.total-flink.size". Instead, we parse it (if explicitly
>>>>>>>>>> configured) in startup scripts / active resource managers, and set the
>>>>>>>>>> value to "taskmanager.memory.total-flink.size" in the scripts and
>>>>>>>>>> "taskmanager.memory.total-process.size" in active resource managers (if the
>>>>>>>>>> new config options are not configured). We can provide util methods in
>>>>>>>>>> TaskExecutorResourceUtils for such conversions, to keep all the
>>>>>>>>>> configuration logics at one place.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *==> This is pretty much what I meant as well (maybe my
>>>>>>>>> description was not very clear), so +1 for that mechanism*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>   - Mini Cluster tries to imitate exact ratio of memory pools as a
>>>>>>>>>>> standalone setup. I get the idea behind that, but I am wondering if it is
>>>>>>>>>>> the right approach here.
>>>>>>>>>>>     For example: I started a relatively large JVM (large heap
>>>>>>>>>>> size of 10 GB) as a test. With the current logic, the system tries to
>>>>>>>>>>> reserve an additional 6GB for managed memory which is more than there is
>>>>>>>>>>> memory left. When you see the error that no memory could be allocated, you
>>>>>>>>>>> need to understand the magic of how this is derived.
>>>>>>>>>>>     I am trying to think about this from the perspective of
>>>>>>>>>>> using "Flink as a Library", which the MiniCluster is close to.
>>>>>>>>>>>     When starting Flink out of a running process, we cannot
>>>>>>>>>>> assume that we are the only users of that process and that we can mold the
>>>>>>>>>>> process to our demands. I think a fix value for managed memory and network
>>>>>>>>>>> memory would feel more natural in such a setup than a mechanism that is
>>>>>>>>>>> tailored towards exclusive use of the process.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> +1 on having fixed values for managed / shuffle memory.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *==> Cool, let's also see what Andrey and Till think here.*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>   - Some off-heap memory goes into direct memory, some does not.
>>>>>>>>>>> This confused me a bit. For example
>>>>>>>>>>> "taskmanager.memory.framework.off-heap.size" is counted into
>>>>>>>>>>> MaxDirectMemory while "taskmanager.memory.task.off-heap.size" is counted as
>>>>>>>>>>> native memory. Maybe we should rename the keys to reflect that. There is no
>>>>>>>>>>> one "off heap" memory type after all. Maybe use
>>>>>>>>>>> "taskmanager.memory.task.native: XXXmb" and
>>>>>>>>>>> "taskmanager.memory.framework.direct: XXXmb" instead?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I believe "taskmanager.memory.task.off-heap.size" is also
>>>>>>>>>> accounted in the max direct memory size limit. The confusion probably comes
>>>>>>>>>> from that "taskmanager.memory.framework.off-heap.size" explicitly mentioned
>>>>>>>>>> that in its description while "taskmanager.memory.task.off-heap.size"
>>>>>>>>>> didn't. That's actually because the framework off-heap memory is introduced
>>>>>>>>>> later in a separate commit. We should fix that.
>>>>>>>>>>
>>>>>>>>>> For framework / task off-heap memory, we do not differentiate
>>>>>>>>>> direct / native memory usage. That means the configure value for these two
>>>>>>>>>> options could be a mixture of direct / native memory. Since we do not know
>>>>>>>>>> the portion of direct memory out of the configured value, we have
>>>>>>>>>> to conservatively account it all into the max direct memory size limit.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *==>  In that case, I am a bit confused. For the total size
>>>>>>>>> calculation, it is fine. But why do we then set MaxDirectMemory? It is a
>>>>>>>>> difficult parameter, and the main reason to set it was (if I recall
>>>>>>>>> correctly) to trigger GC based on direct memory allocation (to free heap
>>>>>>>>> structures that then in turn release direct memory). If the limit is
>>>>>>>>> anyways too high (because we also count native memory in there) such that
>>>>>>>>> we can exceed the total process (container) memory, why do we set it then?*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>   - The network memory keys are now called
>>>>>>>>>>> "taskmanager.memory.shuffle.*". To my knowledge, shuffle is usually
>>>>>>>>>>> understood as a redistribution (random, or maybe by hash of key). As an
>>>>>>>>>>> example, there are many discussions about "shuffle join versus broadcast
>>>>>>>>>>> join", where "shuffle" is the synonym for "re-partitioning". We use that
>>>>>>>>>>> memory however for all network operations, like forward pipes, broadcasts,
>>>>>>>>>>> receiver-side buffering on checkpoints, etc. I find the name "*.shuffle.*"
>>>>>>>>>>> confusing, I am wondering if users would find that as well. So throwing in
>>>>>>>>>>> the suggestion to call the options "taskmanager.memory.network.*".
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> +0 on this one. I'm ok with "taskmanager.memory.network.*". On
>>>>>>>>>> the other hand, one can also argue that this part of memory is used by
>>>>>>>>>> ShuffleEnvironment, and the key "taskmanager.memory.shuffle.*" points more
>>>>>>>>>> directly to the shuffle service components.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *==> In that case, the name "Shuffle Environment" may be a bit
>>>>>>>>> incorrect, because it is doing not only shuffles as well. The
>>>>>>>>> ShuffleEnvironment is also more internal, so the name is not too critical.
>>>>>>>>> This isn't super high priority for me, but if we want to adjust it, better
>>>>>>>>> earlier than later.*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - The descriptions for the "taskmanager.memory.jvm-overhead.*"
>>>>>>>>> keys say that it also accounts for I/O direct memory, but the parameter is
>>>>>>>>> not counted into the MaxDirectMemory parameter.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> True. Since we already have framework off-heap memory accounted
>>>>>>>>>> for ad hoc direct memory usages, accounting all of jvm-overhead also into
>>>>>>>>>> max direct memory limit seems not necessary. I would suggest to remove "I/O
>>>>>>>>>> direct memory" from the description, and explicitly mention that this
>>>>>>>>>> option does not account for direct memory and will not be accounted into
>>>>>>>>>> max direct memory limit. WDYT?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *==> Sounds good. *
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>   - Can make the new ConfigOptions strongly typed with the new
>>>>>>>>>>> configuration options. For example, directly use MemorySize typed options.
>>>>>>>>>>> That makes validation automatic and saves us from breaking the options
>>>>>>>>>>> later.
>>>>>>>>>>
>>>>>>>>>> +1. Wasn't aware of the new memory type config options.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *==> Thanks. Do you need help with adjusting this?*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Stephan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>

Re: [DISCUSS] Some feedback after trying out the new FLIP-49 memory configurations

Posted by Xintong Song <to...@gmail.com>.
Hi all,

It seems we have already get consensus on most of the issues. Thanks
everyone for the good discussion.

While there are still open questions under discussion, I'd like to
summarize the discussion so far, and list the action items that we already
get consensus on. In this way, we can already start working on these items
while discussing the remaining open questions. Please correct me if the
summary does not reflect your argument.

*Action Items that we already get consensus on:*

   - Map "taskmanager.heap.size" to "taskmanager.memory.flink.size" for
   standalone setups, and "taskmanager.memory.process.size" for active setups.
   (should be mentioned in release notes)
   - If not explicitly configured, MiniCluster should have fixed default
   network and managed memory sizes. (should be mentioned in docs & release
   notes)
   - Change the memory config options' type from String to MemorySize
   - Change the config option key "taskmanager.memory.total-flink.size" to
   "taskmanager.memory.flink.size"
   - Change the config option key "taskmanager.memory.total-process.size"
   to "taskmanager.memory.process.size"
   - Update descriptions for "taskmanager.memory.framework.off-heap.size"
   and "taskmanager.memory.task.off-heap.size" to explicitly state that
      - Both direct and native memory are accounted
      - Will be fully counted into MaxDirectMemorySize
   - Update descriptions for
   "taskmanager.memory.jvm-overhead.[min|max|fraction]" to remove "I/O direct
   memory" and explicitly state that it's not counted into MaxDirectMemorySize
   - Print MemorySize with proper unit. (non-blocker for 1.10)


*Questions that are still open:*

   - Which config option do we put in the default flink-conf.yaml?
      - "taskmanager.memory.flink.size"
      - "taskmanager.memory.process.size"
      - Deprecated "taskmanager.heap.size"
   - What is proper keys for network / shuffle memory
      - "taskmanager.memory.shuffle.*"
      - "taskmanager.memory.network.*"
      - "taskmanager.network.memory.*"


Thank you~

Xintong Song



On Tue, Dec 24, 2019 at 10:19 AM Xintong Song <to...@gmail.com> wrote:

> How about putting "taskmanager.memory.flink.size" in the configuration?
>> Then new downloaded Flink behaves similar to the previous Standalone setups.
>> If someone upgrades the binaries, but re-uses their old configuration,
>> then they get the compatibility as discussed previously.
>> We used that approach previously with the fine-grained failover recovery.
>
>
> I'm trying to understand why "taskmanager.memory.flink.size" rather than
> "taskmanager.memory.process.size" in the default flink-conf.yaml. Or put it
> another way, why do we want the new downloaded Flink behaves similar to
> previous Standalone setups rather than previous active mode setups? Is
> there any special reason that I overlooked, which makes backwards
> compatibility for standalone setups more important than for active setups?
>
> IMO, "taskmanager.memory.process.size" is easier for the new comers. For
> standalone setups, users can simply configure it to their machines'
> available memory size, without needing to worry about leaving enough space
> for JVM overehead / metaspace. For containerized setups, it's more
> predictable how many memory the containers / Flink could use, which is more
> friendly for users to manage their resource quota.
>
> Therefore, unless there is anything I overlooked, I'm in favor of putting
> "taskmanager.memory.process.size" rather than
> "taskmanager.memory.flink.size" in the default configuration.
>
>
> Thank you~
>
> Xintong Song
>
>
>
> On Tue, Dec 24, 2019 at 4:27 AM Andrey Zagrebin <az...@apache.org>
> wrote:
>
>> How about putting "taskmanager.memory.flink.size" in the configuration?
>>> Then new downloaded Flink behaves similar to the previous Standalone setups.
>>> If someone upgrades the binaries, but re-uses their old configuration,
>>> then they get the compatibility as discussed previously.
>>> We used that approach previously with the fine-grained failover recovery.
>>
>> +1, this sounds like a good compromise.
>>
>> +1 to not have more options for off-heap memory, that would get confusing
>>> fast. We can keep the name "off-heap" for both task and framework, as long
>>> as they mean the same thing: native plus direct, and fully counted into
>>> MaxDirectMemory. I would suggest to update the config descriptions then to
>>> reflect that.
>>>
>> True, this should be explained in the config descriptions.
>>
>> looks good to me
>>
>> From a user's perspective I believe "taskmanager.memory.network" would be
>>> easier to understand as not everyone knows exactly what the shuffle service
>>> is. I see the point that it would be a bit imprecise as we can have
>>> different shuffle implementations but I would go with the ease of
>>> use/understanding here. Moreover, I think that we won't have many different
>>> shuffle service implementations in the foreseeable future.
>>
>> I agree that if we cannot find any other convincing names for the
>> options, we should keep what we already have and change it if the
>> alternative is convincing enough.
>> The question is also whether we still want to rename it because it was
>> "taskmanager.network.*memory*.*" in 1.9 but "taskmanager.*memory*.network.*"
>> is more aligned with other new memory option names.
>> Or we can just 'un'-deprecate "taskmanager.network.*memory*.*".
>>
>> On Mon, Dec 23, 2019 at 8:42 PM Stephan Ewen <se...@apache.org> wrote:
>>
>>> How about putting "taskmanager.memory.flink.size" in the configuration?
>>> Then new downloaded Flink behaves similar to the previous Standalone setups.
>>>
>>> If someone upgrades the binaries, but re-uses their old configuration,
>>> then they get the compatibility as discussed previously.
>>> We used that approach previously with the fine-grained failover recovery.
>>>
>>> On Mon, Dec 23, 2019 at 3:27 AM Xintong Song <to...@gmail.com>
>>> wrote:
>>>
>>>> +1 to not have more options for off-heap memory, that would get
>>>>> confusing fast. We can keep the name "off-heap" for both task and
>>>>> framework, as long as they mean the same thing: native plus direct, and
>>>>> fully counted into MaxDirectMemory. I would suggest to update the config
>>>>> descriptions then to reflect that.
>>>>>
>>>> True, this should be explained in the config descriptions.
>>>>
>>>> Which configuration option will be set in Flink's default
>>>>> flink-conf.yaml? If we want to maintain the existing behaviour it would
>>>>> have to be the then deprecated taskmanager.heap.size config option. If we
>>>>> are ok with Yarn requesting slightly larger containers, then it could also
>>>>> be taskmanager.memory.total-flink.size.
>>>>>
>>>> Good point. Currently, we have "taskmanager.memory.total-process.size".
>>>> In order to preserve the previous behavior, we need to have
>>>> "taskmanager.heap.size" so it can be mapped to different new options in
>>>> standalone / active setups.
>>>> I think we can have the deprecated "taskmanager.heap.size" in the
>>>> default flink-conf.yaml, and also have the
>>>> new "taskmanager.memory.total-process.size" in a commented line. We can
>>>> explain how the deprecated config option behaves differently in the
>>>> comments, so that user can switch to the new config options if they want to.
>>>>
>>>> Thank you~
>>>>
>>>> Xintong Song
>>>>
>>>>
>>>>
>>>> On Sat, Dec 21, 2019 at 1:00 AM Till Rohrmann <tr...@apache.org>
>>>> wrote:
>>>>
>>>>> Thanks for the feedback and good discussion everyone. I left some
>>>>> comments inline.
>>>>>
>>>>> On Fri, Dec 20, 2019 at 1:59 PM Stephan Ewen <se...@apache.org> wrote:
>>>>>
>>>>>> +1 to not have more options for off-heap memory, that would get
>>>>>> confusing fast. We can keep the name "off-heap" for both task and
>>>>>> framework, as long as they mean the same thing: native plus direct, and
>>>>>> fully counted into MaxDirectMemory. I would suggest to update the config
>>>>>> descriptions then to reflect that.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Dec 20, 2019 at 1:03 PM Xintong Song <to...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Regarding the framework/task direct/native memory options, I tend to
>>>>>>> think it differently. I'm in favor of keep the "*.off-heap.size" for the
>>>>>>> config option keys.
>>>>>>>
>>>>>>>    - It's not necessary IMO to expose the difference concepts of
>>>>>>>    direct / native memory to the users.
>>>>>>>    - I would avoid introducing more options for native memory if
>>>>>>>    possible. Taking fine grained resource management and dynamic slot
>>>>>>>    allocation into consideration, that also means introduce more fields into
>>>>>>>    ResourceSpec / ResourceProfile.
>>>>>>>    - My gut feeling is that having a relative loose MaxDirectMemory
>>>>>>>    should not be a big problem.
>>>>>>>    - In most cases, the task / framework off-heap memory should be
>>>>>>>       mainly (if not all) direct memory, so the difference between derived
>>>>>>>       MaxDirectMemory and the ideal direct memory limit should not be too much.
>>>>>>>       - We do not have a good way to know the exact size needed for
>>>>>>>       jvm overhead / metaspace and framework / task off-heap memory, thus having
>>>>>>>       to conservatively reserve slightly more memory then what actually needed.
>>>>>>>       Such reserved but not used memory can cover for the small MaxDirectMemory
>>>>>>>       error.
>>>>>>>       -
>>>>>>>       - MaxDirectMemory is not the only way to trigger full gc. We
>>>>>>>       still heap activities that can also trigger the gc.
>>>>>>>
>>>>>>> Regarding the memory type config options, I've looked into the
>>>>>>> latest ConfigOptions changes. I think it shouldn't be too complicated to
>>>>>>> change the config options to use memory type, and I can handle it maybe
>>>>>>> during your vacations.
>>>>>>>
>>>>>>>
>>>>>>> Also agree with improving MemorySize logging and parsing. This
>>>>>>> should not be a blocker that has to be done in 1.10. I would say we finish
>>>>>>> other works (testability, documentation and those discussed in this thread)
>>>>>>> first, and get to this only if we have time.
>>>>>>>
>>>>>>>
>>>>>>> Thank you~
>>>>>>>
>>>>>>> Xintong Song
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Dec 20, 2019 at 6:07 PM Andrey Zagrebin <
>>>>>>> azagrebin.apache@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Stephan and Xintong,
>>>>>>>>
>>>>>>>> Thanks for the further FLIP-49 feedbacks.
>>>>>>>>
>>>>>>>>   - "taskmanager.memory.size" (old main config option) is replaced
>>>>>>>>> by "taskmanager.memory.total-process.size" which has a different meaning in
>>>>>>>>> standalone setups. The old option did not subtract metaspace and other
>>>>>>>>> overhead, while the new option does. That means that with the default
>>>>>>>>> config, standalone clusters get quite a bit less memory. (independent of
>>>>>>>>> managed memory going off heap).
>>>>>>>>>     I am wondering if we could interpret "taskmanager.memory.size"
>>>>>>>>> as the deprecated key for "taskmanager.memory.total-flink.size". That would
>>>>>>>>> be in line with the old mechanism (assuming managed memory is set to off
>>>>>>>>> heap).
>>>>>>>>>     The effect would be that the container size on Yarn/Mesos
>>>>>>>>> increases, because from "taskmanager.memory.total-flink.size", we need to
>>>>>>>>> add overhead and metaspace to reach the total process size, rather than
>>>>>>>>> cutting off memory. But if we want, we could even adjust for that in the
>>>>>>>>> active resource manager, getting full backwards compatibility on that part.
>>>>>>>>>     Curious to hear more thoughts there.
>>>>>>>>
>>>>>>>>
>>>>>>>> I believe you mean "taskmanager.heap.size".
>>>>>>>>
>>>>>>>> I think the problem here is that the legacy "taskmanager.heap.size"
>>>>>>>> was used differently in standalone setups and active yarn / mesos setups,
>>>>>>>> and such different calculation logics and behaviors are exactly what we
>>>>>>>> want to avoid with FLIP-49. Therefore, I'm not in favor of treating
>>>>>>>> "taskmanager.memory.total-flink.size" differently for standalone and active
>>>>>>>> setups.
>>>>>>>>
>>>>>>>> I think what we really want is probably
>>>>>>>> mapping "taskmanager.heap.size" to different new config options in
>>>>>>>> different setups. How about we mark "taskmanager.heap.size" as deprecated
>>>>>>>> key for neither of "taskmanager.memory.total-process.size" and
>>>>>>>> "taskmanager.memory.total-flink.size". Instead, we parse it (if explicitly
>>>>>>>> configured) in startup scripts / active resource managers, and set the
>>>>>>>> value to "taskmanager.memory.total-flink.size" in the scripts and
>>>>>>>> "taskmanager.memory.total-process.size" in active resource managers (if the
>>>>>>>> new config options are not configured). We can provide util methods in
>>>>>>>> TaskExecutorResourceUtils for such conversions, to keep all the
>>>>>>>> configuration logics at one place.
>>>>>>>>
>>>>>>>>
>>>>>>>> I agree that the problem is that the legacy option
>>>>>>>> ‘taskmanager.heap.size’ has different semantics for standalone/container.
>>>>>>>> We had it initially falling back to 'taskmanager.memory.total-flink.size’
>>>>>>>> but I changed that to align it with container cut-off. Now I see it changes
>>>>>>>> standalone setup then.
>>>>>>>> +1 for supporting its backwards compatibility differently for
>>>>>>>> standalone/container setups.
>>>>>>>>
>>>>>>>
>>>>> Which configuration option will be set in Flink's default
>>>>> flink-conf.yaml? If we want to maintain the existing behaviour it would
>>>>> have to be the then deprecated taskmanager.heap.size config option. If we
>>>>> are ok with Yarn requesting slightly larger containers, then it could also
>>>>> be taskmanager.memory.total-flink.size.
>>>>>
>>>>>>
>>>>>>>>
>>>>>>>>   - Mini Cluster tries to imitate exact ratio of memory pools as a
>>>>>>>>> standalone setup. I get the idea behind that, but I am wondering if it is
>>>>>>>>> the right approach here.
>>>>>>>>>     For example: I started a relatively large JVM (large heap size
>>>>>>>>> of 10 GB) as a test. With the current logic, the system tries to reserve an
>>>>>>>>> additional 6GB for managed memory which is more than there is memory left.
>>>>>>>>> When you see the error that no memory could be allocated, you need to
>>>>>>>>> understand the magic of how this is derived.
>>>>>>>>>     I am trying to think about this from the perspective of using
>>>>>>>>> "Flink as a Library", which the MiniCluster is close to.
>>>>>>>>>     When starting Flink out of a running process, we cannot assume
>>>>>>>>> that we are the only users of that process and that we can mold the process
>>>>>>>>> to our demands. I think a fix value for managed memory and network memory
>>>>>>>>> would feel more natural in such a setup than a mechanism that is tailored
>>>>>>>>> towards exclusive use of the process.
>>>>>>>>
>>>>>>>>
>>>>>>>> +1 on having fixed values for managed / shuffle memory.
>>>>>>>>
>>>>>>>>
>>>>>>>> also +1 for that, if user has not specified any main options to
>>>>>>>> derive memory. We should also log this fixing of managed / shuffle memory.
>>>>>>>> And just noticed, we could also sanity check framework and if
>>>>>>>> explicitly configured task heap against available JVM heap, and at least
>>>>>>>> log inconsistencies.
>>>>>>>>
>>>>>>>>   - Some off-heap memory goes into direct memory, some does not.
>>>>>>>>> This confused me a bit. For example
>>>>>>>>> "taskmanager.memory.framework.off-heap.size" is counted into
>>>>>>>>> MaxDirectMemory while "taskmanager.memory.task.off-heap.size" is counted as
>>>>>>>>> native memory. Maybe we should rename the keys to reflect that. There is no
>>>>>>>>> one "off heap" memory type after all. Maybe use
>>>>>>>>> "taskmanager.memory.task.native: XXXmb" and
>>>>>>>>> "taskmanager.memory.framework.direct: XXXmb" instead?
>>>>>>>>
>>>>>>>>
>>>>>>>> I believe "taskmanager.memory.task.off-heap.size" is also accounted
>>>>>>>> in the max direct memory size limit. The confusion probably comes from that
>>>>>>>> "taskmanager.memory.framework.off-heap.size" explicitly mentioned that in
>>>>>>>> its description while "taskmanager.memory.task.off-heap.size" didn't.
>>>>>>>> That's actually because the framework off-heap memory is introduced later
>>>>>>>> in a separate commit. We should fix that.
>>>>>>>>
>>>>>>>> For framework / task off-heap memory, we do not differentiate
>>>>>>>> direct / native memory usage. That means the configure value for these two
>>>>>>>> options could be a mixture of direct / native memory. Since we do not know
>>>>>>>> the portion of direct memory out of the configured value, we have
>>>>>>>> to conservatively account it all into the max direct memory size limit.
>>>>>>>>
>>>>>>>>
>>>>>>>> *==>  In that case, I am a bit confused. For the total size
>>>>>>>> calculation, it is fine. But why do we then set MaxDirectMemory? It is a
>>>>>>>> difficult parameter, and the main reason to set it was (if I recall
>>>>>>>> correctly) to trigger GC based on direct memory allocation (to free heap
>>>>>>>> structures that then in turn release direct memory). If the limit is
>>>>>>>> anyways too high (because we also count native memory in there) such that
>>>>>>>> we can exceed the total process (container) memory, why do we set it then?*
>>>>>>>>
>>>>>>>>
>>>>>>>> I always also thought about it as providing more safety net for
>>>>>>>> direct allocations but GC thing looks more important.
>>>>>>>>
>>>>>>>> +1 for fixing docs for 'taskmanager.memory.task.off-heap.size’ and
>>>>>>>> renaming to ‘direct' as this is what really happens
>>>>>>>> if we want to support direct limit more exact than now.
>>>>>>>>
>>>>>>>> I also think that it is hard to separate direct / native memory
>>>>>>>> unless we introduce even more options.
>>>>>>>> If user wants to keep the direct limit tight to a certain value but
>>>>>>>> also use native memory outside of it,
>>>>>>>> she would have to increase something else, like JVM overhead to
>>>>>>>> account for it and there is no other better way.
>>>>>>>> Having more options to account for the native memory outside of
>>>>>>>> direct limit complicates things but can be introduced later if needed.
>>>>>>>>
>>>>>>>>   - What do you think about renaming
>>>>>>>>> "taskmanager.memory.total-flink.size" to "taskmanager.memory.flink.size"
>>>>>>>>> and "taskmanager.memory.total-process.size" to
>>>>>>>>> "taskmanager.memory.process.size" (or "taskmanager.memory.jvm.total"). I
>>>>>>>>> think these keys may be a bit less clumsy (dropping the "total-") without
>>>>>>>>> loss of expressiveness.
>>>>>>>>
>>>>>>>>
>>>>>>>> +1 on this.
>>>>>>>>
>>>>>>>>
>>>>>>>> +1 as well. Also an option: 'taskmanager.memory.total-process.size’
>>>>>>>> -> ‘taskmanager.memory.jvm.process.size’,
>>>>>>>> although it can be also mentioned in docs that we mean JVM process.
>>>>>>>>
>>>>>>>
>>>>> I'd be in favour of Stephan's proposal for the config keys as shorter
>>>>> is usually better and they are still descriptive enough. Between
>>>>> "taskmanager.memory.process.size" and "taskmanager.memory.jvm.total" I
>>>>> would slightly favour the first variant.
>>>>>
>>>>>>
>>>>>>>>   - The network memory keys are now called
>>>>>>>>> "taskmanager.memory.shuffle.*". To my knowledge, shuffle is usually
>>>>>>>>> understood as a redistribution (random, or maybe by hash of key). As an
>>>>>>>>> example, there are many discussions about "shuffle join versus broadcast
>>>>>>>>> join", where "shuffle" is the synonym for "re-partitioning". We use that
>>>>>>>>> memory however for all network operations, like forward pipes, broadcasts,
>>>>>>>>> receiver-side buffering on checkpoints, etc. I find the name "*.shuffle.*"
>>>>>>>>> confusing, I am wondering if users would find that as well. So throwing in
>>>>>>>>> the suggestion to call the options "taskmanager.memory.network.*".
>>>>>>>>
>>>>>>>>
>>>>>>>> +0 on this one. I'm ok with "taskmanager.memory.network.*". On the
>>>>>>>> other hand, one can also argue that this part of memory is used by
>>>>>>>> ShuffleEnvironment, and the key "taskmanager.memory.shuffle.*" points more
>>>>>>>> directly to the shuffle service components.
>>>>>>>>
>>>>>>>>
>>>>>>>> *==> In that case, the name "Shuffle Environment" may be a bit
>>>>>>>> incorrect, because it is doing not only shuffles as well. The
>>>>>>>> ShuffleEnvironment is also more internal, so the name is not too critical.
>>>>>>>> This isn't super high priority for me, but if we want to adjust it, better
>>>>>>>> earlier than later.*
>>>>>>>>
>>>>>>>>
>>>>>>>> This is also a bit controversial topic for me. Indeed, we have
>>>>>>>> always used ’network’ for this concept of task data shuffling over the
>>>>>>>> network and this can confuse existing users.
>>>>>>>>
>>>>>>>> On the other hand for the new users and in a long term, ’network’
>>>>>>>> can delude into a conclusion that all network memory is managed by this
>>>>>>>> option.
>>>>>>>> Also other types of shuffle might not directly deal with network at
>>>>>>>> all.
>>>>>>>>
>>>>>>>> By calling it shuffle, we were somewhat biased by understanding it
>>>>>>>> in term of map/reduce. This is rather an inter-task data exchange.
>>>>>>>> Maybe then 'taskmanager.memory.shuffle.communication.*’ or
>>>>>>>> ‘taskmanager.memory.task.shuffle/communication/io/network.*’.
>>>>>>>>
>>>>>>>
>>>>>  From a user's perspective I believe "taskmanager.memory.network"
>>>>> would be easier to understand as not everyone knows exactly what the
>>>>> shuffle service is. I see the point that it would be a bit imprecise as we
>>>>> can have different shuffle implementations but I would go with the ease of
>>>>> use/understanding here. Moreover, I think that we won't have many different
>>>>> shuffle service implementations in the foreseeable future.
>>>>>
>>>>>>
>>>>>>>>   - The descriptions for the "taskmanager.memory.jvm-overhead.*"
>>>>>>>>> keys say that it also accounts for I/O direct memory, but the parameter is
>>>>>>>>> not counted into the MaxDirectMemory parameter.
>>>>>>>>
>>>>>>>>
>>>>>>>> True. Since we already have framework off-heap memory accounted for
>>>>>>>> ad hoc direct memory usages, accounting all of jvm-overhead also into max
>>>>>>>> direct memory limit seems not necessary. I would suggest to remove "I/O
>>>>>>>> direct memory" from the description, and explicitly mention that this
>>>>>>>> option does not account for direct memory and will not be accounted into
>>>>>>>> max direct memory limit. WDYT?
>>>>>>>>
>>>>>>>>
>>>>>>>> +1
>>>>>>>>
>>>>>>>>   - Can make the new ConfigOptions strongly typed with the new
>>>>>>>>> configuration options. For example, directly use MemorySize typed options.
>>>>>>>>> That makes validation automatic and saves us from breaking the options
>>>>>>>>> later.
>>>>>>>>
>>>>>>>> +1. Wasn't aware of the new memory type config options.
>>>>>>>>
>>>>>>>>
>>>>>>>> +1
>>>>>>>>
>>>>>>>> *==> Thanks. Do you need help with adjusting this?*
>>>>>>>>
>>>>>>>>
>>>>>>>> I would appreciate it.
>>>>>>>>
>>>>>>>> Also small side note, we extensively use MemorySize in logs now but
>>>>>>>> it might be not always readable as its string representation is only in
>>>>>>>> bytes atm
>>>>>>>> and does not reduce it to kb/mb/etc in case of big bytes value. We
>>>>>>>> could have at least some .prettyPrint function to use in logs.
>>>>>>>> And .fromMegabytes/etc factory methods would improve code
>>>>>>>> readability instead of .parse(int + “m”).
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>> On 20 Dec 2019, at 12:13, Stephan Ewen <se...@apache.org> wrote:
>>>>>>>>
>>>>>>>> Hi Xintong!
>>>>>>>>
>>>>>>>> Please find my answers inline:
>>>>>>>>
>>>>>>>>>   - "taskmanager.memory.size" (old main config option) is replaced
>>>>>>>>>> by "taskmanager.memory.total-process.size" which has a different meaning in
>>>>>>>>>> standalone setups. The old option did not subtract metaspace and other
>>>>>>>>>> overhead, while the new option does. That means that with the default
>>>>>>>>>> config, standalone clusters get quite a bit less memory. (independent of
>>>>>>>>>> managed memory going off heap).
>>>>>>>>>>     I am wondering if we could interpret
>>>>>>>>>> "taskmanager.memory.size" as the deprecated key for
>>>>>>>>>> "taskmanager.memory.total-flink.size". That would be in line with the old
>>>>>>>>>> mechanism (assuming managed memory is set to off heap).
>>>>>>>>>>     The effect would be that the container size on Yarn/Mesos
>>>>>>>>>> increases, because from "taskmanager.memory.total-flink.size", we need to
>>>>>>>>>> add overhead and metaspace to reach the total process size, rather than
>>>>>>>>>> cutting off memory. But if we want, we could even adjust for that in the
>>>>>>>>>> active resource manager, getting full backwards compatibility on that part.
>>>>>>>>>>     Curious to hear more thoughts there.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I believe you mean "taskmanager.heap.size".
>>>>>>>>>
>>>>>>>>
>>>>>>>> *==> Yes*
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> I think the problem here is that the
>>>>>>>>> legacy "taskmanager.heap.size" was used differently in standalone setups
>>>>>>>>> and active yarn / mesos setups, and such different calculation logics and
>>>>>>>>> behaviors are exactly what we want to avoid with FLIP-49. Therefore, I'm
>>>>>>>>> not in favor of treating "taskmanager.memory.total-flink.size" differently
>>>>>>>>> for standalone and active setups.
>>>>>>>>>
>>>>>>>>> I think what we really want is probably
>>>>>>>>> mapping "taskmanager.heap.size" to different new config options in
>>>>>>>>> different setups. How about we mark "taskmanager.heap.size" as deprecated
>>>>>>>>> key for neither of "taskmanager.memory.total-process.size" and
>>>>>>>>> "taskmanager.memory.total-flink.size". Instead, we parse it (if explicitly
>>>>>>>>> configured) in startup scripts / active resource managers, and set the
>>>>>>>>> value to "taskmanager.memory.total-flink.size" in the scripts and
>>>>>>>>> "taskmanager.memory.total-process.size" in active resource managers (if the
>>>>>>>>> new config options are not configured). We can provide util methods in
>>>>>>>>> TaskExecutorResourceUtils for such conversions, to keep all the
>>>>>>>>> configuration logics at one place.
>>>>>>>>>
>>>>>>>>
>>>>>>>> *==> This is pretty much what I meant as well (maybe my description
>>>>>>>> was not very clear), so +1 for that mechanism*
>>>>>>>>
>>>>>>>>
>>>>>>>>   - Mini Cluster tries to imitate exact ratio of memory pools as a
>>>>>>>>>> standalone setup. I get the idea behind that, but I am wondering if it is
>>>>>>>>>> the right approach here.
>>>>>>>>>>     For example: I started a relatively large JVM (large heap
>>>>>>>>>> size of 10 GB) as a test. With the current logic, the system tries to
>>>>>>>>>> reserve an additional 6GB for managed memory which is more than there is
>>>>>>>>>> memory left. When you see the error that no memory could be allocated, you
>>>>>>>>>> need to understand the magic of how this is derived.
>>>>>>>>>>     I am trying to think about this from the perspective of using
>>>>>>>>>> "Flink as a Library", which the MiniCluster is close to.
>>>>>>>>>>     When starting Flink out of a running process, we cannot
>>>>>>>>>> assume that we are the only users of that process and that we can mold the
>>>>>>>>>> process to our demands. I think a fix value for managed memory and network
>>>>>>>>>> memory would feel more natural in such a setup than a mechanism that is
>>>>>>>>>> tailored towards exclusive use of the process.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> +1 on having fixed values for managed / shuffle memory.
>>>>>>>>>
>>>>>>>>
>>>>>>>> *==> Cool, let's also see what Andrey and Till think here.*
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>   - Some off-heap memory goes into direct memory, some does not.
>>>>>>>>>> This confused me a bit. For example
>>>>>>>>>> "taskmanager.memory.framework.off-heap.size" is counted into
>>>>>>>>>> MaxDirectMemory while "taskmanager.memory.task.off-heap.size" is counted as
>>>>>>>>>> native memory. Maybe we should rename the keys to reflect that. There is no
>>>>>>>>>> one "off heap" memory type after all. Maybe use
>>>>>>>>>> "taskmanager.memory.task.native: XXXmb" and
>>>>>>>>>> "taskmanager.memory.framework.direct: XXXmb" instead?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I believe "taskmanager.memory.task.off-heap.size" is also
>>>>>>>>> accounted in the max direct memory size limit. The confusion probably comes
>>>>>>>>> from that "taskmanager.memory.framework.off-heap.size" explicitly mentioned
>>>>>>>>> that in its description while "taskmanager.memory.task.off-heap.size"
>>>>>>>>> didn't. That's actually because the framework off-heap memory is introduced
>>>>>>>>> later in a separate commit. We should fix that.
>>>>>>>>>
>>>>>>>>> For framework / task off-heap memory, we do not differentiate
>>>>>>>>> direct / native memory usage. That means the configure value for these two
>>>>>>>>> options could be a mixture of direct / native memory. Since we do not know
>>>>>>>>> the portion of direct memory out of the configured value, we have
>>>>>>>>> to conservatively account it all into the max direct memory size limit.
>>>>>>>>>
>>>>>>>>
>>>>>>>> *==>  In that case, I am a bit confused. For the total size
>>>>>>>> calculation, it is fine. But why do we then set MaxDirectMemory? It is a
>>>>>>>> difficult parameter, and the main reason to set it was (if I recall
>>>>>>>> correctly) to trigger GC based on direct memory allocation (to free heap
>>>>>>>> structures that then in turn release direct memory). If the limit is
>>>>>>>> anyways too high (because we also count native memory in there) such that
>>>>>>>> we can exceed the total process (container) memory, why do we set it then?*
>>>>>>>>
>>>>>>>>
>>>>>>>>   - The network memory keys are now called
>>>>>>>>>> "taskmanager.memory.shuffle.*". To my knowledge, shuffle is usually
>>>>>>>>>> understood as a redistribution (random, or maybe by hash of key). As an
>>>>>>>>>> example, there are many discussions about "shuffle join versus broadcast
>>>>>>>>>> join", where "shuffle" is the synonym for "re-partitioning". We use that
>>>>>>>>>> memory however for all network operations, like forward pipes, broadcasts,
>>>>>>>>>> receiver-side buffering on checkpoints, etc. I find the name "*.shuffle.*"
>>>>>>>>>> confusing, I am wondering if users would find that as well. So throwing in
>>>>>>>>>> the suggestion to call the options "taskmanager.memory.network.*".
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> +0 on this one. I'm ok with "taskmanager.memory.network.*". On the
>>>>>>>>> other hand, one can also argue that this part of memory is used by
>>>>>>>>> ShuffleEnvironment, and the key "taskmanager.memory.shuffle.*" points more
>>>>>>>>> directly to the shuffle service components.
>>>>>>>>>
>>>>>>>>
>>>>>>>> *==> In that case, the name "Shuffle Environment" may be a bit
>>>>>>>> incorrect, because it is doing not only shuffles as well. The
>>>>>>>> ShuffleEnvironment is also more internal, so the name is not too critical.
>>>>>>>> This isn't super high priority for me, but if we want to adjust it, better
>>>>>>>> earlier than later.*
>>>>>>>>
>>>>>>>>
>>>>>>>>    - The descriptions for the "taskmanager.memory.jvm-overhead.*"
>>>>>>>> keys say that it also accounts for I/O direct memory, but the parameter is
>>>>>>>> not counted into the MaxDirectMemory parameter.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> True. Since we already have framework off-heap memory accounted
>>>>>>>>> for ad hoc direct memory usages, accounting all of jvm-overhead also into
>>>>>>>>> max direct memory limit seems not necessary. I would suggest to remove "I/O
>>>>>>>>> direct memory" from the description, and explicitly mention that this
>>>>>>>>> option does not account for direct memory and will not be accounted into
>>>>>>>>> max direct memory limit. WDYT?
>>>>>>>>>
>>>>>>>>
>>>>>>>> *==> Sounds good. *
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>   - Can make the new ConfigOptions strongly typed with the new
>>>>>>>>>> configuration options. For example, directly use MemorySize typed options.
>>>>>>>>>> That makes validation automatic and saves us from breaking the options
>>>>>>>>>> later.
>>>>>>>>>
>>>>>>>>> +1. Wasn't aware of the new memory type config options.
>>>>>>>>>
>>>>>>>>
>>>>>>>> *==> Thanks. Do you need help with adjusting this?*
>>>>>>>>
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Stephan
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>

Re: [DISCUSS] Some feedback after trying out the new FLIP-49 memory configurations

Posted by Xintong Song <to...@gmail.com>.
>
> How about putting "taskmanager.memory.flink.size" in the configuration?
> Then new downloaded Flink behaves similar to the previous Standalone setups.
> If someone upgrades the binaries, but re-uses their old configuration,
> then they get the compatibility as discussed previously.
> We used that approach previously with the fine-grained failover recovery.


I'm trying to understand why "taskmanager.memory.flink.size" rather than
"taskmanager.memory.process.size" in the default flink-conf.yaml. Or put it
another way, why do we want the new downloaded Flink behaves similar to
previous Standalone setups rather than previous active mode setups? Is
there any special reason that I overlooked, which makes backwards
compatibility for standalone setups more important than for active setups?

IMO, "taskmanager.memory.process.size" is easier for the new comers. For
standalone setups, users can simply configure it to their machines'
available memory size, without needing to worry about leaving enough space
for JVM overehead / metaspace. For containerized setups, it's more
predictable how many memory the containers / Flink could use, which is more
friendly for users to manage their resource quota.

Therefore, unless there is anything I overlooked, I'm in favor of putting
"taskmanager.memory.process.size" rather than
"taskmanager.memory.flink.size" in the default configuration.


Thank you~

Xintong Song



On Tue, Dec 24, 2019 at 4:27 AM Andrey Zagrebin <az...@apache.org>
wrote:

> How about putting "taskmanager.memory.flink.size" in the configuration?
>> Then new downloaded Flink behaves similar to the previous Standalone setups.
>> If someone upgrades the binaries, but re-uses their old configuration,
>> then they get the compatibility as discussed previously.
>> We used that approach previously with the fine-grained failover recovery.
>
> +1, this sounds like a good compromise.
>
> +1 to not have more options for off-heap memory, that would get confusing
>> fast. We can keep the name "off-heap" for both task and framework, as long
>> as they mean the same thing: native plus direct, and fully counted into
>> MaxDirectMemory. I would suggest to update the config descriptions then to
>> reflect that.
>>
> True, this should be explained in the config descriptions.
>
> looks good to me
>
> From a user's perspective I believe "taskmanager.memory.network" would be
>> easier to understand as not everyone knows exactly what the shuffle service
>> is. I see the point that it would be a bit imprecise as we can have
>> different shuffle implementations but I would go with the ease of
>> use/understanding here. Moreover, I think that we won't have many different
>> shuffle service implementations in the foreseeable future.
>
> I agree that if we cannot find any other convincing names for the options,
> we should keep what we already have and change it if the alternative is
> convincing enough.
> The question is also whether we still want to rename it because it was
> "taskmanager.network.*memory*.*" in 1.9 but "taskmanager.*memory*.network.*"
> is more aligned with other new memory option names.
> Or we can just 'un'-deprecate "taskmanager.network.*memory*.*".
>
> On Mon, Dec 23, 2019 at 8:42 PM Stephan Ewen <se...@apache.org> wrote:
>
>> How about putting "taskmanager.memory.flink.size" in the configuration?
>> Then new downloaded Flink behaves similar to the previous Standalone setups.
>>
>> If someone upgrades the binaries, but re-uses their old configuration,
>> then they get the compatibility as discussed previously.
>> We used that approach previously with the fine-grained failover recovery.
>>
>> On Mon, Dec 23, 2019 at 3:27 AM Xintong Song <to...@gmail.com>
>> wrote:
>>
>>> +1 to not have more options for off-heap memory, that would get
>>>> confusing fast. We can keep the name "off-heap" for both task and
>>>> framework, as long as they mean the same thing: native plus direct, and
>>>> fully counted into MaxDirectMemory. I would suggest to update the config
>>>> descriptions then to reflect that.
>>>>
>>> True, this should be explained in the config descriptions.
>>>
>>> Which configuration option will be set in Flink's default
>>>> flink-conf.yaml? If we want to maintain the existing behaviour it would
>>>> have to be the then deprecated taskmanager.heap.size config option. If we
>>>> are ok with Yarn requesting slightly larger containers, then it could also
>>>> be taskmanager.memory.total-flink.size.
>>>>
>>> Good point. Currently, we have "taskmanager.memory.total-process.size".
>>> In order to preserve the previous behavior, we need to have
>>> "taskmanager.heap.size" so it can be mapped to different new options in
>>> standalone / active setups.
>>> I think we can have the deprecated "taskmanager.heap.size" in the
>>> default flink-conf.yaml, and also have the
>>> new "taskmanager.memory.total-process.size" in a commented line. We can
>>> explain how the deprecated config option behaves differently in the
>>> comments, so that user can switch to the new config options if they want to.
>>>
>>> Thank you~
>>>
>>> Xintong Song
>>>
>>>
>>>
>>> On Sat, Dec 21, 2019 at 1:00 AM Till Rohrmann <tr...@apache.org>
>>> wrote:
>>>
>>>> Thanks for the feedback and good discussion everyone. I left some
>>>> comments inline.
>>>>
>>>> On Fri, Dec 20, 2019 at 1:59 PM Stephan Ewen <se...@apache.org> wrote:
>>>>
>>>>> +1 to not have more options for off-heap memory, that would get
>>>>> confusing fast. We can keep the name "off-heap" for both task and
>>>>> framework, as long as they mean the same thing: native plus direct, and
>>>>> fully counted into MaxDirectMemory. I would suggest to update the config
>>>>> descriptions then to reflect that.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Dec 20, 2019 at 1:03 PM Xintong Song <to...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Regarding the framework/task direct/native memory options, I tend to
>>>>>> think it differently. I'm in favor of keep the "*.off-heap.size" for the
>>>>>> config option keys.
>>>>>>
>>>>>>    - It's not necessary IMO to expose the difference concepts of
>>>>>>    direct / native memory to the users.
>>>>>>    - I would avoid introducing more options for native memory if
>>>>>>    possible. Taking fine grained resource management and dynamic slot
>>>>>>    allocation into consideration, that also means introduce more fields into
>>>>>>    ResourceSpec / ResourceProfile.
>>>>>>    - My gut feeling is that having a relative loose MaxDirectMemory
>>>>>>    should not be a big problem.
>>>>>>    - In most cases, the task / framework off-heap memory should be
>>>>>>       mainly (if not all) direct memory, so the difference between derived
>>>>>>       MaxDirectMemory and the ideal direct memory limit should not be too much.
>>>>>>       - We do not have a good way to know the exact size needed for
>>>>>>       jvm overhead / metaspace and framework / task off-heap memory, thus having
>>>>>>       to conservatively reserve slightly more memory then what actually needed.
>>>>>>       Such reserved but not used memory can cover for the small MaxDirectMemory
>>>>>>       error.
>>>>>>       -
>>>>>>       - MaxDirectMemory is not the only way to trigger full gc. We
>>>>>>       still heap activities that can also trigger the gc.
>>>>>>
>>>>>> Regarding the memory type config options, I've looked into the latest
>>>>>> ConfigOptions changes. I think it shouldn't be too complicated to change
>>>>>> the config options to use memory type, and I can handle it maybe during
>>>>>> your vacations.
>>>>>>
>>>>>>
>>>>>> Also agree with improving MemorySize logging and parsing. This should
>>>>>> not be a blocker that has to be done in 1.10. I would say we finish other
>>>>>> works (testability, documentation and those discussed in this thread)
>>>>>> first, and get to this only if we have time.
>>>>>>
>>>>>>
>>>>>> Thank you~
>>>>>>
>>>>>> Xintong Song
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Dec 20, 2019 at 6:07 PM Andrey Zagrebin <
>>>>>> azagrebin.apache@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Stephan and Xintong,
>>>>>>>
>>>>>>> Thanks for the further FLIP-49 feedbacks.
>>>>>>>
>>>>>>>   - "taskmanager.memory.size" (old main config option) is replaced
>>>>>>>> by "taskmanager.memory.total-process.size" which has a different meaning in
>>>>>>>> standalone setups. The old option did not subtract metaspace and other
>>>>>>>> overhead, while the new option does. That means that with the default
>>>>>>>> config, standalone clusters get quite a bit less memory. (independent of
>>>>>>>> managed memory going off heap).
>>>>>>>>     I am wondering if we could interpret "taskmanager.memory.size"
>>>>>>>> as the deprecated key for "taskmanager.memory.total-flink.size". That would
>>>>>>>> be in line with the old mechanism (assuming managed memory is set to off
>>>>>>>> heap).
>>>>>>>>     The effect would be that the container size on Yarn/Mesos
>>>>>>>> increases, because from "taskmanager.memory.total-flink.size", we need to
>>>>>>>> add overhead and metaspace to reach the total process size, rather than
>>>>>>>> cutting off memory. But if we want, we could even adjust for that in the
>>>>>>>> active resource manager, getting full backwards compatibility on that part.
>>>>>>>>     Curious to hear more thoughts there.
>>>>>>>
>>>>>>>
>>>>>>> I believe you mean "taskmanager.heap.size".
>>>>>>>
>>>>>>> I think the problem here is that the legacy "taskmanager.heap.size"
>>>>>>> was used differently in standalone setups and active yarn / mesos setups,
>>>>>>> and such different calculation logics and behaviors are exactly what we
>>>>>>> want to avoid with FLIP-49. Therefore, I'm not in favor of treating
>>>>>>> "taskmanager.memory.total-flink.size" differently for standalone and active
>>>>>>> setups.
>>>>>>>
>>>>>>> I think what we really want is probably
>>>>>>> mapping "taskmanager.heap.size" to different new config options in
>>>>>>> different setups. How about we mark "taskmanager.heap.size" as deprecated
>>>>>>> key for neither of "taskmanager.memory.total-process.size" and
>>>>>>> "taskmanager.memory.total-flink.size". Instead, we parse it (if explicitly
>>>>>>> configured) in startup scripts / active resource managers, and set the
>>>>>>> value to "taskmanager.memory.total-flink.size" in the scripts and
>>>>>>> "taskmanager.memory.total-process.size" in active resource managers (if the
>>>>>>> new config options are not configured). We can provide util methods in
>>>>>>> TaskExecutorResourceUtils for such conversions, to keep all the
>>>>>>> configuration logics at one place.
>>>>>>>
>>>>>>>
>>>>>>> I agree that the problem is that the legacy option
>>>>>>> ‘taskmanager.heap.size’ has different semantics for standalone/container.
>>>>>>> We had it initially falling back to 'taskmanager.memory.total-flink.size’
>>>>>>> but I changed that to align it with container cut-off. Now I see it changes
>>>>>>> standalone setup then.
>>>>>>> +1 for supporting its backwards compatibility differently for
>>>>>>> standalone/container setups.
>>>>>>>
>>>>>>
>>>> Which configuration option will be set in Flink's default
>>>> flink-conf.yaml? If we want to maintain the existing behaviour it would
>>>> have to be the then deprecated taskmanager.heap.size config option. If we
>>>> are ok with Yarn requesting slightly larger containers, then it could also
>>>> be taskmanager.memory.total-flink.size.
>>>>
>>>>>
>>>>>>>
>>>>>>>   - Mini Cluster tries to imitate exact ratio of memory pools as a
>>>>>>>> standalone setup. I get the idea behind that, but I am wondering if it is
>>>>>>>> the right approach here.
>>>>>>>>     For example: I started a relatively large JVM (large heap size
>>>>>>>> of 10 GB) as a test. With the current logic, the system tries to reserve an
>>>>>>>> additional 6GB for managed memory which is more than there is memory left.
>>>>>>>> When you see the error that no memory could be allocated, you need to
>>>>>>>> understand the magic of how this is derived.
>>>>>>>>     I am trying to think about this from the perspective of using
>>>>>>>> "Flink as a Library", which the MiniCluster is close to.
>>>>>>>>     When starting Flink out of a running process, we cannot assume
>>>>>>>> that we are the only users of that process and that we can mold the process
>>>>>>>> to our demands. I think a fix value for managed memory and network memory
>>>>>>>> would feel more natural in such a setup than a mechanism that is tailored
>>>>>>>> towards exclusive use of the process.
>>>>>>>
>>>>>>>
>>>>>>> +1 on having fixed values for managed / shuffle memory.
>>>>>>>
>>>>>>>
>>>>>>> also +1 for that, if user has not specified any main options to
>>>>>>> derive memory. We should also log this fixing of managed / shuffle memory.
>>>>>>> And just noticed, we could also sanity check framework and if
>>>>>>> explicitly configured task heap against available JVM heap, and at least
>>>>>>> log inconsistencies.
>>>>>>>
>>>>>>>   - Some off-heap memory goes into direct memory, some does not.
>>>>>>>> This confused me a bit. For example
>>>>>>>> "taskmanager.memory.framework.off-heap.size" is counted into
>>>>>>>> MaxDirectMemory while "taskmanager.memory.task.off-heap.size" is counted as
>>>>>>>> native memory. Maybe we should rename the keys to reflect that. There is no
>>>>>>>> one "off heap" memory type after all. Maybe use
>>>>>>>> "taskmanager.memory.task.native: XXXmb" and
>>>>>>>> "taskmanager.memory.framework.direct: XXXmb" instead?
>>>>>>>
>>>>>>>
>>>>>>> I believe "taskmanager.memory.task.off-heap.size" is also accounted
>>>>>>> in the max direct memory size limit. The confusion probably comes from that
>>>>>>> "taskmanager.memory.framework.off-heap.size" explicitly mentioned that in
>>>>>>> its description while "taskmanager.memory.task.off-heap.size" didn't.
>>>>>>> That's actually because the framework off-heap memory is introduced later
>>>>>>> in a separate commit. We should fix that.
>>>>>>>
>>>>>>> For framework / task off-heap memory, we do not differentiate direct
>>>>>>> / native memory usage. That means the configure value for these two options
>>>>>>> could be a mixture of direct / native memory. Since we do not know the
>>>>>>> portion of direct memory out of the configured value, we have
>>>>>>> to conservatively account it all into the max direct memory size limit.
>>>>>>>
>>>>>>>
>>>>>>> *==>  In that case, I am a bit confused. For the total size
>>>>>>> calculation, it is fine. But why do we then set MaxDirectMemory? It is a
>>>>>>> difficult parameter, and the main reason to set it was (if I recall
>>>>>>> correctly) to trigger GC based on direct memory allocation (to free heap
>>>>>>> structures that then in turn release direct memory). If the limit is
>>>>>>> anyways too high (because we also count native memory in there) such that
>>>>>>> we can exceed the total process (container) memory, why do we set it then?*
>>>>>>>
>>>>>>>
>>>>>>> I always also thought about it as providing more safety net for
>>>>>>> direct allocations but GC thing looks more important.
>>>>>>>
>>>>>>> +1 for fixing docs for 'taskmanager.memory.task.off-heap.size’ and
>>>>>>> renaming to ‘direct' as this is what really happens
>>>>>>> if we want to support direct limit more exact than now.
>>>>>>>
>>>>>>> I also think that it is hard to separate direct / native memory
>>>>>>> unless we introduce even more options.
>>>>>>> If user wants to keep the direct limit tight to a certain value but
>>>>>>> also use native memory outside of it,
>>>>>>> she would have to increase something else, like JVM overhead to
>>>>>>> account for it and there is no other better way.
>>>>>>> Having more options to account for the native memory outside of
>>>>>>> direct limit complicates things but can be introduced later if needed.
>>>>>>>
>>>>>>>   - What do you think about renaming
>>>>>>>> "taskmanager.memory.total-flink.size" to "taskmanager.memory.flink.size"
>>>>>>>> and "taskmanager.memory.total-process.size" to
>>>>>>>> "taskmanager.memory.process.size" (or "taskmanager.memory.jvm.total"). I
>>>>>>>> think these keys may be a bit less clumsy (dropping the "total-") without
>>>>>>>> loss of expressiveness.
>>>>>>>
>>>>>>>
>>>>>>> +1 on this.
>>>>>>>
>>>>>>>
>>>>>>> +1 as well. Also an option: 'taskmanager.memory.total-process.size’
>>>>>>> -> ‘taskmanager.memory.jvm.process.size’,
>>>>>>> although it can be also mentioned in docs that we mean JVM process.
>>>>>>>
>>>>>>
>>>> I'd be in favour of Stephan's proposal for the config keys as shorter
>>>> is usually better and they are still descriptive enough. Between
>>>> "taskmanager.memory.process.size" and "taskmanager.memory.jvm.total" I
>>>> would slightly favour the first variant.
>>>>
>>>>>
>>>>>>>   - The network memory keys are now called
>>>>>>>> "taskmanager.memory.shuffle.*". To my knowledge, shuffle is usually
>>>>>>>> understood as a redistribution (random, or maybe by hash of key). As an
>>>>>>>> example, there are many discussions about "shuffle join versus broadcast
>>>>>>>> join", where "shuffle" is the synonym for "re-partitioning". We use that
>>>>>>>> memory however for all network operations, like forward pipes, broadcasts,
>>>>>>>> receiver-side buffering on checkpoints, etc. I find the name "*.shuffle.*"
>>>>>>>> confusing, I am wondering if users would find that as well. So throwing in
>>>>>>>> the suggestion to call the options "taskmanager.memory.network.*".
>>>>>>>
>>>>>>>
>>>>>>> +0 on this one. I'm ok with "taskmanager.memory.network.*". On the
>>>>>>> other hand, one can also argue that this part of memory is used by
>>>>>>> ShuffleEnvironment, and the key "taskmanager.memory.shuffle.*" points more
>>>>>>> directly to the shuffle service components.
>>>>>>>
>>>>>>>
>>>>>>> *==> In that case, the name "Shuffle Environment" may be a bit
>>>>>>> incorrect, because it is doing not only shuffles as well. The
>>>>>>> ShuffleEnvironment is also more internal, so the name is not too critical.
>>>>>>> This isn't super high priority for me, but if we want to adjust it, better
>>>>>>> earlier than later.*
>>>>>>>
>>>>>>>
>>>>>>> This is also a bit controversial topic for me. Indeed, we have
>>>>>>> always used ’network’ for this concept of task data shuffling over the
>>>>>>> network and this can confuse existing users.
>>>>>>>
>>>>>>> On the other hand for the new users and in a long term, ’network’
>>>>>>> can delude into a conclusion that all network memory is managed by this
>>>>>>> option.
>>>>>>> Also other types of shuffle might not directly deal with network at
>>>>>>> all.
>>>>>>>
>>>>>>> By calling it shuffle, we were somewhat biased by understanding it
>>>>>>> in term of map/reduce. This is rather an inter-task data exchange.
>>>>>>> Maybe then 'taskmanager.memory.shuffle.communication.*’ or
>>>>>>> ‘taskmanager.memory.task.shuffle/communication/io/network.*’.
>>>>>>>
>>>>>>
>>>>  From a user's perspective I believe "taskmanager.memory.network" would
>>>> be easier to understand as not everyone knows exactly what the shuffle
>>>> service is. I see the point that it would be a bit imprecise as we can have
>>>> different shuffle implementations but I would go with the ease of
>>>> use/understanding here. Moreover, I think that we won't have many different
>>>> shuffle service implementations in the foreseeable future.
>>>>
>>>>>
>>>>>>>   - The descriptions for the "taskmanager.memory.jvm-overhead.*"
>>>>>>>> keys say that it also accounts for I/O direct memory, but the parameter is
>>>>>>>> not counted into the MaxDirectMemory parameter.
>>>>>>>
>>>>>>>
>>>>>>> True. Since we already have framework off-heap memory accounted for
>>>>>>> ad hoc direct memory usages, accounting all of jvm-overhead also into max
>>>>>>> direct memory limit seems not necessary. I would suggest to remove "I/O
>>>>>>> direct memory" from the description, and explicitly mention that this
>>>>>>> option does not account for direct memory and will not be accounted into
>>>>>>> max direct memory limit. WDYT?
>>>>>>>
>>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>>   - Can make the new ConfigOptions strongly typed with the new
>>>>>>>> configuration options. For example, directly use MemorySize typed options.
>>>>>>>> That makes validation automatic and saves us from breaking the options
>>>>>>>> later.
>>>>>>>
>>>>>>> +1. Wasn't aware of the new memory type config options.
>>>>>>>
>>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> *==> Thanks. Do you need help with adjusting this?*
>>>>>>>
>>>>>>>
>>>>>>> I would appreciate it.
>>>>>>>
>>>>>>> Also small side note, we extensively use MemorySize in logs now but
>>>>>>> it might be not always readable as its string representation is only in
>>>>>>> bytes atm
>>>>>>> and does not reduce it to kb/mb/etc in case of big bytes value. We
>>>>>>> could have at least some .prettyPrint function to use in logs.
>>>>>>> And .fromMegabytes/etc factory methods would improve code
>>>>>>> readability instead of .parse(int + “m”).
>>>>>>>
>>>>>>> Best,
>>>>>>> Andrey
>>>>>>>
>>>>>>> On 20 Dec 2019, at 12:13, Stephan Ewen <se...@apache.org> wrote:
>>>>>>>
>>>>>>> Hi Xintong!
>>>>>>>
>>>>>>> Please find my answers inline:
>>>>>>>
>>>>>>>>   - "taskmanager.memory.size" (old main config option) is replaced
>>>>>>>>> by "taskmanager.memory.total-process.size" which has a different meaning in
>>>>>>>>> standalone setups. The old option did not subtract metaspace and other
>>>>>>>>> overhead, while the new option does. That means that with the default
>>>>>>>>> config, standalone clusters get quite a bit less memory. (independent of
>>>>>>>>> managed memory going off heap).
>>>>>>>>>     I am wondering if we could interpret "taskmanager.memory.size"
>>>>>>>>> as the deprecated key for "taskmanager.memory.total-flink.size". That would
>>>>>>>>> be in line with the old mechanism (assuming managed memory is set to off
>>>>>>>>> heap).
>>>>>>>>>     The effect would be that the container size on Yarn/Mesos
>>>>>>>>> increases, because from "taskmanager.memory.total-flink.size", we need to
>>>>>>>>> add overhead and metaspace to reach the total process size, rather than
>>>>>>>>> cutting off memory. But if we want, we could even adjust for that in the
>>>>>>>>> active resource manager, getting full backwards compatibility on that part.
>>>>>>>>>     Curious to hear more thoughts there.
>>>>>>>>
>>>>>>>>
>>>>>>>> I believe you mean "taskmanager.heap.size".
>>>>>>>>
>>>>>>>
>>>>>>> *==> Yes*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> I think the problem here is that the legacy "taskmanager.heap.size"
>>>>>>>> was used differently in standalone setups and active yarn / mesos setups,
>>>>>>>> and such different calculation logics and behaviors are exactly what we
>>>>>>>> want to avoid with FLIP-49. Therefore, I'm not in favor of treating
>>>>>>>> "taskmanager.memory.total-flink.size" differently for standalone and active
>>>>>>>> setups.
>>>>>>>>
>>>>>>>> I think what we really want is probably
>>>>>>>> mapping "taskmanager.heap.size" to different new config options in
>>>>>>>> different setups. How about we mark "taskmanager.heap.size" as deprecated
>>>>>>>> key for neither of "taskmanager.memory.total-process.size" and
>>>>>>>> "taskmanager.memory.total-flink.size". Instead, we parse it (if explicitly
>>>>>>>> configured) in startup scripts / active resource managers, and set the
>>>>>>>> value to "taskmanager.memory.total-flink.size" in the scripts and
>>>>>>>> "taskmanager.memory.total-process.size" in active resource managers (if the
>>>>>>>> new config options are not configured). We can provide util methods in
>>>>>>>> TaskExecutorResourceUtils for such conversions, to keep all the
>>>>>>>> configuration logics at one place.
>>>>>>>>
>>>>>>>
>>>>>>> *==> This is pretty much what I meant as well (maybe my description
>>>>>>> was not very clear), so +1 for that mechanism*
>>>>>>>
>>>>>>>
>>>>>>>   - Mini Cluster tries to imitate exact ratio of memory pools as a
>>>>>>>>> standalone setup. I get the idea behind that, but I am wondering if it is
>>>>>>>>> the right approach here.
>>>>>>>>>     For example: I started a relatively large JVM (large heap size
>>>>>>>>> of 10 GB) as a test. With the current logic, the system tries to reserve an
>>>>>>>>> additional 6GB for managed memory which is more than there is memory left.
>>>>>>>>> When you see the error that no memory could be allocated, you need to
>>>>>>>>> understand the magic of how this is derived.
>>>>>>>>>     I am trying to think about this from the perspective of using
>>>>>>>>> "Flink as a Library", which the MiniCluster is close to.
>>>>>>>>>     When starting Flink out of a running process, we cannot assume
>>>>>>>>> that we are the only users of that process and that we can mold the process
>>>>>>>>> to our demands. I think a fix value for managed memory and network memory
>>>>>>>>> would feel more natural in such a setup than a mechanism that is tailored
>>>>>>>>> towards exclusive use of the process.
>>>>>>>>
>>>>>>>>
>>>>>>>> +1 on having fixed values for managed / shuffle memory.
>>>>>>>>
>>>>>>>
>>>>>>> *==> Cool, let's also see what Andrey and Till think here.*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>   - Some off-heap memory goes into direct memory, some does not.
>>>>>>>>> This confused me a bit. For example
>>>>>>>>> "taskmanager.memory.framework.off-heap.size" is counted into
>>>>>>>>> MaxDirectMemory while "taskmanager.memory.task.off-heap.size" is counted as
>>>>>>>>> native memory. Maybe we should rename the keys to reflect that. There is no
>>>>>>>>> one "off heap" memory type after all. Maybe use
>>>>>>>>> "taskmanager.memory.task.native: XXXmb" and
>>>>>>>>> "taskmanager.memory.framework.direct: XXXmb" instead?
>>>>>>>>
>>>>>>>>
>>>>>>>> I believe "taskmanager.memory.task.off-heap.size" is also accounted
>>>>>>>> in the max direct memory size limit. The confusion probably comes from that
>>>>>>>> "taskmanager.memory.framework.off-heap.size" explicitly mentioned that in
>>>>>>>> its description while "taskmanager.memory.task.off-heap.size" didn't.
>>>>>>>> That's actually because the framework off-heap memory is introduced later
>>>>>>>> in a separate commit. We should fix that.
>>>>>>>>
>>>>>>>> For framework / task off-heap memory, we do not differentiate
>>>>>>>> direct / native memory usage. That means the configure value for these two
>>>>>>>> options could be a mixture of direct / native memory. Since we do not know
>>>>>>>> the portion of direct memory out of the configured value, we have
>>>>>>>> to conservatively account it all into the max direct memory size limit.
>>>>>>>>
>>>>>>>
>>>>>>> *==>  In that case, I am a bit confused. For the total size
>>>>>>> calculation, it is fine. But why do we then set MaxDirectMemory? It is a
>>>>>>> difficult parameter, and the main reason to set it was (if I recall
>>>>>>> correctly) to trigger GC based on direct memory allocation (to free heap
>>>>>>> structures that then in turn release direct memory). If the limit is
>>>>>>> anyways too high (because we also count native memory in there) such that
>>>>>>> we can exceed the total process (container) memory, why do we set it then?*
>>>>>>>
>>>>>>>
>>>>>>>   - The network memory keys are now called
>>>>>>>>> "taskmanager.memory.shuffle.*". To my knowledge, shuffle is usually
>>>>>>>>> understood as a redistribution (random, or maybe by hash of key). As an
>>>>>>>>> example, there are many discussions about "shuffle join versus broadcast
>>>>>>>>> join", where "shuffle" is the synonym for "re-partitioning". We use that
>>>>>>>>> memory however for all network operations, like forward pipes, broadcasts,
>>>>>>>>> receiver-side buffering on checkpoints, etc. I find the name "*.shuffle.*"
>>>>>>>>> confusing, I am wondering if users would find that as well. So throwing in
>>>>>>>>> the suggestion to call the options "taskmanager.memory.network.*".
>>>>>>>>
>>>>>>>>
>>>>>>>> +0 on this one. I'm ok with "taskmanager.memory.network.*". On the
>>>>>>>> other hand, one can also argue that this part of memory is used by
>>>>>>>> ShuffleEnvironment, and the key "taskmanager.memory.shuffle.*" points more
>>>>>>>> directly to the shuffle service components.
>>>>>>>>
>>>>>>>
>>>>>>> *==> In that case, the name "Shuffle Environment" may be a bit
>>>>>>> incorrect, because it is doing not only shuffles as well. The
>>>>>>> ShuffleEnvironment is also more internal, so the name is not too critical.
>>>>>>> This isn't super high priority for me, but if we want to adjust it, better
>>>>>>> earlier than later.*
>>>>>>>
>>>>>>>
>>>>>>>    - The descriptions for the "taskmanager.memory.jvm-overhead.*"
>>>>>>> keys say that it also accounts for I/O direct memory, but the parameter is
>>>>>>> not counted into the MaxDirectMemory parameter.
>>>>>>>
>>>>>>>>
>>>>>>>> True. Since we already have framework off-heap memory accounted for
>>>>>>>> ad hoc direct memory usages, accounting all of jvm-overhead also into max
>>>>>>>> direct memory limit seems not necessary. I would suggest to remove "I/O
>>>>>>>> direct memory" from the description, and explicitly mention that this
>>>>>>>> option does not account for direct memory and will not be accounted into
>>>>>>>> max direct memory limit. WDYT?
>>>>>>>>
>>>>>>>
>>>>>>> *==> Sounds good. *
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>   - Can make the new ConfigOptions strongly typed with the new
>>>>>>>>> configuration options. For example, directly use MemorySize typed options.
>>>>>>>>> That makes validation automatic and saves us from breaking the options
>>>>>>>>> later.
>>>>>>>>
>>>>>>>> +1. Wasn't aware of the new memory type config options.
>>>>>>>>
>>>>>>>
>>>>>>> *==> Thanks. Do you need help with adjusting this?*
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>> Stephan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>

Re: [DISCUSS] Some feedback after trying out the new FLIP-49 memory configurations

Posted by Andrey Zagrebin <az...@apache.org>.
>
> How about putting "taskmanager.memory.flink.size" in the configuration?
> Then new downloaded Flink behaves similar to the previous Standalone setups.
> If someone upgrades the binaries, but re-uses their old configuration,
> then they get the compatibility as discussed previously.
> We used that approach previously with the fine-grained failover recovery.

+1, this sounds like a good compromise.

+1 to not have more options for off-heap memory, that would get confusing
> fast. We can keep the name "off-heap" for both task and framework, as long
> as they mean the same thing: native plus direct, and fully counted into
> MaxDirectMemory. I would suggest to update the config descriptions then to
> reflect that.
>
True, this should be explained in the config descriptions.

looks good to me

From a user's perspective I believe "taskmanager.memory.network" would be
> easier to understand as not everyone knows exactly what the shuffle service
> is. I see the point that it would be a bit imprecise as we can have
> different shuffle implementations but I would go with the ease of
> use/understanding here. Moreover, I think that we won't have many different
> shuffle service implementations in the foreseeable future.

I agree that if we cannot find any other convincing names for the options,
we should keep what we already have and change it if the alternative is
convincing enough.
The question is also whether we still want to rename it because it was
"taskmanager.network.*memory*.*" in 1.9 but "taskmanager.*memory*.network.*"
is more aligned with other new memory option names.
Or we can just 'un'-deprecate "taskmanager.network.*memory*.*".

On Mon, Dec 23, 2019 at 8:42 PM Stephan Ewen <se...@apache.org> wrote:

> How about putting "taskmanager.memory.flink.size" in the configuration?
> Then new downloaded Flink behaves similar to the previous Standalone setups.
>
> If someone upgrades the binaries, but re-uses their old configuration,
> then they get the compatibility as discussed previously.
> We used that approach previously with the fine-grained failover recovery.
>
> On Mon, Dec 23, 2019 at 3:27 AM Xintong Song <to...@gmail.com>
> wrote:
>
>> +1 to not have more options for off-heap memory, that would get confusing
>>> fast. We can keep the name "off-heap" for both task and framework, as long
>>> as they mean the same thing: native plus direct, and fully counted into
>>> MaxDirectMemory. I would suggest to update the config descriptions then to
>>> reflect that.
>>>
>> True, this should be explained in the config descriptions.
>>
>> Which configuration option will be set in Flink's default
>>> flink-conf.yaml? If we want to maintain the existing behaviour it would
>>> have to be the then deprecated taskmanager.heap.size config option. If we
>>> are ok with Yarn requesting slightly larger containers, then it could also
>>> be taskmanager.memory.total-flink.size.
>>>
>> Good point. Currently, we have "taskmanager.memory.total-process.size".
>> In order to preserve the previous behavior, we need to have
>> "taskmanager.heap.size" so it can be mapped to different new options in
>> standalone / active setups.
>> I think we can have the deprecated "taskmanager.heap.size" in the default
>> flink-conf.yaml, and also have the
>> new "taskmanager.memory.total-process.size" in a commented line. We can
>> explain how the deprecated config option behaves differently in the
>> comments, so that user can switch to the new config options if they want to.
>>
>> Thank you~
>>
>> Xintong Song
>>
>>
>>
>> On Sat, Dec 21, 2019 at 1:00 AM Till Rohrmann <tr...@apache.org>
>> wrote:
>>
>>> Thanks for the feedback and good discussion everyone. I left some
>>> comments inline.
>>>
>>> On Fri, Dec 20, 2019 at 1:59 PM Stephan Ewen <se...@apache.org> wrote:
>>>
>>>> +1 to not have more options for off-heap memory, that would get
>>>> confusing fast. We can keep the name "off-heap" for both task and
>>>> framework, as long as they mean the same thing: native plus direct, and
>>>> fully counted into MaxDirectMemory. I would suggest to update the config
>>>> descriptions then to reflect that.
>>>>
>>>>
>>>>
>>>> On Fri, Dec 20, 2019 at 1:03 PM Xintong Song <to...@gmail.com>
>>>> wrote:
>>>>
>>>>> Regarding the framework/task direct/native memory options, I tend to
>>>>> think it differently. I'm in favor of keep the "*.off-heap.size" for the
>>>>> config option keys.
>>>>>
>>>>>    - It's not necessary IMO to expose the difference concepts of
>>>>>    direct / native memory to the users.
>>>>>    - I would avoid introducing more options for native memory if
>>>>>    possible. Taking fine grained resource management and dynamic slot
>>>>>    allocation into consideration, that also means introduce more fields into
>>>>>    ResourceSpec / ResourceProfile.
>>>>>    - My gut feeling is that having a relative loose MaxDirectMemory
>>>>>    should not be a big problem.
>>>>>    - In most cases, the task / framework off-heap memory should be
>>>>>       mainly (if not all) direct memory, so the difference between derived
>>>>>       MaxDirectMemory and the ideal direct memory limit should not be too much.
>>>>>       - We do not have a good way to know the exact size needed for
>>>>>       jvm overhead / metaspace and framework / task off-heap memory, thus having
>>>>>       to conservatively reserve slightly more memory then what actually needed.
>>>>>       Such reserved but not used memory can cover for the small MaxDirectMemory
>>>>>       error.
>>>>>       -
>>>>>       - MaxDirectMemory is not the only way to trigger full gc. We
>>>>>       still heap activities that can also trigger the gc.
>>>>>
>>>>> Regarding the memory type config options, I've looked into the latest
>>>>> ConfigOptions changes. I think it shouldn't be too complicated to change
>>>>> the config options to use memory type, and I can handle it maybe during
>>>>> your vacations.
>>>>>
>>>>>
>>>>> Also agree with improving MemorySize logging and parsing. This should
>>>>> not be a blocker that has to be done in 1.10. I would say we finish other
>>>>> works (testability, documentation and those discussed in this thread)
>>>>> first, and get to this only if we have time.
>>>>>
>>>>>
>>>>> Thank you~
>>>>>
>>>>> Xintong Song
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Dec 20, 2019 at 6:07 PM Andrey Zagrebin <
>>>>> azagrebin.apache@gmail.com> wrote:
>>>>>
>>>>>> Hi Stephan and Xintong,
>>>>>>
>>>>>> Thanks for the further FLIP-49 feedbacks.
>>>>>>
>>>>>>   - "taskmanager.memory.size" (old main config option) is replaced by
>>>>>>> "taskmanager.memory.total-process.size" which has a different meaning in
>>>>>>> standalone setups. The old option did not subtract metaspace and other
>>>>>>> overhead, while the new option does. That means that with the default
>>>>>>> config, standalone clusters get quite a bit less memory. (independent of
>>>>>>> managed memory going off heap).
>>>>>>>     I am wondering if we could interpret "taskmanager.memory.size"
>>>>>>> as the deprecated key for "taskmanager.memory.total-flink.size". That would
>>>>>>> be in line with the old mechanism (assuming managed memory is set to off
>>>>>>> heap).
>>>>>>>     The effect would be that the container size on Yarn/Mesos
>>>>>>> increases, because from "taskmanager.memory.total-flink.size", we need to
>>>>>>> add overhead and metaspace to reach the total process size, rather than
>>>>>>> cutting off memory. But if we want, we could even adjust for that in the
>>>>>>> active resource manager, getting full backwards compatibility on that part.
>>>>>>>     Curious to hear more thoughts there.
>>>>>>
>>>>>>
>>>>>> I believe you mean "taskmanager.heap.size".
>>>>>>
>>>>>> I think the problem here is that the legacy "taskmanager.heap.size"
>>>>>> was used differently in standalone setups and active yarn / mesos setups,
>>>>>> and such different calculation logics and behaviors are exactly what we
>>>>>> want to avoid with FLIP-49. Therefore, I'm not in favor of treating
>>>>>> "taskmanager.memory.total-flink.size" differently for standalone and active
>>>>>> setups.
>>>>>>
>>>>>> I think what we really want is probably
>>>>>> mapping "taskmanager.heap.size" to different new config options in
>>>>>> different setups. How about we mark "taskmanager.heap.size" as deprecated
>>>>>> key for neither of "taskmanager.memory.total-process.size" and
>>>>>> "taskmanager.memory.total-flink.size". Instead, we parse it (if explicitly
>>>>>> configured) in startup scripts / active resource managers, and set the
>>>>>> value to "taskmanager.memory.total-flink.size" in the scripts and
>>>>>> "taskmanager.memory.total-process.size" in active resource managers (if the
>>>>>> new config options are not configured). We can provide util methods in
>>>>>> TaskExecutorResourceUtils for such conversions, to keep all the
>>>>>> configuration logics at one place.
>>>>>>
>>>>>>
>>>>>> I agree that the problem is that the legacy option
>>>>>> ‘taskmanager.heap.size’ has different semantics for standalone/container.
>>>>>> We had it initially falling back to 'taskmanager.memory.total-flink.size’
>>>>>> but I changed that to align it with container cut-off. Now I see it changes
>>>>>> standalone setup then.
>>>>>> +1 for supporting its backwards compatibility differently for
>>>>>> standalone/container setups.
>>>>>>
>>>>>
>>> Which configuration option will be set in Flink's default
>>> flink-conf.yaml? If we want to maintain the existing behaviour it would
>>> have to be the then deprecated taskmanager.heap.size config option. If we
>>> are ok with Yarn requesting slightly larger containers, then it could also
>>> be taskmanager.memory.total-flink.size.
>>>
>>>>
>>>>>>
>>>>>>   - Mini Cluster tries to imitate exact ratio of memory pools as a
>>>>>>> standalone setup. I get the idea behind that, but I am wondering if it is
>>>>>>> the right approach here.
>>>>>>>     For example: I started a relatively large JVM (large heap size
>>>>>>> of 10 GB) as a test. With the current logic, the system tries to reserve an
>>>>>>> additional 6GB for managed memory which is more than there is memory left.
>>>>>>> When you see the error that no memory could be allocated, you need to
>>>>>>> understand the magic of how this is derived.
>>>>>>>     I am trying to think about this from the perspective of using
>>>>>>> "Flink as a Library", which the MiniCluster is close to.
>>>>>>>     When starting Flink out of a running process, we cannot assume
>>>>>>> that we are the only users of that process and that we can mold the process
>>>>>>> to our demands. I think a fix value for managed memory and network memory
>>>>>>> would feel more natural in such a setup than a mechanism that is tailored
>>>>>>> towards exclusive use of the process.
>>>>>>
>>>>>>
>>>>>> +1 on having fixed values for managed / shuffle memory.
>>>>>>
>>>>>>
>>>>>> also +1 for that, if user has not specified any main options to
>>>>>> derive memory. We should also log this fixing of managed / shuffle memory.
>>>>>> And just noticed, we could also sanity check framework and if
>>>>>> explicitly configured task heap against available JVM heap, and at least
>>>>>> log inconsistencies.
>>>>>>
>>>>>>   - Some off-heap memory goes into direct memory, some does not. This
>>>>>>> confused me a bit. For example "taskmanager.memory.framework.off-heap.size"
>>>>>>> is counted into MaxDirectMemory while
>>>>>>> "taskmanager.memory.task.off-heap.size" is counted as native memory. Maybe
>>>>>>> we should rename the keys to reflect that. There is no one "off heap"
>>>>>>> memory type after all. Maybe use "taskmanager.memory.task.native: XXXmb"
>>>>>>> and "taskmanager.memory.framework.direct: XXXmb" instead?
>>>>>>
>>>>>>
>>>>>> I believe "taskmanager.memory.task.off-heap.size" is also accounted
>>>>>> in the max direct memory size limit. The confusion probably comes from that
>>>>>> "taskmanager.memory.framework.off-heap.size" explicitly mentioned that in
>>>>>> its description while "taskmanager.memory.task.off-heap.size" didn't.
>>>>>> That's actually because the framework off-heap memory is introduced later
>>>>>> in a separate commit. We should fix that.
>>>>>>
>>>>>> For framework / task off-heap memory, we do not differentiate direct
>>>>>> / native memory usage. That means the configure value for these two options
>>>>>> could be a mixture of direct / native memory. Since we do not know the
>>>>>> portion of direct memory out of the configured value, we have
>>>>>> to conservatively account it all into the max direct memory size limit.
>>>>>>
>>>>>>
>>>>>> *==>  In that case, I am a bit confused. For the total size
>>>>>> calculation, it is fine. But why do we then set MaxDirectMemory? It is a
>>>>>> difficult parameter, and the main reason to set it was (if I recall
>>>>>> correctly) to trigger GC based on direct memory allocation (to free heap
>>>>>> structures that then in turn release direct memory). If the limit is
>>>>>> anyways too high (because we also count native memory in there) such that
>>>>>> we can exceed the total process (container) memory, why do we set it then?*
>>>>>>
>>>>>>
>>>>>> I always also thought about it as providing more safety net for
>>>>>> direct allocations but GC thing looks more important.
>>>>>>
>>>>>> +1 for fixing docs for 'taskmanager.memory.task.off-heap.size’ and
>>>>>> renaming to ‘direct' as this is what really happens
>>>>>> if we want to support direct limit more exact than now.
>>>>>>
>>>>>> I also think that it is hard to separate direct / native memory
>>>>>> unless we introduce even more options.
>>>>>> If user wants to keep the direct limit tight to a certain value but
>>>>>> also use native memory outside of it,
>>>>>> she would have to increase something else, like JVM overhead to
>>>>>> account for it and there is no other better way.
>>>>>> Having more options to account for the native memory outside of
>>>>>> direct limit complicates things but can be introduced later if needed.
>>>>>>
>>>>>>   - What do you think about renaming
>>>>>>> "taskmanager.memory.total-flink.size" to "taskmanager.memory.flink.size"
>>>>>>> and "taskmanager.memory.total-process.size" to
>>>>>>> "taskmanager.memory.process.size" (or "taskmanager.memory.jvm.total"). I
>>>>>>> think these keys may be a bit less clumsy (dropping the "total-") without
>>>>>>> loss of expressiveness.
>>>>>>
>>>>>>
>>>>>> +1 on this.
>>>>>>
>>>>>>
>>>>>> +1 as well. Also an option: 'taskmanager.memory.total-process.size’
>>>>>> -> ‘taskmanager.memory.jvm.process.size’,
>>>>>> although it can be also mentioned in docs that we mean JVM process.
>>>>>>
>>>>>
>>> I'd be in favour of Stephan's proposal for the config keys as shorter is
>>> usually better and they are still descriptive enough. Between
>>> "taskmanager.memory.process.size" and "taskmanager.memory.jvm.total" I
>>> would slightly favour the first variant.
>>>
>>>>
>>>>>>   - The network memory keys are now called
>>>>>>> "taskmanager.memory.shuffle.*". To my knowledge, shuffle is usually
>>>>>>> understood as a redistribution (random, or maybe by hash of key). As an
>>>>>>> example, there are many discussions about "shuffle join versus broadcast
>>>>>>> join", where "shuffle" is the synonym for "re-partitioning". We use that
>>>>>>> memory however for all network operations, like forward pipes, broadcasts,
>>>>>>> receiver-side buffering on checkpoints, etc. I find the name "*.shuffle.*"
>>>>>>> confusing, I am wondering if users would find that as well. So throwing in
>>>>>>> the suggestion to call the options "taskmanager.memory.network.*".
>>>>>>
>>>>>>
>>>>>> +0 on this one. I'm ok with "taskmanager.memory.network.*". On the
>>>>>> other hand, one can also argue that this part of memory is used by
>>>>>> ShuffleEnvironment, and the key "taskmanager.memory.shuffle.*" points more
>>>>>> directly to the shuffle service components.
>>>>>>
>>>>>>
>>>>>> *==> In that case, the name "Shuffle Environment" may be a bit
>>>>>> incorrect, because it is doing not only shuffles as well. The
>>>>>> ShuffleEnvironment is also more internal, so the name is not too critical.
>>>>>> This isn't super high priority for me, but if we want to adjust it, better
>>>>>> earlier than later.*
>>>>>>
>>>>>>
>>>>>> This is also a bit controversial topic for me. Indeed, we have always
>>>>>> used ’network’ for this concept of task data shuffling over the network and
>>>>>> this can confuse existing users.
>>>>>>
>>>>>> On the other hand for the new users and in a long term, ’network’ can
>>>>>> delude into a conclusion that all network memory is managed by this option.
>>>>>> Also other types of shuffle might not directly deal with network at
>>>>>> all.
>>>>>>
>>>>>> By calling it shuffle, we were somewhat biased by understanding it in
>>>>>> term of map/reduce. This is rather an inter-task data exchange.
>>>>>> Maybe then 'taskmanager.memory.shuffle.communication.*’ or
>>>>>> ‘taskmanager.memory.task.shuffle/communication/io/network.*’.
>>>>>>
>>>>>
>>>  From a user's perspective I believe "taskmanager.memory.network" would
>>> be easier to understand as not everyone knows exactly what the shuffle
>>> service is. I see the point that it would be a bit imprecise as we can have
>>> different shuffle implementations but I would go with the ease of
>>> use/understanding here. Moreover, I think that we won't have many different
>>> shuffle service implementations in the foreseeable future.
>>>
>>>>
>>>>>>   - The descriptions for the "taskmanager.memory.jvm-overhead.*" keys
>>>>>>> say that it also accounts for I/O direct memory, but the parameter is not
>>>>>>> counted into the MaxDirectMemory parameter.
>>>>>>
>>>>>>
>>>>>> True. Since we already have framework off-heap memory accounted for
>>>>>> ad hoc direct memory usages, accounting all of jvm-overhead also into max
>>>>>> direct memory limit seems not necessary. I would suggest to remove "I/O
>>>>>> direct memory" from the description, and explicitly mention that this
>>>>>> option does not account for direct memory and will not be accounted into
>>>>>> max direct memory limit. WDYT?
>>>>>>
>>>>>>
>>>>>> +1
>>>>>>
>>>>>>   - Can make the new ConfigOptions strongly typed with the new
>>>>>>> configuration options. For example, directly use MemorySize typed options.
>>>>>>> That makes validation automatic and saves us from breaking the options
>>>>>>> later.
>>>>>>
>>>>>> +1. Wasn't aware of the new memory type config options.
>>>>>>
>>>>>>
>>>>>> +1
>>>>>>
>>>>>> *==> Thanks. Do you need help with adjusting this?*
>>>>>>
>>>>>>
>>>>>> I would appreciate it.
>>>>>>
>>>>>> Also small side note, we extensively use MemorySize in logs now but
>>>>>> it might be not always readable as its string representation is only in
>>>>>> bytes atm
>>>>>> and does not reduce it to kb/mb/etc in case of big bytes value. We
>>>>>> could have at least some .prettyPrint function to use in logs.
>>>>>> And .fromMegabytes/etc factory methods would improve code readability
>>>>>> instead of .parse(int + “m”).
>>>>>>
>>>>>> Best,
>>>>>> Andrey
>>>>>>
>>>>>> On 20 Dec 2019, at 12:13, Stephan Ewen <se...@apache.org> wrote:
>>>>>>
>>>>>> Hi Xintong!
>>>>>>
>>>>>> Please find my answers inline:
>>>>>>
>>>>>>>   - "taskmanager.memory.size" (old main config option) is replaced
>>>>>>>> by "taskmanager.memory.total-process.size" which has a different meaning in
>>>>>>>> standalone setups. The old option did not subtract metaspace and other
>>>>>>>> overhead, while the new option does. That means that with the default
>>>>>>>> config, standalone clusters get quite a bit less memory. (independent of
>>>>>>>> managed memory going off heap).
>>>>>>>>     I am wondering if we could interpret "taskmanager.memory.size"
>>>>>>>> as the deprecated key for "taskmanager.memory.total-flink.size". That would
>>>>>>>> be in line with the old mechanism (assuming managed memory is set to off
>>>>>>>> heap).
>>>>>>>>     The effect would be that the container size on Yarn/Mesos
>>>>>>>> increases, because from "taskmanager.memory.total-flink.size", we need to
>>>>>>>> add overhead and metaspace to reach the total process size, rather than
>>>>>>>> cutting off memory. But if we want, we could even adjust for that in the
>>>>>>>> active resource manager, getting full backwards compatibility on that part.
>>>>>>>>     Curious to hear more thoughts there.
>>>>>>>
>>>>>>>
>>>>>>> I believe you mean "taskmanager.heap.size".
>>>>>>>
>>>>>>
>>>>>> *==> Yes*
>>>>>>
>>>>>>
>>>>>>
>>>>>>> I think the problem here is that the legacy "taskmanager.heap.size"
>>>>>>> was used differently in standalone setups and active yarn / mesos setups,
>>>>>>> and such different calculation logics and behaviors are exactly what we
>>>>>>> want to avoid with FLIP-49. Therefore, I'm not in favor of treating
>>>>>>> "taskmanager.memory.total-flink.size" differently for standalone and active
>>>>>>> setups.
>>>>>>>
>>>>>>> I think what we really want is probably
>>>>>>> mapping "taskmanager.heap.size" to different new config options in
>>>>>>> different setups. How about we mark "taskmanager.heap.size" as deprecated
>>>>>>> key for neither of "taskmanager.memory.total-process.size" and
>>>>>>> "taskmanager.memory.total-flink.size". Instead, we parse it (if explicitly
>>>>>>> configured) in startup scripts / active resource managers, and set the
>>>>>>> value to "taskmanager.memory.total-flink.size" in the scripts and
>>>>>>> "taskmanager.memory.total-process.size" in active resource managers (if the
>>>>>>> new config options are not configured). We can provide util methods in
>>>>>>> TaskExecutorResourceUtils for such conversions, to keep all the
>>>>>>> configuration logics at one place.
>>>>>>>
>>>>>>
>>>>>> *==> This is pretty much what I meant as well (maybe my description
>>>>>> was not very clear), so +1 for that mechanism*
>>>>>>
>>>>>>
>>>>>>   - Mini Cluster tries to imitate exact ratio of memory pools as a
>>>>>>>> standalone setup. I get the idea behind that, but I am wondering if it is
>>>>>>>> the right approach here.
>>>>>>>>     For example: I started a relatively large JVM (large heap size
>>>>>>>> of 10 GB) as a test. With the current logic, the system tries to reserve an
>>>>>>>> additional 6GB for managed memory which is more than there is memory left.
>>>>>>>> When you see the error that no memory could be allocated, you need to
>>>>>>>> understand the magic of how this is derived.
>>>>>>>>     I am trying to think about this from the perspective of using
>>>>>>>> "Flink as a Library", which the MiniCluster is close to.
>>>>>>>>     When starting Flink out of a running process, we cannot assume
>>>>>>>> that we are the only users of that process and that we can mold the process
>>>>>>>> to our demands. I think a fix value for managed memory and network memory
>>>>>>>> would feel more natural in such a setup than a mechanism that is tailored
>>>>>>>> towards exclusive use of the process.
>>>>>>>
>>>>>>>
>>>>>>> +1 on having fixed values for managed / shuffle memory.
>>>>>>>
>>>>>>
>>>>>> *==> Cool, let's also see what Andrey and Till think here.*
>>>>>>
>>>>>>
>>>>>>
>>>>>>>   - Some off-heap memory goes into direct memory, some does not.
>>>>>>>> This confused me a bit. For example
>>>>>>>> "taskmanager.memory.framework.off-heap.size" is counted into
>>>>>>>> MaxDirectMemory while "taskmanager.memory.task.off-heap.size" is counted as
>>>>>>>> native memory. Maybe we should rename the keys to reflect that. There is no
>>>>>>>> one "off heap" memory type after all. Maybe use
>>>>>>>> "taskmanager.memory.task.native: XXXmb" and
>>>>>>>> "taskmanager.memory.framework.direct: XXXmb" instead?
>>>>>>>
>>>>>>>
>>>>>>> I believe "taskmanager.memory.task.off-heap.size" is also accounted
>>>>>>> in the max direct memory size limit. The confusion probably comes from that
>>>>>>> "taskmanager.memory.framework.off-heap.size" explicitly mentioned that in
>>>>>>> its description while "taskmanager.memory.task.off-heap.size" didn't.
>>>>>>> That's actually because the framework off-heap memory is introduced later
>>>>>>> in a separate commit. We should fix that.
>>>>>>>
>>>>>>> For framework / task off-heap memory, we do not differentiate direct
>>>>>>> / native memory usage. That means the configure value for these two options
>>>>>>> could be a mixture of direct / native memory. Since we do not know the
>>>>>>> portion of direct memory out of the configured value, we have
>>>>>>> to conservatively account it all into the max direct memory size limit.
>>>>>>>
>>>>>>
>>>>>> *==>  In that case, I am a bit confused. For the total size
>>>>>> calculation, it is fine. But why do we then set MaxDirectMemory? It is a
>>>>>> difficult parameter, and the main reason to set it was (if I recall
>>>>>> correctly) to trigger GC based on direct memory allocation (to free heap
>>>>>> structures that then in turn release direct memory). If the limit is
>>>>>> anyways too high (because we also count native memory in there) such that
>>>>>> we can exceed the total process (container) memory, why do we set it then?*
>>>>>>
>>>>>>
>>>>>>   - The network memory keys are now called
>>>>>>>> "taskmanager.memory.shuffle.*". To my knowledge, shuffle is usually
>>>>>>>> understood as a redistribution (random, or maybe by hash of key). As an
>>>>>>>> example, there are many discussions about "shuffle join versus broadcast
>>>>>>>> join", where "shuffle" is the synonym for "re-partitioning". We use that
>>>>>>>> memory however for all network operations, like forward pipes, broadcasts,
>>>>>>>> receiver-side buffering on checkpoints, etc. I find the name "*.shuffle.*"
>>>>>>>> confusing, I am wondering if users would find that as well. So throwing in
>>>>>>>> the suggestion to call the options "taskmanager.memory.network.*".
>>>>>>>
>>>>>>>
>>>>>>> +0 on this one. I'm ok with "taskmanager.memory.network.*". On the
>>>>>>> other hand, one can also argue that this part of memory is used by
>>>>>>> ShuffleEnvironment, and the key "taskmanager.memory.shuffle.*" points more
>>>>>>> directly to the shuffle service components.
>>>>>>>
>>>>>>
>>>>>> *==> In that case, the name "Shuffle Environment" may be a bit
>>>>>> incorrect, because it is doing not only shuffles as well. The
>>>>>> ShuffleEnvironment is also more internal, so the name is not too critical.
>>>>>> This isn't super high priority for me, but if we want to adjust it, better
>>>>>> earlier than later.*
>>>>>>
>>>>>>
>>>>>>    - The descriptions for the "taskmanager.memory.jvm-overhead.*"
>>>>>> keys say that it also accounts for I/O direct memory, but the parameter is
>>>>>> not counted into the MaxDirectMemory parameter.
>>>>>>
>>>>>>>
>>>>>>> True. Since we already have framework off-heap memory accounted for
>>>>>>> ad hoc direct memory usages, accounting all of jvm-overhead also into max
>>>>>>> direct memory limit seems not necessary. I would suggest to remove "I/O
>>>>>>> direct memory" from the description, and explicitly mention that this
>>>>>>> option does not account for direct memory and will not be accounted into
>>>>>>> max direct memory limit. WDYT?
>>>>>>>
>>>>>>
>>>>>> *==> Sounds good. *
>>>>>>
>>>>>>
>>>>>>
>>>>>>>   - Can make the new ConfigOptions strongly typed with the new
>>>>>>>> configuration options. For example, directly use MemorySize typed options.
>>>>>>>> That makes validation automatic and saves us from breaking the options
>>>>>>>> later.
>>>>>>>
>>>>>>> +1. Wasn't aware of the new memory type config options.
>>>>>>>
>>>>>>
>>>>>> *==> Thanks. Do you need help with adjusting this?*
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>> Stephan
>>>>>>
>>>>>>
>>>>>>
>>>>>>

Re: [DISCUSS] Some feedback after trying out the new FLIP-49 memory configurations

Posted by Stephan Ewen <se...@apache.org>.
How about putting "taskmanager.memory.flink.size" in the configuration?
Then new downloaded Flink behaves similar to the previous Standalone setups.

If someone upgrades the binaries, but re-uses their old configuration, then
they get the compatibility as discussed previously.
We used that approach previously with the fine-grained failover recovery.

On Mon, Dec 23, 2019 at 3:27 AM Xintong Song <to...@gmail.com> wrote:

> +1 to not have more options for off-heap memory, that would get confusing
>> fast. We can keep the name "off-heap" for both task and framework, as long
>> as they mean the same thing: native plus direct, and fully counted into
>> MaxDirectMemory. I would suggest to update the config descriptions then to
>> reflect that.
>>
> True, this should be explained in the config descriptions.
>
> Which configuration option will be set in Flink's default flink-conf.yaml?
>> If we want to maintain the existing behaviour it would have to be the then
>> deprecated taskmanager.heap.size config option. If we are ok with Yarn
>> requesting slightly larger containers, then it could also
>> be taskmanager.memory.total-flink.size.
>>
> Good point. Currently, we have "taskmanager.memory.total-process.size". In
> order to preserve the previous behavior, we need to have
> "taskmanager.heap.size" so it can be mapped to different new options in
> standalone / active setups.
> I think we can have the deprecated "taskmanager.heap.size" in the default
> flink-conf.yaml, and also have the
> new "taskmanager.memory.total-process.size" in a commented line. We can
> explain how the deprecated config option behaves differently in the
> comments, so that user can switch to the new config options if they want to.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Sat, Dec 21, 2019 at 1:00 AM Till Rohrmann <tr...@apache.org>
> wrote:
>
>> Thanks for the feedback and good discussion everyone. I left some
>> comments inline.
>>
>> On Fri, Dec 20, 2019 at 1:59 PM Stephan Ewen <se...@apache.org> wrote:
>>
>>> +1 to not have more options for off-heap memory, that would get
>>> confusing fast. We can keep the name "off-heap" for both task and
>>> framework, as long as they mean the same thing: native plus direct, and
>>> fully counted into MaxDirectMemory. I would suggest to update the config
>>> descriptions then to reflect that.
>>>
>>>
>>>
>>> On Fri, Dec 20, 2019 at 1:03 PM Xintong Song <to...@gmail.com>
>>> wrote:
>>>
>>>> Regarding the framework/task direct/native memory options, I tend to
>>>> think it differently. I'm in favor of keep the "*.off-heap.size" for the
>>>> config option keys.
>>>>
>>>>    - It's not necessary IMO to expose the difference concepts of
>>>>    direct / native memory to the users.
>>>>    - I would avoid introducing more options for native memory if
>>>>    possible. Taking fine grained resource management and dynamic slot
>>>>    allocation into consideration, that also means introduce more fields into
>>>>    ResourceSpec / ResourceProfile.
>>>>    - My gut feeling is that having a relative loose MaxDirectMemory
>>>>    should not be a big problem.
>>>>    - In most cases, the task / framework off-heap memory should be
>>>>       mainly (if not all) direct memory, so the difference between derived
>>>>       MaxDirectMemory and the ideal direct memory limit should not be too much.
>>>>       - We do not have a good way to know the exact size needed for
>>>>       jvm overhead / metaspace and framework / task off-heap memory, thus having
>>>>       to conservatively reserve slightly more memory then what actually needed.
>>>>       Such reserved but not used memory can cover for the small MaxDirectMemory
>>>>       error.
>>>>       -
>>>>       - MaxDirectMemory is not the only way to trigger full gc. We
>>>>       still heap activities that can also trigger the gc.
>>>>
>>>> Regarding the memory type config options, I've looked into the latest
>>>> ConfigOptions changes. I think it shouldn't be too complicated to change
>>>> the config options to use memory type, and I can handle it maybe during
>>>> your vacations.
>>>>
>>>>
>>>> Also agree with improving MemorySize logging and parsing. This should
>>>> not be a blocker that has to be done in 1.10. I would say we finish other
>>>> works (testability, documentation and those discussed in this thread)
>>>> first, and get to this only if we have time.
>>>>
>>>>
>>>> Thank you~
>>>>
>>>> Xintong Song
>>>>
>>>>
>>>>
>>>> On Fri, Dec 20, 2019 at 6:07 PM Andrey Zagrebin <
>>>> azagrebin.apache@gmail.com> wrote:
>>>>
>>>>> Hi Stephan and Xintong,
>>>>>
>>>>> Thanks for the further FLIP-49 feedbacks.
>>>>>
>>>>>   - "taskmanager.memory.size" (old main config option) is replaced by
>>>>>> "taskmanager.memory.total-process.size" which has a different meaning in
>>>>>> standalone setups. The old option did not subtract metaspace and other
>>>>>> overhead, while the new option does. That means that with the default
>>>>>> config, standalone clusters get quite a bit less memory. (independent of
>>>>>> managed memory going off heap).
>>>>>>     I am wondering if we could interpret "taskmanager.memory.size" as
>>>>>> the deprecated key for "taskmanager.memory.total-flink.size". That would be
>>>>>> in line with the old mechanism (assuming managed memory is set to off heap).
>>>>>>     The effect would be that the container size on Yarn/Mesos
>>>>>> increases, because from "taskmanager.memory.total-flink.size", we need to
>>>>>> add overhead and metaspace to reach the total process size, rather than
>>>>>> cutting off memory. But if we want, we could even adjust for that in the
>>>>>> active resource manager, getting full backwards compatibility on that part.
>>>>>>     Curious to hear more thoughts there.
>>>>>
>>>>>
>>>>> I believe you mean "taskmanager.heap.size".
>>>>>
>>>>> I think the problem here is that the legacy "taskmanager.heap.size"
>>>>> was used differently in standalone setups and active yarn / mesos setups,
>>>>> and such different calculation logics and behaviors are exactly what we
>>>>> want to avoid with FLIP-49. Therefore, I'm not in favor of treating
>>>>> "taskmanager.memory.total-flink.size" differently for standalone and active
>>>>> setups.
>>>>>
>>>>> I think what we really want is probably
>>>>> mapping "taskmanager.heap.size" to different new config options in
>>>>> different setups. How about we mark "taskmanager.heap.size" as deprecated
>>>>> key for neither of "taskmanager.memory.total-process.size" and
>>>>> "taskmanager.memory.total-flink.size". Instead, we parse it (if explicitly
>>>>> configured) in startup scripts / active resource managers, and set the
>>>>> value to "taskmanager.memory.total-flink.size" in the scripts and
>>>>> "taskmanager.memory.total-process.size" in active resource managers (if the
>>>>> new config options are not configured). We can provide util methods in
>>>>> TaskExecutorResourceUtils for such conversions, to keep all the
>>>>> configuration logics at one place.
>>>>>
>>>>>
>>>>> I agree that the problem is that the legacy option
>>>>> ‘taskmanager.heap.size’ has different semantics for standalone/container.
>>>>> We had it initially falling back to 'taskmanager.memory.total-flink.size’
>>>>> but I changed that to align it with container cut-off. Now I see it changes
>>>>> standalone setup then.
>>>>> +1 for supporting its backwards compatibility differently for
>>>>> standalone/container setups.
>>>>>
>>>>
>> Which configuration option will be set in Flink's default
>> flink-conf.yaml? If we want to maintain the existing behaviour it would
>> have to be the then deprecated taskmanager.heap.size config option. If we
>> are ok with Yarn requesting slightly larger containers, then it could also
>> be taskmanager.memory.total-flink.size.
>>
>>>
>>>>>
>>>>>   - Mini Cluster tries to imitate exact ratio of memory pools as a
>>>>>> standalone setup. I get the idea behind that, but I am wondering if it is
>>>>>> the right approach here.
>>>>>>     For example: I started a relatively large JVM (large heap size of
>>>>>> 10 GB) as a test. With the current logic, the system tries to reserve an
>>>>>> additional 6GB for managed memory which is more than there is memory left.
>>>>>> When you see the error that no memory could be allocated, you need to
>>>>>> understand the magic of how this is derived.
>>>>>>     I am trying to think about this from the perspective of using
>>>>>> "Flink as a Library", which the MiniCluster is close to.
>>>>>>     When starting Flink out of a running process, we cannot assume
>>>>>> that we are the only users of that process and that we can mold the process
>>>>>> to our demands. I think a fix value for managed memory and network memory
>>>>>> would feel more natural in such a setup than a mechanism that is tailored
>>>>>> towards exclusive use of the process.
>>>>>
>>>>>
>>>>> +1 on having fixed values for managed / shuffle memory.
>>>>>
>>>>>
>>>>> also +1 for that, if user has not specified any main options to derive
>>>>> memory. We should also log this fixing of managed / shuffle memory.
>>>>> And just noticed, we could also sanity check framework and if
>>>>> explicitly configured task heap against available JVM heap, and at least
>>>>> log inconsistencies.
>>>>>
>>>>>   - Some off-heap memory goes into direct memory, some does not. This
>>>>>> confused me a bit. For example "taskmanager.memory.framework.off-heap.size"
>>>>>> is counted into MaxDirectMemory while
>>>>>> "taskmanager.memory.task.off-heap.size" is counted as native memory. Maybe
>>>>>> we should rename the keys to reflect that. There is no one "off heap"
>>>>>> memory type after all. Maybe use "taskmanager.memory.task.native: XXXmb"
>>>>>> and "taskmanager.memory.framework.direct: XXXmb" instead?
>>>>>
>>>>>
>>>>> I believe "taskmanager.memory.task.off-heap.size" is also accounted in
>>>>> the max direct memory size limit. The confusion probably comes from that
>>>>> "taskmanager.memory.framework.off-heap.size" explicitly mentioned that in
>>>>> its description while "taskmanager.memory.task.off-heap.size" didn't.
>>>>> That's actually because the framework off-heap memory is introduced later
>>>>> in a separate commit. We should fix that.
>>>>>
>>>>> For framework / task off-heap memory, we do not differentiate direct /
>>>>> native memory usage. That means the configure value for these two options
>>>>> could be a mixture of direct / native memory. Since we do not know the
>>>>> portion of direct memory out of the configured value, we have
>>>>> to conservatively account it all into the max direct memory size limit.
>>>>>
>>>>>
>>>>> *==>  In that case, I am a bit confused. For the total size
>>>>> calculation, it is fine. But why do we then set MaxDirectMemory? It is a
>>>>> difficult parameter, and the main reason to set it was (if I recall
>>>>> correctly) to trigger GC based on direct memory allocation (to free heap
>>>>> structures that then in turn release direct memory). If the limit is
>>>>> anyways too high (because we also count native memory in there) such that
>>>>> we can exceed the total process (container) memory, why do we set it then?*
>>>>>
>>>>>
>>>>> I always also thought about it as providing more safety net for direct
>>>>> allocations but GC thing looks more important.
>>>>>
>>>>> +1 for fixing docs for 'taskmanager.memory.task.off-heap.size’ and
>>>>> renaming to ‘direct' as this is what really happens
>>>>> if we want to support direct limit more exact than now.
>>>>>
>>>>> I also think that it is hard to separate direct / native memory unless
>>>>> we introduce even more options.
>>>>> If user wants to keep the direct limit tight to a certain value but
>>>>> also use native memory outside of it,
>>>>> she would have to increase something else, like JVM overhead to
>>>>> account for it and there is no other better way.
>>>>> Having more options to account for the native memory outside of direct
>>>>> limit complicates things but can be introduced later if needed.
>>>>>
>>>>>   - What do you think about renaming
>>>>>> "taskmanager.memory.total-flink.size" to "taskmanager.memory.flink.size"
>>>>>> and "taskmanager.memory.total-process.size" to
>>>>>> "taskmanager.memory.process.size" (or "taskmanager.memory.jvm.total"). I
>>>>>> think these keys may be a bit less clumsy (dropping the "total-") without
>>>>>> loss of expressiveness.
>>>>>
>>>>>
>>>>> +1 on this.
>>>>>
>>>>>
>>>>> +1 as well. Also an option: 'taskmanager.memory.total-process.size’ ->
>>>>> ‘taskmanager.memory.jvm.process.size’,
>>>>> although it can be also mentioned in docs that we mean JVM process.
>>>>>
>>>>
>> I'd be in favour of Stephan's proposal for the config keys as shorter is
>> usually better and they are still descriptive enough. Between
>> "taskmanager.memory.process.size" and "taskmanager.memory.jvm.total" I
>> would slightly favour the first variant.
>>
>>>
>>>>>   - The network memory keys are now called
>>>>>> "taskmanager.memory.shuffle.*". To my knowledge, shuffle is usually
>>>>>> understood as a redistribution (random, or maybe by hash of key). As an
>>>>>> example, there are many discussions about "shuffle join versus broadcast
>>>>>> join", where "shuffle" is the synonym for "re-partitioning". We use that
>>>>>> memory however for all network operations, like forward pipes, broadcasts,
>>>>>> receiver-side buffering on checkpoints, etc. I find the name "*.shuffle.*"
>>>>>> confusing, I am wondering if users would find that as well. So throwing in
>>>>>> the suggestion to call the options "taskmanager.memory.network.*".
>>>>>
>>>>>
>>>>> +0 on this one. I'm ok with "taskmanager.memory.network.*". On the
>>>>> other hand, one can also argue that this part of memory is used by
>>>>> ShuffleEnvironment, and the key "taskmanager.memory.shuffle.*" points more
>>>>> directly to the shuffle service components.
>>>>>
>>>>>
>>>>> *==> In that case, the name "Shuffle Environment" may be a bit
>>>>> incorrect, because it is doing not only shuffles as well. The
>>>>> ShuffleEnvironment is also more internal, so the name is not too critical.
>>>>> This isn't super high priority for me, but if we want to adjust it, better
>>>>> earlier than later.*
>>>>>
>>>>>
>>>>> This is also a bit controversial topic for me. Indeed, we have always
>>>>> used ’network’ for this concept of task data shuffling over the network and
>>>>> this can confuse existing users.
>>>>>
>>>>> On the other hand for the new users and in a long term, ’network’ can
>>>>> delude into a conclusion that all network memory is managed by this option.
>>>>> Also other types of shuffle might not directly deal with network at
>>>>> all.
>>>>>
>>>>> By calling it shuffle, we were somewhat biased by understanding it in
>>>>> term of map/reduce. This is rather an inter-task data exchange.
>>>>> Maybe then 'taskmanager.memory.shuffle.communication.*’ or
>>>>> ‘taskmanager.memory.task.shuffle/communication/io/network.*’.
>>>>>
>>>>
>>  From a user's perspective I believe "taskmanager.memory.network" would
>> be easier to understand as not everyone knows exactly what the shuffle
>> service is. I see the point that it would be a bit imprecise as we can have
>> different shuffle implementations but I would go with the ease of
>> use/understanding here. Moreover, I think that we won't have many different
>> shuffle service implementations in the foreseeable future.
>>
>>>
>>>>>   - The descriptions for the "taskmanager.memory.jvm-overhead.*" keys
>>>>>> say that it also accounts for I/O direct memory, but the parameter is not
>>>>>> counted into the MaxDirectMemory parameter.
>>>>>
>>>>>
>>>>> True. Since we already have framework off-heap memory accounted for ad
>>>>> hoc direct memory usages, accounting all of jvm-overhead also into max
>>>>> direct memory limit seems not necessary. I would suggest to remove "I/O
>>>>> direct memory" from the description, and explicitly mention that this
>>>>> option does not account for direct memory and will not be accounted into
>>>>> max direct memory limit. WDYT?
>>>>>
>>>>>
>>>>> +1
>>>>>
>>>>>   - Can make the new ConfigOptions strongly typed with the new
>>>>>> configuration options. For example, directly use MemorySize typed options.
>>>>>> That makes validation automatic and saves us from breaking the options
>>>>>> later.
>>>>>
>>>>> +1. Wasn't aware of the new memory type config options.
>>>>>
>>>>>
>>>>> +1
>>>>>
>>>>> *==> Thanks. Do you need help with adjusting this?*
>>>>>
>>>>>
>>>>> I would appreciate it.
>>>>>
>>>>> Also small side note, we extensively use MemorySize in logs now but it
>>>>> might be not always readable as its string representation is only in bytes
>>>>> atm
>>>>> and does not reduce it to kb/mb/etc in case of big bytes value. We
>>>>> could have at least some .prettyPrint function to use in logs.
>>>>> And .fromMegabytes/etc factory methods would improve code readability
>>>>> instead of .parse(int + “m”).
>>>>>
>>>>> Best,
>>>>> Andrey
>>>>>
>>>>> On 20 Dec 2019, at 12:13, Stephan Ewen <se...@apache.org> wrote:
>>>>>
>>>>> Hi Xintong!
>>>>>
>>>>> Please find my answers inline:
>>>>>
>>>>>>   - "taskmanager.memory.size" (old main config option) is replaced by
>>>>>>> "taskmanager.memory.total-process.size" which has a different meaning in
>>>>>>> standalone setups. The old option did not subtract metaspace and other
>>>>>>> overhead, while the new option does. That means that with the default
>>>>>>> config, standalone clusters get quite a bit less memory. (independent of
>>>>>>> managed memory going off heap).
>>>>>>>     I am wondering if we could interpret "taskmanager.memory.size"
>>>>>>> as the deprecated key for "taskmanager.memory.total-flink.size". That would
>>>>>>> be in line with the old mechanism (assuming managed memory is set to off
>>>>>>> heap).
>>>>>>>     The effect would be that the container size on Yarn/Mesos
>>>>>>> increases, because from "taskmanager.memory.total-flink.size", we need to
>>>>>>> add overhead and metaspace to reach the total process size, rather than
>>>>>>> cutting off memory. But if we want, we could even adjust for that in the
>>>>>>> active resource manager, getting full backwards compatibility on that part.
>>>>>>>     Curious to hear more thoughts there.
>>>>>>
>>>>>>
>>>>>> I believe you mean "taskmanager.heap.size".
>>>>>>
>>>>>
>>>>> *==> Yes*
>>>>>
>>>>>
>>>>>
>>>>>> I think the problem here is that the legacy "taskmanager.heap.size"
>>>>>> was used differently in standalone setups and active yarn / mesos setups,
>>>>>> and such different calculation logics and behaviors are exactly what we
>>>>>> want to avoid with FLIP-49. Therefore, I'm not in favor of treating
>>>>>> "taskmanager.memory.total-flink.size" differently for standalone and active
>>>>>> setups.
>>>>>>
>>>>>> I think what we really want is probably
>>>>>> mapping "taskmanager.heap.size" to different new config options in
>>>>>> different setups. How about we mark "taskmanager.heap.size" as deprecated
>>>>>> key for neither of "taskmanager.memory.total-process.size" and
>>>>>> "taskmanager.memory.total-flink.size". Instead, we parse it (if explicitly
>>>>>> configured) in startup scripts / active resource managers, and set the
>>>>>> value to "taskmanager.memory.total-flink.size" in the scripts and
>>>>>> "taskmanager.memory.total-process.size" in active resource managers (if the
>>>>>> new config options are not configured). We can provide util methods in
>>>>>> TaskExecutorResourceUtils for such conversions, to keep all the
>>>>>> configuration logics at one place.
>>>>>>
>>>>>
>>>>> *==> This is pretty much what I meant as well (maybe my description
>>>>> was not very clear), so +1 for that mechanism*
>>>>>
>>>>>
>>>>>   - Mini Cluster tries to imitate exact ratio of memory pools as a
>>>>>>> standalone setup. I get the idea behind that, but I am wondering if it is
>>>>>>> the right approach here.
>>>>>>>     For example: I started a relatively large JVM (large heap size
>>>>>>> of 10 GB) as a test. With the current logic, the system tries to reserve an
>>>>>>> additional 6GB for managed memory which is more than there is memory left.
>>>>>>> When you see the error that no memory could be allocated, you need to
>>>>>>> understand the magic of how this is derived.
>>>>>>>     I am trying to think about this from the perspective of using
>>>>>>> "Flink as a Library", which the MiniCluster is close to.
>>>>>>>     When starting Flink out of a running process, we cannot assume
>>>>>>> that we are the only users of that process and that we can mold the process
>>>>>>> to our demands. I think a fix value for managed memory and network memory
>>>>>>> would feel more natural in such a setup than a mechanism that is tailored
>>>>>>> towards exclusive use of the process.
>>>>>>
>>>>>>
>>>>>> +1 on having fixed values for managed / shuffle memory.
>>>>>>
>>>>>
>>>>> *==> Cool, let's also see what Andrey and Till think here.*
>>>>>
>>>>>
>>>>>
>>>>>>   - Some off-heap memory goes into direct memory, some does not. This
>>>>>>> confused me a bit. For example "taskmanager.memory.framework.off-heap.size"
>>>>>>> is counted into MaxDirectMemory while
>>>>>>> "taskmanager.memory.task.off-heap.size" is counted as native memory. Maybe
>>>>>>> we should rename the keys to reflect that. There is no one "off heap"
>>>>>>> memory type after all. Maybe use "taskmanager.memory.task.native: XXXmb"
>>>>>>> and "taskmanager.memory.framework.direct: XXXmb" instead?
>>>>>>
>>>>>>
>>>>>> I believe "taskmanager.memory.task.off-heap.size" is also accounted
>>>>>> in the max direct memory size limit. The confusion probably comes from that
>>>>>> "taskmanager.memory.framework.off-heap.size" explicitly mentioned that in
>>>>>> its description while "taskmanager.memory.task.off-heap.size" didn't.
>>>>>> That's actually because the framework off-heap memory is introduced later
>>>>>> in a separate commit. We should fix that.
>>>>>>
>>>>>> For framework / task off-heap memory, we do not differentiate direct
>>>>>> / native memory usage. That means the configure value for these two options
>>>>>> could be a mixture of direct / native memory. Since we do not know the
>>>>>> portion of direct memory out of the configured value, we have
>>>>>> to conservatively account it all into the max direct memory size limit.
>>>>>>
>>>>>
>>>>> *==>  In that case, I am a bit confused. For the total size
>>>>> calculation, it is fine. But why do we then set MaxDirectMemory? It is a
>>>>> difficult parameter, and the main reason to set it was (if I recall
>>>>> correctly) to trigger GC based on direct memory allocation (to free heap
>>>>> structures that then in turn release direct memory). If the limit is
>>>>> anyways too high (because we also count native memory in there) such that
>>>>> we can exceed the total process (container) memory, why do we set it then?*
>>>>>
>>>>>
>>>>>   - The network memory keys are now called
>>>>>>> "taskmanager.memory.shuffle.*". To my knowledge, shuffle is usually
>>>>>>> understood as a redistribution (random, or maybe by hash of key). As an
>>>>>>> example, there are many discussions about "shuffle join versus broadcast
>>>>>>> join", where "shuffle" is the synonym for "re-partitioning". We use that
>>>>>>> memory however for all network operations, like forward pipes, broadcasts,
>>>>>>> receiver-side buffering on checkpoints, etc. I find the name "*.shuffle.*"
>>>>>>> confusing, I am wondering if users would find that as well. So throwing in
>>>>>>> the suggestion to call the options "taskmanager.memory.network.*".
>>>>>>
>>>>>>
>>>>>> +0 on this one. I'm ok with "taskmanager.memory.network.*". On the
>>>>>> other hand, one can also argue that this part of memory is used by
>>>>>> ShuffleEnvironment, and the key "taskmanager.memory.shuffle.*" points more
>>>>>> directly to the shuffle service components.
>>>>>>
>>>>>
>>>>> *==> In that case, the name "Shuffle Environment" may be a bit
>>>>> incorrect, because it is doing not only shuffles as well. The
>>>>> ShuffleEnvironment is also more internal, so the name is not too critical.
>>>>> This isn't super high priority for me, but if we want to adjust it, better
>>>>> earlier than later.*
>>>>>
>>>>>
>>>>>    - The descriptions for the "taskmanager.memory.jvm-overhead.*" keys
>>>>> say that it also accounts for I/O direct memory, but the parameter is not
>>>>> counted into the MaxDirectMemory parameter.
>>>>>
>>>>>>
>>>>>> True. Since we already have framework off-heap memory accounted for
>>>>>> ad hoc direct memory usages, accounting all of jvm-overhead also into max
>>>>>> direct memory limit seems not necessary. I would suggest to remove "I/O
>>>>>> direct memory" from the description, and explicitly mention that this
>>>>>> option does not account for direct memory and will not be accounted into
>>>>>> max direct memory limit. WDYT?
>>>>>>
>>>>>
>>>>> *==> Sounds good. *
>>>>>
>>>>>
>>>>>
>>>>>>   - Can make the new ConfigOptions strongly typed with the new
>>>>>>> configuration options. For example, directly use MemorySize typed options.
>>>>>>> That makes validation automatic and saves us from breaking the options
>>>>>>> later.
>>>>>>
>>>>>> +1. Wasn't aware of the new memory type config options.
>>>>>>
>>>>>
>>>>> *==> Thanks. Do you need help with adjusting this?*
>>>>>
>>>>>
>>>>> Best,
>>>>> Stephan
>>>>>
>>>>>
>>>>>
>>>>>

Re: [DISCUSS] Some feedback after trying out the new FLIP-49 memory configurations

Posted by Xintong Song <to...@gmail.com>.
>
> +1 to not have more options for off-heap memory, that would get confusing
> fast. We can keep the name "off-heap" for both task and framework, as long
> as they mean the same thing: native plus direct, and fully counted into
> MaxDirectMemory. I would suggest to update the config descriptions then to
> reflect that.
>
True, this should be explained in the config descriptions.

Which configuration option will be set in Flink's default flink-conf.yaml?
> If we want to maintain the existing behaviour it would have to be the then
> deprecated taskmanager.heap.size config option. If we are ok with Yarn
> requesting slightly larger containers, then it could also
> be taskmanager.memory.total-flink.size.
>
Good point. Currently, we have "taskmanager.memory.total-process.size". In
order to preserve the previous behavior, we need to have
"taskmanager.heap.size" so it can be mapped to different new options in
standalone / active setups.
I think we can have the deprecated "taskmanager.heap.size" in the default
flink-conf.yaml, and also have the
new "taskmanager.memory.total-process.size" in a commented line. We can
explain how the deprecated config option behaves differently in the
comments, so that user can switch to the new config options if they want to.

Thank you~

Xintong Song



On Sat, Dec 21, 2019 at 1:00 AM Till Rohrmann <tr...@apache.org> wrote:

> Thanks for the feedback and good discussion everyone. I left some comments
> inline.
>
> On Fri, Dec 20, 2019 at 1:59 PM Stephan Ewen <se...@apache.org> wrote:
>
>> +1 to not have more options for off-heap memory, that would get confusing
>> fast. We can keep the name "off-heap" for both task and framework, as long
>> as they mean the same thing: native plus direct, and fully counted into
>> MaxDirectMemory. I would suggest to update the config descriptions then to
>> reflect that.
>>
>>
>>
>> On Fri, Dec 20, 2019 at 1:03 PM Xintong Song <to...@gmail.com>
>> wrote:
>>
>>> Regarding the framework/task direct/native memory options, I tend to
>>> think it differently. I'm in favor of keep the "*.off-heap.size" for the
>>> config option keys.
>>>
>>>    - It's not necessary IMO to expose the difference concepts of direct
>>>    / native memory to the users.
>>>    - I would avoid introducing more options for native memory if
>>>    possible. Taking fine grained resource management and dynamic slot
>>>    allocation into consideration, that also means introduce more fields into
>>>    ResourceSpec / ResourceProfile.
>>>    - My gut feeling is that having a relative loose MaxDirectMemory
>>>    should not be a big problem.
>>>    - In most cases, the task / framework off-heap memory should be
>>>       mainly (if not all) direct memory, so the difference between derived
>>>       MaxDirectMemory and the ideal direct memory limit should not be too much.
>>>       - We do not have a good way to know the exact size needed for jvm
>>>       overhead / metaspace and framework / task off-heap memory, thus having
>>>       to conservatively reserve slightly more memory then what actually needed.
>>>       Such reserved but not used memory can cover for the small MaxDirectMemory
>>>       error.
>>>       -
>>>       - MaxDirectMemory is not the only way to trigger full gc. We
>>>       still heap activities that can also trigger the gc.
>>>
>>> Regarding the memory type config options, I've looked into the latest
>>> ConfigOptions changes. I think it shouldn't be too complicated to change
>>> the config options to use memory type, and I can handle it maybe during
>>> your vacations.
>>>
>>>
>>> Also agree with improving MemorySize logging and parsing. This should
>>> not be a blocker that has to be done in 1.10. I would say we finish other
>>> works (testability, documentation and those discussed in this thread)
>>> first, and get to this only if we have time.
>>>
>>>
>>> Thank you~
>>>
>>> Xintong Song
>>>
>>>
>>>
>>> On Fri, Dec 20, 2019 at 6:07 PM Andrey Zagrebin <
>>> azagrebin.apache@gmail.com> wrote:
>>>
>>>> Hi Stephan and Xintong,
>>>>
>>>> Thanks for the further FLIP-49 feedbacks.
>>>>
>>>>   - "taskmanager.memory.size" (old main config option) is replaced by
>>>>> "taskmanager.memory.total-process.size" which has a different meaning in
>>>>> standalone setups. The old option did not subtract metaspace and other
>>>>> overhead, while the new option does. That means that with the default
>>>>> config, standalone clusters get quite a bit less memory. (independent of
>>>>> managed memory going off heap).
>>>>>     I am wondering if we could interpret "taskmanager.memory.size" as
>>>>> the deprecated key for "taskmanager.memory.total-flink.size". That would be
>>>>> in line with the old mechanism (assuming managed memory is set to off heap).
>>>>>     The effect would be that the container size on Yarn/Mesos
>>>>> increases, because from "taskmanager.memory.total-flink.size", we need to
>>>>> add overhead and metaspace to reach the total process size, rather than
>>>>> cutting off memory. But if we want, we could even adjust for that in the
>>>>> active resource manager, getting full backwards compatibility on that part.
>>>>>     Curious to hear more thoughts there.
>>>>
>>>>
>>>> I believe you mean "taskmanager.heap.size".
>>>>
>>>> I think the problem here is that the legacy "taskmanager.heap.size" was
>>>> used differently in standalone setups and active yarn / mesos setups, and
>>>> such different calculation logics and behaviors are exactly what we want to
>>>> avoid with FLIP-49. Therefore, I'm not in favor of treating
>>>> "taskmanager.memory.total-flink.size" differently for standalone and active
>>>> setups.
>>>>
>>>> I think what we really want is probably mapping "taskmanager.heap.size"
>>>> to different new config options in different setups. How about we
>>>> mark "taskmanager.heap.size" as deprecated key for neither of
>>>> "taskmanager.memory.total-process.size" and
>>>> "taskmanager.memory.total-flink.size". Instead, we parse it (if explicitly
>>>> configured) in startup scripts / active resource managers, and set the
>>>> value to "taskmanager.memory.total-flink.size" in the scripts and
>>>> "taskmanager.memory.total-process.size" in active resource managers (if the
>>>> new config options are not configured). We can provide util methods in
>>>> TaskExecutorResourceUtils for such conversions, to keep all the
>>>> configuration logics at one place.
>>>>
>>>>
>>>> I agree that the problem is that the legacy option
>>>> ‘taskmanager.heap.size’ has different semantics for standalone/container.
>>>> We had it initially falling back to 'taskmanager.memory.total-flink.size’
>>>> but I changed that to align it with container cut-off. Now I see it changes
>>>> standalone setup then.
>>>> +1 for supporting its backwards compatibility differently for
>>>> standalone/container setups.
>>>>
>>>
> Which configuration option will be set in Flink's default flink-conf.yaml?
> If we want to maintain the existing behaviour it would have to be the then
> deprecated taskmanager.heap.size config option. If we are ok with Yarn
> requesting slightly larger containers, then it could also
> be taskmanager.memory.total-flink.size.
>
>>
>>>>
>>>>   - Mini Cluster tries to imitate exact ratio of memory pools as a
>>>>> standalone setup. I get the idea behind that, but I am wondering if it is
>>>>> the right approach here.
>>>>>     For example: I started a relatively large JVM (large heap size of
>>>>> 10 GB) as a test. With the current logic, the system tries to reserve an
>>>>> additional 6GB for managed memory which is more than there is memory left.
>>>>> When you see the error that no memory could be allocated, you need to
>>>>> understand the magic of how this is derived.
>>>>>     I am trying to think about this from the perspective of using
>>>>> "Flink as a Library", which the MiniCluster is close to.
>>>>>     When starting Flink out of a running process, we cannot assume
>>>>> that we are the only users of that process and that we can mold the process
>>>>> to our demands. I think a fix value for managed memory and network memory
>>>>> would feel more natural in such a setup than a mechanism that is tailored
>>>>> towards exclusive use of the process.
>>>>
>>>>
>>>> +1 on having fixed values for managed / shuffle memory.
>>>>
>>>>
>>>> also +1 for that, if user has not specified any main options to derive
>>>> memory. We should also log this fixing of managed / shuffle memory.
>>>> And just noticed, we could also sanity check framework and if
>>>> explicitly configured task heap against available JVM heap, and at least
>>>> log inconsistencies.
>>>>
>>>>   - Some off-heap memory goes into direct memory, some does not. This
>>>>> confused me a bit. For example "taskmanager.memory.framework.off-heap.size"
>>>>> is counted into MaxDirectMemory while
>>>>> "taskmanager.memory.task.off-heap.size" is counted as native memory. Maybe
>>>>> we should rename the keys to reflect that. There is no one "off heap"
>>>>> memory type after all. Maybe use "taskmanager.memory.task.native: XXXmb"
>>>>> and "taskmanager.memory.framework.direct: XXXmb" instead?
>>>>
>>>>
>>>> I believe "taskmanager.memory.task.off-heap.size" is also accounted in
>>>> the max direct memory size limit. The confusion probably comes from that
>>>> "taskmanager.memory.framework.off-heap.size" explicitly mentioned that in
>>>> its description while "taskmanager.memory.task.off-heap.size" didn't.
>>>> That's actually because the framework off-heap memory is introduced later
>>>> in a separate commit. We should fix that.
>>>>
>>>> For framework / task off-heap memory, we do not differentiate direct /
>>>> native memory usage. That means the configure value for these two options
>>>> could be a mixture of direct / native memory. Since we do not know the
>>>> portion of direct memory out of the configured value, we have
>>>> to conservatively account it all into the max direct memory size limit.
>>>>
>>>>
>>>> *==>  In that case, I am a bit confused. For the total size
>>>> calculation, it is fine. But why do we then set MaxDirectMemory? It is a
>>>> difficult parameter, and the main reason to set it was (if I recall
>>>> correctly) to trigger GC based on direct memory allocation (to free heap
>>>> structures that then in turn release direct memory). If the limit is
>>>> anyways too high (because we also count native memory in there) such that
>>>> we can exceed the total process (container) memory, why do we set it then?*
>>>>
>>>>
>>>> I always also thought about it as providing more safety net for direct
>>>> allocations but GC thing looks more important.
>>>>
>>>> +1 for fixing docs for 'taskmanager.memory.task.off-heap.size’ and
>>>> renaming to ‘direct' as this is what really happens
>>>> if we want to support direct limit more exact than now.
>>>>
>>>> I also think that it is hard to separate direct / native memory unless
>>>> we introduce even more options.
>>>> If user wants to keep the direct limit tight to a certain value but
>>>> also use native memory outside of it,
>>>> she would have to increase something else, like JVM overhead to account
>>>> for it and there is no other better way.
>>>> Having more options to account for the native memory outside of direct
>>>> limit complicates things but can be introduced later if needed.
>>>>
>>>>   - What do you think about renaming
>>>>> "taskmanager.memory.total-flink.size" to "taskmanager.memory.flink.size"
>>>>> and "taskmanager.memory.total-process.size" to
>>>>> "taskmanager.memory.process.size" (or "taskmanager.memory.jvm.total"). I
>>>>> think these keys may be a bit less clumsy (dropping the "total-") without
>>>>> loss of expressiveness.
>>>>
>>>>
>>>> +1 on this.
>>>>
>>>>
>>>> +1 as well. Also an option: 'taskmanager.memory.total-process.size’ ->
>>>> ‘taskmanager.memory.jvm.process.size’,
>>>> although it can be also mentioned in docs that we mean JVM process.
>>>>
>>>
> I'd be in favour of Stephan's proposal for the config keys as shorter is
> usually better and they are still descriptive enough. Between
> "taskmanager.memory.process.size" and "taskmanager.memory.jvm.total" I
> would slightly favour the first variant.
>
>>
>>>>   - The network memory keys are now called
>>>>> "taskmanager.memory.shuffle.*". To my knowledge, shuffle is usually
>>>>> understood as a redistribution (random, or maybe by hash of key). As an
>>>>> example, there are many discussions about "shuffle join versus broadcast
>>>>> join", where "shuffle" is the synonym for "re-partitioning". We use that
>>>>> memory however for all network operations, like forward pipes, broadcasts,
>>>>> receiver-side buffering on checkpoints, etc. I find the name "*.shuffle.*"
>>>>> confusing, I am wondering if users would find that as well. So throwing in
>>>>> the suggestion to call the options "taskmanager.memory.network.*".
>>>>
>>>>
>>>> +0 on this one. I'm ok with "taskmanager.memory.network.*". On the
>>>> other hand, one can also argue that this part of memory is used by
>>>> ShuffleEnvironment, and the key "taskmanager.memory.shuffle.*" points more
>>>> directly to the shuffle service components.
>>>>
>>>>
>>>> *==> In that case, the name "Shuffle Environment" may be a bit
>>>> incorrect, because it is doing not only shuffles as well. The
>>>> ShuffleEnvironment is also more internal, so the name is not too critical.
>>>> This isn't super high priority for me, but if we want to adjust it, better
>>>> earlier than later.*
>>>>
>>>>
>>>> This is also a bit controversial topic for me. Indeed, we have always
>>>> used ’network’ for this concept of task data shuffling over the network and
>>>> this can confuse existing users.
>>>>
>>>> On the other hand for the new users and in a long term, ’network’ can
>>>> delude into a conclusion that all network memory is managed by this option.
>>>> Also other types of shuffle might not directly deal with network at all.
>>>>
>>>> By calling it shuffle, we were somewhat biased by understanding it in
>>>> term of map/reduce. This is rather an inter-task data exchange.
>>>> Maybe then 'taskmanager.memory.shuffle.communication.*’ or
>>>> ‘taskmanager.memory.task.shuffle/communication/io/network.*’.
>>>>
>>>
>  From a user's perspective I believe "taskmanager.memory.network" would be
> easier to understand as not everyone knows exactly what the shuffle service
> is. I see the point that it would be a bit imprecise as we can have
> different shuffle implementations but I would go with the ease of
> use/understanding here. Moreover, I think that we won't have many different
> shuffle service implementations in the foreseeable future.
>
>>
>>>>   - The descriptions for the "taskmanager.memory.jvm-overhead.*" keys
>>>>> say that it also accounts for I/O direct memory, but the parameter is not
>>>>> counted into the MaxDirectMemory parameter.
>>>>
>>>>
>>>> True. Since we already have framework off-heap memory accounted for ad
>>>> hoc direct memory usages, accounting all of jvm-overhead also into max
>>>> direct memory limit seems not necessary. I would suggest to remove "I/O
>>>> direct memory" from the description, and explicitly mention that this
>>>> option does not account for direct memory and will not be accounted into
>>>> max direct memory limit. WDYT?
>>>>
>>>>
>>>> +1
>>>>
>>>>   - Can make the new ConfigOptions strongly typed with the new
>>>>> configuration options. For example, directly use MemorySize typed options.
>>>>> That makes validation automatic and saves us from breaking the options
>>>>> later.
>>>>
>>>> +1. Wasn't aware of the new memory type config options.
>>>>
>>>>
>>>> +1
>>>>
>>>> *==> Thanks. Do you need help with adjusting this?*
>>>>
>>>>
>>>> I would appreciate it.
>>>>
>>>> Also small side note, we extensively use MemorySize in logs now but it
>>>> might be not always readable as its string representation is only in bytes
>>>> atm
>>>> and does not reduce it to kb/mb/etc in case of big bytes value. We
>>>> could have at least some .prettyPrint function to use in logs.
>>>> And .fromMegabytes/etc factory methods would improve code readability
>>>> instead of .parse(int + “m”).
>>>>
>>>> Best,
>>>> Andrey
>>>>
>>>> On 20 Dec 2019, at 12:13, Stephan Ewen <se...@apache.org> wrote:
>>>>
>>>> Hi Xintong!
>>>>
>>>> Please find my answers inline:
>>>>
>>>>>   - "taskmanager.memory.size" (old main config option) is replaced by
>>>>>> "taskmanager.memory.total-process.size" which has a different meaning in
>>>>>> standalone setups. The old option did not subtract metaspace and other
>>>>>> overhead, while the new option does. That means that with the default
>>>>>> config, standalone clusters get quite a bit less memory. (independent of
>>>>>> managed memory going off heap).
>>>>>>     I am wondering if we could interpret "taskmanager.memory.size" as
>>>>>> the deprecated key for "taskmanager.memory.total-flink.size". That would be
>>>>>> in line with the old mechanism (assuming managed memory is set to off heap).
>>>>>>     The effect would be that the container size on Yarn/Mesos
>>>>>> increases, because from "taskmanager.memory.total-flink.size", we need to
>>>>>> add overhead and metaspace to reach the total process size, rather than
>>>>>> cutting off memory. But if we want, we could even adjust for that in the
>>>>>> active resource manager, getting full backwards compatibility on that part.
>>>>>>     Curious to hear more thoughts there.
>>>>>
>>>>>
>>>>> I believe you mean "taskmanager.heap.size".
>>>>>
>>>>
>>>> *==> Yes*
>>>>
>>>>
>>>>
>>>>> I think the problem here is that the legacy "taskmanager.heap.size"
>>>>> was used differently in standalone setups and active yarn / mesos setups,
>>>>> and such different calculation logics and behaviors are exactly what we
>>>>> want to avoid with FLIP-49. Therefore, I'm not in favor of treating
>>>>> "taskmanager.memory.total-flink.size" differently for standalone and active
>>>>> setups.
>>>>>
>>>>> I think what we really want is probably
>>>>> mapping "taskmanager.heap.size" to different new config options in
>>>>> different setups. How about we mark "taskmanager.heap.size" as deprecated
>>>>> key for neither of "taskmanager.memory.total-process.size" and
>>>>> "taskmanager.memory.total-flink.size". Instead, we parse it (if explicitly
>>>>> configured) in startup scripts / active resource managers, and set the
>>>>> value to "taskmanager.memory.total-flink.size" in the scripts and
>>>>> "taskmanager.memory.total-process.size" in active resource managers (if the
>>>>> new config options are not configured). We can provide util methods in
>>>>> TaskExecutorResourceUtils for such conversions, to keep all the
>>>>> configuration logics at one place.
>>>>>
>>>>
>>>> *==> This is pretty much what I meant as well (maybe my description was
>>>> not very clear), so +1 for that mechanism*
>>>>
>>>>
>>>>   - Mini Cluster tries to imitate exact ratio of memory pools as a
>>>>>> standalone setup. I get the idea behind that, but I am wondering if it is
>>>>>> the right approach here.
>>>>>>     For example: I started a relatively large JVM (large heap size of
>>>>>> 10 GB) as a test. With the current logic, the system tries to reserve an
>>>>>> additional 6GB for managed memory which is more than there is memory left.
>>>>>> When you see the error that no memory could be allocated, you need to
>>>>>> understand the magic of how this is derived.
>>>>>>     I am trying to think about this from the perspective of using
>>>>>> "Flink as a Library", which the MiniCluster is close to.
>>>>>>     When starting Flink out of a running process, we cannot assume
>>>>>> that we are the only users of that process and that we can mold the process
>>>>>> to our demands. I think a fix value for managed memory and network memory
>>>>>> would feel more natural in such a setup than a mechanism that is tailored
>>>>>> towards exclusive use of the process.
>>>>>
>>>>>
>>>>> +1 on having fixed values for managed / shuffle memory.
>>>>>
>>>>
>>>> *==> Cool, let's also see what Andrey and Till think here.*
>>>>
>>>>
>>>>
>>>>>   - Some off-heap memory goes into direct memory, some does not. This
>>>>>> confused me a bit. For example "taskmanager.memory.framework.off-heap.size"
>>>>>> is counted into MaxDirectMemory while
>>>>>> "taskmanager.memory.task.off-heap.size" is counted as native memory. Maybe
>>>>>> we should rename the keys to reflect that. There is no one "off heap"
>>>>>> memory type after all. Maybe use "taskmanager.memory.task.native: XXXmb"
>>>>>> and "taskmanager.memory.framework.direct: XXXmb" instead?
>>>>>
>>>>>
>>>>> I believe "taskmanager.memory.task.off-heap.size" is also accounted in
>>>>> the max direct memory size limit. The confusion probably comes from that
>>>>> "taskmanager.memory.framework.off-heap.size" explicitly mentioned that in
>>>>> its description while "taskmanager.memory.task.off-heap.size" didn't.
>>>>> That's actually because the framework off-heap memory is introduced later
>>>>> in a separate commit. We should fix that.
>>>>>
>>>>> For framework / task off-heap memory, we do not differentiate direct /
>>>>> native memory usage. That means the configure value for these two options
>>>>> could be a mixture of direct / native memory. Since we do not know the
>>>>> portion of direct memory out of the configured value, we have
>>>>> to conservatively account it all into the max direct memory size limit.
>>>>>
>>>>
>>>> *==>  In that case, I am a bit confused. For the total size
>>>> calculation, it is fine. But why do we then set MaxDirectMemory? It is a
>>>> difficult parameter, and the main reason to set it was (if I recall
>>>> correctly) to trigger GC based on direct memory allocation (to free heap
>>>> structures that then in turn release direct memory). If the limit is
>>>> anyways too high (because we also count native memory in there) such that
>>>> we can exceed the total process (container) memory, why do we set it then?*
>>>>
>>>>
>>>>   - The network memory keys are now called
>>>>>> "taskmanager.memory.shuffle.*". To my knowledge, shuffle is usually
>>>>>> understood as a redistribution (random, or maybe by hash of key). As an
>>>>>> example, there are many discussions about "shuffle join versus broadcast
>>>>>> join", where "shuffle" is the synonym for "re-partitioning". We use that
>>>>>> memory however for all network operations, like forward pipes, broadcasts,
>>>>>> receiver-side buffering on checkpoints, etc. I find the name "*.shuffle.*"
>>>>>> confusing, I am wondering if users would find that as well. So throwing in
>>>>>> the suggestion to call the options "taskmanager.memory.network.*".
>>>>>
>>>>>
>>>>> +0 on this one. I'm ok with "taskmanager.memory.network.*". On the
>>>>> other hand, one can also argue that this part of memory is used by
>>>>> ShuffleEnvironment, and the key "taskmanager.memory.shuffle.*" points more
>>>>> directly to the shuffle service components.
>>>>>
>>>>
>>>> *==> In that case, the name "Shuffle Environment" may be a bit
>>>> incorrect, because it is doing not only shuffles as well. The
>>>> ShuffleEnvironment is also more internal, so the name is not too critical.
>>>> This isn't super high priority for me, but if we want to adjust it, better
>>>> earlier than later.*
>>>>
>>>>
>>>>    - The descriptions for the "taskmanager.memory.jvm-overhead.*" keys
>>>> say that it also accounts for I/O direct memory, but the parameter is not
>>>> counted into the MaxDirectMemory parameter.
>>>>
>>>>>
>>>>> True. Since we already have framework off-heap memory accounted for ad
>>>>> hoc direct memory usages, accounting all of jvm-overhead also into max
>>>>> direct memory limit seems not necessary. I would suggest to remove "I/O
>>>>> direct memory" from the description, and explicitly mention that this
>>>>> option does not account for direct memory and will not be accounted into
>>>>> max direct memory limit. WDYT?
>>>>>
>>>>
>>>> *==> Sounds good. *
>>>>
>>>>
>>>>
>>>>>   - Can make the new ConfigOptions strongly typed with the new
>>>>>> configuration options. For example, directly use MemorySize typed options.
>>>>>> That makes validation automatic and saves us from breaking the options
>>>>>> later.
>>>>>
>>>>> +1. Wasn't aware of the new memory type config options.
>>>>>
>>>>
>>>> *==> Thanks. Do you need help with adjusting this?*
>>>>
>>>>
>>>> Best,
>>>> Stephan
>>>>
>>>>
>>>>
>>>>

Re: [DISCUSS] Some feedback after trying out the new FLIP-49 memory configurations

Posted by Till Rohrmann <tr...@apache.org>.
Thanks for the feedback and good discussion everyone. I left some comments
inline.

On Fri, Dec 20, 2019 at 1:59 PM Stephan Ewen <se...@apache.org> wrote:

> +1 to not have more options for off-heap memory, that would get confusing
> fast. We can keep the name "off-heap" for both task and framework, as long
> as they mean the same thing: native plus direct, and fully counted into
> MaxDirectMemory. I would suggest to update the config descriptions then to
> reflect that.
>
>
>
> On Fri, Dec 20, 2019 at 1:03 PM Xintong Song <to...@gmail.com>
> wrote:
>
>> Regarding the framework/task direct/native memory options, I tend to
>> think it differently. I'm in favor of keep the "*.off-heap.size" for the
>> config option keys.
>>
>>    - It's not necessary IMO to expose the difference concepts of direct
>>    / native memory to the users.
>>    - I would avoid introducing more options for native memory if
>>    possible. Taking fine grained resource management and dynamic slot
>>    allocation into consideration, that also means introduce more fields into
>>    ResourceSpec / ResourceProfile.
>>    - My gut feeling is that having a relative loose MaxDirectMemory
>>    should not be a big problem.
>>    - In most cases, the task / framework off-heap memory should be
>>       mainly (if not all) direct memory, so the difference between derived
>>       MaxDirectMemory and the ideal direct memory limit should not be too much.
>>       - We do not have a good way to know the exact size needed for jvm
>>       overhead / metaspace and framework / task off-heap memory, thus having
>>       to conservatively reserve slightly more memory then what actually needed.
>>       Such reserved but not used memory can cover for the small MaxDirectMemory
>>       error.
>>       -
>>       - MaxDirectMemory is not the only way to trigger full gc. We still
>>       heap activities that can also trigger the gc.
>>
>> Regarding the memory type config options, I've looked into the latest
>> ConfigOptions changes. I think it shouldn't be too complicated to change
>> the config options to use memory type, and I can handle it maybe during
>> your vacations.
>>
>>
>> Also agree with improving MemorySize logging and parsing. This should not
>> be a blocker that has to be done in 1.10. I would say we finish other works
>> (testability, documentation and those discussed in this thread) first, and
>> get to this only if we have time.
>>
>>
>> Thank you~
>>
>> Xintong Song
>>
>>
>>
>> On Fri, Dec 20, 2019 at 6:07 PM Andrey Zagrebin <
>> azagrebin.apache@gmail.com> wrote:
>>
>>> Hi Stephan and Xintong,
>>>
>>> Thanks for the further FLIP-49 feedbacks.
>>>
>>>   - "taskmanager.memory.size" (old main config option) is replaced by
>>>> "taskmanager.memory.total-process.size" which has a different meaning in
>>>> standalone setups. The old option did not subtract metaspace and other
>>>> overhead, while the new option does. That means that with the default
>>>> config, standalone clusters get quite a bit less memory. (independent of
>>>> managed memory going off heap).
>>>>     I am wondering if we could interpret "taskmanager.memory.size" as
>>>> the deprecated key for "taskmanager.memory.total-flink.size". That would be
>>>> in line with the old mechanism (assuming managed memory is set to off heap).
>>>>     The effect would be that the container size on Yarn/Mesos
>>>> increases, because from "taskmanager.memory.total-flink.size", we need to
>>>> add overhead and metaspace to reach the total process size, rather than
>>>> cutting off memory. But if we want, we could even adjust for that in the
>>>> active resource manager, getting full backwards compatibility on that part.
>>>>     Curious to hear more thoughts there.
>>>
>>>
>>> I believe you mean "taskmanager.heap.size".
>>>
>>> I think the problem here is that the legacy "taskmanager.heap.size" was
>>> used differently in standalone setups and active yarn / mesos setups, and
>>> such different calculation logics and behaviors are exactly what we want to
>>> avoid with FLIP-49. Therefore, I'm not in favor of treating
>>> "taskmanager.memory.total-flink.size" differently for standalone and active
>>> setups.
>>>
>>> I think what we really want is probably mapping "taskmanager.heap.size"
>>> to different new config options in different setups. How about we
>>> mark "taskmanager.heap.size" as deprecated key for neither of
>>> "taskmanager.memory.total-process.size" and
>>> "taskmanager.memory.total-flink.size". Instead, we parse it (if explicitly
>>> configured) in startup scripts / active resource managers, and set the
>>> value to "taskmanager.memory.total-flink.size" in the scripts and
>>> "taskmanager.memory.total-process.size" in active resource managers (if the
>>> new config options are not configured). We can provide util methods in
>>> TaskExecutorResourceUtils for such conversions, to keep all the
>>> configuration logics at one place.
>>>
>>>
>>> I agree that the problem is that the legacy option
>>> ‘taskmanager.heap.size’ has different semantics for standalone/container.
>>> We had it initially falling back to 'taskmanager.memory.total-flink.size’
>>> but I changed that to align it with container cut-off. Now I see it changes
>>> standalone setup then.
>>> +1 for supporting its backwards compatibility differently for
>>> standalone/container setups.
>>>
>>
Which configuration option will be set in Flink's default flink-conf.yaml?
If we want to maintain the existing behaviour it would have to be the then
deprecated taskmanager.heap.size config option. If we are ok with Yarn
requesting slightly larger containers, then it could also
be taskmanager.memory.total-flink.size.

>
>>>
>>>   - Mini Cluster tries to imitate exact ratio of memory pools as a
>>>> standalone setup. I get the idea behind that, but I am wondering if it is
>>>> the right approach here.
>>>>     For example: I started a relatively large JVM (large heap size of
>>>> 10 GB) as a test. With the current logic, the system tries to reserve an
>>>> additional 6GB for managed memory which is more than there is memory left.
>>>> When you see the error that no memory could be allocated, you need to
>>>> understand the magic of how this is derived.
>>>>     I am trying to think about this from the perspective of using
>>>> "Flink as a Library", which the MiniCluster is close to.
>>>>     When starting Flink out of a running process, we cannot assume that
>>>> we are the only users of that process and that we can mold the process to
>>>> our demands. I think a fix value for managed memory and network memory
>>>> would feel more natural in such a setup than a mechanism that is tailored
>>>> towards exclusive use of the process.
>>>
>>>
>>> +1 on having fixed values for managed / shuffle memory.
>>>
>>>
>>> also +1 for that, if user has not specified any main options to derive
>>> memory. We should also log this fixing of managed / shuffle memory.
>>> And just noticed, we could also sanity check framework and if explicitly
>>> configured task heap against available JVM heap, and at least log
>>> inconsistencies.
>>>
>>>   - Some off-heap memory goes into direct memory, some does not. This
>>>> confused me a bit. For example "taskmanager.memory.framework.off-heap.size"
>>>> is counted into MaxDirectMemory while
>>>> "taskmanager.memory.task.off-heap.size" is counted as native memory. Maybe
>>>> we should rename the keys to reflect that. There is no one "off heap"
>>>> memory type after all. Maybe use "taskmanager.memory.task.native: XXXmb"
>>>> and "taskmanager.memory.framework.direct: XXXmb" instead?
>>>
>>>
>>> I believe "taskmanager.memory.task.off-heap.size" is also accounted in
>>> the max direct memory size limit. The confusion probably comes from that
>>> "taskmanager.memory.framework.off-heap.size" explicitly mentioned that in
>>> its description while "taskmanager.memory.task.off-heap.size" didn't.
>>> That's actually because the framework off-heap memory is introduced later
>>> in a separate commit. We should fix that.
>>>
>>> For framework / task off-heap memory, we do not differentiate direct /
>>> native memory usage. That means the configure value for these two options
>>> could be a mixture of direct / native memory. Since we do not know the
>>> portion of direct memory out of the configured value, we have
>>> to conservatively account it all into the max direct memory size limit.
>>>
>>>
>>> *==>  In that case, I am a bit confused. For the total size calculation,
>>> it is fine. But why do we then set MaxDirectMemory? It is a difficult
>>> parameter, and the main reason to set it was (if I recall correctly) to
>>> trigger GC based on direct memory allocation (to free heap structures that
>>> then in turn release direct memory). If the limit is anyways too high
>>> (because we also count native memory in there) such that we can exceed the
>>> total process (container) memory, why do we set it then?*
>>>
>>>
>>> I always also thought about it as providing more safety net for direct
>>> allocations but GC thing looks more important.
>>>
>>> +1 for fixing docs for 'taskmanager.memory.task.off-heap.size’ and
>>> renaming to ‘direct' as this is what really happens
>>> if we want to support direct limit more exact than now.
>>>
>>> I also think that it is hard to separate direct / native memory unless
>>> we introduce even more options.
>>> If user wants to keep the direct limit tight to a certain value but also
>>> use native memory outside of it,
>>> she would have to increase something else, like JVM overhead to account
>>> for it and there is no other better way.
>>> Having more options to account for the native memory outside of direct
>>> limit complicates things but can be introduced later if needed.
>>>
>>>   - What do you think about renaming
>>>> "taskmanager.memory.total-flink.size" to "taskmanager.memory.flink.size"
>>>> and "taskmanager.memory.total-process.size" to
>>>> "taskmanager.memory.process.size" (or "taskmanager.memory.jvm.total"). I
>>>> think these keys may be a bit less clumsy (dropping the "total-") without
>>>> loss of expressiveness.
>>>
>>>
>>> +1 on this.
>>>
>>>
>>> +1 as well. Also an option: 'taskmanager.memory.total-process.size’ ->
>>> ‘taskmanager.memory.jvm.process.size’,
>>> although it can be also mentioned in docs that we mean JVM process.
>>>
>>
I'd be in favour of Stephan's proposal for the config keys as shorter is
usually better and they are still descriptive enough. Between
"taskmanager.memory.process.size" and "taskmanager.memory.jvm.total" I
would slightly favour the first variant.

>
>>>   - The network memory keys are now called
>>>> "taskmanager.memory.shuffle.*". To my knowledge, shuffle is usually
>>>> understood as a redistribution (random, or maybe by hash of key). As an
>>>> example, there are many discussions about "shuffle join versus broadcast
>>>> join", where "shuffle" is the synonym for "re-partitioning". We use that
>>>> memory however for all network operations, like forward pipes, broadcasts,
>>>> receiver-side buffering on checkpoints, etc. I find the name "*.shuffle.*"
>>>> confusing, I am wondering if users would find that as well. So throwing in
>>>> the suggestion to call the options "taskmanager.memory.network.*".
>>>
>>>
>>> +0 on this one. I'm ok with "taskmanager.memory.network.*". On the other
>>> hand, one can also argue that this part of memory is used by
>>> ShuffleEnvironment, and the key "taskmanager.memory.shuffle.*" points more
>>> directly to the shuffle service components.
>>>
>>>
>>> *==> In that case, the name "Shuffle Environment" may be a bit
>>> incorrect, because it is doing not only shuffles as well. The
>>> ShuffleEnvironment is also more internal, so the name is not too critical.
>>> This isn't super high priority for me, but if we want to adjust it, better
>>> earlier than later.*
>>>
>>>
>>> This is also a bit controversial topic for me. Indeed, we have always
>>> used ’network’ for this concept of task data shuffling over the network and
>>> this can confuse existing users.
>>>
>>> On the other hand for the new users and in a long term, ’network’ can
>>> delude into a conclusion that all network memory is managed by this option.
>>> Also other types of shuffle might not directly deal with network at all.
>>>
>>> By calling it shuffle, we were somewhat biased by understanding it in
>>> term of map/reduce. This is rather an inter-task data exchange.
>>> Maybe then 'taskmanager.memory.shuffle.communication.*’ or
>>> ‘taskmanager.memory.task.shuffle/communication/io/network.*’.
>>>
>>
 From a user's perspective I believe "taskmanager.memory.network" would be
easier to understand as not everyone knows exactly what the shuffle service
is. I see the point that it would be a bit imprecise as we can have
different shuffle implementations but I would go with the ease of
use/understanding here. Moreover, I think that we won't have many different
shuffle service implementations in the foreseeable future.

>
>>>   - The descriptions for the "taskmanager.memory.jvm-overhead.*" keys
>>>> say that it also accounts for I/O direct memory, but the parameter is not
>>>> counted into the MaxDirectMemory parameter.
>>>
>>>
>>> True. Since we already have framework off-heap memory accounted for ad
>>> hoc direct memory usages, accounting all of jvm-overhead also into max
>>> direct memory limit seems not necessary. I would suggest to remove "I/O
>>> direct memory" from the description, and explicitly mention that this
>>> option does not account for direct memory and will not be accounted into
>>> max direct memory limit. WDYT?
>>>
>>>
>>> +1
>>>
>>>   - Can make the new ConfigOptions strongly typed with the new
>>>> configuration options. For example, directly use MemorySize typed options.
>>>> That makes validation automatic and saves us from breaking the options
>>>> later.
>>>
>>> +1. Wasn't aware of the new memory type config options.
>>>
>>>
>>> +1
>>>
>>> *==> Thanks. Do you need help with adjusting this?*
>>>
>>>
>>> I would appreciate it.
>>>
>>> Also small side note, we extensively use MemorySize in logs now but it
>>> might be not always readable as its string representation is only in bytes
>>> atm
>>> and does not reduce it to kb/mb/etc in case of big bytes value. We could
>>> have at least some .prettyPrint function to use in logs.
>>> And .fromMegabytes/etc factory methods would improve code readability
>>> instead of .parse(int + “m”).
>>>
>>> Best,
>>> Andrey
>>>
>>> On 20 Dec 2019, at 12:13, Stephan Ewen <se...@apache.org> wrote:
>>>
>>> Hi Xintong!
>>>
>>> Please find my answers inline:
>>>
>>>>   - "taskmanager.memory.size" (old main config option) is replaced by
>>>>> "taskmanager.memory.total-process.size" which has a different meaning in
>>>>> standalone setups. The old option did not subtract metaspace and other
>>>>> overhead, while the new option does. That means that with the default
>>>>> config, standalone clusters get quite a bit less memory. (independent of
>>>>> managed memory going off heap).
>>>>>     I am wondering if we could interpret "taskmanager.memory.size" as
>>>>> the deprecated key for "taskmanager.memory.total-flink.size". That would be
>>>>> in line with the old mechanism (assuming managed memory is set to off heap).
>>>>>     The effect would be that the container size on Yarn/Mesos
>>>>> increases, because from "taskmanager.memory.total-flink.size", we need to
>>>>> add overhead and metaspace to reach the total process size, rather than
>>>>> cutting off memory. But if we want, we could even adjust for that in the
>>>>> active resource manager, getting full backwards compatibility on that part.
>>>>>     Curious to hear more thoughts there.
>>>>
>>>>
>>>> I believe you mean "taskmanager.heap.size".
>>>>
>>>
>>> *==> Yes*
>>>
>>>
>>>
>>>> I think the problem here is that the legacy "taskmanager.heap.size" was
>>>> used differently in standalone setups and active yarn / mesos setups, and
>>>> such different calculation logics and behaviors are exactly what we want to
>>>> avoid with FLIP-49. Therefore, I'm not in favor of treating
>>>> "taskmanager.memory.total-flink.size" differently for standalone and active
>>>> setups.
>>>>
>>>> I think what we really want is probably mapping "taskmanager.heap.size"
>>>> to different new config options in different setups. How about we
>>>> mark "taskmanager.heap.size" as deprecated key for neither of
>>>> "taskmanager.memory.total-process.size" and
>>>> "taskmanager.memory.total-flink.size". Instead, we parse it (if explicitly
>>>> configured) in startup scripts / active resource managers, and set the
>>>> value to "taskmanager.memory.total-flink.size" in the scripts and
>>>> "taskmanager.memory.total-process.size" in active resource managers (if the
>>>> new config options are not configured). We can provide util methods in
>>>> TaskExecutorResourceUtils for such conversions, to keep all the
>>>> configuration logics at one place.
>>>>
>>>
>>> *==> This is pretty much what I meant as well (maybe my description was
>>> not very clear), so +1 for that mechanism*
>>>
>>>
>>>   - Mini Cluster tries to imitate exact ratio of memory pools as a
>>>>> standalone setup. I get the idea behind that, but I am wondering if it is
>>>>> the right approach here.
>>>>>     For example: I started a relatively large JVM (large heap size of
>>>>> 10 GB) as a test. With the current logic, the system tries to reserve an
>>>>> additional 6GB for managed memory which is more than there is memory left.
>>>>> When you see the error that no memory could be allocated, you need to
>>>>> understand the magic of how this is derived.
>>>>>     I am trying to think about this from the perspective of using
>>>>> "Flink as a Library", which the MiniCluster is close to.
>>>>>     When starting Flink out of a running process, we cannot assume
>>>>> that we are the only users of that process and that we can mold the process
>>>>> to our demands. I think a fix value for managed memory and network memory
>>>>> would feel more natural in such a setup than a mechanism that is tailored
>>>>> towards exclusive use of the process.
>>>>
>>>>
>>>> +1 on having fixed values for managed / shuffle memory.
>>>>
>>>
>>> *==> Cool, let's also see what Andrey and Till think here.*
>>>
>>>
>>>
>>>>   - Some off-heap memory goes into direct memory, some does not. This
>>>>> confused me a bit. For example "taskmanager.memory.framework.off-heap.size"
>>>>> is counted into MaxDirectMemory while
>>>>> "taskmanager.memory.task.off-heap.size" is counted as native memory. Maybe
>>>>> we should rename the keys to reflect that. There is no one "off heap"
>>>>> memory type after all. Maybe use "taskmanager.memory.task.native: XXXmb"
>>>>> and "taskmanager.memory.framework.direct: XXXmb" instead?
>>>>
>>>>
>>>> I believe "taskmanager.memory.task.off-heap.size" is also accounted in
>>>> the max direct memory size limit. The confusion probably comes from that
>>>> "taskmanager.memory.framework.off-heap.size" explicitly mentioned that in
>>>> its description while "taskmanager.memory.task.off-heap.size" didn't.
>>>> That's actually because the framework off-heap memory is introduced later
>>>> in a separate commit. We should fix that.
>>>>
>>>> For framework / task off-heap memory, we do not differentiate direct /
>>>> native memory usage. That means the configure value for these two options
>>>> could be a mixture of direct / native memory. Since we do not know the
>>>> portion of direct memory out of the configured value, we have
>>>> to conservatively account it all into the max direct memory size limit.
>>>>
>>>
>>> *==>  In that case, I am a bit confused. For the total size calculation,
>>> it is fine. But why do we then set MaxDirectMemory? It is a difficult
>>> parameter, and the main reason to set it was (if I recall correctly) to
>>> trigger GC based on direct memory allocation (to free heap structures that
>>> then in turn release direct memory). If the limit is anyways too high
>>> (because we also count native memory in there) such that we can exceed the
>>> total process (container) memory, why do we set it then?*
>>>
>>>
>>>   - The network memory keys are now called
>>>>> "taskmanager.memory.shuffle.*". To my knowledge, shuffle is usually
>>>>> understood as a redistribution (random, or maybe by hash of key). As an
>>>>> example, there are many discussions about "shuffle join versus broadcast
>>>>> join", where "shuffle" is the synonym for "re-partitioning". We use that
>>>>> memory however for all network operations, like forward pipes, broadcasts,
>>>>> receiver-side buffering on checkpoints, etc. I find the name "*.shuffle.*"
>>>>> confusing, I am wondering if users would find that as well. So throwing in
>>>>> the suggestion to call the options "taskmanager.memory.network.*".
>>>>
>>>>
>>>> +0 on this one. I'm ok with "taskmanager.memory.network.*". On the
>>>> other hand, one can also argue that this part of memory is used by
>>>> ShuffleEnvironment, and the key "taskmanager.memory.shuffle.*" points more
>>>> directly to the shuffle service components.
>>>>
>>>
>>> *==> In that case, the name "Shuffle Environment" may be a bit
>>> incorrect, because it is doing not only shuffles as well. The
>>> ShuffleEnvironment is also more internal, so the name is not too critical.
>>> This isn't super high priority for me, but if we want to adjust it, better
>>> earlier than later.*
>>>
>>>
>>>    - The descriptions for the "taskmanager.memory.jvm-overhead.*" keys
>>> say that it also accounts for I/O direct memory, but the parameter is not
>>> counted into the MaxDirectMemory parameter.
>>>
>>>>
>>>> True. Since we already have framework off-heap memory accounted for ad
>>>> hoc direct memory usages, accounting all of jvm-overhead also into max
>>>> direct memory limit seems not necessary. I would suggest to remove "I/O
>>>> direct memory" from the description, and explicitly mention that this
>>>> option does not account for direct memory and will not be accounted into
>>>> max direct memory limit. WDYT?
>>>>
>>>
>>> *==> Sounds good. *
>>>
>>>
>>>
>>>>   - Can make the new ConfigOptions strongly typed with the new
>>>>> configuration options. For example, directly use MemorySize typed options.
>>>>> That makes validation automatic and saves us from breaking the options
>>>>> later.
>>>>
>>>> +1. Wasn't aware of the new memory type config options.
>>>>
>>>
>>> *==> Thanks. Do you need help with adjusting this?*
>>>
>>>
>>> Best,
>>> Stephan
>>>
>>>
>>>
>>>

Re: [DISCUSS] Some feedback after trying out the new FLIP-49 memory configurations

Posted by Stephan Ewen <se...@apache.org>.
+1 to not have more options for off-heap memory, that would get confusing
fast. We can keep the name "off-heap" for both task and framework, as long
as they mean the same thing: native plus direct, and fully counted into
MaxDirectMemory. I would suggest to update the config descriptions then to
reflect that.



On Fri, Dec 20, 2019 at 1:03 PM Xintong Song <to...@gmail.com> wrote:

> Regarding the framework/task direct/native memory options, I tend to think
> it differently. I'm in favor of keep the "*.off-heap.size" for the config
> option keys.
>
>    - It's not necessary IMO to expose the difference concepts of direct /
>    native memory to the users.
>    - I would avoid introducing more options for native memory if
>    possible. Taking fine grained resource management and dynamic slot
>    allocation into consideration, that also means introduce more fields into
>    ResourceSpec / ResourceProfile.
>    - My gut feeling is that having a relative loose MaxDirectMemory
>    should not be a big problem.
>    - In most cases, the task / framework off-heap memory should be mainly
>       (if not all) direct memory, so the difference between derived
>       MaxDirectMemory and the ideal direct memory limit should not be too much.
>       - We do not have a good way to know the exact size needed for jvm
>       overhead / metaspace and framework / task off-heap memory, thus having
>       to conservatively reserve slightly more memory then what actually needed.
>       Such reserved but not used memory can cover for the small MaxDirectMemory
>       error.
>       -
>       - MaxDirectMemory is not the only way to trigger full gc. We still
>       heap activities that can also trigger the gc.
>
> Regarding the memory type config options, I've looked into the latest
> ConfigOptions changes. I think it shouldn't be too complicated to change
> the config options to use memory type, and I can handle it maybe during
> your vacations.
>
>
> Also agree with improving MemorySize logging and parsing. This should not
> be a blocker that has to be done in 1.10. I would say we finish other works
> (testability, documentation and those discussed in this thread) first, and
> get to this only if we have time.
>
>
> Thank you~
>
> Xintong Song
>
>
>
> On Fri, Dec 20, 2019 at 6:07 PM Andrey Zagrebin <
> azagrebin.apache@gmail.com> wrote:
>
>> Hi Stephan and Xintong,
>>
>> Thanks for the further FLIP-49 feedbacks.
>>
>>   - "taskmanager.memory.size" (old main config option) is replaced by
>>> "taskmanager.memory.total-process.size" which has a different meaning in
>>> standalone setups. The old option did not subtract metaspace and other
>>> overhead, while the new option does. That means that with the default
>>> config, standalone clusters get quite a bit less memory. (independent of
>>> managed memory going off heap).
>>>     I am wondering if we could interpret "taskmanager.memory.size" as
>>> the deprecated key for "taskmanager.memory.total-flink.size". That would be
>>> in line with the old mechanism (assuming managed memory is set to off heap).
>>>     The effect would be that the container size on Yarn/Mesos increases,
>>> because from "taskmanager.memory.total-flink.size", we need to add overhead
>>> and metaspace to reach the total process size, rather than cutting off
>>> memory. But if we want, we could even adjust for that in the active
>>> resource manager, getting full backwards compatibility on that part.
>>>     Curious to hear more thoughts there.
>>
>>
>> I believe you mean "taskmanager.heap.size".
>>
>> I think the problem here is that the legacy "taskmanager.heap.size" was
>> used differently in standalone setups and active yarn / mesos setups, and
>> such different calculation logics and behaviors are exactly what we want to
>> avoid with FLIP-49. Therefore, I'm not in favor of treating
>> "taskmanager.memory.total-flink.size" differently for standalone and active
>> setups.
>>
>> I think what we really want is probably mapping "taskmanager.heap.size"
>> to different new config options in different setups. How about we
>> mark "taskmanager.heap.size" as deprecated key for neither of
>> "taskmanager.memory.total-process.size" and
>> "taskmanager.memory.total-flink.size". Instead, we parse it (if explicitly
>> configured) in startup scripts / active resource managers, and set the
>> value to "taskmanager.memory.total-flink.size" in the scripts and
>> "taskmanager.memory.total-process.size" in active resource managers (if the
>> new config options are not configured). We can provide util methods in
>> TaskExecutorResourceUtils for such conversions, to keep all the
>> configuration logics at one place.
>>
>>
>> I agree that the problem is that the legacy option
>> ‘taskmanager.heap.size’ has different semantics for standalone/container.
>> We had it initially falling back to 'taskmanager.memory.total-flink.size’
>> but I changed that to align it with container cut-off. Now I see it changes
>> standalone setup then.
>> +1 for supporting its backwards compatibility differently for
>> standalone/container setups.
>>
>>
>>   - Mini Cluster tries to imitate exact ratio of memory pools as a
>>> standalone setup. I get the idea behind that, but I am wondering if it is
>>> the right approach here.
>>>     For example: I started a relatively large JVM (large heap size of 10
>>> GB) as a test. With the current logic, the system tries to reserve an
>>> additional 6GB for managed memory which is more than there is memory left.
>>> When you see the error that no memory could be allocated, you need to
>>> understand the magic of how this is derived.
>>>     I am trying to think about this from the perspective of using "Flink
>>> as a Library", which the MiniCluster is close to.
>>>     When starting Flink out of a running process, we cannot assume that
>>> we are the only users of that process and that we can mold the process to
>>> our demands. I think a fix value for managed memory and network memory
>>> would feel more natural in such a setup than a mechanism that is tailored
>>> towards exclusive use of the process.
>>
>>
>> +1 on having fixed values for managed / shuffle memory.
>>
>>
>> also +1 for that, if user has not specified any main options to derive
>> memory. We should also log this fixing of managed / shuffle memory.
>> And just noticed, we could also sanity check framework and if explicitly
>> configured task heap against available JVM heap, and at least log
>> inconsistencies.
>>
>>   - Some off-heap memory goes into direct memory, some does not. This
>>> confused me a bit. For example "taskmanager.memory.framework.off-heap.size"
>>> is counted into MaxDirectMemory while
>>> "taskmanager.memory.task.off-heap.size" is counted as native memory. Maybe
>>> we should rename the keys to reflect that. There is no one "off heap"
>>> memory type after all. Maybe use "taskmanager.memory.task.native: XXXmb"
>>> and "taskmanager.memory.framework.direct: XXXmb" instead?
>>
>>
>> I believe "taskmanager.memory.task.off-heap.size" is also accounted in
>> the max direct memory size limit. The confusion probably comes from that
>> "taskmanager.memory.framework.off-heap.size" explicitly mentioned that in
>> its description while "taskmanager.memory.task.off-heap.size" didn't.
>> That's actually because the framework off-heap memory is introduced later
>> in a separate commit. We should fix that.
>>
>> For framework / task off-heap memory, we do not differentiate direct /
>> native memory usage. That means the configure value for these two options
>> could be a mixture of direct / native memory. Since we do not know the
>> portion of direct memory out of the configured value, we have
>> to conservatively account it all into the max direct memory size limit.
>>
>>
>> *==>  In that case, I am a bit confused. For the total size calculation,
>> it is fine. But why do we then set MaxDirectMemory? It is a difficult
>> parameter, and the main reason to set it was (if I recall correctly) to
>> trigger GC based on direct memory allocation (to free heap structures that
>> then in turn release direct memory). If the limit is anyways too high
>> (because we also count native memory in there) such that we can exceed the
>> total process (container) memory, why do we set it then?*
>>
>>
>> I always also thought about it as providing more safety net for direct
>> allocations but GC thing looks more important.
>>
>> +1 for fixing docs for 'taskmanager.memory.task.off-heap.size’ and
>> renaming to ‘direct' as this is what really happens
>> if we want to support direct limit more exact than now.
>>
>> I also think that it is hard to separate direct / native memory unless we
>> introduce even more options.
>> If user wants to keep the direct limit tight to a certain value but also
>> use native memory outside of it,
>> she would have to increase something else, like JVM overhead to account
>> for it and there is no other better way.
>> Having more options to account for the native memory outside of direct
>> limit complicates things but can be introduced later if needed.
>>
>>   - What do you think about renaming
>>> "taskmanager.memory.total-flink.size" to "taskmanager.memory.flink.size"
>>> and "taskmanager.memory.total-process.size" to
>>> "taskmanager.memory.process.size" (or "taskmanager.memory.jvm.total"). I
>>> think these keys may be a bit less clumsy (dropping the "total-") without
>>> loss of expressiveness.
>>
>>
>> +1 on this.
>>
>>
>> +1 as well. Also an option: 'taskmanager.memory.total-process.size’ ->
>> ‘taskmanager.memory.jvm.process.size’,
>> although it can be also mentioned in docs that we mean JVM process.
>>
>>   - The network memory keys are now called
>>> "taskmanager.memory.shuffle.*". To my knowledge, shuffle is usually
>>> understood as a redistribution (random, or maybe by hash of key). As an
>>> example, there are many discussions about "shuffle join versus broadcast
>>> join", where "shuffle" is the synonym for "re-partitioning". We use that
>>> memory however for all network operations, like forward pipes, broadcasts,
>>> receiver-side buffering on checkpoints, etc. I find the name "*.shuffle.*"
>>> confusing, I am wondering if users would find that as well. So throwing in
>>> the suggestion to call the options "taskmanager.memory.network.*".
>>
>>
>> +0 on this one. I'm ok with "taskmanager.memory.network.*". On the other
>> hand, one can also argue that this part of memory is used by
>> ShuffleEnvironment, and the key "taskmanager.memory.shuffle.*" points more
>> directly to the shuffle service components.
>>
>>
>> *==> In that case, the name "Shuffle Environment" may be a bit incorrect,
>> because it is doing not only shuffles as well. The ShuffleEnvironment is
>> also more internal, so the name is not too critical. This isn't super high
>> priority for me, but if we want to adjust it, better earlier than later.*
>>
>>
>> This is also a bit controversial topic for me. Indeed, we have always
>> used ’network’ for this concept of task data shuffling over the network and
>> this can confuse existing users.
>>
>> On the other hand for the new users and in a long term, ’network’ can
>> delude into a conclusion that all network memory is managed by this option.
>> Also other types of shuffle might not directly deal with network at all.
>>
>> By calling it shuffle, we were somewhat biased by understanding it in
>> term of map/reduce. This is rather an inter-task data exchange.
>> Maybe then 'taskmanager.memory.shuffle.communication.*’ or
>> ‘taskmanager.memory.task.shuffle/communication/io/network.*’.
>>
>>   - The descriptions for the "taskmanager.memory.jvm-overhead.*" keys say
>>> that it also accounts for I/O direct memory, but the parameter is not
>>> counted into the MaxDirectMemory parameter.
>>
>>
>> True. Since we already have framework off-heap memory accounted for ad
>> hoc direct memory usages, accounting all of jvm-overhead also into max
>> direct memory limit seems not necessary. I would suggest to remove "I/O
>> direct memory" from the description, and explicitly mention that this
>> option does not account for direct memory and will not be accounted into
>> max direct memory limit. WDYT?
>>
>>
>> +1
>>
>>   - Can make the new ConfigOptions strongly typed with the new
>>> configuration options. For example, directly use MemorySize typed options.
>>> That makes validation automatic and saves us from breaking the options
>>> later.
>>
>> +1. Wasn't aware of the new memory type config options.
>>
>>
>> +1
>>
>> *==> Thanks. Do you need help with adjusting this?*
>>
>>
>> I would appreciate it.
>>
>> Also small side note, we extensively use MemorySize in logs now but it
>> might be not always readable as its string representation is only in bytes
>> atm
>> and does not reduce it to kb/mb/etc in case of big bytes value. We could
>> have at least some .prettyPrint function to use in logs.
>> And .fromMegabytes/etc factory methods would improve code readability
>> instead of .parse(int + “m”).
>>
>> Best,
>> Andrey
>>
>> On 20 Dec 2019, at 12:13, Stephan Ewen <se...@apache.org> wrote:
>>
>> Hi Xintong!
>>
>> Please find my answers inline:
>>
>>>   - "taskmanager.memory.size" (old main config option) is replaced by
>>>> "taskmanager.memory.total-process.size" which has a different meaning in
>>>> standalone setups. The old option did not subtract metaspace and other
>>>> overhead, while the new option does. That means that with the default
>>>> config, standalone clusters get quite a bit less memory. (independent of
>>>> managed memory going off heap).
>>>>     I am wondering if we could interpret "taskmanager.memory.size" as
>>>> the deprecated key for "taskmanager.memory.total-flink.size". That would be
>>>> in line with the old mechanism (assuming managed memory is set to off heap).
>>>>     The effect would be that the container size on Yarn/Mesos
>>>> increases, because from "taskmanager.memory.total-flink.size", we need to
>>>> add overhead and metaspace to reach the total process size, rather than
>>>> cutting off memory. But if we want, we could even adjust for that in the
>>>> active resource manager, getting full backwards compatibility on that part.
>>>>     Curious to hear more thoughts there.
>>>
>>>
>>> I believe you mean "taskmanager.heap.size".
>>>
>>
>> *==> Yes*
>>
>>
>>
>>> I think the problem here is that the legacy "taskmanager.heap.size" was
>>> used differently in standalone setups and active yarn / mesos setups, and
>>> such different calculation logics and behaviors are exactly what we want to
>>> avoid with FLIP-49. Therefore, I'm not in favor of treating
>>> "taskmanager.memory.total-flink.size" differently for standalone and active
>>> setups.
>>>
>>> I think what we really want is probably mapping "taskmanager.heap.size"
>>> to different new config options in different setups. How about we
>>> mark "taskmanager.heap.size" as deprecated key for neither of
>>> "taskmanager.memory.total-process.size" and
>>> "taskmanager.memory.total-flink.size". Instead, we parse it (if explicitly
>>> configured) in startup scripts / active resource managers, and set the
>>> value to "taskmanager.memory.total-flink.size" in the scripts and
>>> "taskmanager.memory.total-process.size" in active resource managers (if the
>>> new config options are not configured). We can provide util methods in
>>> TaskExecutorResourceUtils for such conversions, to keep all the
>>> configuration logics at one place.
>>>
>>
>> *==> This is pretty much what I meant as well (maybe my description was
>> not very clear), so +1 for that mechanism*
>>
>>
>>   - Mini Cluster tries to imitate exact ratio of memory pools as a
>>>> standalone setup. I get the idea behind that, but I am wondering if it is
>>>> the right approach here.
>>>>     For example: I started a relatively large JVM (large heap size of
>>>> 10 GB) as a test. With the current logic, the system tries to reserve an
>>>> additional 6GB for managed memory which is more than there is memory left.
>>>> When you see the error that no memory could be allocated, you need to
>>>> understand the magic of how this is derived.
>>>>     I am trying to think about this from the perspective of using
>>>> "Flink as a Library", which the MiniCluster is close to.
>>>>     When starting Flink out of a running process, we cannot assume that
>>>> we are the only users of that process and that we can mold the process to
>>>> our demands. I think a fix value for managed memory and network memory
>>>> would feel more natural in such a setup than a mechanism that is tailored
>>>> towards exclusive use of the process.
>>>
>>>
>>> +1 on having fixed values for managed / shuffle memory.
>>>
>>
>> *==> Cool, let's also see what Andrey and Till think here.*
>>
>>
>>
>>>   - Some off-heap memory goes into direct memory, some does not. This
>>>> confused me a bit. For example "taskmanager.memory.framework.off-heap.size"
>>>> is counted into MaxDirectMemory while
>>>> "taskmanager.memory.task.off-heap.size" is counted as native memory. Maybe
>>>> we should rename the keys to reflect that. There is no one "off heap"
>>>> memory type after all. Maybe use "taskmanager.memory.task.native: XXXmb"
>>>> and "taskmanager.memory.framework.direct: XXXmb" instead?
>>>
>>>
>>> I believe "taskmanager.memory.task.off-heap.size" is also accounted in
>>> the max direct memory size limit. The confusion probably comes from that
>>> "taskmanager.memory.framework.off-heap.size" explicitly mentioned that in
>>> its description while "taskmanager.memory.task.off-heap.size" didn't.
>>> That's actually because the framework off-heap memory is introduced later
>>> in a separate commit. We should fix that.
>>>
>>> For framework / task off-heap memory, we do not differentiate direct /
>>> native memory usage. That means the configure value for these two options
>>> could be a mixture of direct / native memory. Since we do not know the
>>> portion of direct memory out of the configured value, we have
>>> to conservatively account it all into the max direct memory size limit.
>>>
>>
>> *==>  In that case, I am a bit confused. For the total size calculation,
>> it is fine. But why do we then set MaxDirectMemory? It is a difficult
>> parameter, and the main reason to set it was (if I recall correctly) to
>> trigger GC based on direct memory allocation (to free heap structures that
>> then in turn release direct memory). If the limit is anyways too high
>> (because we also count native memory in there) such that we can exceed the
>> total process (container) memory, why do we set it then?*
>>
>>
>>   - The network memory keys are now called
>>>> "taskmanager.memory.shuffle.*". To my knowledge, shuffle is usually
>>>> understood as a redistribution (random, or maybe by hash of key). As an
>>>> example, there are many discussions about "shuffle join versus broadcast
>>>> join", where "shuffle" is the synonym for "re-partitioning". We use that
>>>> memory however for all network operations, like forward pipes, broadcasts,
>>>> receiver-side buffering on checkpoints, etc. I find the name "*.shuffle.*"
>>>> confusing, I am wondering if users would find that as well. So throwing in
>>>> the suggestion to call the options "taskmanager.memory.network.*".
>>>
>>>
>>> +0 on this one. I'm ok with "taskmanager.memory.network.*". On the other
>>> hand, one can also argue that this part of memory is used by
>>> ShuffleEnvironment, and the key "taskmanager.memory.shuffle.*" points more
>>> directly to the shuffle service components.
>>>
>>
>> *==> In that case, the name "Shuffle Environment" may be a bit incorrect,
>> because it is doing not only shuffles as well. The ShuffleEnvironment is
>> also more internal, so the name is not too critical. This isn't super high
>> priority for me, but if we want to adjust it, better earlier than later.*
>>
>>
>>    - The descriptions for the "taskmanager.memory.jvm-overhead.*" keys
>> say that it also accounts for I/O direct memory, but the parameter is not
>> counted into the MaxDirectMemory parameter.
>>
>>>
>>> True. Since we already have framework off-heap memory accounted for ad
>>> hoc direct memory usages, accounting all of jvm-overhead also into max
>>> direct memory limit seems not necessary. I would suggest to remove "I/O
>>> direct memory" from the description, and explicitly mention that this
>>> option does not account for direct memory and will not be accounted into
>>> max direct memory limit. WDYT?
>>>
>>
>> *==> Sounds good. *
>>
>>
>>
>>>   - Can make the new ConfigOptions strongly typed with the new
>>>> configuration options. For example, directly use MemorySize typed options.
>>>> That makes validation automatic and saves us from breaking the options
>>>> later.
>>>
>>> +1. Wasn't aware of the new memory type config options.
>>>
>>
>> *==> Thanks. Do you need help with adjusting this?*
>>
>>
>> Best,
>> Stephan
>>
>>
>>
>>

Re: [DISCUSS] Some feedback after trying out the new FLIP-49 memory configurations

Posted by Xintong Song <to...@gmail.com>.
Regarding the framework/task direct/native memory options, I tend to think
it differently. I'm in favor of keep the "*.off-heap.size" for the config
option keys.

   - It's not necessary IMO to expose the difference concepts of direct /
   native memory to the users.
   - I would avoid introducing more options for native memory if possible.
   Taking fine grained resource management and dynamic slot allocation into
   consideration, that also means introduce more fields into ResourceSpec /
   ResourceProfile.
   - My gut feeling is that having a relative loose MaxDirectMemory should
   not be a big problem.
   - In most cases, the task / framework off-heap memory should be mainly
      (if not all) direct memory, so the difference between derived
      MaxDirectMemory and the ideal direct memory limit should not be too much.
      - We do not have a good way to know the exact size needed for jvm
      overhead / metaspace and framework / task off-heap memory, thus having
      to conservatively reserve slightly more memory then what actually needed.
      Such reserved but not used memory can cover for the small MaxDirectMemory
      error.
      -
      - MaxDirectMemory is not the only way to trigger full gc. We still
      heap activities that can also trigger the gc.

Regarding the memory type config options, I've looked into the latest
ConfigOptions changes. I think it shouldn't be too complicated to change
the config options to use memory type, and I can handle it maybe during
your vacations.


Also agree with improving MemorySize logging and parsing. This should not
be a blocker that has to be done in 1.10. I would say we finish other works
(testability, documentation and those discussed in this thread) first, and
get to this only if we have time.


Thank you~

Xintong Song



On Fri, Dec 20, 2019 at 6:07 PM Andrey Zagrebin <az...@gmail.com>
wrote:

> Hi Stephan and Xintong,
>
> Thanks for the further FLIP-49 feedbacks.
>
>   - "taskmanager.memory.size" (old main config option) is replaced by
>> "taskmanager.memory.total-process.size" which has a different meaning in
>> standalone setups. The old option did not subtract metaspace and other
>> overhead, while the new option does. That means that with the default
>> config, standalone clusters get quite a bit less memory. (independent of
>> managed memory going off heap).
>>     I am wondering if we could interpret "taskmanager.memory.size" as the
>> deprecated key for "taskmanager.memory.total-flink.size". That would be in
>> line with the old mechanism (assuming managed memory is set to off heap).
>>     The effect would be that the container size on Yarn/Mesos increases,
>> because from "taskmanager.memory.total-flink.size", we need to add overhead
>> and metaspace to reach the total process size, rather than cutting off
>> memory. But if we want, we could even adjust for that in the active
>> resource manager, getting full backwards compatibility on that part.
>>     Curious to hear more thoughts there.
>
>
> I believe you mean "taskmanager.heap.size".
>
> I think the problem here is that the legacy "taskmanager.heap.size" was
> used differently in standalone setups and active yarn / mesos setups, and
> such different calculation logics and behaviors are exactly what we want to
> avoid with FLIP-49. Therefore, I'm not in favor of treating
> "taskmanager.memory.total-flink.size" differently for standalone and active
> setups.
>
> I think what we really want is probably mapping "taskmanager.heap.size" to
> different new config options in different setups. How about we
> mark "taskmanager.heap.size" as deprecated key for neither of
> "taskmanager.memory.total-process.size" and
> "taskmanager.memory.total-flink.size". Instead, we parse it (if explicitly
> configured) in startup scripts / active resource managers, and set the
> value to "taskmanager.memory.total-flink.size" in the scripts and
> "taskmanager.memory.total-process.size" in active resource managers (if the
> new config options are not configured). We can provide util methods in
> TaskExecutorResourceUtils for such conversions, to keep all the
> configuration logics at one place.
>
>
> I agree that the problem is that the legacy option ‘taskmanager.heap.size’
> has different semantics for standalone/container. We had it initially
> falling back to 'taskmanager.memory.total-flink.size’ but I changed that to
> align it with container cut-off. Now I see it changes standalone setup then.
> +1 for supporting its backwards compatibility differently for
> standalone/container setups.
>
>
>   - Mini Cluster tries to imitate exact ratio of memory pools as a
>> standalone setup. I get the idea behind that, but I am wondering if it is
>> the right approach here.
>>     For example: I started a relatively large JVM (large heap size of 10
>> GB) as a test. With the current logic, the system tries to reserve an
>> additional 6GB for managed memory which is more than there is memory left.
>> When you see the error that no memory could be allocated, you need to
>> understand the magic of how this is derived.
>>     I am trying to think about this from the perspective of using "Flink
>> as a Library", which the MiniCluster is close to.
>>     When starting Flink out of a running process, we cannot assume that
>> we are the only users of that process and that we can mold the process to
>> our demands. I think a fix value for managed memory and network memory
>> would feel more natural in such a setup than a mechanism that is tailored
>> towards exclusive use of the process.
>
>
> +1 on having fixed values for managed / shuffle memory.
>
>
> also +1 for that, if user has not specified any main options to derive
> memory. We should also log this fixing of managed / shuffle memory.
> And just noticed, we could also sanity check framework and if explicitly
> configured task heap against available JVM heap, and at least log
> inconsistencies.
>
>   - Some off-heap memory goes into direct memory, some does not. This
>> confused me a bit. For example "taskmanager.memory.framework.off-heap.size"
>> is counted into MaxDirectMemory while
>> "taskmanager.memory.task.off-heap.size" is counted as native memory. Maybe
>> we should rename the keys to reflect that. There is no one "off heap"
>> memory type after all. Maybe use "taskmanager.memory.task.native: XXXmb"
>> and "taskmanager.memory.framework.direct: XXXmb" instead?
>
>
> I believe "taskmanager.memory.task.off-heap.size" is also accounted in the
> max direct memory size limit. The confusion probably comes from that
> "taskmanager.memory.framework.off-heap.size" explicitly mentioned that in
> its description while "taskmanager.memory.task.off-heap.size" didn't.
> That's actually because the framework off-heap memory is introduced later
> in a separate commit. We should fix that.
>
> For framework / task off-heap memory, we do not differentiate direct /
> native memory usage. That means the configure value for these two options
> could be a mixture of direct / native memory. Since we do not know the
> portion of direct memory out of the configured value, we have
> to conservatively account it all into the max direct memory size limit.
>
>
> *==>  In that case, I am a bit confused. For the total size calculation,
> it is fine. But why do we then set MaxDirectMemory? It is a difficult
> parameter, and the main reason to set it was (if I recall correctly) to
> trigger GC based on direct memory allocation (to free heap structures that
> then in turn release direct memory). If the limit is anyways too high
> (because we also count native memory in there) such that we can exceed the
> total process (container) memory, why do we set it then?*
>
>
> I always also thought about it as providing more safety net for direct
> allocations but GC thing looks more important.
>
> +1 for fixing docs for 'taskmanager.memory.task.off-heap.size’ and
> renaming to ‘direct' as this is what really happens
> if we want to support direct limit more exact than now.
>
> I also think that it is hard to separate direct / native memory unless we
> introduce even more options.
> If user wants to keep the direct limit tight to a certain value but also
> use native memory outside of it,
> she would have to increase something else, like JVM overhead to account
> for it and there is no other better way.
> Having more options to account for the native memory outside of direct
> limit complicates things but can be introduced later if needed.
>
>   - What do you think about renaming "taskmanager.memory.total-flink.size"
>> to "taskmanager.memory.flink.size" and
>> "taskmanager.memory.total-process.size" to
>> "taskmanager.memory.process.size" (or "taskmanager.memory.jvm.total"). I
>> think these keys may be a bit less clumsy (dropping the "total-") without
>> loss of expressiveness.
>
>
> +1 on this.
>
>
> +1 as well. Also an option: 'taskmanager.memory.total-process.size’ ->
> ‘taskmanager.memory.jvm.process.size’,
> although it can be also mentioned in docs that we mean JVM process.
>
>   - The network memory keys are now called "taskmanager.memory.shuffle.*".
>> To my knowledge, shuffle is usually understood as a redistribution (random,
>> or maybe by hash of key). As an example, there are many discussions about
>> "shuffle join versus broadcast join", where "shuffle" is the synonym for
>> "re-partitioning". We use that memory however for all network operations,
>> like forward pipes, broadcasts, receiver-side buffering on checkpoints,
>> etc. I find the name "*.shuffle.*" confusing, I am wondering if users would
>> find that as well. So throwing in the suggestion to call the options
>> "taskmanager.memory.network.*".
>
>
> +0 on this one. I'm ok with "taskmanager.memory.network.*". On the other
> hand, one can also argue that this part of memory is used by
> ShuffleEnvironment, and the key "taskmanager.memory.shuffle.*" points more
> directly to the shuffle service components.
>
>
> *==> In that case, the name "Shuffle Environment" may be a bit incorrect,
> because it is doing not only shuffles as well. The ShuffleEnvironment is
> also more internal, so the name is not too critical. This isn't super high
> priority for me, but if we want to adjust it, better earlier than later.*
>
>
> This is also a bit controversial topic for me. Indeed, we have always used
> ’network’ for this concept of task data shuffling over the network and this
> can confuse existing users.
>
> On the other hand for the new users and in a long term, ’network’ can
> delude into a conclusion that all network memory is managed by this option.
> Also other types of shuffle might not directly deal with network at all.
>
> By calling it shuffle, we were somewhat biased by understanding it in term
> of map/reduce. This is rather an inter-task data exchange.
> Maybe then 'taskmanager.memory.shuffle.communication.*’ or
> ‘taskmanager.memory.task.shuffle/communication/io/network.*’.
>
>   - The descriptions for the "taskmanager.memory.jvm-overhead.*" keys say
>> that it also accounts for I/O direct memory, but the parameter is not
>> counted into the MaxDirectMemory parameter.
>
>
> True. Since we already have framework off-heap memory accounted for ad hoc
> direct memory usages, accounting all of jvm-overhead also into max direct
> memory limit seems not necessary. I would suggest to remove "I/O direct
> memory" from the description, and explicitly mention that this option does
> not account for direct memory and will not be accounted into max direct
> memory limit. WDYT?
>
>
> +1
>
>   - Can make the new ConfigOptions strongly typed with the new
>> configuration options. For example, directly use MemorySize typed options.
>> That makes validation automatic and saves us from breaking the options
>> later.
>
> +1. Wasn't aware of the new memory type config options.
>
>
> +1
>
> *==> Thanks. Do you need help with adjusting this?*
>
>
> I would appreciate it.
>
> Also small side note, we extensively use MemorySize in logs now but it
> might be not always readable as its string representation is only in bytes
> atm
> and does not reduce it to kb/mb/etc in case of big bytes value. We could
> have at least some .prettyPrint function to use in logs.
> And .fromMegabytes/etc factory methods would improve code readability
> instead of .parse(int + “m”).
>
> Best,
> Andrey
>
> On 20 Dec 2019, at 12:13, Stephan Ewen <se...@apache.org> wrote:
>
> Hi Xintong!
>
> Please find my answers inline:
>
>>   - "taskmanager.memory.size" (old main config option) is replaced by
>>> "taskmanager.memory.total-process.size" which has a different meaning in
>>> standalone setups. The old option did not subtract metaspace and other
>>> overhead, while the new option does. That means that with the default
>>> config, standalone clusters get quite a bit less memory. (independent of
>>> managed memory going off heap).
>>>     I am wondering if we could interpret "taskmanager.memory.size" as
>>> the deprecated key for "taskmanager.memory.total-flink.size". That would be
>>> in line with the old mechanism (assuming managed memory is set to off heap).
>>>     The effect would be that the container size on Yarn/Mesos increases,
>>> because from "taskmanager.memory.total-flink.size", we need to add overhead
>>> and metaspace to reach the total process size, rather than cutting off
>>> memory. But if we want, we could even adjust for that in the active
>>> resource manager, getting full backwards compatibility on that part.
>>>     Curious to hear more thoughts there.
>>
>>
>> I believe you mean "taskmanager.heap.size".
>>
>
> *==> Yes*
>
>
>
>> I think the problem here is that the legacy "taskmanager.heap.size" was
>> used differently in standalone setups and active yarn / mesos setups, and
>> such different calculation logics and behaviors are exactly what we want to
>> avoid with FLIP-49. Therefore, I'm not in favor of treating
>> "taskmanager.memory.total-flink.size" differently for standalone and active
>> setups.
>>
>> I think what we really want is probably mapping "taskmanager.heap.size"
>> to different new config options in different setups. How about we
>> mark "taskmanager.heap.size" as deprecated key for neither of
>> "taskmanager.memory.total-process.size" and
>> "taskmanager.memory.total-flink.size". Instead, we parse it (if explicitly
>> configured) in startup scripts / active resource managers, and set the
>> value to "taskmanager.memory.total-flink.size" in the scripts and
>> "taskmanager.memory.total-process.size" in active resource managers (if the
>> new config options are not configured). We can provide util methods in
>> TaskExecutorResourceUtils for such conversions, to keep all the
>> configuration logics at one place.
>>
>
> *==> This is pretty much what I meant as well (maybe my description was
> not very clear), so +1 for that mechanism*
>
>
>   - Mini Cluster tries to imitate exact ratio of memory pools as a
>>> standalone setup. I get the idea behind that, but I am wondering if it is
>>> the right approach here.
>>>     For example: I started a relatively large JVM (large heap size of 10
>>> GB) as a test. With the current logic, the system tries to reserve an
>>> additional 6GB for managed memory which is more than there is memory left.
>>> When you see the error that no memory could be allocated, you need to
>>> understand the magic of how this is derived.
>>>     I am trying to think about this from the perspective of using "Flink
>>> as a Library", which the MiniCluster is close to.
>>>     When starting Flink out of a running process, we cannot assume that
>>> we are the only users of that process and that we can mold the process to
>>> our demands. I think a fix value for managed memory and network memory
>>> would feel more natural in such a setup than a mechanism that is tailored
>>> towards exclusive use of the process.
>>
>>
>> +1 on having fixed values for managed / shuffle memory.
>>
>
> *==> Cool, let's also see what Andrey and Till think here.*
>
>
>
>>   - Some off-heap memory goes into direct memory, some does not. This
>>> confused me a bit. For example "taskmanager.memory.framework.off-heap.size"
>>> is counted into MaxDirectMemory while
>>> "taskmanager.memory.task.off-heap.size" is counted as native memory. Maybe
>>> we should rename the keys to reflect that. There is no one "off heap"
>>> memory type after all. Maybe use "taskmanager.memory.task.native: XXXmb"
>>> and "taskmanager.memory.framework.direct: XXXmb" instead?
>>
>>
>> I believe "taskmanager.memory.task.off-heap.size" is also accounted in
>> the max direct memory size limit. The confusion probably comes from that
>> "taskmanager.memory.framework.off-heap.size" explicitly mentioned that in
>> its description while "taskmanager.memory.task.off-heap.size" didn't.
>> That's actually because the framework off-heap memory is introduced later
>> in a separate commit. We should fix that.
>>
>> For framework / task off-heap memory, we do not differentiate direct /
>> native memory usage. That means the configure value for these two options
>> could be a mixture of direct / native memory. Since we do not know the
>> portion of direct memory out of the configured value, we have
>> to conservatively account it all into the max direct memory size limit.
>>
>
> *==>  In that case, I am a bit confused. For the total size calculation,
> it is fine. But why do we then set MaxDirectMemory? It is a difficult
> parameter, and the main reason to set it was (if I recall correctly) to
> trigger GC based on direct memory allocation (to free heap structures that
> then in turn release direct memory). If the limit is anyways too high
> (because we also count native memory in there) such that we can exceed the
> total process (container) memory, why do we set it then?*
>
>
>   - The network memory keys are now called "taskmanager.memory.shuffle.*".
>>> To my knowledge, shuffle is usually understood as a redistribution (random,
>>> or maybe by hash of key). As an example, there are many discussions about
>>> "shuffle join versus broadcast join", where "shuffle" is the synonym for
>>> "re-partitioning". We use that memory however for all network operations,
>>> like forward pipes, broadcasts, receiver-side buffering on checkpoints,
>>> etc. I find the name "*.shuffle.*" confusing, I am wondering if users would
>>> find that as well. So throwing in the suggestion to call the options
>>> "taskmanager.memory.network.*".
>>
>>
>> +0 on this one. I'm ok with "taskmanager.memory.network.*". On the other
>> hand, one can also argue that this part of memory is used by
>> ShuffleEnvironment, and the key "taskmanager.memory.shuffle.*" points more
>> directly to the shuffle service components.
>>
>
> *==> In that case, the name "Shuffle Environment" may be a bit incorrect,
> because it is doing not only shuffles as well. The ShuffleEnvironment is
> also more internal, so the name is not too critical. This isn't super high
> priority for me, but if we want to adjust it, better earlier than later.*
>
>
>    - The descriptions for the "taskmanager.memory.jvm-overhead.*" keys say
> that it also accounts for I/O direct memory, but the parameter is not
> counted into the MaxDirectMemory parameter.
>
>>
>> True. Since we already have framework off-heap memory accounted for ad
>> hoc direct memory usages, accounting all of jvm-overhead also into max
>> direct memory limit seems not necessary. I would suggest to remove "I/O
>> direct memory" from the description, and explicitly mention that this
>> option does not account for direct memory and will not be accounted into
>> max direct memory limit. WDYT?
>>
>
> *==> Sounds good. *
>
>
>
>>   - Can make the new ConfigOptions strongly typed with the new
>>> configuration options. For example, directly use MemorySize typed options.
>>> That makes validation automatic and saves us from breaking the options
>>> later.
>>
>> +1. Wasn't aware of the new memory type config options.
>>
>
> *==> Thanks. Do you need help with adjusting this?*
>
>
> Best,
> Stephan
>
>
>
>

Re: [DISCUSS] Some feedback after trying out the new FLIP-49 memory configurations

Posted by Andrey Zagrebin <az...@gmail.com>.
Hi Stephan and Xintong,

Thanks for the further FLIP-49 feedbacks.

>   - "taskmanager.memory.size" (old main config option) is replaced by "taskmanager.memory.total-process.size" which has a different meaning in standalone setups. The old option did not subtract metaspace and other overhead, while the new option does. That means that with the default config, standalone clusters get quite a bit less memory. (independent of managed memory going off heap).
>     I am wondering if we could interpret "taskmanager.memory.size" as the deprecated key for "taskmanager.memory.total-flink.size". That would be in line with the old mechanism (assuming managed memory is set to off heap).
>     The effect would be that the container size on Yarn/Mesos increases, because from "taskmanager.memory.total-flink.size", we need to add overhead and metaspace to reach the total process size, rather than cutting off memory. But if we want, we could even adjust for that in the active resource manager, getting full backwards compatibility on that part.
>     Curious to hear more thoughts there.
> 
> I believe you mean "taskmanager.heap.size".
> 
> I think the problem here is that the legacy "taskmanager.heap.size" was used differently in standalone setups and active yarn / mesos setups, and such different calculation logics and behaviors are exactly what we want to avoid with FLIP-49. Therefore, I'm not in favor of treating "taskmanager.memory.total-flink.size" differently for standalone and active setups.
> 
> I think what we really want is probably mapping "taskmanager.heap.size" to different new config options in different setups. How about we mark "taskmanager.heap.size" as deprecated key for neither of "taskmanager.memory.total-process.size" and "taskmanager.memory.total-flink.size". Instead, we parse it (if explicitly configured) in startup scripts / active resource managers, and set the value to "taskmanager.memory.total-flink.size" in the scripts and "taskmanager.memory.total-process.size" in active resource managers (if the new config options are not configured). We can provide util methods in TaskExecutorResourceUtils for such conversions, to keep all the configuration logics at one place.

I agree that the problem is that the legacy option ‘taskmanager.heap.size’ has different semantics for standalone/container. We had it initially falling back to 'taskmanager.memory.total-flink.size’ but I changed that to align it with container cut-off. Now I see it changes standalone setup then.
+1 for supporting its backwards compatibility differently for standalone/container setups.

> 
>   - Mini Cluster tries to imitate exact ratio of memory pools as a standalone setup. I get the idea behind that, but I am wondering if it is the right approach here.
>     For example: I started a relatively large JVM (large heap size of 10 GB) as a test. With the current logic, the system tries to reserve an additional 6GB for managed memory which is more than there is memory left. When you see the error that no memory could be allocated, you need to understand the magic of how this is derived.
>     I am trying to think about this from the perspective of using "Flink as a Library", which the MiniCluster is close to.
>     When starting Flink out of a running process, we cannot assume that we are the only users of that process and that we can mold the process to our demands. I think a fix value for managed memory and network memory would feel more natural in such a setup than a mechanism that is tailored towards exclusive use of the process.
> 
> +1 on having fixed values for managed / shuffle memory.

also +1 for that, if user has not specified any main options to derive memory. We should also log this fixing of managed / shuffle memory.
And just noticed, we could also sanity check framework and if explicitly configured task heap against available JVM heap, and at least log inconsistencies.

>   - Some off-heap memory goes into direct memory, some does not. This confused me a bit. For example "taskmanager.memory.framework.off-heap.size" is counted into MaxDirectMemory while "taskmanager.memory.task.off-heap.size" is counted as native memory. Maybe we should rename the keys to reflect that. There is no one "off heap" memory type after all. Maybe use "taskmanager.memory.task.native: XXXmb" and "taskmanager.memory.framework.direct: XXXmb" instead?
> 
> I believe "taskmanager.memory.task.off-heap.size" is also accounted in the max direct memory size limit. The confusion probably comes from that "taskmanager.memory.framework.off-heap.size" explicitly mentioned that in its description while "taskmanager.memory.task.off-heap.size" didn't. That's actually because the framework off-heap memory is introduced later in a separate commit. We should fix that.
> 
> For framework / task off-heap memory, we do not differentiate direct / native memory usage. That means the configure value for these two options could be a mixture of direct / native memory. Since we do not know the portion of direct memory out of the configured value, we have to conservatively account it all into the max direct memory size limit.

> ==>  In that case, I am a bit confused. For the total size calculation, it is fine. But why do we then set MaxDirectMemory? It is a difficult parameter, and the main reason to set it was (if I recall correctly) to trigger GC based on direct memory allocation (to free heap structures that then in turn release direct memory). If the limit is anyways too high (because we also count native memory in there) such that we can exceed the total process (container) memory, why do we set it then?

I always also thought about it as providing more safety net for direct allocations but GC thing looks more important.

+1 for fixing docs for 'taskmanager.memory.task.off-heap.size’ and renaming to ‘direct' as this is what really happens
if we want to support direct limit more exact than now.

I also think that it is hard to separate direct / native memory unless we introduce even more options.
If user wants to keep the direct limit tight to a certain value but also use native memory outside of it,
she would have to increase something else, like JVM overhead to account for it and there is no other better way.
Having more options to account for the native memory outside of direct limit complicates things but can be introduced later if needed.

>   - What do you think about renaming "taskmanager.memory.total-flink.size" to "taskmanager.memory.flink.size" and "taskmanager.memory.total-process.size" to "taskmanager.memory.process.size" (or "taskmanager.memory.jvm.total"). I think these keys may be a bit less clumsy (dropping the "total-") without loss of expressiveness.
> 
> +1 on this.

+1 as well. Also an option: 'taskmanager.memory.total-process.size’ -> ‘taskmanager.memory.jvm.process.size’,
although it can be also mentioned in docs that we mean JVM process.

>   - The network memory keys are now called "taskmanager.memory.shuffle.*". To my knowledge, shuffle is usually understood as a redistribution (random, or maybe by hash of key). As an example, there are many discussions about "shuffle join versus broadcast join", where "shuffle" is the synonym for "re-partitioning". We use that memory however for all network operations, like forward pipes, broadcasts, receiver-side buffering on checkpoints, etc. I find the name "*.shuffle.*" confusing, I am wondering if users would find that as well. So throwing in the suggestion to call the options "taskmanager.memory.network.*".
> 
> +0 on this one. I'm ok with "taskmanager.memory.network.*". On the other hand, one can also argue that this part of memory is used by ShuffleEnvironment, and the key "taskmanager.memory.shuffle.*" points more directly to the shuffle service components.

> ==> In that case, the name "Shuffle Environment" may be a bit incorrect, because it is doing not only shuffles as well. The ShuffleEnvironment is also more internal, so the name is not too critical. This isn't super high priority for me, but if we want to adjust it, better earlier than later.

This is also a bit controversial topic for me. Indeed, we have always used ’network’ for this concept of task data shuffling over the network and this can confuse existing users.

On the other hand for the new users and in a long term, ’network’ can delude into a conclusion that all network memory is managed by this option.
Also other types of shuffle might not directly deal with network at all.

By calling it shuffle, we were somewhat biased by understanding it in term of map/reduce. This is rather an inter-task data exchange.
Maybe then 'taskmanager.memory.shuffle.communication.*’ or ‘taskmanager.memory.task.shuffle/communication/io/network.*’.

>   - The descriptions for the "taskmanager.memory.jvm-overhead.*" keys say that it also accounts for I/O direct memory, but the parameter is not counted into the MaxDirectMemory parameter.
> 
> True. Since we already have framework off-heap memory accounted for ad hoc direct memory usages, accounting all of jvm-overhead also into max direct memory limit seems not necessary. I would suggest to remove "I/O direct memory" from the description, and explicitly mention that this option does not account for direct memory and will not be accounted into max direct memory limit. WDYT?

+1

>   - Can make the new ConfigOptions strongly typed with the new configuration options. For example, directly use MemorySize typed options. That makes validation automatic and saves us from breaking the options later.
> +1. Wasn't aware of the new memory type config options.

+1

> ==> Thanks. Do you need help with adjusting this?


I would appreciate it.

Also small side note, we extensively use MemorySize in logs now but it might be not always readable as its string representation is only in bytes atm
and does not reduce it to kb/mb/etc in case of big bytes value. We could have at least some .prettyPrint function to use in logs.
And .fromMegabytes/etc factory methods would improve code readability instead of .parse(int + “m”).

Best,
Andrey

> On 20 Dec 2019, at 12:13, Stephan Ewen <se...@apache.org> wrote:
> 
> Hi Xintong!
> 
> Please find my answers inline:
>   - "taskmanager.memory.size" (old main config option) is replaced by "taskmanager.memory.total-process.size" which has a different meaning in standalone setups. The old option did not subtract metaspace and other overhead, while the new option does. That means that with the default config, standalone clusters get quite a bit less memory. (independent of managed memory going off heap).
>     I am wondering if we could interpret "taskmanager.memory.size" as the deprecated key for "taskmanager.memory.total-flink.size". That would be in line with the old mechanism (assuming managed memory is set to off heap).
>     The effect would be that the container size on Yarn/Mesos increases, because from "taskmanager.memory.total-flink.size", we need to add overhead and metaspace to reach the total process size, rather than cutting off memory. But if we want, we could even adjust for that in the active resource manager, getting full backwards compatibility on that part.
>     Curious to hear more thoughts there.
> 
> I believe you mean "taskmanager.heap.size".
> 
> ==> Yes
> 
>  
> I think the problem here is that the legacy "taskmanager.heap.size" was used differently in standalone setups and active yarn / mesos setups, and such different calculation logics and behaviors are exactly what we want to avoid with FLIP-49. Therefore, I'm not in favor of treating "taskmanager.memory.total-flink.size" differently for standalone and active setups.
> 
> I think what we really want is probably mapping "taskmanager.heap.size" to different new config options in different setups. How about we mark "taskmanager.heap.size" as deprecated key for neither of "taskmanager.memory.total-process.size" and "taskmanager.memory.total-flink.size". Instead, we parse it (if explicitly configured) in startup scripts / active resource managers, and set the value to "taskmanager.memory.total-flink.size" in the scripts and "taskmanager.memory.total-process.size" in active resource managers (if the new config options are not configured). We can provide util methods in TaskExecutorResourceUtils for such conversions, to keep all the configuration logics at one place.
>  
> ==> This is pretty much what I meant as well (maybe my description was not very clear), so +1 for that mechanism
> 
> 
>   - Mini Cluster tries to imitate exact ratio of memory pools as a standalone setup. I get the idea behind that, but I am wondering if it is the right approach here.
>     For example: I started a relatively large JVM (large heap size of 10 GB) as a test. With the current logic, the system tries to reserve an additional 6GB for managed memory which is more than there is memory left. When you see the error that no memory could be allocated, you need to understand the magic of how this is derived.
>     I am trying to think about this from the perspective of using "Flink as a Library", which the MiniCluster is close to.
>     When starting Flink out of a running process, we cannot assume that we are the only users of that process and that we can mold the process to our demands. I think a fix value for managed memory and network memory would feel more natural in such a setup than a mechanism that is tailored towards exclusive use of the process.
> 
> +1 on having fixed values for managed / shuffle memory.
> 
> ==> Cool, let's also see what Andrey and Till think here.
> 
>  
>   - Some off-heap memory goes into direct memory, some does not. This confused me a bit. For example "taskmanager.memory.framework.off-heap.size" is counted into MaxDirectMemory while "taskmanager.memory.task.off-heap.size" is counted as native memory. Maybe we should rename the keys to reflect that. There is no one "off heap" memory type after all. Maybe use "taskmanager.memory.task.native: XXXmb" and "taskmanager.memory.framework.direct: XXXmb" instead?
> 
> I believe "taskmanager.memory.task.off-heap.size" is also accounted in the max direct memory size limit. The confusion probably comes from that "taskmanager.memory.framework.off-heap.size" explicitly mentioned that in its description while "taskmanager.memory.task.off-heap.size" didn't. That's actually because the framework off-heap memory is introduced later in a separate commit. We should fix that.
> 
> For framework / task off-heap memory, we do not differentiate direct / native memory usage. That means the configure value for these two options could be a mixture of direct / native memory. Since we do not know the portion of direct memory out of the configured value, we have to conservatively account it all into the max direct memory size limit.
> 
> ==>  In that case, I am a bit confused. For the total size calculation, it is fine. But why do we then set MaxDirectMemory? It is a difficult parameter, and the main reason to set it was (if I recall correctly) to trigger GC based on direct memory allocation (to free heap structures that then in turn release direct memory). If the limit is anyways too high (because we also count native memory in there) such that we can exceed the total process (container) memory, why do we set it then?
> 
> 
>   - The network memory keys are now called "taskmanager.memory.shuffle.*". To my knowledge, shuffle is usually understood as a redistribution (random, or maybe by hash of key). As an example, there are many discussions about "shuffle join versus broadcast join", where "shuffle" is the synonym for "re-partitioning". We use that memory however for all network operations, like forward pipes, broadcasts, receiver-side buffering on checkpoints, etc. I find the name "*.shuffle.*" confusing, I am wondering if users would find that as well. So throwing in the suggestion to call the options "taskmanager.memory.network.*".
> 
> +0 on this one. I'm ok with "taskmanager.memory.network.*". On the other hand, one can also argue that this part of memory is used by ShuffleEnvironment, and the key "taskmanager.memory.shuffle.*" points more directly to the shuffle service components.
> 
> ==> In that case, the name "Shuffle Environment" may be a bit incorrect, because it is doing not only shuffles as well. The ShuffleEnvironment is also more internal, so the name is not too critical. This isn't super high priority for me, but if we want to adjust it, better earlier than later.
> 
>  
>    - The descriptions for the "taskmanager.memory.jvm-overhead.*" keys say that it also accounts for I/O direct memory, but the parameter is not counted into the MaxDirectMemory parameter.
> 
> True. Since we already have framework off-heap memory accounted for ad hoc direct memory usages, accounting all of jvm-overhead also into max direct memory limit seems not necessary. I would suggest to remove "I/O direct memory" from the description, and explicitly mention that this option does not account for direct memory and will not be accounted into max direct memory limit. WDYT?
> 
> ==> Sounds good. 
> 
>  
>   - Can make the new ConfigOptions strongly typed with the new configuration options. For example, directly use MemorySize typed options. That makes validation automatic and saves us from breaking the options later.
> +1. Wasn't aware of the new memory type config options.
> 
> ==> Thanks. Do you need help with adjusting this?
> 
>  
> Best,
> Stephan
> 
> 


Re: [DISCUSS] Some feedback after trying out the new FLIP-49 memory configurations

Posted by Stephan Ewen <se...@apache.org>.
Hi Xintong!

Please find my answers inline:

>   - "taskmanager.memory.size" (old main config option) is replaced by
>> "taskmanager.memory.total-process.size" which has a different meaning in
>> standalone setups. The old option did not subtract metaspace and other
>> overhead, while the new option does. That means that with the default
>> config, standalone clusters get quite a bit less memory. (independent of
>> managed memory going off heap).
>>     I am wondering if we could interpret "taskmanager.memory.size" as the
>> deprecated key for "taskmanager.memory.total-flink.size". That would be in
>> line with the old mechanism (assuming managed memory is set to off heap).
>>     The effect would be that the container size on Yarn/Mesos increases,
>> because from "taskmanager.memory.total-flink.size", we need to add overhead
>> and metaspace to reach the total process size, rather than cutting off
>> memory. But if we want, we could even adjust for that in the active
>> resource manager, getting full backwards compatibility on that part.
>>     Curious to hear more thoughts there.
>
>
> I believe you mean "taskmanager.heap.size".
>

*==> Yes*



> I think the problem here is that the legacy "taskmanager.heap.size" was
> used differently in standalone setups and active yarn / mesos setups, and
> such different calculation logics and behaviors are exactly what we want to
> avoid with FLIP-49. Therefore, I'm not in favor of treating
> "taskmanager.memory.total-flink.size" differently for standalone and active
> setups.
>
> I think what we really want is probably mapping "taskmanager.heap.size" to
> different new config options in different setups. How about we
> mark "taskmanager.heap.size" as deprecated key for neither of
> "taskmanager.memory.total-process.size" and
> "taskmanager.memory.total-flink.size". Instead, we parse it (if explicitly
> configured) in startup scripts / active resource managers, and set the
> value to "taskmanager.memory.total-flink.size" in the scripts and
> "taskmanager.memory.total-process.size" in active resource managers (if the
> new config options are not configured). We can provide util methods in
> TaskExecutorResourceUtils for such conversions, to keep all the
> configuration logics at one place.
>

*==> This is pretty much what I meant as well (maybe my description was not
very clear), so +1 for that mechanism*


  - Mini Cluster tries to imitate exact ratio of memory pools as a
>> standalone setup. I get the idea behind that, but I am wondering if it is
>> the right approach here.
>>     For example: I started a relatively large JVM (large heap size of 10
>> GB) as a test. With the current logic, the system tries to reserve an
>> additional 6GB for managed memory which is more than there is memory left.
>> When you see the error that no memory could be allocated, you need to
>> understand the magic of how this is derived.
>>     I am trying to think about this from the perspective of using "Flink
>> as a Library", which the MiniCluster is close to.
>>     When starting Flink out of a running process, we cannot assume that
>> we are the only users of that process and that we can mold the process to
>> our demands. I think a fix value for managed memory and network memory
>> would feel more natural in such a setup than a mechanism that is tailored
>> towards exclusive use of the process.
>
>
> +1 on having fixed values for managed / shuffle memory.
>

*==> Cool, let's also see what Andrey and Till think here.*



>   - Some off-heap memory goes into direct memory, some does not. This
>> confused me a bit. For example "taskmanager.memory.framework.off-heap.size"
>> is counted into MaxDirectMemory while
>> "taskmanager.memory.task.off-heap.size" is counted as native memory. Maybe
>> we should rename the keys to reflect that. There is no one "off heap"
>> memory type after all. Maybe use "taskmanager.memory.task.native: XXXmb"
>> and "taskmanager.memory.framework.direct: XXXmb" instead?
>
>
> I believe "taskmanager.memory.task.off-heap.size" is also accounted in the
> max direct memory size limit. The confusion probably comes from that
> "taskmanager.memory.framework.off-heap.size" explicitly mentioned that in
> its description while "taskmanager.memory.task.off-heap.size" didn't.
> That's actually because the framework off-heap memory is introduced later
> in a separate commit. We should fix that.
>
> For framework / task off-heap memory, we do not differentiate direct /
> native memory usage. That means the configure value for these two options
> could be a mixture of direct / native memory. Since we do not know the
> portion of direct memory out of the configured value, we have
> to conservatively account it all into the max direct memory size limit.
>

*==>  In that case, I am a bit confused. For the total size calculation, it
is fine. But why do we then set MaxDirectMemory? It is a difficult
parameter, and the main reason to set it was (if I recall correctly) to
trigger GC based on direct memory allocation (to free heap structures that
then in turn release direct memory). If the limit is anyways too high
(because we also count native memory in there) such that we can exceed the
total process (container) memory, why do we set it then?*


  - The network memory keys are now called "taskmanager.memory.shuffle.*".
>> To my knowledge, shuffle is usually understood as a redistribution (random,
>> or maybe by hash of key). As an example, there are many discussions about
>> "shuffle join versus broadcast join", where "shuffle" is the synonym for
>> "re-partitioning". We use that memory however for all network operations,
>> like forward pipes, broadcasts, receiver-side buffering on checkpoints,
>> etc. I find the name "*.shuffle.*" confusing, I am wondering if users would
>> find that as well. So throwing in the suggestion to call the options
>> "taskmanager.memory.network.*".
>
>
> +0 on this one. I'm ok with "taskmanager.memory.network.*". On the other
> hand, one can also argue that this part of memory is used by
> ShuffleEnvironment, and the key "taskmanager.memory.shuffle.*" points more
> directly to the shuffle service components.
>

*==> In that case, the name "Shuffle Environment" may be a bit incorrect,
because it is doing not only shuffles as well. The ShuffleEnvironment is
also more internal, so the name is not too critical. This isn't super high
priority for me, but if we want to adjust it, better earlier than later.*


   - The descriptions for the "taskmanager.memory.jvm-overhead.*" keys say
that it also accounts for I/O direct memory, but the parameter is not
counted into the MaxDirectMemory parameter.

>
> True. Since we already have framework off-heap memory accounted for ad hoc
> direct memory usages, accounting all of jvm-overhead also into max direct
> memory limit seems not necessary. I would suggest to remove "I/O direct
> memory" from the description, and explicitly mention that this option does
> not account for direct memory and will not be accounted into max direct
> memory limit. WDYT?
>

*==> Sounds good. *



>   - Can make the new ConfigOptions strongly typed with the new
>> configuration options. For example, directly use MemorySize typed options.
>> That makes validation automatic and saves us from breaking the options
>> later.
>
> +1. Wasn't aware of the new memory type config options.
>

*==> Thanks. Do you need help with adjusting this?*


Best,
Stephan

Re: [DISCUSS] Some feedback after trying out the new FLIP-49 memory configurations

Posted by Xintong Song <to...@gmail.com>.
Hi Stephan,

Thank you for kicking off this discussion and providing the suggestions.

  - "taskmanager.memory.size" (old main config option) is replaced by
> "taskmanager.memory.total-process.size" which has a different meaning in
> standalone setups. The old option did not subtract metaspace and other
> overhead, while the new option does. That means that with the default
> config, standalone clusters get quite a bit less memory. (independent of
> managed memory going off heap).
>     I am wondering if we could interpret "taskmanager.memory.size" as the
> deprecated key for "taskmanager.memory.total-flink.size". That would be in
> line with the old mechanism (assuming managed memory is set to off heap).
>     The effect would be that the container size on Yarn/Mesos increases,
> because from "taskmanager.memory.total-flink.size", we need to add overhead
> and metaspace to reach the total process size, rather than cutting off
> memory. But if we want, we could even adjust for that in the active
> resource manager, getting full backwards compatibility on that part.
>     Curious to hear more thoughts there.


I believe you mean "taskmanager.heap.size".

I think the problem here is that the legacy "taskmanager.heap.size" was
used differently in standalone setups and active yarn / mesos setups, and
such different calculation logics and behaviors are exactly what we want to
avoid with FLIP-49. Therefore, I'm not in favor of treating
"taskmanager.memory.total-flink.size" differently for standalone and active
setups.

I think what we really want is probably mapping "taskmanager.heap.size" to
different new config options in different setups. How about we
mark "taskmanager.heap.size" as deprecated key for neither of
"taskmanager.memory.total-process.size" and
"taskmanager.memory.total-flink.size". Instead, we parse it (if explicitly
configured) in startup scripts / active resource managers, and set the
value to "taskmanager.memory.total-flink.size" in the scripts and
"taskmanager.memory.total-process.size" in active resource managers (if the
new config options are not configured). We can provide util methods in
TaskExecutorResourceUtils for such conversions, to keep all the
configuration logics at one place.

  - Mini Cluster tries to imitate exact ratio of memory pools as a
> standalone setup. I get the idea behind that, but I am wondering if it is
> the right approach here.
>     For example: I started a relatively large JVM (large heap size of 10
> GB) as a test. With the current logic, the system tries to reserve an
> additional 6GB for managed memory which is more than there is memory left.
> When you see the error that no memory could be allocated, you need to
> understand the magic of how this is derived.
>     I am trying to think about this from the perspective of using "Flink
> as a Library", which the MiniCluster is close to.
>     When starting Flink out of a running process, we cannot assume that we
> are the only users of that process and that we can mold the process to our
> demands. I think a fix value for managed memory and network memory would
> feel more natural in such a setup than a mechanism that is tailored towards
> exclusive use of the process.


+1 on having fixed values for managed / shuffle memory.

  - Some off-heap memory goes into direct memory, some does not. This
> confused me a bit. For example "taskmanager.memory.framework.off-heap.size"
> is counted into MaxDirectMemory while
> "taskmanager.memory.task.off-heap.size" is counted as native memory. Maybe
> we should rename the keys to reflect that. There is no one "off heap"
> memory type after all. Maybe use "taskmanager.memory.task.native: XXXmb"
> and "taskmanager.memory.framework.direct: XXXmb" instead?


I believe "taskmanager.memory.task.off-heap.size" is also accounted in the
max direct memory size limit. The confusion probably comes from that
"taskmanager.memory.framework.off-heap.size" explicitly mentioned that in
its description while "taskmanager.memory.task.off-heap.size" didn't.
That's actually because the framework off-heap memory is introduced later
in a separate commit. We should fix that.

For framework / task off-heap memory, we do not differentiate direct /
native memory usage. That means the configure value for these two options
could be a mixture of direct / native memory. Since we do not know the
portion of direct memory out of the configured value, we have
to conservatively account it all into the max direct memory size limit.

  - What do you think about renaming "taskmanager.memory.total-flink.size"
> to "taskmanager.memory.flink.size" and
> "taskmanager.memory.total-process.size" to
> "taskmanager.memory.process.size" (or "taskmanager.memory.jvm.total"). I
> think these keys may be a bit less clumsy (dropping the "total-") without
> loss of expressiveness.


+1 on this.

  - The network memory keys are now called "taskmanager.memory.shuffle.*".
> To my knowledge, shuffle is usually understood as a redistribution (random,
> or maybe by hash of key). As an example, there are many discussions about
> "shuffle join versus broadcast join", where "shuffle" is the synonym for
> "re-partitioning". We use that memory however for all network operations,
> like forward pipes, broadcasts, receiver-side buffering on checkpoints,
> etc. I find the name "*.shuffle.*" confusing, I am wondering if users would
> find that as well. So throwing in the suggestion to call the options
> "taskmanager.memory.network.*".


+0 on this one. I'm ok with "taskmanager.memory.network.*". On the other
hand, one can also argue that this part of memory is used by
ShuffleEnvironment, and the key "taskmanager.memory.shuffle.*" points more
directly to the shuffle service components.

  - The descriptions for the "taskmanager.memory.jvm-overhead.*" keys say
> that it also accounts for I/O direct memory, but the parameter is not
> counted into the MaxDirectMemory parameter.


True. Since we already have framework off-heap memory accounted for ad hoc
direct memory usages, accounting all of jvm-overhead also into max direct
memory limit seems not necessary. I would suggest to remove "I/O direct
memory" from the description, and explicitly mention that this option does
not account for direct memory and will not be accounted into max direct
memory limit. WDYT?

  - Can make the new ConfigOptions strongly typed with the new
> configuration options. For example, directly use MemorySize typed options.
> That makes validation automatic and saves us from breaking the options
> later.

+1. Wasn't aware of the new memory type config options.

Thank you~

Xintong Song



On Thu, Dec 19, 2019 at 6:03 PM Stephan Ewen <se...@apache.org> wrote:

> Hi all!
>
> I looked into using Flink with the new memory configurations and played
> around a bit.
> All in all, I think we made a really good step there in removing implicit
> assumptions and rather making the configuration of memory sizes more
> explicit. My experience was mostly very positive.
>
> There were a few issues I stumbled across that were a bit un-intuitive,
> where we could think about some alternatives.
>
> ==========
> Main points
> ==========
>
>   - "taskmanager.memory.size" (old main config option) is replaced by
> "taskmanager.memory.total-process.size" which has a different meaning in
> standalone setups. The old option did not subtract metaspace and other
> overhead, while the new option does. That means that with the default
> config, standalone clusters get quite a bit less memory. (independent of
> managed memory going off heap).
>
>     I am wondering if we could interpret "taskmanager.memory.size" as the
> deprecated key for "taskmanager.memory.total-flink.size". That would be in
> line with the old mechanism (assuming managed memory is set to off heap).
>
>     The effect would be that the container size on Yarn/Mesos increases,
> because from "taskmanager.memory.total-flink.size", we need to add overhead
> and metaspace to reach the total process size, rather than cutting off
> memory. But if we want, we could even adjust for that in the active
> resource manager, getting full backwards compatibility on that part.
>
>     Curious to hear more thoughts there.
>
>   - Mini Cluster tries to imitate exact ratio of memory pools as a
> standalone setup. I get the idea behind that, but I am wondering if it is
> the right approach here.
>
>     For example: I started a relatively large JVM (large heap size of 10
> GB) as a test. With the current logic, the system tries to reserve an
> additional 6GB for managed memory which is more than there is memory left.
> When you see the error that no memory could be allocated, you need to
> understand the magic of how this is derived.
>
>     I am trying to think about this from the perspective of using "Flink
> as a Library", which the MiniCluster is close to.
>     When starting Flink out of a running process, we cannot assume that we
> are the only users of that process and that we can mold the process to our
> demands. I think a fix value for managed memory and network memory would
> feel more natural in such a setup than a mechanism that is tailored towards
> exclusive use of the process.
>
>
> ===========================
> Config key names and descriptions
> ===========================
>
>   - Some off-heap memory goes into direct memory, some does not. This
> confused me a bit. For example "taskmanager.memory.framework.off-heap.size"
> is counted into MaxDirectMemory while
> "taskmanager.memory.task.off-heap.size" is counted as native memory. Maybe
> we should rename the keys to reflect that. There is no one "off heap"
> memory type after all. Maybe use "taskmanager.memory.task.native: XXXmb"
> and "taskmanager.memory.framework.direct: XXXmb" instead?
>
>   - What do you think about renaming "taskmanager.memory.total-flink.size"
> to "taskmanager.memory.flink.size" and
> "taskmanager.memory.total-process.size" to
> "taskmanager.memory.process.size" (or "taskmanager.memory.jvm.total"). I
> think these keys may be a bit less clumsy (dropping the "total-") without
> loss of expressiveness.
>
>   - The network memory keys are now called "taskmanager.memory.shuffle.*".
> To my knowledge, shuffle is usually understood as a redistribution (random,
> or maybe by hash of key). As an example, there are many discussions about
> "shuffle join versus broadcast join", where "shuffle" is the synonym for
> "re-partitioning". We use that memory however for all network operations,
> like forward pipes, broadcasts, receiver-side buffering on checkpoints,
> etc. I find the name "*.shuffle.*" confusing, I am wondering if users would
> find that as well. So throwing in the suggestion to call the options
> "taskmanager.memory.network.*".
>
>   - The descriptions for the "taskmanager.memory.jvm-overhead.*" keys say
> that it also accounts for I/O direct memory, but the parameter is not
> counted into the MaxDirectMemory parameter.
>
>
> Some minor comments
> ==================
>
>   - Can make the new ConfigOptions strongly typed with the new
> configuration options. For example, directly use MemorySize typed options.
> That makes validation automatic and saves us from breaking the options
> later.
>
>
> Best,
> Stephan
>
>