You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@geode.apache.org by Nick Reich <nr...@pivotal.io> on 2018/03/08 19:25:50 UTC

[PROPOSAL]: concurrent bucket moves during rebalance

Team,

The time required to undertake a rebalance of a geode cluster has often
been an area for improvement noted by users. Currently, buckets are moved
one at a time and we propose that creating a system that moved buckets in
parallel could greatly improve performance for this feature.

Previously, parallelization was implemented for adding redundant copies of
buckets to restore redundancy. However, moving buckets is a more
complicated matter and requires a different approach than restoration of
redundancy. The reason for this is that members could be potentially both
be gaining buckets and giving away buckets at the same time. While giving
away a bucket, that member still has all of the data for the bucket, until
the receiving member has fully received the bucket and it can safely be
removed from the original owner. This means that unless the member has the
memory overhead to store all of the buckets it will receive and all the
buckets it started with, there is potential that parallel moving of buckets
could cause the member to run out of memory.

For this reason, we propose a system that does (potentially) several rounds
of concurrent bucket moves:
1) A set of moves is calculated to improve balance that meet a requirement
that no member both receives and gives away a bucket (no member will have
memory overhead of an existing bucket it is ultimately removing and a new
bucket).
2) Conduct all calculated bucket moves in parallel. Parameters to throttle
this process (to prevent taking too many cluster resources, impacting
performance) should be added, such as only allowing each member to either
receive or send a maximum number of buckets concurrently.
3) If cluster is not yet balanced, perform additional iterations of
calculating and conducting bucket moves, until balance is achieved or a
possible maximum iterations is reached.
Note: in both the existing and proposed system, regions are rebalanced one
at a time.

Please let us know if you have feedback on this approach or additional
ideas that should be considered.

Re: [PROPOSAL]: concurrent bucket moves during rebalance

Posted by Nick Reich <nr...@pivotal.io>.
Swapnil,
Interesting points, here are my thoughts:


> Given that there is already support for parallel rebalancing among regions,
> I do not see the value in supporting parallel rebalancing of buckets.

I think there are some advantages to parallel rebalance of a region over
rebalancing regions in parallel. The first is that the existing system does
not help users with a single region that is significantly larger than
others. The second is that the existing system could cause memory overage
if several regions are rebalanced in parallel, since they are not
coordinated and would suffer from the issue discussed in the proposal about
members receiving and giving up buckets at the same time. The third is that
specifying the number of regions to rebalance does not spread the load
evenly across the cluster, but can hotspot specific members (for the reason
discussed in the previous point). By specifying the maximum number of
bucket transfers a member should concurrently be involved with, you can
specifically tune the overhead that rebalance can cause while at the same
time maximizing the number of transfers occurring at a time, especially on
larger clusters.

If we end up doing this anyway, I would suggest to not rely on "parameters"
> for throttling, as these parameters would have to be configured in advance
> without knowing what the actual load looks like when rebalance is in
> progress and hence could be difficult to get right. It would be ideal if we
> can handle this using back pressure.
>
Any back pressure mechanism still requires parameterization on what the
maximum resources are. For example, back pressure on adding tasks to a
thread pool executor is based on the number of allowed threads and the
maximum backlog. Based on Mike's anecdote, users will likely have their of
definition of too much resources, such as the hit they take to either
throughput or latency during the rebalance. I think the right approach is
to provide a good (but conservative) default and let users increase the
resource usage until they reach their optimal situation. This means we
should allow the level of parallelization to be configurable at runtime and
likely also on an individual rebalance (when kicked off manually). I am
interested in ideas on how best to parameterize for throttling and making
it as easy as possible for users to understand, configure, and predict
performance tradeoffs.

On Fri, Mar 9, 2018 at 6:52 AM, Anthony Baker <ab...@pivotal.io> wrote:

> It would be interesting to have a way to model the behavior of the current
> algorithm and compare it to the proposal under various conditions like
> membership changes, data imbalance, etc. That would let us understand the
> optimality of the change in a concrete way.
>
> Anthony
>
> > On Mar 9, 2018, at 1:19 AM, Swapnil Bawaskar <sb...@pivotal.io>
> wrote:
> >
> > Given that there is already support for parallel rebalancing among
> regions,
> > I do not see the value in supporting parallel rebalancing of buckets.
> >
> > If we end up doing this anyway, I would suggest to not rely on
> "parameters"
> > for throttling, as these parameters would have to be configured in
> advance
> > without knowing what the actual load looks like when rebalance is in
> > progress and hence could be difficult to get right. It would be ideal if
> we
> > can handle this using back pressure.
> >
> >> On Thu, Mar 8, 2018 at 12:05 PM Nick Reich <nr...@pivotal.io> wrote:
> >>
> >> Mike,
> >>
> >> I think having a good default value for maximum parallel operations will
> >> play a role in not consuming too many resources. Perhaps defaulting to
> only
> >> a single (or other small number based on testing) parallel action(s) per
> >> member at a time and allowing users that want better performance to
> >> increase that number would be a good start. That should result in
> >> performance improvements, but not place increased burden on any specific
> >> member. Especially when bootstrapping new members, relance speed may be
> >> more valuable than usual, so making it possible to configure on a per
> >> rebalance action level would be prefered.
> >>
> >> One clarification from my original proposal: regions can already be
> >> rebalanced in parallel, depending on the value of
> resource.manager.threads
> >> (which defaults to 1, so no parallelization or regions in the default
> >> case).
> >>
> >>> On Thu, Mar 8, 2018 at 11:46 AM, Michael Stolz <ms...@pivotal.io>
> wrote:
> >>>
> >>> We should be very careful about how much resource we dedicate to
> >>> rebalancing.
> >>>
> >>> One of our competitors rebalances *much* faster than we do, but in
> doing
> >> so
> >>> they consume all available resources.
> >>>
> >>> At one bank that caused significant loss of incoming market data that
> was
> >>> coming in on a multicast feed, which had a severe adverse effect on the
> >>> pricing and risk management functions for a period of time. That bank
> >>> removed the competitor's product and for several years no distributed
> >>> caching was allowed by the chief architect at that bank. Until he left
> >> and
> >>> a new chief architect was named they didn't use any distributed caching
> >>> products. When they DID go back to using them, it pre-dated Geode, so
> >> they
> >>> used GemFire largely because GemFire does not consume all available
> >>> resources while rebalancing.
> >>>
> >>> I do think we need to improve our rebalancing such that it iterates
> until
> >>> it achieves balance, but not in a way that will consume all available
> >>> resources.
> >>>
> >>> --
> >>> Mike Stolz
> >>>
> >>>
> >>>> On Thu, Mar 8, 2018 at 2:25 PM, Nick Reich <nr...@pivotal.io> wrote:
> >>>>
> >>>> Team,
> >>>>
> >>>> The time required to undertake a rebalance of a geode cluster has
> often
> >>>> been an area for improvement noted by users. Currently, buckets are
> >> moved
> >>>> one at a time and we propose that creating a system that moved buckets
> >> in
> >>>> parallel could greatly improve performance for this feature.
> >>>>
> >>>> Previously, parallelization was implemented for adding redundant
> copies
> >>> of
> >>>> buckets to restore redundancy. However, moving buckets is a more
> >>>> complicated matter and requires a different approach than restoration
> >> of
> >>>> redundancy. The reason for this is that members could be potentially
> >> both
> >>>> be gaining buckets and giving away buckets at the same time. While
> >> giving
> >>>> away a bucket, that member still has all of the data for the bucket,
> >>> until
> >>>> the receiving member has fully received the bucket and it can safely
> be
> >>>> removed from the original owner. This means that unless the member has
> >>> the
> >>>> memory overhead to store all of the buckets it will receive and all
> the
> >>>> buckets it started with, there is potential that parallel moving of
> >>> buckets
> >>>> could cause the member to run out of memory.
> >>>>
> >>>> For this reason, we propose a system that does (potentially) several
> >>> rounds
> >>>> of concurrent bucket moves:
> >>>> 1) A set of moves is calculated to improve balance that meet a
> >>> requirement
> >>>> that no member both receives and gives away a bucket (no member will
> >> have
> >>>> memory overhead of an existing bucket it is ultimately removing and a
> >> new
> >>>> bucket).
> >>>> 2) Conduct all calculated bucket moves in parallel. Parameters to
> >>> throttle
> >>>> this process (to prevent taking too many cluster resources, impacting
> >>>> performance) should be added, such as only allowing each member to
> >> either
> >>>> receive or send a maximum number of buckets concurrently.
> >>>> 3) If cluster is not yet balanced, perform additional iterations of
> >>>> calculating and conducting bucket moves, until balance is achieved or
> a
> >>>> possible maximum iterations is reached.
> >>>> Note: in both the existing and proposed system, regions are rebalanced
> >>> one
> >>>> at a time.
> >>>>
> >>>> Please let us know if you have feedback on this approach or additional
> >>>> ideas that should be considered.
> >>>>
> >>>
> >>
>

Re: [PROPOSAL]: concurrent bucket moves during rebalance

Posted by Anthony Baker <ab...@pivotal.io>.
It would be interesting to have a way to model the behavior of the current algorithm and compare it to the proposal under various conditions like membership changes, data imbalance, etc. That would let us understand the optimality of the change in a concrete way. 

Anthony

> On Mar 9, 2018, at 1:19 AM, Swapnil Bawaskar <sb...@pivotal.io> wrote:
> 
> Given that there is already support for parallel rebalancing among regions,
> I do not see the value in supporting parallel rebalancing of buckets.
> 
> If we end up doing this anyway, I would suggest to not rely on "parameters"
> for throttling, as these parameters would have to be configured in advance
> without knowing what the actual load looks like when rebalance is in
> progress and hence could be difficult to get right. It would be ideal if we
> can handle this using back pressure.
> 
>> On Thu, Mar 8, 2018 at 12:05 PM Nick Reich <nr...@pivotal.io> wrote:
>> 
>> Mike,
>> 
>> I think having a good default value for maximum parallel operations will
>> play a role in not consuming too many resources. Perhaps defaulting to only
>> a single (or other small number based on testing) parallel action(s) per
>> member at a time and allowing users that want better performance to
>> increase that number would be a good start. That should result in
>> performance improvements, but not place increased burden on any specific
>> member. Especially when bootstrapping new members, relance speed may be
>> more valuable than usual, so making it possible to configure on a per
>> rebalance action level would be prefered.
>> 
>> One clarification from my original proposal: regions can already be
>> rebalanced in parallel, depending on the value of resource.manager.threads
>> (which defaults to 1, so no parallelization or regions in the default
>> case).
>> 
>>> On Thu, Mar 8, 2018 at 11:46 AM, Michael Stolz <ms...@pivotal.io> wrote:
>>> 
>>> We should be very careful about how much resource we dedicate to
>>> rebalancing.
>>> 
>>> One of our competitors rebalances *much* faster than we do, but in doing
>> so
>>> they consume all available resources.
>>> 
>>> At one bank that caused significant loss of incoming market data that was
>>> coming in on a multicast feed, which had a severe adverse effect on the
>>> pricing and risk management functions for a period of time. That bank
>>> removed the competitor's product and for several years no distributed
>>> caching was allowed by the chief architect at that bank. Until he left
>> and
>>> a new chief architect was named they didn't use any distributed caching
>>> products. When they DID go back to using them, it pre-dated Geode, so
>> they
>>> used GemFire largely because GemFire does not consume all available
>>> resources while rebalancing.
>>> 
>>> I do think we need to improve our rebalancing such that it iterates until
>>> it achieves balance, but not in a way that will consume all available
>>> resources.
>>> 
>>> --
>>> Mike Stolz
>>> 
>>> 
>>>> On Thu, Mar 8, 2018 at 2:25 PM, Nick Reich <nr...@pivotal.io> wrote:
>>>> 
>>>> Team,
>>>> 
>>>> The time required to undertake a rebalance of a geode cluster has often
>>>> been an area for improvement noted by users. Currently, buckets are
>> moved
>>>> one at a time and we propose that creating a system that moved buckets
>> in
>>>> parallel could greatly improve performance for this feature.
>>>> 
>>>> Previously, parallelization was implemented for adding redundant copies
>>> of
>>>> buckets to restore redundancy. However, moving buckets is a more
>>>> complicated matter and requires a different approach than restoration
>> of
>>>> redundancy. The reason for this is that members could be potentially
>> both
>>>> be gaining buckets and giving away buckets at the same time. While
>> giving
>>>> away a bucket, that member still has all of the data for the bucket,
>>> until
>>>> the receiving member has fully received the bucket and it can safely be
>>>> removed from the original owner. This means that unless the member has
>>> the
>>>> memory overhead to store all of the buckets it will receive and all the
>>>> buckets it started with, there is potential that parallel moving of
>>> buckets
>>>> could cause the member to run out of memory.
>>>> 
>>>> For this reason, we propose a system that does (potentially) several
>>> rounds
>>>> of concurrent bucket moves:
>>>> 1) A set of moves is calculated to improve balance that meet a
>>> requirement
>>>> that no member both receives and gives away a bucket (no member will
>> have
>>>> memory overhead of an existing bucket it is ultimately removing and a
>> new
>>>> bucket).
>>>> 2) Conduct all calculated bucket moves in parallel. Parameters to
>>> throttle
>>>> this process (to prevent taking too many cluster resources, impacting
>>>> performance) should be added, such as only allowing each member to
>> either
>>>> receive or send a maximum number of buckets concurrently.
>>>> 3) If cluster is not yet balanced, perform additional iterations of
>>>> calculating and conducting bucket moves, until balance is achieved or a
>>>> possible maximum iterations is reached.
>>>> Note: in both the existing and proposed system, regions are rebalanced
>>> one
>>>> at a time.
>>>> 
>>>> Please let us know if you have feedback on this approach or additional
>>>> ideas that should be considered.
>>>> 
>>> 
>> 

Re: [PROPOSAL]: concurrent bucket moves during rebalance

Posted by Swapnil Bawaskar <sb...@pivotal.io>.
Given that there is already support for parallel rebalancing among regions,
I do not see the value in supporting parallel rebalancing of buckets.

If we end up doing this anyway, I would suggest to not rely on "parameters"
for throttling, as these parameters would have to be configured in advance
without knowing what the actual load looks like when rebalance is in
progress and hence could be difficult to get right. It would be ideal if we
can handle this using back pressure.

On Thu, Mar 8, 2018 at 12:05 PM Nick Reich <nr...@pivotal.io> wrote:

> Mike,
>
> I think having a good default value for maximum parallel operations will
> play a role in not consuming too many resources. Perhaps defaulting to only
> a single (or other small number based on testing) parallel action(s) per
> member at a time and allowing users that want better performance to
> increase that number would be a good start. That should result in
> performance improvements, but not place increased burden on any specific
> member. Especially when bootstrapping new members, relance speed may be
> more valuable than usual, so making it possible to configure on a per
> rebalance action level would be prefered.
>
> One clarification from my original proposal: regions can already be
> rebalanced in parallel, depending on the value of resource.manager.threads
> (which defaults to 1, so no parallelization or regions in the default
> case).
>
> On Thu, Mar 8, 2018 at 11:46 AM, Michael Stolz <ms...@pivotal.io> wrote:
>
> > We should be very careful about how much resource we dedicate to
> > rebalancing.
> >
> > One of our competitors rebalances *much* faster than we do, but in doing
> so
> > they consume all available resources.
> >
> > At one bank that caused significant loss of incoming market data that was
> > coming in on a multicast feed, which had a severe adverse effect on the
> > pricing and risk management functions for a period of time. That bank
> > removed the competitor's product and for several years no distributed
> > caching was allowed by the chief architect at that bank. Until he left
> and
> > a new chief architect was named they didn't use any distributed caching
> > products. When they DID go back to using them, it pre-dated Geode, so
> they
> > used GemFire largely because GemFire does not consume all available
> > resources while rebalancing.
> >
> > I do think we need to improve our rebalancing such that it iterates until
> > it achieves balance, but not in a way that will consume all available
> > resources.
> >
> > --
> > Mike Stolz
> >
> >
> > On Thu, Mar 8, 2018 at 2:25 PM, Nick Reich <nr...@pivotal.io> wrote:
> >
> > > Team,
> > >
> > > The time required to undertake a rebalance of a geode cluster has often
> > > been an area for improvement noted by users. Currently, buckets are
> moved
> > > one at a time and we propose that creating a system that moved buckets
> in
> > > parallel could greatly improve performance for this feature.
> > >
> > > Previously, parallelization was implemented for adding redundant copies
> > of
> > > buckets to restore redundancy. However, moving buckets is a more
> > > complicated matter and requires a different approach than restoration
> of
> > > redundancy. The reason for this is that members could be potentially
> both
> > > be gaining buckets and giving away buckets at the same time. While
> giving
> > > away a bucket, that member still has all of the data for the bucket,
> > until
> > > the receiving member has fully received the bucket and it can safely be
> > > removed from the original owner. This means that unless the member has
> > the
> > > memory overhead to store all of the buckets it will receive and all the
> > > buckets it started with, there is potential that parallel moving of
> > buckets
> > > could cause the member to run out of memory.
> > >
> > > For this reason, we propose a system that does (potentially) several
> > rounds
> > > of concurrent bucket moves:
> > > 1) A set of moves is calculated to improve balance that meet a
> > requirement
> > > that no member both receives and gives away a bucket (no member will
> have
> > > memory overhead of an existing bucket it is ultimately removing and a
> new
> > > bucket).
> > > 2) Conduct all calculated bucket moves in parallel. Parameters to
> > throttle
> > > this process (to prevent taking too many cluster resources, impacting
> > > performance) should be added, such as only allowing each member to
> either
> > > receive or send a maximum number of buckets concurrently.
> > > 3) If cluster is not yet balanced, perform additional iterations of
> > > calculating and conducting bucket moves, until balance is achieved or a
> > > possible maximum iterations is reached.
> > > Note: in both the existing and proposed system, regions are rebalanced
> > one
> > > at a time.
> > >
> > > Please let us know if you have feedback on this approach or additional
> > > ideas that should be considered.
> > >
> >
>

Re: [PROPOSAL]: concurrent bucket moves during rebalance

Posted by Nick Reich <nr...@pivotal.io>.
Mike,

I think having a good default value for maximum parallel operations will
play a role in not consuming too many resources. Perhaps defaulting to only
a single (or other small number based on testing) parallel action(s) per
member at a time and allowing users that want better performance to
increase that number would be a good start. That should result in
performance improvements, but not place increased burden on any specific
member. Especially when bootstrapping new members, relance speed may be
more valuable than usual, so making it possible to configure on a per
rebalance action level would be prefered.

One clarification from my original proposal: regions can already be
rebalanced in parallel, depending on the value of resource.manager.threads
(which defaults to 1, so no parallelization or regions in the default case).

On Thu, Mar 8, 2018 at 11:46 AM, Michael Stolz <ms...@pivotal.io> wrote:

> We should be very careful about how much resource we dedicate to
> rebalancing.
>
> One of our competitors rebalances *much* faster than we do, but in doing so
> they consume all available resources.
>
> At one bank that caused significant loss of incoming market data that was
> coming in on a multicast feed, which had a severe adverse effect on the
> pricing and risk management functions for a period of time. That bank
> removed the competitor's product and for several years no distributed
> caching was allowed by the chief architect at that bank. Until he left and
> a new chief architect was named they didn't use any distributed caching
> products. When they DID go back to using them, it pre-dated Geode, so they
> used GemFire largely because GemFire does not consume all available
> resources while rebalancing.
>
> I do think we need to improve our rebalancing such that it iterates until
> it achieves balance, but not in a way that will consume all available
> resources.
>
> --
> Mike Stolz
>
>
> On Thu, Mar 8, 2018 at 2:25 PM, Nick Reich <nr...@pivotal.io> wrote:
>
> > Team,
> >
> > The time required to undertake a rebalance of a geode cluster has often
> > been an area for improvement noted by users. Currently, buckets are moved
> > one at a time and we propose that creating a system that moved buckets in
> > parallel could greatly improve performance for this feature.
> >
> > Previously, parallelization was implemented for adding redundant copies
> of
> > buckets to restore redundancy. However, moving buckets is a more
> > complicated matter and requires a different approach than restoration of
> > redundancy. The reason for this is that members could be potentially both
> > be gaining buckets and giving away buckets at the same time. While giving
> > away a bucket, that member still has all of the data for the bucket,
> until
> > the receiving member has fully received the bucket and it can safely be
> > removed from the original owner. This means that unless the member has
> the
> > memory overhead to store all of the buckets it will receive and all the
> > buckets it started with, there is potential that parallel moving of
> buckets
> > could cause the member to run out of memory.
> >
> > For this reason, we propose a system that does (potentially) several
> rounds
> > of concurrent bucket moves:
> > 1) A set of moves is calculated to improve balance that meet a
> requirement
> > that no member both receives and gives away a bucket (no member will have
> > memory overhead of an existing bucket it is ultimately removing and a new
> > bucket).
> > 2) Conduct all calculated bucket moves in parallel. Parameters to
> throttle
> > this process (to prevent taking too many cluster resources, impacting
> > performance) should be added, such as only allowing each member to either
> > receive or send a maximum number of buckets concurrently.
> > 3) If cluster is not yet balanced, perform additional iterations of
> > calculating and conducting bucket moves, until balance is achieved or a
> > possible maximum iterations is reached.
> > Note: in both the existing and proposed system, regions are rebalanced
> one
> > at a time.
> >
> > Please let us know if you have feedback on this approach or additional
> > ideas that should be considered.
> >
>

Re: [PROPOSAL]: concurrent bucket moves during rebalance

Posted by Michael Stolz <ms...@pivotal.io>.
We should be very careful about how much resource we dedicate to
rebalancing.

One of our competitors rebalances *much* faster than we do, but in doing so
they consume all available resources.

At one bank that caused significant loss of incoming market data that was
coming in on a multicast feed, which had a severe adverse effect on the
pricing and risk management functions for a period of time. That bank
removed the competitor's product and for several years no distributed
caching was allowed by the chief architect at that bank. Until he left and
a new chief architect was named they didn't use any distributed caching
products. When they DID go back to using them, it pre-dated Geode, so they
used GemFire largely because GemFire does not consume all available
resources while rebalancing.

I do think we need to improve our rebalancing such that it iterates until
it achieves balance, but not in a way that will consume all available
resources.

--
Mike Stolz


On Thu, Mar 8, 2018 at 2:25 PM, Nick Reich <nr...@pivotal.io> wrote:

> Team,
>
> The time required to undertake a rebalance of a geode cluster has often
> been an area for improvement noted by users. Currently, buckets are moved
> one at a time and we propose that creating a system that moved buckets in
> parallel could greatly improve performance for this feature.
>
> Previously, parallelization was implemented for adding redundant copies of
> buckets to restore redundancy. However, moving buckets is a more
> complicated matter and requires a different approach than restoration of
> redundancy. The reason for this is that members could be potentially both
> be gaining buckets and giving away buckets at the same time. While giving
> away a bucket, that member still has all of the data for the bucket, until
> the receiving member has fully received the bucket and it can safely be
> removed from the original owner. This means that unless the member has the
> memory overhead to store all of the buckets it will receive and all the
> buckets it started with, there is potential that parallel moving of buckets
> could cause the member to run out of memory.
>
> For this reason, we propose a system that does (potentially) several rounds
> of concurrent bucket moves:
> 1) A set of moves is calculated to improve balance that meet a requirement
> that no member both receives and gives away a bucket (no member will have
> memory overhead of an existing bucket it is ultimately removing and a new
> bucket).
> 2) Conduct all calculated bucket moves in parallel. Parameters to throttle
> this process (to prevent taking too many cluster resources, impacting
> performance) should be added, such as only allowing each member to either
> receive or send a maximum number of buckets concurrently.
> 3) If cluster is not yet balanced, perform additional iterations of
> calculating and conducting bucket moves, until balance is achieved or a
> possible maximum iterations is reached.
> Note: in both the existing and proposed system, regions are rebalanced one
> at a time.
>
> Please let us know if you have feedback on this approach or additional
> ideas that should be considered.
>