You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ignite.apache.org by Ilya Lantukh <il...@gridgain.com> on 2018/03/22 15:38:33 UTC

Rebalancing - how to make it faster

Igniters,

I've spent some time analyzing performance of rebalancing process. The
initial goal was to understand, what limits it's throughput, because it is
significantly slower than network and storage device can theoretically
handle.

Turns out, our current implementation has a number of issues caused by a
single fundamental problem.

During rebalance data is sent in batches called
GridDhtPartitionSupplyMessages. Batch size is configurable, default value
is 512KB, which could mean thousands of key-value pairs. However, we don't
take any advantage over this fact and process each entry independently:
- checkpointReadLock is acquired multiple times for every entry, leading to
unnecessary contention - this is clearly a bug;
- for each entry we write (and fsync, if configuration assumes it) a
separate WAL record - so, if batch contains N entries, we might end up
doing N fsyncs;
- adding every entry into CacheDataStore also happens completely
independently. It means, we will traverse and modify each index tree N
times, we will allocate space in FreeList N times and we will have to
additionally store in WAL O(N*log(N)) page delta records.

I've created a few tickets in JIRA with very different levels of scale and
complexity.

Ways to reduce impact of independent processing:
- https://issues.apache.org/jira/browse/IGNITE-8019 - aforementioned bug,
causing contention on checkpointReadLock;
- https://issues.apache.org/jira/browse/IGNITE-8018 - inefficiency in
GridCacheMapEntry implementation;
- https://issues.apache.org/jira/browse/IGNITE-8017 - automatically disable
WAL during preloading.

Ways to solve problem on more global level:
- https://issues.apache.org/jira/browse/IGNITE-7935 - a ticket to introduce
batch modification;
- https://issues.apache.org/jira/browse/IGNITE-8020 - complete redesign of
rebalancing process for persistent caches, based on file transfer.

Everyone is welcome to criticize above ideas, suggest new ones or
participate in implementation.

-- 
Best regards,
Ilya

Re: Rebalancing - how to make it faster

Posted by Anton Vinogradov <av...@apache.org>.

+1 for IEP creation

2018-03-23 22:40 GMT+03:00 Dmitry Pavlov <dp...@gmail.com>:

> Denis, as I understood, there is and idea to exclude only rebalanced
> partition(s) data. All other data will go to the WAL.
>
> Ilya, please correct me if I'm wrong.
>
> пт, 23 мар. 2018 г. в 22:15, Denis Magda <dm...@apache.org>:
>
> > Ilya,
> >
> > That's a decent boost (5-20%) even having WAL enabled. Not sure that we
> > should stake on the WAL "off" mode here because if the whole cluster goes
> > down, it's then the data consistency is questionable. As an architect, I
> > wouldn't disable WAL for the sake of rebalancing; it's too risky.
> >
> > If you agree, then let's create the IEP. This way it will be easier to
> > track this endeavor. BTW, are you already ready to release any
> > optimizations in 2.5 that is being discussed in a separate thread?
> >
> > --
> > Denis
> >
> >
> >
> > On Fri, Mar 23, 2018 at 6:37 AM, Ilya Lantukh <il...@gridgain.com>
> > wrote:
> >
> > > Denis,
> > >
> > > > - Don't you want to aggregate the tickets under an IEP?
> > > Yes, I think so.
> > >
> > > > - Does it mean we're going to update our B+Tree implementation? Any
> > ideas
> > > how risky it is?
> > > One of tickets that I created (
> > > https://issues.apache.org/jira/browse/IGNITE-7935) involves B+Tree
> > > modification, but I am not planning to do it in the nearest future. It
> > > shouldn't affect existing tree operations, only introduce new ones
> > (putAll,
> > > invokeAll, removeAll).
> > >
> > > > - Any chance you had a prototype that shows performance optimizations
> > the
> > > approach you are suggesting to take?
> > > I have a prototype for simplest improvements (
> https://issues.apache.org/
> > > jira/browse/IGNITE-8019 & https://issues.apache.org/
> > > jira/browse/IGNITE-8018)
> > > - together they increase throughput by 5-20%, depending on
> configuration
> > > and environment. Also, I've tested different WAL modes - switching from
> > > LOG_ONLY to NONE gives over 100% boost - this is what I expect from
> > > https://issues.apache.org/jira/browse/IGNITE-8017.
> > >
> > > On Thu, Mar 22, 2018 at 9:48 PM, Denis Magda <dm...@apache.org>
> wrote:
> > >
> > > > Ilya,
> > > >
> > > > That's outstanding research and summary. Thanks for spending your
> time
> > on
> > > > this.
> > > >
> > > > Not sure I have enough expertise to challenge your approach, but it
> > > sounds
> > > > 100% reasonable to me. As side notes:
> > > >
> > > >    - Don't you want to aggregate the tickets under an IEP?
> > > >    - Does it mean we're going to update our B+Tree implementation?
> Any
> > > >    ideas how risky it is?
> > > >    - Any chance you had a prototype that shows performance
> > optimizations
> > > of
> > > >    the approach you are suggesting to take?
> > > >
> > > > --
> > > > Denis
> > > >
> > > > On Thu, Mar 22, 2018 at 8:38 AM, Ilya Lantukh <ilantukh@gridgain.com
> >
> > > > wrote:
> > > >
> > > > > Igniters,
> > > > >
> > > > > I've spent some time analyzing performance of rebalancing process.
> > The
> > > > > initial goal was to understand, what limits it's throughput,
> because
> > it
> > > > is
> > > > > significantly slower than network and storage device can
> > theoretically
> > > > > handle.
> > > > >
> > > > > Turns out, our current implementation has a number of issues caused
> > by
> > > a
> > > > > single fundamental problem.
> > > > >
> > > > > During rebalance data is sent in batches called
> > > > > GridDhtPartitionSupplyMessages. Batch size is configurable,
> default
> > > > value
> > > > > is 512KB, which could mean thousands of key-value pairs. However,
> we
> > > > don't
> > > > > take any advantage over this fact and process each entry
> > independently:
> > > > > - checkpointReadLock is acquired multiple times for every entry,
> > > leading
> > > > to
> > > > > unnecessary contention - this is clearly a bug;
> > > > > - for each entry we write (and fsync, if configuration assumes it)
> a
> > > > > separate WAL record - so, if batch contains N entries, we might end
> > up
> > > > > doing N fsyncs;
> > > > > - adding every entry into CacheDataStore also happens completely
> > > > > independently. It means, we will traverse and modify each index
> tree
> > N
> > > > > times, we will allocate space in FreeList N times and we will have
> to
> > > > > additionally store in WAL O(N*log(N)) page delta records.
> > > > >
> > > > > I've created a few tickets in JIRA with very different levels of
> > scale
> > > > and
> > > > > complexity.
> > > > >
> > > > > Ways to reduce impact of independent processing:
> > > > > - https://issues.apache.org/jira/browse/IGNITE-8019 -
> aforementioned
> > > > bug,
> > > > > causing contention on checkpointReadLock;
> > > > > - https://issues.apache.org/jira/browse/IGNITE-8018 - inefficiency
> > in
> > > > > GridCacheMapEntry implementation;
> > > > > - https://issues.apache.org/jira/browse/IGNITE-8017 -
> automatically
> > > > > disable
> > > > > WAL during preloading.
> > > > >
> > > > > Ways to solve problem on more global level:
> > > > > - https://issues.apache.org/jira/browse/IGNITE-7935 - a ticket to
> > > > > introduce
> > > > > batch modification;
> > > > > - https://issues.apache.org/jira/browse/IGNITE-8020 - complete
> > > redesign
> > > > of
> > > > > rebalancing process for persistent caches, based on file transfer.
> > > > >
> > > > > Everyone is welcome to criticize above ideas, suggest new ones or
> > > > > participate in implementation.
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Ilya
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > > Ilya
> > >
> >
>

Re: Rebalancing - how to make it faster

Posted by Denis Magda <dm...@apache.org>.

The system property is a good choice, especially for experimental features.
Anton, thanks for reminding about that.

Once the feature will be about to graduate from the experimental state we
might not need any API parameter at all if the functionality is enabled by
default.

--
Denis

On Mon, Apr 9, 2018 at 12:28 PM, Anton Vinogradov <av...@apache.org> wrote:

> Dmitry, I see no problems to use system property for cases you specified.
> Migration possibility is not a reason to change the API.
>
> пн, 9 апр. 2018 г., 20:18 Dmitry Pavlov <dp...@gmail.com>:
>
> > Not sure I agree. I think feature toggle is best practice.
> >
> > Eventually we can remove related code, but it always good to have
> migration
> > stages:
> > 1) old and new provided but new is disabled,
> > 2) both provided and new is enabled
> > 3) new is enabled, old is not available.
> >
> > It is common practice for enterprize software.
> >
> > пн, 9 апр. 2018 г. в 20:08, Anton Vinogradov <av...@apache.org>:
> >
> > > >> I would enable this option only after we confirm it's stable.
> > > We have to add system property in that case.
> > >
> > > It's a bad idea to have such options at API because it can be unstable.
> > > It's like allowMoreThan1024Node(bool) since we have issues with that.
> > >
> > > 2018-04-09 20:03 GMT+03:00 Yakov Zhdanov <yz...@apache.org>:
> > >
> > > > How about skipWalOnRebalancing or disableWalOnRebalancing? I like the
> > > first
> > > > one better and both are shorter.
> > > >
> > > > --Yakov
> > > >
> > > > 2018-04-09 19:55 GMT+03:00 Denis Magda <dm...@apache.org>:
> > > >
> > > > > I would enable this option only after we confirm it's stable. Until
> > it
> > > > > happens it should be treated as an experimental feature that is
> > turned
> > > on
> > > > > manually with a parameter like Ilya is suggesting.
> > > > >
> > > > > Ilya, the name sounds good to me.
> > > > >
> > > > > --
> > > > > Denis
> > > > >
> > > > > On Mon, Apr 9, 2018 at 9:16 AM, Anton Vinogradov <av...@apache.org>
> > > wrote:
> > > > >
> > > > > > Ilya,
> > > > > >
> > > > > > WAL should be automatically disabled at initial rebalancing.
> > > > > > It can't be disabled at regilar rebalancing, since you never
> ready
> > to
> > > > > lose
> > > > > > data you protected by WAL.
> > > > > > Is there any need to have special param in that case?
> > > > > >
> > > > > > 2018-04-09 16:41 GMT+03:00 Ilya Lantukh <il...@gridgain.com>:
> > > > > >
> > > > > > > Igniters,
> > > > > > >
> > > > > > > I am currently at the finish line of
> > > > > > > https://issues.apache.org/jira/browse/IGNITE-8017 ("Disable
> WAL
> > > > during
> > > > > > > initial preloading") implementation. And I need that such
> > behavior
> > > > > should
> > > > > > > be configurable. In my intermediate implementation I have
> > parameter
> > > > > > called
> > > > > > > "disableWalDuringRebalancing" in IgniteConfiguration. Do you
> > thing
> > > > such
> > > > > > > name is meaningful and self-explanatory? Do we need to ensure
> > that
> > > it
> > > > > has
> > > > > > > the same value on every node? Should I make it configurable per
> > > cache
> > > > > > > rather than globally?
> > > > > > >
> > > > > > > Please share your thoughts.
> > > > > > >
> > > > > > > On Mon, Apr 9, 2018 at 4:32 PM, Ilya Lantukh <
> > > ilantukh@gridgain.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Denis,
> > > > > > > >
> > > > > > > > Those ticket are rather complex, and so I don't know when
> I'll
> > be
> > > > > able
> > > > > > to
> > > > > > > > start working on them.
> > > > > > > >
> > > > > > > > On Fri, Mar 30, 2018 at 11:45 PM, Denis Magda <
> > dmagda@apache.org
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > >> Ilya,
> > > > > > > >>
> > > > > > > >> Just came across the IEP put together by you:
> > > > > > > >> https://cwiki.apache.org/confluence/display/IGNITE/IEP-16%
> > > > > > > >> 3A+Optimization+of+rebalancing
> > > > > > > >>
> > > > > > > >> Excellent explanation, thanks for aggregating everything
> > there.
> > > > > > > >>
> > > > > > > >> Two tickets below don't have a fixed version assigned:
> > > > > > > >> https://issues.apache.org/jira/browse/IGNITE-8020
> > > > > > > >> https://issues.apache.org/jira/browse/IGNITE-7935
> > > > > > > >>
> > > > > > > >> Do you plan to work on them in 2.6 time frame, right?
> > > > > > > >>
> > > > > > > >> --
> > > > > > > >> Denis
> > > > > > > >>
> > > > > > > >> On Tue, Mar 27, 2018 at 9:29 AM, Denis Magda <
> > dmagda@apache.org
> > > >
> > > > > > wrote:
> > > > > > > >>
> > > > > > > >> > Ilya, granted you all the required permissions. Please let
> > me
> > > > know
> > > > > > if
> > > > > > > >> you
> > > > > > > >> > still have troubles with the wiki.
> > > > > > > >> >
> > > > > > > >> > --
> > > > > > > >> > Denis
> > > > > > > >> >
> > > > > > > >> > On Tue, Mar 27, 2018 at 8:56 AM, Ilya Lantukh <
> > > > > > ilantukh@gridgain.com>
> > > > > > > >> > wrote:
> > > > > > > >> >
> > > > > > > >> >> Unfortunately, I don't have permission to create page for
> > IEP
> > > > on
> > > > > > > wiki.
> > > > > > > >> >> Denis, can you grant it? My username is ilantukh.
> > > > > > > >> >>
> > > > > > > >> >> On Mon, Mar 26, 2018 at 8:04 PM, Anton Vinogradov <
> > > > av@apache.org
> > > > > >
> > > > > > > >> wrote:
> > > > > > > >> >>
> > > > > > > >> >> > >> It is impossible to disable WAL only for certain
> > > > partitions
> > > > > > > >> without
> > > > > > > >> >> > >> completely overhauling design of Ignite storage
> > > mechanism.
> > > > > > Right
> > > > > > > >> now
> > > > > > > >> >> we
> > > > > > > >> >> > can
> > > > > > > >> >> > >> afford only to change WAL mode per cache group.
> > > > > > > >> >> >
> > > > > > > >> >> > Cache group rebalancing is a one cache rebalancing, and
> > > then
> > > > > this
> > > > > > > >> cache
> > > > > > > >> >> > ("cache group") can be presented as a set of virtual
> > > caches.
> > > > > > > >> >> > So, there is no issues for initial rebalancing.
> > > > > > > >> >> > Lets disable WAL on initial rebalancing.
> > > > > > > >> >> >
> > > > > > > >> >> > 2018-03-26 16:46 GMT+03:00 Ilya Lantukh <
> > > > ilantukh@gridgain.com
> > > > > >:
> > > > > > > >> >> >
> > > > > > > >> >> > > Dmitry,
> > > > > > > >> >> > > It is impossible to disable WAL only for certain
> > > partitions
> > > > > > > without
> > > > > > > >> >> > > completely overhauling design of Ignite storage
> > > mechanism.
> > > > > > Right
> > > > > > > >> now
> > > > > > > >> >> we
> > > > > > > >> >> > can
> > > > > > > >> >> > > afford only to change WAL mode per cache group.
> > > > > > > >> >> > >
> > > > > > > >> >> > > The idea is to disable WAL when node doesn't have any
> > > > > partition
> > > > > > > in
> > > > > > > >> >> OWNING
> > > > > > > >> >> > > state, which means it doesn't have any consistent
> data
> > > and
> > > > > > won't
> > > > > > > be
> > > > > > > >> >> able
> > > > > > > >> >> > to
> > > > > > > >> >> > > restore from WAL anyway. I don't see any potential
> use
> > > for
> > > > > WAL
> > > > > > on
> > > > > > > >> such
> > > > > > > >> >> > > node, but we can keep a configurable parameter
> > indicating
> > > > can
> > > > > > we
> > > > > > > >> >> > > automatically disable WAL in such case or not.
> > > > > > > >> >> > >
> > > > > > > >> >> > > On Fri, Mar 23, 2018 at 10:40 PM, Dmitry Pavlov <
> > > > > > > >> >> dpavlov.spb@gmail.com>
> > > > > > > >> >> > > wrote:
> > > > > > > >> >> > >
> > > > > > > >> >> > > > Denis, as I understood, there is and idea to
> exclude
> > > only
> > > > > > > >> rebalanced
> > > > > > > >> >> > > > partition(s) data. All other data will go to the
> WAL.
> > > > > > > >> >> > > >
> > > > > > > >> >> > > > Ilya, please correct me if I'm wrong.
> > > > > > > >> >> > > >
> > > > > > > >> >> > > > пт, 23 мар. 2018 г. в 22:15, Denis Magda <
> > > > > dmagda@apache.org
> > > > > > >:
> > > > > > > >> >> > > >
> > > > > > > >> >> > > > > Ilya,
> > > > > > > >> >> > > > >
> > > > > > > >> >> > > > > That's a decent boost (5-20%) even having WAL
> > > enabled.
> > > > > Not
> > > > > > > sure
> > > > > > > >> >> that
> > > > > > > >> >> > we
> > > > > > > >> >> > > > > should stake on the WAL "off" mode here because
> if
> > > the
> > > > > > whole
> > > > > > > >> >> cluster
> > > > > > > >> >> > > goes
> > > > > > > >> >> > > > > down, it's then the data consistency is
> > questionable.
> > > > As
> > > > > an
> > > > > > > >> >> > architect,
> > > > > > > >> >> > > I
> > > > > > > >> >> > > > > wouldn't disable WAL for the sake of rebalancing;
> > > it's
> > > > > too
> > > > > > > >> risky.
> > > > > > > >> >> > > > >
> > > > > > > >> >> > > > > If you agree, then let's create the IEP. This way
> > it
> > > > will
> > > > > > be
> > > > > > > >> >> easier
> > > > > > > >> >> > to
> > > > > > > >> >> > > > > track this endeavor. BTW, are you already ready
> to
> > > > > release
> > > > > > > any
> > > > > > > >> >> > > > > optimizations in 2.5 that is being discussed in a
> > > > > separate
> > > > > > > >> thread?
> > > > > > > >> >> > > > >
> > > > > > > >> >> > > > > --
> > > > > > > >> >> > > > > Denis
> > > > > > > >> >> > > > >
> > > > > > > >> >> > > > >
> > > > > > > >> >> > > > >
> > > > > > > >> >> > > > > On Fri, Mar 23, 2018 at 6:37 AM, Ilya Lantukh <
> > > > > > > >> >> ilantukh@gridgain.com
> > > > > > > >> >> > >
> > > > > > > >> >> > > > > wrote:
> > > > > > > >> >> > > > >
> > > > > > > >> >> > > > > > Denis,
> > > > > > > >> >> > > > > >
> > > > > > > >> >> > > > > > > - Don't you want to aggregate the tickets
> under
> > > an
> > > > > IEP?
> > > > > > > >> >> > > > > > Yes, I think so.
> > > > > > > >> >> > > > > >
> > > > > > > >> >> > > > > > > - Does it mean we're going to update our
> B+Tree
> > > > > > > >> >> implementation?
> > > > > > > >> >> > Any
> > > > > > > >> >> > > > > ideas
> > > > > > > >> >> > > > > > how risky it is?
> > > > > > > >> >> > > > > > One of tickets that I created (
> > > > > > > >> >> > > > > >
> > https://issues.apache.org/jira/browse/IGNITE-7935)
> > > > > > > involves
> > > > > > > >> >> B+Tree
> > > > > > > >> >> > > > > > modification, but I am not planning to do it in
> > the
> > > > > > nearest
> > > > > > > >> >> future.
> > > > > > > >> >> > > It
> > > > > > > >> >> > > > > > shouldn't affect existing tree operations, only
> > > > > introduce
> > > > > > > new
> > > > > > > >> >> ones
> > > > > > > >> >> > > > > (putAll,
> > > > > > > >> >> > > > > > invokeAll, removeAll).
> > > > > > > >> >> > > > > >
> > > > > > > >> >> > > > > > > - Any chance you had a prototype that shows
> > > > > performance
> > > > > > > >> >> > > optimizations
> > > > > > > >> >> > > > > the
> > > > > > > >> >> > > > > > approach you are suggesting to take?
> > > > > > > >> >> > > > > > I have a prototype for simplest improvements (
> > > > > > > >> >> > > > https://issues.apache.org/
> > > > > > > >> >> > > > > > jira/browse/IGNITE-8019 &
> > > https://issues.apache.org/
> > > > > > > >> >> > > > > > jira/browse/IGNITE-8018)
> > > > > > > >> >> > > > > > - together they increase throughput by 5-20%,
> > > > depending
> > > > > > on
> > > > > > > >> >> > > > configuration
> > > > > > > >> >> > > > > > and environment. Also, I've tested different
> WAL
> > > > modes
> > > > > -
> > > > > > > >> >> switching
> > > > > > > >> >> > > from
> > > > > > > >> >> > > > > > LOG_ONLY to NONE gives over 100% boost - this
> is
> > > > what I
> > > > > > > >> expect
> > > > > > > >> >> from
> > > > > > > >> >> > > > > >
> > https://issues.apache.org/jira/browse/IGNITE-8017.
> > > > > > > >> >> > > > > >
> > > > > > > >> >> > > > > > On Thu, Mar 22, 2018 at 9:48 PM, Denis Magda <
> > > > > > > >> dmagda@apache.org
> > > > > > > >> >> >
> > > > > > > >> >> > > > wrote:
> > > > > > > >> >> > > > > >
> > > > > > > >> >> > > > > > > Ilya,
> > > > > > > >> >> > > > > > >
> > > > > > > >> >> > > > > > > That's outstanding research and summary.
> Thanks
> > > for
> > > > > > > >> spending
> > > > > > > >> >> your
> > > > > > > >> >> > > > time
> > > > > > > >> >> > > > > on
> > > > > > > >> >> > > > > > > this.
> > > > > > > >> >> > > > > > >
> > > > > > > >> >> > > > > > > Not sure I have enough expertise to challenge
> > > your
> > > > > > > >> approach,
> > > > > > > >> >> but
> > > > > > > >> >> > it
> > > > > > > >> >> > > > > > sounds
> > > > > > > >> >> > > > > > > 100% reasonable to me. As side notes:
> > > > > > > >> >> > > > > > >
> > > > > > > >> >> > > > > > >    - Don't you want to aggregate the tickets
> > > under
> > > > an
> > > > > > > IEP?
> > > > > > > >> >> > > > > > >    - Does it mean we're going to update our
> > > B+Tree
> > > > > > > >> >> > implementation?
> > > > > > > >> >> > > > Any
> > > > > > > >> >> > > > > > >    ideas how risky it is?
> > > > > > > >> >> > > > > > >    - Any chance you had a prototype that
> shows
> > > > > > > performance
> > > > > > > >> >> > > > > optimizations
> > > > > > > >> >> > > > > > of
> > > > > > > >> >> > > > > > >    the approach you are suggesting to take?
> > > > > > > >> >> > > > > > >
> > > > > > > >> >> > > > > > > --
> > > > > > > >> >> > > > > > > Denis
> > > > > > > >> >> > > > > > >
> > > > > > > >> >> > > > > > > On Thu, Mar 22, 2018 at 8:38 AM, Ilya
> Lantukh <
> > > > > > > >> >> > > ilantukh@gridgain.com
> > > > > > > >> >> > > > >
> > > > > > > >> >> > > > > > > wrote:
> > > > > > > >> >> > > > > > >
> > > > > > > >> >> > > > > > > > Igniters,
> > > > > > > >> >> > > > > > > >
> > > > > > > >> >> > > > > > > > I've spent some time analyzing performance
> of
> > > > > > > rebalancing
> > > > > > > >> >> > > process.
> > > > > > > >> >> > > > > The
> > > > > > > >> >> > > > > > > > initial goal was to understand, what limits
> > > it's
> > > > > > > >> throughput,
> > > > > > > >> >> > > > because
> > > > > > > >> >> > > > > it
> > > > > > > >> >> > > > > > > is
> > > > > > > >> >> > > > > > > > significantly slower than network and
> storage
> > > > > device
> > > > > > > can
> > > > > > > >> >> > > > > theoretically
> > > > > > > >> >> > > > > > > > handle.
> > > > > > > >> >> > > > > > > >
> > > > > > > >> >> > > > > > > > Turns out, our current implementation has a
> > > > number
> > > > > of
> > > > > > > >> issues
> > > > > > > >> >> > > caused
> > > > > > > >> >> > > > > by
> > > > > > > >> >> > > > > > a
> > > > > > > >> >> > > > > > > > single fundamental problem.
> > > > > > > >> >> > > > > > > >
> > > > > > > >> >> > > > > > > > During rebalance data is sent in batches
> > called
> > > > > > > >> >> > > > > > > > GridDhtPartitionSupplyMessages. Batch
> size is
> > > > > > > >> configurable,
> > > > > > > >> >> > > > default
> > > > > > > >> >> > > > > > > value
> > > > > > > >> >> > > > > > > > is 512KB, which could mean thousands of
> > > key-value
> > > > > > > pairs.
> > > > > > > >> >> > However,
> > > > > > > >> >> > > > we
> > > > > > > >> >> > > > > > > don't
> > > > > > > >> >> > > > > > > > take any advantage over this fact and
> process
> > > > each
> > > > > > > entry
> > > > > > > >> >> > > > > independently:
> > > > > > > >> >> > > > > > > > - checkpointReadLock is acquired multiple
> > times
> > > > for
> > > > > > > every
> > > > > > > >> >> > entry,
> > > > > > > >> >> > > > > > leading
> > > > > > > >> >> > > > > > > to
> > > > > > > >> >> > > > > > > > unnecessary contention - this is clearly a
> > bug;
> > > > > > > >> >> > > > > > > > - for each entry we write (and fsync, if
> > > > > > configuration
> > > > > > > >> >> assumes
> > > > > > > >> >> > > it)
> > > > > > > >> >> > > > a
> > > > > > > >> >> > > > > > > > separate WAL record - so, if batch
> contains N
> > > > > > entries,
> > > > > > > we
> > > > > > > >> >> might
> > > > > > > >> >> > > end
> > > > > > > >> >> > > > > up
> > > > > > > >> >> > > > > > > > doing N fsyncs;
> > > > > > > >> >> > > > > > > > - adding every entry into CacheDataStore
> also
> > > > > happens
> > > > > > > >> >> > completely
> > > > > > > >> >> > > > > > > > independently. It means, we will traverse
> and
> > > > > modify
> > > > > > > each
> > > > > > > >> >> index
> > > > > > > >> >> > > > tree
> > > > > > > >> >> > > > > N
> > > > > > > >> >> > > > > > > > times, we will allocate space in FreeList N
> > > times
> > > > > and
> > > > > > > we
> > > > > > > >> >> will
> > > > > > > >> >> > > have
> > > > > > > >> >> > > > to
> > > > > > > >> >> > > > > > > > additionally store in WAL O(N*log(N)) page
> > > delta
> > > > > > > records.
> > > > > > > >> >> > > > > > > >
> > > > > > > >> >> > > > > > > > I've created a few tickets in JIRA with
> very
> > > > > > different
> > > > > > > >> >> levels
> > > > > > > >> >> > of
> > > > > > > >> >> > > > > scale
> > > > > > > >> >> > > > > > > and
> > > > > > > >> >> > > > > > > > complexity.
> > > > > > > >> >> > > > > > > >
> > > > > > > >> >> > > > > > > > Ways to reduce impact of independent
> > > processing:
> > > > > > > >> >> > > > > > > > - https://issues.apache.org/
> > > > > jira/browse/IGNITE-8019
> > > > > > -
> > > > > > > >> >> > > > aforementioned
> > > > > > > >> >> > > > > > > bug,
> > > > > > > >> >> > > > > > > > causing contention on checkpointReadLock;
> > > > > > > >> >> > > > > > > > - https://issues.apache.org/
> > > > > jira/browse/IGNITE-8018
> > > > > > -
> > > > > > > >> >> > > inefficiency
> > > > > > > >> >> > > > > in
> > > > > > > >> >> > > > > > > > GridCacheMapEntry implementation;
> > > > > > > >> >> > > > > > > > - https://issues.apache.org/
> > > > > jira/browse/IGNITE-8017
> > > > > > -
> > > > > > > >> >> > > > automatically
> > > > > > > >> >> > > > > > > > disable
> > > > > > > >> >> > > > > > > > WAL during preloading.
> > > > > > > >> >> > > > > > > >
> > > > > > > >> >> > > > > > > > Ways to solve problem on more global level:
> > > > > > > >> >> > > > > > > > - https://issues.apache.org/
> > > > > jira/browse/IGNITE-7935
> > > > > > -
> > > > > > > a
> > > > > > > >> >> ticket
> > > > > > > >> >> > > to
> > > > > > > >> >> > > > > > > > introduce
> > > > > > > >> >> > > > > > > > batch modification;
> > > > > > > >> >> > > > > > > > - https://issues.apache.org/
> > > > > jira/browse/IGNITE-8020
> > > > > > -
> > > > > > > >> >> complete
> > > > > > > >> >> > > > > > redesign
> > > > > > > >> >> > > > > > > of
> > > > > > > >> >> > > > > > > > rebalancing process for persistent caches,
> > > based
> > > > on
> > > > > > > file
> > > > > > > >> >> > > transfer.
> > > > > > > >> >> > > > > > > >
> > > > > > > >> >> > > > > > > > Everyone is welcome to criticize above
> ideas,
> > > > > suggest
> > > > > > > new
> > > > > > > >> >> ones
> > > > > > > >> >> > or
> > > > > > > >> >> > > > > > > > participate in implementation.
> > > > > > > >> >> > > > > > > >
> > > > > > > >> >> > > > > > > > --
> > > > > > > >> >> > > > > > > > Best regards,
> > > > > > > >> >> > > > > > > > Ilya
> > > > > > > >> >> > > > > > > >
> > > > > > > >> >> > > > > > >
> > > > > > > >> >> > > > > >
> > > > > > > >> >> > > > > >
> > > > > > > >> >> > > > > >
> > > > > > > >> >> > > > > > --
> > > > > > > >> >> > > > > > Best regards,
> > > > > > > >> >> > > > > > Ilya
> > > > > > > >> >> > > > > >
> > > > > > > >> >> > > > >
> > > > > > > >> >> > > >
> > > > > > > >> >> > >
> > > > > > > >> >> > >
> > > > > > > >> >> > >
> > > > > > > >> >> > > --
> > > > > > > >> >> > > Best regards,
> > > > > > > >> >> > > Ilya
> > > > > > > >> >> > >
> > > > > > > >> >> >
> > > > > > > >> >>
> > > > > > > >> >>
> > > > > > > >> >>
> > > > > > > >> >> --
> > > > > > > >> >> Best regards,
> > > > > > > >> >> Ilya
> > > > > > > >> >>
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best regards,
> > > > > > > > Ilya
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best regards,
> > > > > > > Ilya
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Rebalancing - how to make it faster

Posted by Anton Vinogradov <av...@apache.org>.

Dmitry, I see no problems to use system property for cases you specified.
Migration possibility is not a reason to change the API.

пн, 9 апр. 2018 г., 20:18 Dmitry Pavlov <dp...@gmail.com>:

> Not sure I agree. I think feature toggle is best practice.
>
> Eventually we can remove related code, but it always good to have migration
> stages:
> 1) old and new provided but new is disabled,
> 2) both provided and new is enabled
> 3) new is enabled, old is not available.
>
> It is common practice for enterprize software.
>
> пн, 9 апр. 2018 г. в 20:08, Anton Vinogradov <av...@apache.org>:
>
> > >> I would enable this option only after we confirm it's stable.
> > We have to add system property in that case.
> >
> > It's a bad idea to have such options at API because it can be unstable.
> > It's like allowMoreThan1024Node(bool) since we have issues with that.
> >
> > 2018-04-09 20:03 GMT+03:00 Yakov Zhdanov <yz...@apache.org>:
> >
> > > How about skipWalOnRebalancing or disableWalOnRebalancing? I like the
> > first
> > > one better and both are shorter.
> > >
> > > --Yakov
> > >
> > > 2018-04-09 19:55 GMT+03:00 Denis Magda <dm...@apache.org>:
> > >
> > > > I would enable this option only after we confirm it's stable. Until
> it
> > > > happens it should be treated as an experimental feature that is
> turned
> > on
> > > > manually with a parameter like Ilya is suggesting.
> > > >
> > > > Ilya, the name sounds good to me.
> > > >
> > > > --
> > > > Denis
> > > >
> > > > On Mon, Apr 9, 2018 at 9:16 AM, Anton Vinogradov <av...@apache.org>
> > wrote:
> > > >
> > > > > Ilya,
> > > > >
> > > > > WAL should be automatically disabled at initial rebalancing.
> > > > > It can't be disabled at regilar rebalancing, since you never ready
> to
> > > > lose
> > > > > data you protected by WAL.
> > > > > Is there any need to have special param in that case?
> > > > >
> > > > > 2018-04-09 16:41 GMT+03:00 Ilya Lantukh <il...@gridgain.com>:
> > > > >
> > > > > > Igniters,
> > > > > >
> > > > > > I am currently at the finish line of
> > > > > > https://issues.apache.org/jira/browse/IGNITE-8017 ("Disable WAL
> > > during
> > > > > > initial preloading") implementation. And I need that such
> behavior
> > > > should
> > > > > > be configurable. In my intermediate implementation I have
> parameter
> > > > > called
> > > > > > "disableWalDuringRebalancing" in IgniteConfiguration. Do you
> thing
> > > such
> > > > > > name is meaningful and self-explanatory? Do we need to ensure
> that
> > it
> > > > has
> > > > > > the same value on every node? Should I make it configurable per
> > cache
> > > > > > rather than globally?
> > > > > >
> > > > > > Please share your thoughts.
> > > > > >
> > > > > > On Mon, Apr 9, 2018 at 4:32 PM, Ilya Lantukh <
> > ilantukh@gridgain.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Denis,
> > > > > > >
> > > > > > > Those ticket are rather complex, and so I don't know when I'll
> be
> > > > able
> > > > > to
> > > > > > > start working on them.
> > > > > > >
> > > > > > > On Fri, Mar 30, 2018 at 11:45 PM, Denis Magda <
> dmagda@apache.org
> > >
> > > > > wrote:
> > > > > > >
> > > > > > >> Ilya,
> > > > > > >>
> > > > > > >> Just came across the IEP put together by you:
> > > > > > >> https://cwiki.apache.org/confluence/display/IGNITE/IEP-16%
> > > > > > >> 3A+Optimization+of+rebalancing
> > > > > > >>
> > > > > > >> Excellent explanation, thanks for aggregating everything
> there.
> > > > > > >>
> > > > > > >> Two tickets below don't have a fixed version assigned:
> > > > > > >> https://issues.apache.org/jira/browse/IGNITE-8020
> > > > > > >> https://issues.apache.org/jira/browse/IGNITE-7935
> > > > > > >>
> > > > > > >> Do you plan to work on them in 2.6 time frame, right?
> > > > > > >>
> > > > > > >> --
> > > > > > >> Denis
> > > > > > >>
> > > > > > >> On Tue, Mar 27, 2018 at 9:29 AM, Denis Magda <
> dmagda@apache.org
> > >
> > > > > wrote:
> > > > > > >>
> > > > > > >> > Ilya, granted you all the required permissions. Please let
> me
> > > know
> > > > > if
> > > > > > >> you
> > > > > > >> > still have troubles with the wiki.
> > > > > > >> >
> > > > > > >> > --
> > > > > > >> > Denis
> > > > > > >> >
> > > > > > >> > On Tue, Mar 27, 2018 at 8:56 AM, Ilya Lantukh <
> > > > > ilantukh@gridgain.com>
> > > > > > >> > wrote:
> > > > > > >> >
> > > > > > >> >> Unfortunately, I don't have permission to create page for
> IEP
> > > on
> > > > > > wiki.
> > > > > > >> >> Denis, can you grant it? My username is ilantukh.
> > > > > > >> >>
> > > > > > >> >> On Mon, Mar 26, 2018 at 8:04 PM, Anton Vinogradov <
> > > av@apache.org
> > > > >
> > > > > > >> wrote:
> > > > > > >> >>
> > > > > > >> >> > >> It is impossible to disable WAL only for certain
> > > partitions
> > > > > > >> without
> > > > > > >> >> > >> completely overhauling design of Ignite storage
> > mechanism.
> > > > > Right
> > > > > > >> now
> > > > > > >> >> we
> > > > > > >> >> > can
> > > > > > >> >> > >> afford only to change WAL mode per cache group.
> > > > > > >> >> >
> > > > > > >> >> > Cache group rebalancing is a one cache rebalancing, and
> > then
> > > > this
> > > > > > >> cache
> > > > > > >> >> > ("cache group") can be presented as a set of virtual
> > caches.
> > > > > > >> >> > So, there is no issues for initial rebalancing.
> > > > > > >> >> > Lets disable WAL on initial rebalancing.
> > > > > > >> >> >
> > > > > > >> >> > 2018-03-26 16:46 GMT+03:00 Ilya Lantukh <
> > > ilantukh@gridgain.com
> > > > >:
> > > > > > >> >> >
> > > > > > >> >> > > Dmitry,
> > > > > > >> >> > > It is impossible to disable WAL only for certain
> > partitions
> > > > > > without
> > > > > > >> >> > > completely overhauling design of Ignite storage
> > mechanism.
> > > > > Right
> > > > > > >> now
> > > > > > >> >> we
> > > > > > >> >> > can
> > > > > > >> >> > > afford only to change WAL mode per cache group.
> > > > > > >> >> > >
> > > > > > >> >> > > The idea is to disable WAL when node doesn't have any
> > > > partition
> > > > > > in
> > > > > > >> >> OWNING
> > > > > > >> >> > > state, which means it doesn't have any consistent data
> > and
> > > > > won't
> > > > > > be
> > > > > > >> >> able
> > > > > > >> >> > to
> > > > > > >> >> > > restore from WAL anyway. I don't see any potential use
> > for
> > > > WAL
> > > > > on
> > > > > > >> such
> > > > > > >> >> > > node, but we can keep a configurable parameter
> indicating
> > > can
> > > > > we
> > > > > > >> >> > > automatically disable WAL in such case or not.
> > > > > > >> >> > >
> > > > > > >> >> > > On Fri, Mar 23, 2018 at 10:40 PM, Dmitry Pavlov <
> > > > > > >> >> dpavlov.spb@gmail.com>
> > > > > > >> >> > > wrote:
> > > > > > >> >> > >
> > > > > > >> >> > > > Denis, as I understood, there is and idea to exclude
> > only
> > > > > > >> rebalanced
> > > > > > >> >> > > > partition(s) data. All other data will go to the WAL.
> > > > > > >> >> > > >
> > > > > > >> >> > > > Ilya, please correct me if I'm wrong.
> > > > > > >> >> > > >
> > > > > > >> >> > > > пт, 23 мар. 2018 г. в 22:15, Denis Magda <
> > > > dmagda@apache.org
> > > > > >:
> > > > > > >> >> > > >
> > > > > > >> >> > > > > Ilya,
> > > > > > >> >> > > > >
> > > > > > >> >> > > > > That's a decent boost (5-20%) even having WAL
> > enabled.
> > > > Not
> > > > > > sure
> > > > > > >> >> that
> > > > > > >> >> > we
> > > > > > >> >> > > > > should stake on the WAL "off" mode here because if
> > the
> > > > > whole
> > > > > > >> >> cluster
> > > > > > >> >> > > goes
> > > > > > >> >> > > > > down, it's then the data consistency is
> questionable.
> > > As
> > > > an
> > > > > > >> >> > architect,
> > > > > > >> >> > > I
> > > > > > >> >> > > > > wouldn't disable WAL for the sake of rebalancing;
> > it's
> > > > too
> > > > > > >> risky.
> > > > > > >> >> > > > >
> > > > > > >> >> > > > > If you agree, then let's create the IEP. This way
> it
> > > will
> > > > > be
> > > > > > >> >> easier
> > > > > > >> >> > to
> > > > > > >> >> > > > > track this endeavor. BTW, are you already ready to
> > > > release
> > > > > > any
> > > > > > >> >> > > > > optimizations in 2.5 that is being discussed in a
> > > > separate
> > > > > > >> thread?
> > > > > > >> >> > > > >
> > > > > > >> >> > > > > --
> > > > > > >> >> > > > > Denis
> > > > > > >> >> > > > >
> > > > > > >> >> > > > >
> > > > > > >> >> > > > >
> > > > > > >> >> > > > > On Fri, Mar 23, 2018 at 6:37 AM, Ilya Lantukh <
> > > > > > >> >> ilantukh@gridgain.com
> > > > > > >> >> > >
> > > > > > >> >> > > > > wrote:
> > > > > > >> >> > > > >
> > > > > > >> >> > > > > > Denis,
> > > > > > >> >> > > > > >
> > > > > > >> >> > > > > > > - Don't you want to aggregate the tickets under
> > an
> > > > IEP?
> > > > > > >> >> > > > > > Yes, I think so.
> > > > > > >> >> > > > > >
> > > > > > >> >> > > > > > > - Does it mean we're going to update our B+Tree
> > > > > > >> >> implementation?
> > > > > > >> >> > Any
> > > > > > >> >> > > > > ideas
> > > > > > >> >> > > > > > how risky it is?
> > > > > > >> >> > > > > > One of tickets that I created (
> > > > > > >> >> > > > > >
> https://issues.apache.org/jira/browse/IGNITE-7935)
> > > > > > involves
> > > > > > >> >> B+Tree
> > > > > > >> >> > > > > > modification, but I am not planning to do it in
> the
> > > > > nearest
> > > > > > >> >> future.
> > > > > > >> >> > > It
> > > > > > >> >> > > > > > shouldn't affect existing tree operations, only
> > > > introduce
> > > > > > new
> > > > > > >> >> ones
> > > > > > >> >> > > > > (putAll,
> > > > > > >> >> > > > > > invokeAll, removeAll).
> > > > > > >> >> > > > > >
> > > > > > >> >> > > > > > > - Any chance you had a prototype that shows
> > > > performance
> > > > > > >> >> > > optimizations
> > > > > > >> >> > > > > the
> > > > > > >> >> > > > > > approach you are suggesting to take?
> > > > > > >> >> > > > > > I have a prototype for simplest improvements (
> > > > > > >> >> > > > https://issues.apache.org/
> > > > > > >> >> > > > > > jira/browse/IGNITE-8019 &
> > https://issues.apache.org/
> > > > > > >> >> > > > > > jira/browse/IGNITE-8018)
> > > > > > >> >> > > > > > - together they increase throughput by 5-20%,
> > > depending
> > > > > on
> > > > > > >> >> > > > configuration
> > > > > > >> >> > > > > > and environment. Also, I've tested different WAL
> > > modes
> > > > -
> > > > > > >> >> switching
> > > > > > >> >> > > from
> > > > > > >> >> > > > > > LOG_ONLY to NONE gives over 100% boost - this is
> > > what I
> > > > > > >> expect
> > > > > > >> >> from
> > > > > > >> >> > > > > >
> https://issues.apache.org/jira/browse/IGNITE-8017.
> > > > > > >> >> > > > > >
> > > > > > >> >> > > > > > On Thu, Mar 22, 2018 at 9:48 PM, Denis Magda <
> > > > > > >> dmagda@apache.org
> > > > > > >> >> >
> > > > > > >> >> > > > wrote:
> > > > > > >> >> > > > > >
> > > > > > >> >> > > > > > > Ilya,
> > > > > > >> >> > > > > > >
> > > > > > >> >> > > > > > > That's outstanding research and summary. Thanks
> > for
> > > > > > >> spending
> > > > > > >> >> your
> > > > > > >> >> > > > time
> > > > > > >> >> > > > > on
> > > > > > >> >> > > > > > > this.
> > > > > > >> >> > > > > > >
> > > > > > >> >> > > > > > > Not sure I have enough expertise to challenge
> > your
> > > > > > >> approach,
> > > > > > >> >> but
> > > > > > >> >> > it
> > > > > > >> >> > > > > > sounds
> > > > > > >> >> > > > > > > 100% reasonable to me. As side notes:
> > > > > > >> >> > > > > > >
> > > > > > >> >> > > > > > >    - Don't you want to aggregate the tickets
> > under
> > > an
> > > > > > IEP?
> > > > > > >> >> > > > > > >    - Does it mean we're going to update our
> > B+Tree
> > > > > > >> >> > implementation?
> > > > > > >> >> > > > Any
> > > > > > >> >> > > > > > >    ideas how risky it is?
> > > > > > >> >> > > > > > >    - Any chance you had a prototype that shows
> > > > > > performance
> > > > > > >> >> > > > > optimizations
> > > > > > >> >> > > > > > of
> > > > > > >> >> > > > > > >    the approach you are suggesting to take?
> > > > > > >> >> > > > > > >
> > > > > > >> >> > > > > > > --
> > > > > > >> >> > > > > > > Denis
> > > > > > >> >> > > > > > >
> > > > > > >> >> > > > > > > On Thu, Mar 22, 2018 at 8:38 AM, Ilya Lantukh <
> > > > > > >> >> > > ilantukh@gridgain.com
> > > > > > >> >> > > > >
> > > > > > >> >> > > > > > > wrote:
> > > > > > >> >> > > > > > >
> > > > > > >> >> > > > > > > > Igniters,
> > > > > > >> >> > > > > > > >
> > > > > > >> >> > > > > > > > I've spent some time analyzing performance of
> > > > > > rebalancing
> > > > > > >> >> > > process.
> > > > > > >> >> > > > > The
> > > > > > >> >> > > > > > > > initial goal was to understand, what limits
> > it's
> > > > > > >> throughput,
> > > > > > >> >> > > > because
> > > > > > >> >> > > > > it
> > > > > > >> >> > > > > > > is
> > > > > > >> >> > > > > > > > significantly slower than network and storage
> > > > device
> > > > > > can
> > > > > > >> >> > > > > theoretically
> > > > > > >> >> > > > > > > > handle.
> > > > > > >> >> > > > > > > >
> > > > > > >> >> > > > > > > > Turns out, our current implementation has a
> > > number
> > > > of
> > > > > > >> issues
> > > > > > >> >> > > caused
> > > > > > >> >> > > > > by
> > > > > > >> >> > > > > > a
> > > > > > >> >> > > > > > > > single fundamental problem.
> > > > > > >> >> > > > > > > >
> > > > > > >> >> > > > > > > > During rebalance data is sent in batches
> called
> > > > > > >> >> > > > > > > > GridDhtPartitionSupplyMessages. Batch size is
> > > > > > >> configurable,
> > > > > > >> >> > > > default
> > > > > > >> >> > > > > > > value
> > > > > > >> >> > > > > > > > is 512KB, which could mean thousands of
> > key-value
> > > > > > pairs.
> > > > > > >> >> > However,
> > > > > > >> >> > > > we
> > > > > > >> >> > > > > > > don't
> > > > > > >> >> > > > > > > > take any advantage over this fact and process
> > > each
> > > > > > entry
> > > > > > >> >> > > > > independently:
> > > > > > >> >> > > > > > > > - checkpointReadLock is acquired multiple
> times
> > > for
> > > > > > every
> > > > > > >> >> > entry,
> > > > > > >> >> > > > > > leading
> > > > > > >> >> > > > > > > to
> > > > > > >> >> > > > > > > > unnecessary contention - this is clearly a
> bug;
> > > > > > >> >> > > > > > > > - for each entry we write (and fsync, if
> > > > > configuration
> > > > > > >> >> assumes
> > > > > > >> >> > > it)
> > > > > > >> >> > > > a
> > > > > > >> >> > > > > > > > separate WAL record - so, if batch contains N
> > > > > entries,
> > > > > > we
> > > > > > >> >> might
> > > > > > >> >> > > end
> > > > > > >> >> > > > > up
> > > > > > >> >> > > > > > > > doing N fsyncs;
> > > > > > >> >> > > > > > > > - adding every entry into CacheDataStore also
> > > > happens
> > > > > > >> >> > completely
> > > > > > >> >> > > > > > > > independently. It means, we will traverse and
> > > > modify
> > > > > > each
> > > > > > >> >> index
> > > > > > >> >> > > > tree
> > > > > > >> >> > > > > N
> > > > > > >> >> > > > > > > > times, we will allocate space in FreeList N
> > times
> > > > and
> > > > > > we
> > > > > > >> >> will
> > > > > > >> >> > > have
> > > > > > >> >> > > > to
> > > > > > >> >> > > > > > > > additionally store in WAL O(N*log(N)) page
> > delta
> > > > > > records.
> > > > > > >> >> > > > > > > >
> > > > > > >> >> > > > > > > > I've created a few tickets in JIRA with very
> > > > > different
> > > > > > >> >> levels
> > > > > > >> >> > of
> > > > > > >> >> > > > > scale
> > > > > > >> >> > > > > > > and
> > > > > > >> >> > > > > > > > complexity.
> > > > > > >> >> > > > > > > >
> > > > > > >> >> > > > > > > > Ways to reduce impact of independent
> > processing:
> > > > > > >> >> > > > > > > > - https://issues.apache.org/
> > > > jira/browse/IGNITE-8019
> > > > > -
> > > > > > >> >> > > > aforementioned
> > > > > > >> >> > > > > > > bug,
> > > > > > >> >> > > > > > > > causing contention on checkpointReadLock;
> > > > > > >> >> > > > > > > > - https://issues.apache.org/
> > > > jira/browse/IGNITE-8018
> > > > > -
> > > > > > >> >> > > inefficiency
> > > > > > >> >> > > > > in
> > > > > > >> >> > > > > > > > GridCacheMapEntry implementation;
> > > > > > >> >> > > > > > > > - https://issues.apache.org/
> > > > jira/browse/IGNITE-8017
> > > > > -
> > > > > > >> >> > > > automatically
> > > > > > >> >> > > > > > > > disable
> > > > > > >> >> > > > > > > > WAL during preloading.
> > > > > > >> >> > > > > > > >
> > > > > > >> >> > > > > > > > Ways to solve problem on more global level:
> > > > > > >> >> > > > > > > > - https://issues.apache.org/
> > > > jira/browse/IGNITE-7935
> > > > > -
> > > > > > a
> > > > > > >> >> ticket
> > > > > > >> >> > > to
> > > > > > >> >> > > > > > > > introduce
> > > > > > >> >> > > > > > > > batch modification;
> > > > > > >> >> > > > > > > > - https://issues.apache.org/
> > > > jira/browse/IGNITE-8020
> > > > > -
> > > > > > >> >> complete
> > > > > > >> >> > > > > > redesign
> > > > > > >> >> > > > > > > of
> > > > > > >> >> > > > > > > > rebalancing process for persistent caches,
> > based
> > > on
> > > > > > file
> > > > > > >> >> > > transfer.
> > > > > > >> >> > > > > > > >
> > > > > > >> >> > > > > > > > Everyone is welcome to criticize above ideas,
> > > > suggest
> > > > > > new
> > > > > > >> >> ones
> > > > > > >> >> > or
> > > > > > >> >> > > > > > > > participate in implementation.
> > > > > > >> >> > > > > > > >
> > > > > > >> >> > > > > > > > --
> > > > > > >> >> > > > > > > > Best regards,
> > > > > > >> >> > > > > > > > Ilya
> > > > > > >> >> > > > > > > >
> > > > > > >> >> > > > > > >
> > > > > > >> >> > > > > >
> > > > > > >> >> > > > > >
> > > > > > >> >> > > > > >
> > > > > > >> >> > > > > > --
> > > > > > >> >> > > > > > Best regards,
> > > > > > >> >> > > > > > Ilya
> > > > > > >> >> > > > > >
> > > > > > >> >> > > > >
> > > > > > >> >> > > >
> > > > > > >> >> > >
> > > > > > >> >> > >
> > > > > > >> >> > >
> > > > > > >> >> > > --
> > > > > > >> >> > > Best regards,
> > > > > > >> >> > > Ilya
> > > > > > >> >> > >
> > > > > > >> >> >
> > > > > > >> >>
> > > > > > >> >>
> > > > > > >> >>
> > > > > > >> >> --
> > > > > > >> >> Best regards,
> > > > > > >> >> Ilya
> > > > > > >> >>
> > > > > > >> >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best regards,
> > > > > > > Ilya
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > > Ilya
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Rebalancing - how to make it faster

Posted by Dmitry Pavlov <dp...@gmail.com>.

Not sure I agree. I think feature toggle is best practice.

Eventually we can remove related code, but it always good to have migration
stages:
1) old and new provided but new is disabled,
2) both provided and new is enabled
3) new is enabled, old is not available.

It is common practice for enterprize software.

пн, 9 апр. 2018 г. в 20:08, Anton Vinogradov <av...@apache.org>:

> >> I would enable this option only after we confirm it's stable.
> We have to add system property in that case.
>
> It's a bad idea to have such options at API because it can be unstable.
> It's like allowMoreThan1024Node(bool) since we have issues with that.
>
> 2018-04-09 20:03 GMT+03:00 Yakov Zhdanov <yz...@apache.org>:
>
> > How about skipWalOnRebalancing or disableWalOnRebalancing? I like the
> first
> > one better and both are shorter.
> >
> > --Yakov
> >
> > 2018-04-09 19:55 GMT+03:00 Denis Magda <dm...@apache.org>:
> >
> > > I would enable this option only after we confirm it's stable. Until it
> > > happens it should be treated as an experimental feature that is turned
> on
> > > manually with a parameter like Ilya is suggesting.
> > >
> > > Ilya, the name sounds good to me.
> > >
> > > --
> > > Denis
> > >
> > > On Mon, Apr 9, 2018 at 9:16 AM, Anton Vinogradov <av...@apache.org>
> wrote:
> > >
> > > > Ilya,
> > > >
> > > > WAL should be automatically disabled at initial rebalancing.
> > > > It can't be disabled at regilar rebalancing, since you never ready to
> > > lose
> > > > data you protected by WAL.
> > > > Is there any need to have special param in that case?
> > > >
> > > > 2018-04-09 16:41 GMT+03:00 Ilya Lantukh <il...@gridgain.com>:
> > > >
> > > > > Igniters,
> > > > >
> > > > > I am currently at the finish line of
> > > > > https://issues.apache.org/jira/browse/IGNITE-8017 ("Disable WAL
> > during
> > > > > initial preloading") implementation. And I need that such behavior
> > > should
> > > > > be configurable. In my intermediate implementation I have parameter
> > > > called
> > > > > "disableWalDuringRebalancing" in IgniteConfiguration. Do you thing
> > such
> > > > > name is meaningful and self-explanatory? Do we need to ensure that
> it
> > > has
> > > > > the same value on every node? Should I make it configurable per
> cache
> > > > > rather than globally?
> > > > >
> > > > > Please share your thoughts.
> > > > >
> > > > > On Mon, Apr 9, 2018 at 4:32 PM, Ilya Lantukh <
> ilantukh@gridgain.com>
> > > > > wrote:
> > > > >
> > > > > > Denis,
> > > > > >
> > > > > > Those ticket are rather complex, and so I don't know when I'll be
> > > able
> > > > to
> > > > > > start working on them.
> > > > > >
> > > > > > On Fri, Mar 30, 2018 at 11:45 PM, Denis Magda <dmagda@apache.org
> >
> > > > wrote:
> > > > > >
> > > > > >> Ilya,
> > > > > >>
> > > > > >> Just came across the IEP put together by you:
> > > > > >> https://cwiki.apache.org/confluence/display/IGNITE/IEP-16%
> > > > > >> 3A+Optimization+of+rebalancing
> > > > > >>
> > > > > >> Excellent explanation, thanks for aggregating everything there.
> > > > > >>
> > > > > >> Two tickets below don't have a fixed version assigned:
> > > > > >> https://issues.apache.org/jira/browse/IGNITE-8020
> > > > > >> https://issues.apache.org/jira/browse/IGNITE-7935
> > > > > >>
> > > > > >> Do you plan to work on them in 2.6 time frame, right?
> > > > > >>
> > > > > >> --
> > > > > >> Denis
> > > > > >>
> > > > > >> On Tue, Mar 27, 2018 at 9:29 AM, Denis Magda <dmagda@apache.org
> >
> > > > wrote:
> > > > > >>
> > > > > >> > Ilya, granted you all the required permissions. Please let me
> > know
> > > > if
> > > > > >> you
> > > > > >> > still have troubles with the wiki.
> > > > > >> >
> > > > > >> > --
> > > > > >> > Denis
> > > > > >> >
> > > > > >> > On Tue, Mar 27, 2018 at 8:56 AM, Ilya Lantukh <
> > > > ilantukh@gridgain.com>
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> >> Unfortunately, I don't have permission to create page for IEP
> > on
> > > > > wiki.
> > > > > >> >> Denis, can you grant it? My username is ilantukh.
> > > > > >> >>
> > > > > >> >> On Mon, Mar 26, 2018 at 8:04 PM, Anton Vinogradov <
> > av@apache.org
> > > >
> > > > > >> wrote:
> > > > > >> >>
> > > > > >> >> > >> It is impossible to disable WAL only for certain
> > partitions
> > > > > >> without
> > > > > >> >> > >> completely overhauling design of Ignite storage
> mechanism.
> > > > Right
> > > > > >> now
> > > > > >> >> we
> > > > > >> >> > can
> > > > > >> >> > >> afford only to change WAL mode per cache group.
> > > > > >> >> >
> > > > > >> >> > Cache group rebalancing is a one cache rebalancing, and
> then
> > > this
> > > > > >> cache
> > > > > >> >> > ("cache group") can be presented as a set of virtual
> caches.
> > > > > >> >> > So, there is no issues for initial rebalancing.
> > > > > >> >> > Lets disable WAL on initial rebalancing.
> > > > > >> >> >
> > > > > >> >> > 2018-03-26 16:46 GMT+03:00 Ilya Lantukh <
> > ilantukh@gridgain.com
> > > >:
> > > > > >> >> >
> > > > > >> >> > > Dmitry,
> > > > > >> >> > > It is impossible to disable WAL only for certain
> partitions
> > > > > without
> > > > > >> >> > > completely overhauling design of Ignite storage
> mechanism.
> > > > Right
> > > > > >> now
> > > > > >> >> we
> > > > > >> >> > can
> > > > > >> >> > > afford only to change WAL mode per cache group.
> > > > > >> >> > >
> > > > > >> >> > > The idea is to disable WAL when node doesn't have any
> > > partition
> > > > > in
> > > > > >> >> OWNING
> > > > > >> >> > > state, which means it doesn't have any consistent data
> and
> > > > won't
> > > > > be
> > > > > >> >> able
> > > > > >> >> > to
> > > > > >> >> > > restore from WAL anyway. I don't see any potential use
> for
> > > WAL
> > > > on
> > > > > >> such
> > > > > >> >> > > node, but we can keep a configurable parameter indicating
> > can
> > > > we
> > > > > >> >> > > automatically disable WAL in such case or not.
> > > > > >> >> > >
> > > > > >> >> > > On Fri, Mar 23, 2018 at 10:40 PM, Dmitry Pavlov <
> > > > > >> >> dpavlov.spb@gmail.com>
> > > > > >> >> > > wrote:
> > > > > >> >> > >
> > > > > >> >> > > > Denis, as I understood, there is and idea to exclude
> only
> > > > > >> rebalanced
> > > > > >> >> > > > partition(s) data. All other data will go to the WAL.
> > > > > >> >> > > >
> > > > > >> >> > > > Ilya, please correct me if I'm wrong.
> > > > > >> >> > > >
> > > > > >> >> > > > пт, 23 мар. 2018 г. в 22:15, Denis Magda <
> > > dmagda@apache.org
> > > > >:
> > > > > >> >> > > >
> > > > > >> >> > > > > Ilya,
> > > > > >> >> > > > >
> > > > > >> >> > > > > That's a decent boost (5-20%) even having WAL
> enabled.
> > > Not
> > > > > sure
> > > > > >> >> that
> > > > > >> >> > we
> > > > > >> >> > > > > should stake on the WAL "off" mode here because if
> the
> > > > whole
> > > > > >> >> cluster
> > > > > >> >> > > goes
> > > > > >> >> > > > > down, it's then the data consistency is questionable.
> > As
> > > an
> > > > > >> >> > architect,
> > > > > >> >> > > I
> > > > > >> >> > > > > wouldn't disable WAL for the sake of rebalancing;
> it's
> > > too
> > > > > >> risky.
> > > > > >> >> > > > >
> > > > > >> >> > > > > If you agree, then let's create the IEP. This way it
> > will
> > > > be
> > > > > >> >> easier
> > > > > >> >> > to
> > > > > >> >> > > > > track this endeavor. BTW, are you already ready to
> > > release
> > > > > any
> > > > > >> >> > > > > optimizations in 2.5 that is being discussed in a
> > > separate
> > > > > >> thread?
> > > > > >> >> > > > >
> > > > > >> >> > > > > --
> > > > > >> >> > > > > Denis
> > > > > >> >> > > > >
> > > > > >> >> > > > >
> > > > > >> >> > > > >
> > > > > >> >> > > > > On Fri, Mar 23, 2018 at 6:37 AM, Ilya Lantukh <
> > > > > >> >> ilantukh@gridgain.com
> > > > > >> >> > >
> > > > > >> >> > > > > wrote:
> > > > > >> >> > > > >
> > > > > >> >> > > > > > Denis,
> > > > > >> >> > > > > >
> > > > > >> >> > > > > > > - Don't you want to aggregate the tickets under
> an
> > > IEP?
> > > > > >> >> > > > > > Yes, I think so.
> > > > > >> >> > > > > >
> > > > > >> >> > > > > > > - Does it mean we're going to update our B+Tree
> > > > > >> >> implementation?
> > > > > >> >> > Any
> > > > > >> >> > > > > ideas
> > > > > >> >> > > > > > how risky it is?
> > > > > >> >> > > > > > One of tickets that I created (
> > > > > >> >> > > > > > https://issues.apache.org/jira/browse/IGNITE-7935)
> > > > > involves
> > > > > >> >> B+Tree
> > > > > >> >> > > > > > modification, but I am not planning to do it in the
> > > > nearest
> > > > > >> >> future.
> > > > > >> >> > > It
> > > > > >> >> > > > > > shouldn't affect existing tree operations, only
> > > introduce
> > > > > new
> > > > > >> >> ones
> > > > > >> >> > > > > (putAll,
> > > > > >> >> > > > > > invokeAll, removeAll).
> > > > > >> >> > > > > >
> > > > > >> >> > > > > > > - Any chance you had a prototype that shows
> > > performance
> > > > > >> >> > > optimizations
> > > > > >> >> > > > > the
> > > > > >> >> > > > > > approach you are suggesting to take?
> > > > > >> >> > > > > > I have a prototype for simplest improvements (
> > > > > >> >> > > > https://issues.apache.org/
> > > > > >> >> > > > > > jira/browse/IGNITE-8019 &
> https://issues.apache.org/
> > > > > >> >> > > > > > jira/browse/IGNITE-8018)
> > > > > >> >> > > > > > - together they increase throughput by 5-20%,
> > depending
> > > > on
> > > > > >> >> > > > configuration
> > > > > >> >> > > > > > and environment. Also, I've tested different WAL
> > modes
> > > -
> > > > > >> >> switching
> > > > > >> >> > > from
> > > > > >> >> > > > > > LOG_ONLY to NONE gives over 100% boost - this is
> > what I
> > > > > >> expect
> > > > > >> >> from
> > > > > >> >> > > > > > https://issues.apache.org/jira/browse/IGNITE-8017.
> > > > > >> >> > > > > >
> > > > > >> >> > > > > > On Thu, Mar 22, 2018 at 9:48 PM, Denis Magda <
> > > > > >> dmagda@apache.org
> > > > > >> >> >
> > > > > >> >> > > > wrote:
> > > > > >> >> > > > > >
> > > > > >> >> > > > > > > Ilya,
> > > > > >> >> > > > > > >
> > > > > >> >> > > > > > > That's outstanding research and summary. Thanks
> for
> > > > > >> spending
> > > > > >> >> your
> > > > > >> >> > > > time
> > > > > >> >> > > > > on
> > > > > >> >> > > > > > > this.
> > > > > >> >> > > > > > >
> > > > > >> >> > > > > > > Not sure I have enough expertise to challenge
> your
> > > > > >> approach,
> > > > > >> >> but
> > > > > >> >> > it
> > > > > >> >> > > > > > sounds
> > > > > >> >> > > > > > > 100% reasonable to me. As side notes:
> > > > > >> >> > > > > > >
> > > > > >> >> > > > > > >    - Don't you want to aggregate the tickets
> under
> > an
> > > > > IEP?
> > > > > >> >> > > > > > >    - Does it mean we're going to update our
> B+Tree
> > > > > >> >> > implementation?
> > > > > >> >> > > > Any
> > > > > >> >> > > > > > >    ideas how risky it is?
> > > > > >> >> > > > > > >    - Any chance you had a prototype that shows
> > > > > performance
> > > > > >> >> > > > > optimizations
> > > > > >> >> > > > > > of
> > > > > >> >> > > > > > >    the approach you are suggesting to take?
> > > > > >> >> > > > > > >
> > > > > >> >> > > > > > > --
> > > > > >> >> > > > > > > Denis
> > > > > >> >> > > > > > >
> > > > > >> >> > > > > > > On Thu, Mar 22, 2018 at 8:38 AM, Ilya Lantukh <
> > > > > >> >> > > ilantukh@gridgain.com
> > > > > >> >> > > > >
> > > > > >> >> > > > > > > wrote:
> > > > > >> >> > > > > > >
> > > > > >> >> > > > > > > > Igniters,
> > > > > >> >> > > > > > > >
> > > > > >> >> > > > > > > > I've spent some time analyzing performance of
> > > > > rebalancing
> > > > > >> >> > > process.
> > > > > >> >> > > > > The
> > > > > >> >> > > > > > > > initial goal was to understand, what limits
> it's
> > > > > >> throughput,
> > > > > >> >> > > > because
> > > > > >> >> > > > > it
> > > > > >> >> > > > > > > is
> > > > > >> >> > > > > > > > significantly slower than network and storage
> > > device
> > > > > can
> > > > > >> >> > > > > theoretically
> > > > > >> >> > > > > > > > handle.
> > > > > >> >> > > > > > > >
> > > > > >> >> > > > > > > > Turns out, our current implementation has a
> > number
> > > of
> > > > > >> issues
> > > > > >> >> > > caused
> > > > > >> >> > > > > by
> > > > > >> >> > > > > > a
> > > > > >> >> > > > > > > > single fundamental problem.
> > > > > >> >> > > > > > > >
> > > > > >> >> > > > > > > > During rebalance data is sent in batches called
> > > > > >> >> > > > > > > > GridDhtPartitionSupplyMessages. Batch size is
> > > > > >> configurable,
> > > > > >> >> > > > default
> > > > > >> >> > > > > > > value
> > > > > >> >> > > > > > > > is 512KB, which could mean thousands of
> key-value
> > > > > pairs.
> > > > > >> >> > However,
> > > > > >> >> > > > we
> > > > > >> >> > > > > > > don't
> > > > > >> >> > > > > > > > take any advantage over this fact and process
> > each
> > > > > entry
> > > > > >> >> > > > > independently:
> > > > > >> >> > > > > > > > - checkpointReadLock is acquired multiple times
> > for
> > > > > every
> > > > > >> >> > entry,
> > > > > >> >> > > > > > leading
> > > > > >> >> > > > > > > to
> > > > > >> >> > > > > > > > unnecessary contention - this is clearly a bug;
> > > > > >> >> > > > > > > > - for each entry we write (and fsync, if
> > > > configuration
> > > > > >> >> assumes
> > > > > >> >> > > it)
> > > > > >> >> > > > a
> > > > > >> >> > > > > > > > separate WAL record - so, if batch contains N
> > > > entries,
> > > > > we
> > > > > >> >> might
> > > > > >> >> > > end
> > > > > >> >> > > > > up
> > > > > >> >> > > > > > > > doing N fsyncs;
> > > > > >> >> > > > > > > > - adding every entry into CacheDataStore also
> > > happens
> > > > > >> >> > completely
> > > > > >> >> > > > > > > > independently. It means, we will traverse and
> > > modify
> > > > > each
> > > > > >> >> index
> > > > > >> >> > > > tree
> > > > > >> >> > > > > N
> > > > > >> >> > > > > > > > times, we will allocate space in FreeList N
> times
> > > and
> > > > > we
> > > > > >> >> will
> > > > > >> >> > > have
> > > > > >> >> > > > to
> > > > > >> >> > > > > > > > additionally store in WAL O(N*log(N)) page
> delta
> > > > > records.
> > > > > >> >> > > > > > > >
> > > > > >> >> > > > > > > > I've created a few tickets in JIRA with very
> > > > different
> > > > > >> >> levels
> > > > > >> >> > of
> > > > > >> >> > > > > scale
> > > > > >> >> > > > > > > and
> > > > > >> >> > > > > > > > complexity.
> > > > > >> >> > > > > > > >
> > > > > >> >> > > > > > > > Ways to reduce impact of independent
> processing:
> > > > > >> >> > > > > > > > - https://issues.apache.org/
> > > jira/browse/IGNITE-8019
> > > > -
> > > > > >> >> > > > aforementioned
> > > > > >> >> > > > > > > bug,
> > > > > >> >> > > > > > > > causing contention on checkpointReadLock;
> > > > > >> >> > > > > > > > - https://issues.apache.org/
> > > jira/browse/IGNITE-8018
> > > > -
> > > > > >> >> > > inefficiency
> > > > > >> >> > > > > in
> > > > > >> >> > > > > > > > GridCacheMapEntry implementation;
> > > > > >> >> > > > > > > > - https://issues.apache.org/
> > > jira/browse/IGNITE-8017
> > > > -
> > > > > >> >> > > > automatically
> > > > > >> >> > > > > > > > disable
> > > > > >> >> > > > > > > > WAL during preloading.
> > > > > >> >> > > > > > > >
> > > > > >> >> > > > > > > > Ways to solve problem on more global level:
> > > > > >> >> > > > > > > > - https://issues.apache.org/
> > > jira/browse/IGNITE-7935
> > > > -
> > > > > a
> > > > > >> >> ticket
> > > > > >> >> > > to
> > > > > >> >> > > > > > > > introduce
> > > > > >> >> > > > > > > > batch modification;
> > > > > >> >> > > > > > > > - https://issues.apache.org/
> > > jira/browse/IGNITE-8020
> > > > -
> > > > > >> >> complete
> > > > > >> >> > > > > > redesign
> > > > > >> >> > > > > > > of
> > > > > >> >> > > > > > > > rebalancing process for persistent caches,
> based
> > on
> > > > > file
> > > > > >> >> > > transfer.
> > > > > >> >> > > > > > > >
> > > > > >> >> > > > > > > > Everyone is welcome to criticize above ideas,
> > > suggest
> > > > > new
> > > > > >> >> ones
> > > > > >> >> > or
> > > > > >> >> > > > > > > > participate in implementation.
> > > > > >> >> > > > > > > >
> > > > > >> >> > > > > > > > --
> > > > > >> >> > > > > > > > Best regards,
> > > > > >> >> > > > > > > > Ilya
> > > > > >> >> > > > > > > >
> > > > > >> >> > > > > > >
> > > > > >> >> > > > > >
> > > > > >> >> > > > > >
> > > > > >> >> > > > > >
> > > > > >> >> > > > > > --
> > > > > >> >> > > > > > Best regards,
> > > > > >> >> > > > > > Ilya
> > > > > >> >> > > > > >
> > > > > >> >> > > > >
> > > > > >> >> > > >
> > > > > >> >> > >
> > > > > >> >> > >
> > > > > >> >> > >
> > > > > >> >> > > --
> > > > > >> >> > > Best regards,
> > > > > >> >> > > Ilya
> > > > > >> >> > >
> > > > > >> >> >
> > > > > >> >>
> > > > > >> >>
> > > > > >> >>
> > > > > >> >> --
> > > > > >> >> Best regards,
> > > > > >> >> Ilya
> > > > > >> >>
> > > > > >> >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > > Ilya
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Ilya
> > > > >
> > > >
> > >
> >
>

Re: Rebalancing - how to make it faster

Posted by Anton Vinogradov <av...@apache.org>.

>> I would enable this option only after we confirm it's stable.
We have to add system property in that case.

It's a bad idea to have such options at API because it can be unstable.
It's like allowMoreThan1024Node(bool) since we have issues with that.

2018-04-09 20:03 GMT+03:00 Yakov Zhdanov <yz...@apache.org>:

> How about skipWalOnRebalancing or disableWalOnRebalancing? I like the first
> one better and both are shorter.
>
> --Yakov
>
> 2018-04-09 19:55 GMT+03:00 Denis Magda <dm...@apache.org>:
>
> > I would enable this option only after we confirm it's stable. Until it
> > happens it should be treated as an experimental feature that is turned on
> > manually with a parameter like Ilya is suggesting.
> >
> > Ilya, the name sounds good to me.
> >
> > --
> > Denis
> >
> > On Mon, Apr 9, 2018 at 9:16 AM, Anton Vinogradov <av...@apache.org> wrote:
> >
> > > Ilya,
> > >
> > > WAL should be automatically disabled at initial rebalancing.
> > > It can't be disabled at regilar rebalancing, since you never ready to
> > lose
> > > data you protected by WAL.
> > > Is there any need to have special param in that case?
> > >
> > > 2018-04-09 16:41 GMT+03:00 Ilya Lantukh <il...@gridgain.com>:
> > >
> > > > Igniters,
> > > >
> > > > I am currently at the finish line of
> > > > https://issues.apache.org/jira/browse/IGNITE-8017 ("Disable WAL
> during
> > > > initial preloading") implementation. And I need that such behavior
> > should
> > > > be configurable. In my intermediate implementation I have parameter
> > > called
> > > > "disableWalDuringRebalancing" in IgniteConfiguration. Do you thing
> such
> > > > name is meaningful and self-explanatory? Do we need to ensure that it
> > has
> > > > the same value on every node? Should I make it configurable per cache
> > > > rather than globally?
> > > >
> > > > Please share your thoughts.
> > > >
> > > > On Mon, Apr 9, 2018 at 4:32 PM, Ilya Lantukh <il...@gridgain.com>
> > > > wrote:
> > > >
> > > > > Denis,
> > > > >
> > > > > Those ticket are rather complex, and so I don't know when I'll be
> > able
> > > to
> > > > > start working on them.
> > > > >
> > > > > On Fri, Mar 30, 2018 at 11:45 PM, Denis Magda <dm...@apache.org>
> > > wrote:
> > > > >
> > > > >> Ilya,
> > > > >>
> > > > >> Just came across the IEP put together by you:
> > > > >> https://cwiki.apache.org/confluence/display/IGNITE/IEP-16%
> > > > >> 3A+Optimization+of+rebalancing
> > > > >>
> > > > >> Excellent explanation, thanks for aggregating everything there.
> > > > >>
> > > > >> Two tickets below don't have a fixed version assigned:
> > > > >> https://issues.apache.org/jira/browse/IGNITE-8020
> > > > >> https://issues.apache.org/jira/browse/IGNITE-7935
> > > > >>
> > > > >> Do you plan to work on them in 2.6 time frame, right?
> > > > >>
> > > > >> --
> > > > >> Denis
> > > > >>
> > > > >> On Tue, Mar 27, 2018 at 9:29 AM, Denis Magda <dm...@apache.org>
> > > wrote:
> > > > >>
> > > > >> > Ilya, granted you all the required permissions. Please let me
> know
> > > if
> > > > >> you
> > > > >> > still have troubles with the wiki.
> > > > >> >
> > > > >> > --
> > > > >> > Denis
> > > > >> >
> > > > >> > On Tue, Mar 27, 2018 at 8:56 AM, Ilya Lantukh <
> > > ilantukh@gridgain.com>
> > > > >> > wrote:
> > > > >> >
> > > > >> >> Unfortunately, I don't have permission to create page for IEP
> on
> > > > wiki.
> > > > >> >> Denis, can you grant it? My username is ilantukh.
> > > > >> >>
> > > > >> >> On Mon, Mar 26, 2018 at 8:04 PM, Anton Vinogradov <
> av@apache.org
> > >
> > > > >> wrote:
> > > > >> >>
> > > > >> >> > >> It is impossible to disable WAL only for certain
> partitions
> > > > >> without
> > > > >> >> > >> completely overhauling design of Ignite storage mechanism.
> > > Right
> > > > >> now
> > > > >> >> we
> > > > >> >> > can
> > > > >> >> > >> afford only to change WAL mode per cache group.
> > > > >> >> >
> > > > >> >> > Cache group rebalancing is a one cache rebalancing, and then
> > this
> > > > >> cache
> > > > >> >> > ("cache group") can be presented as a set of virtual caches.
> > > > >> >> > So, there is no issues for initial rebalancing.
> > > > >> >> > Lets disable WAL on initial rebalancing.
> > > > >> >> >
> > > > >> >> > 2018-03-26 16:46 GMT+03:00 Ilya Lantukh <
> ilantukh@gridgain.com
> > >:
> > > > >> >> >
> > > > >> >> > > Dmitry,
> > > > >> >> > > It is impossible to disable WAL only for certain partitions
> > > > without
> > > > >> >> > > completely overhauling design of Ignite storage mechanism.
> > > Right
> > > > >> now
> > > > >> >> we
> > > > >> >> > can
> > > > >> >> > > afford only to change WAL mode per cache group.
> > > > >> >> > >
> > > > >> >> > > The idea is to disable WAL when node doesn't have any
> > partition
> > > > in
> > > > >> >> OWNING
> > > > >> >> > > state, which means it doesn't have any consistent data and
> > > won't
> > > > be
> > > > >> >> able
> > > > >> >> > to
> > > > >> >> > > restore from WAL anyway. I don't see any potential use for
> > WAL
> > > on
> > > > >> such
> > > > >> >> > > node, but we can keep a configurable parameter indicating
> can
> > > we
> > > > >> >> > > automatically disable WAL in such case or not.
> > > > >> >> > >
> > > > >> >> > > On Fri, Mar 23, 2018 at 10:40 PM, Dmitry Pavlov <
> > > > >> >> dpavlov.spb@gmail.com>
> > > > >> >> > > wrote:
> > > > >> >> > >
> > > > >> >> > > > Denis, as I understood, there is and idea to exclude only
> > > > >> rebalanced
> > > > >> >> > > > partition(s) data. All other data will go to the WAL.
> > > > >> >> > > >
> > > > >> >> > > > Ilya, please correct me if I'm wrong.
> > > > >> >> > > >
> > > > >> >> > > > пт, 23 мар. 2018 г. в 22:15, Denis Magda <
> > dmagda@apache.org
> > > >:
> > > > >> >> > > >
> > > > >> >> > > > > Ilya,
> > > > >> >> > > > >
> > > > >> >> > > > > That's a decent boost (5-20%) even having WAL enabled.
> > Not
> > > > sure
> > > > >> >> that
> > > > >> >> > we
> > > > >> >> > > > > should stake on the WAL "off" mode here because if the
> > > whole
> > > > >> >> cluster
> > > > >> >> > > goes
> > > > >> >> > > > > down, it's then the data consistency is questionable.
> As
> > an
> > > > >> >> > architect,
> > > > >> >> > > I
> > > > >> >> > > > > wouldn't disable WAL for the sake of rebalancing; it's
> > too
> > > > >> risky.
> > > > >> >> > > > >
> > > > >> >> > > > > If you agree, then let's create the IEP. This way it
> will
> > > be
> > > > >> >> easier
> > > > >> >> > to
> > > > >> >> > > > > track this endeavor. BTW, are you already ready to
> > release
> > > > any
> > > > >> >> > > > > optimizations in 2.5 that is being discussed in a
> > separate
> > > > >> thread?
> > > > >> >> > > > >
> > > > >> >> > > > > --
> > > > >> >> > > > > Denis
> > > > >> >> > > > >
> > > > >> >> > > > >
> > > > >> >> > > > >
> > > > >> >> > > > > On Fri, Mar 23, 2018 at 6:37 AM, Ilya Lantukh <
> > > > >> >> ilantukh@gridgain.com
> > > > >> >> > >
> > > > >> >> > > > > wrote:
> > > > >> >> > > > >
> > > > >> >> > > > > > Denis,
> > > > >> >> > > > > >
> > > > >> >> > > > > > > - Don't you want to aggregate the tickets under an
> > IEP?
> > > > >> >> > > > > > Yes, I think so.
> > > > >> >> > > > > >
> > > > >> >> > > > > > > - Does it mean we're going to update our B+Tree
> > > > >> >> implementation?
> > > > >> >> > Any
> > > > >> >> > > > > ideas
> > > > >> >> > > > > > how risky it is?
> > > > >> >> > > > > > One of tickets that I created (
> > > > >> >> > > > > > https://issues.apache.org/jira/browse/IGNITE-7935)
> > > > involves
> > > > >> >> B+Tree
> > > > >> >> > > > > > modification, but I am not planning to do it in the
> > > nearest
> > > > >> >> future.
> > > > >> >> > > It
> > > > >> >> > > > > > shouldn't affect existing tree operations, only
> > introduce
> > > > new
> > > > >> >> ones
> > > > >> >> > > > > (putAll,
> > > > >> >> > > > > > invokeAll, removeAll).
> > > > >> >> > > > > >
> > > > >> >> > > > > > > - Any chance you had a prototype that shows
> > performance
> > > > >> >> > > optimizations
> > > > >> >> > > > > the
> > > > >> >> > > > > > approach you are suggesting to take?
> > > > >> >> > > > > > I have a prototype for simplest improvements (
> > > > >> >> > > > https://issues.apache.org/
> > > > >> >> > > > > > jira/browse/IGNITE-8019 & https://issues.apache.org/
> > > > >> >> > > > > > jira/browse/IGNITE-8018)
> > > > >> >> > > > > > - together they increase throughput by 5-20%,
> depending
> > > on
> > > > >> >> > > > configuration
> > > > >> >> > > > > > and environment. Also, I've tested different WAL
> modes
> > -
> > > > >> >> switching
> > > > >> >> > > from
> > > > >> >> > > > > > LOG_ONLY to NONE gives over 100% boost - this is
> what I
> > > > >> expect
> > > > >> >> from
> > > > >> >> > > > > > https://issues.apache.org/jira/browse/IGNITE-8017.
> > > > >> >> > > > > >
> > > > >> >> > > > > > On Thu, Mar 22, 2018 at 9:48 PM, Denis Magda <
> > > > >> dmagda@apache.org
> > > > >> >> >
> > > > >> >> > > > wrote:
> > > > >> >> > > > > >
> > > > >> >> > > > > > > Ilya,
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > That's outstanding research and summary. Thanks for
> > > > >> spending
> > > > >> >> your
> > > > >> >> > > > time
> > > > >> >> > > > > on
> > > > >> >> > > > > > > this.
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > Not sure I have enough expertise to challenge your
> > > > >> approach,
> > > > >> >> but
> > > > >> >> > it
> > > > >> >> > > > > > sounds
> > > > >> >> > > > > > > 100% reasonable to me. As side notes:
> > > > >> >> > > > > > >
> > > > >> >> > > > > > >    - Don't you want to aggregate the tickets under
> an
> > > > IEP?
> > > > >> >> > > > > > >    - Does it mean we're going to update our B+Tree
> > > > >> >> > implementation?
> > > > >> >> > > > Any
> > > > >> >> > > > > > >    ideas how risky it is?
> > > > >> >> > > > > > >    - Any chance you had a prototype that shows
> > > > performance
> > > > >> >> > > > > optimizations
> > > > >> >> > > > > > of
> > > > >> >> > > > > > >    the approach you are suggesting to take?
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > --
> > > > >> >> > > > > > > Denis
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > On Thu, Mar 22, 2018 at 8:38 AM, Ilya Lantukh <
> > > > >> >> > > ilantukh@gridgain.com
> > > > >> >> > > > >
> > > > >> >> > > > > > > wrote:
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > > Igniters,
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > I've spent some time analyzing performance of
> > > > rebalancing
> > > > >> >> > > process.
> > > > >> >> > > > > The
> > > > >> >> > > > > > > > initial goal was to understand, what limits it's
> > > > >> throughput,
> > > > >> >> > > > because
> > > > >> >> > > > > it
> > > > >> >> > > > > > > is
> > > > >> >> > > > > > > > significantly slower than network and storage
> > device
> > > > can
> > > > >> >> > > > > theoretically
> > > > >> >> > > > > > > > handle.
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > Turns out, our current implementation has a
> number
> > of
> > > > >> issues
> > > > >> >> > > caused
> > > > >> >> > > > > by
> > > > >> >> > > > > > a
> > > > >> >> > > > > > > > single fundamental problem.
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > During rebalance data is sent in batches called
> > > > >> >> > > > > > > > GridDhtPartitionSupplyMessages. Batch size is
> > > > >> configurable,
> > > > >> >> > > > default
> > > > >> >> > > > > > > value
> > > > >> >> > > > > > > > is 512KB, which could mean thousands of key-value
> > > > pairs.
> > > > >> >> > However,
> > > > >> >> > > > we
> > > > >> >> > > > > > > don't
> > > > >> >> > > > > > > > take any advantage over this fact and process
> each
> > > > entry
> > > > >> >> > > > > independently:
> > > > >> >> > > > > > > > - checkpointReadLock is acquired multiple times
> for
> > > > every
> > > > >> >> > entry,
> > > > >> >> > > > > > leading
> > > > >> >> > > > > > > to
> > > > >> >> > > > > > > > unnecessary contention - this is clearly a bug;
> > > > >> >> > > > > > > > - for each entry we write (and fsync, if
> > > configuration
> > > > >> >> assumes
> > > > >> >> > > it)
> > > > >> >> > > > a
> > > > >> >> > > > > > > > separate WAL record - so, if batch contains N
> > > entries,
> > > > we
> > > > >> >> might
> > > > >> >> > > end
> > > > >> >> > > > > up
> > > > >> >> > > > > > > > doing N fsyncs;
> > > > >> >> > > > > > > > - adding every entry into CacheDataStore also
> > happens
> > > > >> >> > completely
> > > > >> >> > > > > > > > independently. It means, we will traverse and
> > modify
> > > > each
> > > > >> >> index
> > > > >> >> > > > tree
> > > > >> >> > > > > N
> > > > >> >> > > > > > > > times, we will allocate space in FreeList N times
> > and
> > > > we
> > > > >> >> will
> > > > >> >> > > have
> > > > >> >> > > > to
> > > > >> >> > > > > > > > additionally store in WAL O(N*log(N)) page delta
> > > > records.
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > I've created a few tickets in JIRA with very
> > > different
> > > > >> >> levels
> > > > >> >> > of
> > > > >> >> > > > > scale
> > > > >> >> > > > > > > and
> > > > >> >> > > > > > > > complexity.
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > Ways to reduce impact of independent processing:
> > > > >> >> > > > > > > > - https://issues.apache.org/
> > jira/browse/IGNITE-8019
> > > -
> > > > >> >> > > > aforementioned
> > > > >> >> > > > > > > bug,
> > > > >> >> > > > > > > > causing contention on checkpointReadLock;
> > > > >> >> > > > > > > > - https://issues.apache.org/
> > jira/browse/IGNITE-8018
> > > -
> > > > >> >> > > inefficiency
> > > > >> >> > > > > in
> > > > >> >> > > > > > > > GridCacheMapEntry implementation;
> > > > >> >> > > > > > > > - https://issues.apache.org/
> > jira/browse/IGNITE-8017
> > > -
> > > > >> >> > > > automatically
> > > > >> >> > > > > > > > disable
> > > > >> >> > > > > > > > WAL during preloading.
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > Ways to solve problem on more global level:
> > > > >> >> > > > > > > > - https://issues.apache.org/
> > jira/browse/IGNITE-7935
> > > -
> > > > a
> > > > >> >> ticket
> > > > >> >> > > to
> > > > >> >> > > > > > > > introduce
> > > > >> >> > > > > > > > batch modification;
> > > > >> >> > > > > > > > - https://issues.apache.org/
> > jira/browse/IGNITE-8020
> > > -
> > > > >> >> complete
> > > > >> >> > > > > > redesign
> > > > >> >> > > > > > > of
> > > > >> >> > > > > > > > rebalancing process for persistent caches, based
> on
> > > > file
> > > > >> >> > > transfer.
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > Everyone is welcome to criticize above ideas,
> > suggest
> > > > new
> > > > >> >> ones
> > > > >> >> > or
> > > > >> >> > > > > > > > participate in implementation.
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > --
> > > > >> >> > > > > > > > Best regards,
> > > > >> >> > > > > > > > Ilya
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > >
> > > > >> >> > > > > >
> > > > >> >> > > > > >
> > > > >> >> > > > > >
> > > > >> >> > > > > > --
> > > > >> >> > > > > > Best regards,
> > > > >> >> > > > > > Ilya
> > > > >> >> > > > > >
> > > > >> >> > > > >
> > > > >> >> > > >
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> > > --
> > > > >> >> > > Best regards,
> > > > >> >> > > Ilya
> > > > >> >> > >
> > > > >> >> >
> > > > >> >>
> > > > >> >>
> > > > >> >>
> > > > >> >> --
> > > > >> >> Best regards,
> > > > >> >> Ilya
> > > > >> >>
> > > > >> >
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Ilya
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Ilya
> > > >
> > >
> >
>

Re: Rebalancing - how to make it faster

Posted by Yakov Zhdanov <yz...@apache.org>.

How about skipWalOnRebalancing or disableWalOnRebalancing? I like the first
one better and both are shorter.

--Yakov

2018-04-09 19:55 GMT+03:00 Denis Magda <dm...@apache.org>:

> I would enable this option only after we confirm it's stable. Until it
> happens it should be treated as an experimental feature that is turned on
> manually with a parameter like Ilya is suggesting.
>
> Ilya, the name sounds good to me.
>
> --
> Denis
>
> On Mon, Apr 9, 2018 at 9:16 AM, Anton Vinogradov <av...@apache.org> wrote:
>
> > Ilya,
> >
> > WAL should be automatically disabled at initial rebalancing.
> > It can't be disabled at regilar rebalancing, since you never ready to
> lose
> > data you protected by WAL.
> > Is there any need to have special param in that case?
> >
> > 2018-04-09 16:41 GMT+03:00 Ilya Lantukh <il...@gridgain.com>:
> >
> > > Igniters,
> > >
> > > I am currently at the finish line of
> > > https://issues.apache.org/jira/browse/IGNITE-8017 ("Disable WAL during
> > > initial preloading") implementation. And I need that such behavior
> should
> > > be configurable. In my intermediate implementation I have parameter
> > called
> > > "disableWalDuringRebalancing" in IgniteConfiguration. Do you thing such
> > > name is meaningful and self-explanatory? Do we need to ensure that it
> has
> > > the same value on every node? Should I make it configurable per cache
> > > rather than globally?
> > >
> > > Please share your thoughts.
> > >
> > > On Mon, Apr 9, 2018 at 4:32 PM, Ilya Lantukh <il...@gridgain.com>
> > > wrote:
> > >
> > > > Denis,
> > > >
> > > > Those ticket are rather complex, and so I don't know when I'll be
> able
> > to
> > > > start working on them.
> > > >
> > > > On Fri, Mar 30, 2018 at 11:45 PM, Denis Magda <dm...@apache.org>
> > wrote:
> > > >
> > > >> Ilya,
> > > >>
> > > >> Just came across the IEP put together by you:
> > > >> https://cwiki.apache.org/confluence/display/IGNITE/IEP-16%
> > > >> 3A+Optimization+of+rebalancing
> > > >>
> > > >> Excellent explanation, thanks for aggregating everything there.
> > > >>
> > > >> Two tickets below don't have a fixed version assigned:
> > > >> https://issues.apache.org/jira/browse/IGNITE-8020
> > > >> https://issues.apache.org/jira/browse/IGNITE-7935
> > > >>
> > > >> Do you plan to work on them in 2.6 time frame, right?
> > > >>
> > > >> --
> > > >> Denis
> > > >>
> > > >> On Tue, Mar 27, 2018 at 9:29 AM, Denis Magda <dm...@apache.org>
> > wrote:
> > > >>
> > > >> > Ilya, granted you all the required permissions. Please let me know
> > if
> > > >> you
> > > >> > still have troubles with the wiki.
> > > >> >
> > > >> > --
> > > >> > Denis
> > > >> >
> > > >> > On Tue, Mar 27, 2018 at 8:56 AM, Ilya Lantukh <
> > ilantukh@gridgain.com>
> > > >> > wrote:
> > > >> >
> > > >> >> Unfortunately, I don't have permission to create page for IEP on
> > > wiki.
> > > >> >> Denis, can you grant it? My username is ilantukh.
> > > >> >>
> > > >> >> On Mon, Mar 26, 2018 at 8:04 PM, Anton Vinogradov <av@apache.org
> >
> > > >> wrote:
> > > >> >>
> > > >> >> > >> It is impossible to disable WAL only for certain partitions
> > > >> without
> > > >> >> > >> completely overhauling design of Ignite storage mechanism.
> > Right
> > > >> now
> > > >> >> we
> > > >> >> > can
> > > >> >> > >> afford only to change WAL mode per cache group.
> > > >> >> >
> > > >> >> > Cache group rebalancing is a one cache rebalancing, and then
> this
> > > >> cache
> > > >> >> > ("cache group") can be presented as a set of virtual caches.
> > > >> >> > So, there is no issues for initial rebalancing.
> > > >> >> > Lets disable WAL on initial rebalancing.
> > > >> >> >
> > > >> >> > 2018-03-26 16:46 GMT+03:00 Ilya Lantukh <ilantukh@gridgain.com
> >:
> > > >> >> >
> > > >> >> > > Dmitry,
> > > >> >> > > It is impossible to disable WAL only for certain partitions
> > > without
> > > >> >> > > completely overhauling design of Ignite storage mechanism.
> > Right
> > > >> now
> > > >> >> we
> > > >> >> > can
> > > >> >> > > afford only to change WAL mode per cache group.
> > > >> >> > >
> > > >> >> > > The idea is to disable WAL when node doesn't have any
> partition
> > > in
> > > >> >> OWNING
> > > >> >> > > state, which means it doesn't have any consistent data and
> > won't
> > > be
> > > >> >> able
> > > >> >> > to
> > > >> >> > > restore from WAL anyway. I don't see any potential use for
> WAL
> > on
> > > >> such
> > > >> >> > > node, but we can keep a configurable parameter indicating can
> > we
> > > >> >> > > automatically disable WAL in such case or not.
> > > >> >> > >
> > > >> >> > > On Fri, Mar 23, 2018 at 10:40 PM, Dmitry Pavlov <
> > > >> >> dpavlov.spb@gmail.com>
> > > >> >> > > wrote:
> > > >> >> > >
> > > >> >> > > > Denis, as I understood, there is and idea to exclude only
> > > >> rebalanced
> > > >> >> > > > partition(s) data. All other data will go to the WAL.
> > > >> >> > > >
> > > >> >> > > > Ilya, please correct me if I'm wrong.
> > > >> >> > > >
> > > >> >> > > > пт, 23 мар. 2018 г. в 22:15, Denis Magda <
> dmagda@apache.org
> > >:
> > > >> >> > > >
> > > >> >> > > > > Ilya,
> > > >> >> > > > >
> > > >> >> > > > > That's a decent boost (5-20%) even having WAL enabled.
> Not
> > > sure
> > > >> >> that
> > > >> >> > we
> > > >> >> > > > > should stake on the WAL "off" mode here because if the
> > whole
> > > >> >> cluster
> > > >> >> > > goes
> > > >> >> > > > > down, it's then the data consistency is questionable. As
> an
> > > >> >> > architect,
> > > >> >> > > I
> > > >> >> > > > > wouldn't disable WAL for the sake of rebalancing; it's
> too
> > > >> risky.
> > > >> >> > > > >
> > > >> >> > > > > If you agree, then let's create the IEP. This way it will
> > be
> > > >> >> easier
> > > >> >> > to
> > > >> >> > > > > track this endeavor. BTW, are you already ready to
> release
> > > any
> > > >> >> > > > > optimizations in 2.5 that is being discussed in a
> separate
> > > >> thread?
> > > >> >> > > > >
> > > >> >> > > > > --
> > > >> >> > > > > Denis
> > > >> >> > > > >
> > > >> >> > > > >
> > > >> >> > > > >
> > > >> >> > > > > On Fri, Mar 23, 2018 at 6:37 AM, Ilya Lantukh <
> > > >> >> ilantukh@gridgain.com
> > > >> >> > >
> > > >> >> > > > > wrote:
> > > >> >> > > > >
> > > >> >> > > > > > Denis,
> > > >> >> > > > > >
> > > >> >> > > > > > > - Don't you want to aggregate the tickets under an
> IEP?
> > > >> >> > > > > > Yes, I think so.
> > > >> >> > > > > >
> > > >> >> > > > > > > - Does it mean we're going to update our B+Tree
> > > >> >> implementation?
> > > >> >> > Any
> > > >> >> > > > > ideas
> > > >> >> > > > > > how risky it is?
> > > >> >> > > > > > One of tickets that I created (
> > > >> >> > > > > > https://issues.apache.org/jira/browse/IGNITE-7935)
> > > involves
> > > >> >> B+Tree
> > > >> >> > > > > > modification, but I am not planning to do it in the
> > nearest
> > > >> >> future.
> > > >> >> > > It
> > > >> >> > > > > > shouldn't affect existing tree operations, only
> introduce
> > > new
> > > >> >> ones
> > > >> >> > > > > (putAll,
> > > >> >> > > > > > invokeAll, removeAll).
> > > >> >> > > > > >
> > > >> >> > > > > > > - Any chance you had a prototype that shows
> performance
> > > >> >> > > optimizations
> > > >> >> > > > > the
> > > >> >> > > > > > approach you are suggesting to take?
> > > >> >> > > > > > I have a prototype for simplest improvements (
> > > >> >> > > > https://issues.apache.org/
> > > >> >> > > > > > jira/browse/IGNITE-8019 & https://issues.apache.org/
> > > >> >> > > > > > jira/browse/IGNITE-8018)
> > > >> >> > > > > > - together they increase throughput by 5-20%, depending
> > on
> > > >> >> > > > configuration
> > > >> >> > > > > > and environment. Also, I've tested different WAL modes
> -
> > > >> >> switching
> > > >> >> > > from
> > > >> >> > > > > > LOG_ONLY to NONE gives over 100% boost - this is what I
> > > >> expect
> > > >> >> from
> > > >> >> > > > > > https://issues.apache.org/jira/browse/IGNITE-8017.
> > > >> >> > > > > >
> > > >> >> > > > > > On Thu, Mar 22, 2018 at 9:48 PM, Denis Magda <
> > > >> dmagda@apache.org
> > > >> >> >
> > > >> >> > > > wrote:
> > > >> >> > > > > >
> > > >> >> > > > > > > Ilya,
> > > >> >> > > > > > >
> > > >> >> > > > > > > That's outstanding research and summary. Thanks for
> > > >> spending
> > > >> >> your
> > > >> >> > > > time
> > > >> >> > > > > on
> > > >> >> > > > > > > this.
> > > >> >> > > > > > >
> > > >> >> > > > > > > Not sure I have enough expertise to challenge your
> > > >> approach,
> > > >> >> but
> > > >> >> > it
> > > >> >> > > > > > sounds
> > > >> >> > > > > > > 100% reasonable to me. As side notes:
> > > >> >> > > > > > >
> > > >> >> > > > > > >    - Don't you want to aggregate the tickets under an
> > > IEP?
> > > >> >> > > > > > >    - Does it mean we're going to update our B+Tree
> > > >> >> > implementation?
> > > >> >> > > > Any
> > > >> >> > > > > > >    ideas how risky it is?
> > > >> >> > > > > > >    - Any chance you had a prototype that shows
> > > performance
> > > >> >> > > > > optimizations
> > > >> >> > > > > > of
> > > >> >> > > > > > >    the approach you are suggesting to take?
> > > >> >> > > > > > >
> > > >> >> > > > > > > --
> > > >> >> > > > > > > Denis
> > > >> >> > > > > > >
> > > >> >> > > > > > > On Thu, Mar 22, 2018 at 8:38 AM, Ilya Lantukh <
> > > >> >> > > ilantukh@gridgain.com
> > > >> >> > > > >
> > > >> >> > > > > > > wrote:
> > > >> >> > > > > > >
> > > >> >> > > > > > > > Igniters,
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > I've spent some time analyzing performance of
> > > rebalancing
> > > >> >> > > process.
> > > >> >> > > > > The
> > > >> >> > > > > > > > initial goal was to understand, what limits it's
> > > >> throughput,
> > > >> >> > > > because
> > > >> >> > > > > it
> > > >> >> > > > > > > is
> > > >> >> > > > > > > > significantly slower than network and storage
> device
> > > can
> > > >> >> > > > > theoretically
> > > >> >> > > > > > > > handle.
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > Turns out, our current implementation has a number
> of
> > > >> issues
> > > >> >> > > caused
> > > >> >> > > > > by
> > > >> >> > > > > > a
> > > >> >> > > > > > > > single fundamental problem.
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > During rebalance data is sent in batches called
> > > >> >> > > > > > > > GridDhtPartitionSupplyMessages. Batch size is
> > > >> configurable,
> > > >> >> > > > default
> > > >> >> > > > > > > value
> > > >> >> > > > > > > > is 512KB, which could mean thousands of key-value
> > > pairs.
> > > >> >> > However,
> > > >> >> > > > we
> > > >> >> > > > > > > don't
> > > >> >> > > > > > > > take any advantage over this fact and process each
> > > entry
> > > >> >> > > > > independently:
> > > >> >> > > > > > > > - checkpointReadLock is acquired multiple times for
> > > every
> > > >> >> > entry,
> > > >> >> > > > > > leading
> > > >> >> > > > > > > to
> > > >> >> > > > > > > > unnecessary contention - this is clearly a bug;
> > > >> >> > > > > > > > - for each entry we write (and fsync, if
> > configuration
> > > >> >> assumes
> > > >> >> > > it)
> > > >> >> > > > a
> > > >> >> > > > > > > > separate WAL record - so, if batch contains N
> > entries,
> > > we
> > > >> >> might
> > > >> >> > > end
> > > >> >> > > > > up
> > > >> >> > > > > > > > doing N fsyncs;
> > > >> >> > > > > > > > - adding every entry into CacheDataStore also
> happens
> > > >> >> > completely
> > > >> >> > > > > > > > independently. It means, we will traverse and
> modify
> > > each
> > > >> >> index
> > > >> >> > > > tree
> > > >> >> > > > > N
> > > >> >> > > > > > > > times, we will allocate space in FreeList N times
> and
> > > we
> > > >> >> will
> > > >> >> > > have
> > > >> >> > > > to
> > > >> >> > > > > > > > additionally store in WAL O(N*log(N)) page delta
> > > records.
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > I've created a few tickets in JIRA with very
> > different
> > > >> >> levels
> > > >> >> > of
> > > >> >> > > > > scale
> > > >> >> > > > > > > and
> > > >> >> > > > > > > > complexity.
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > Ways to reduce impact of independent processing:
> > > >> >> > > > > > > > - https://issues.apache.org/
> jira/browse/IGNITE-8019
> > -
> > > >> >> > > > aforementioned
> > > >> >> > > > > > > bug,
> > > >> >> > > > > > > > causing contention on checkpointReadLock;
> > > >> >> > > > > > > > - https://issues.apache.org/
> jira/browse/IGNITE-8018
> > -
> > > >> >> > > inefficiency
> > > >> >> > > > > in
> > > >> >> > > > > > > > GridCacheMapEntry implementation;
> > > >> >> > > > > > > > - https://issues.apache.org/
> jira/browse/IGNITE-8017
> > -
> > > >> >> > > > automatically
> > > >> >> > > > > > > > disable
> > > >> >> > > > > > > > WAL during preloading.
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > Ways to solve problem on more global level:
> > > >> >> > > > > > > > - https://issues.apache.org/
> jira/browse/IGNITE-7935
> > -
> > > a
> > > >> >> ticket
> > > >> >> > > to
> > > >> >> > > > > > > > introduce
> > > >> >> > > > > > > > batch modification;
> > > >> >> > > > > > > > - https://issues.apache.org/
> jira/browse/IGNITE-8020
> > -
> > > >> >> complete
> > > >> >> > > > > > redesign
> > > >> >> > > > > > > of
> > > >> >> > > > > > > > rebalancing process for persistent caches, based on
> > > file
> > > >> >> > > transfer.
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > Everyone is welcome to criticize above ideas,
> suggest
> > > new
> > > >> >> ones
> > > >> >> > or
> > > >> >> > > > > > > > participate in implementation.
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > --
> > > >> >> > > > > > > > Best regards,
> > > >> >> > > > > > > > Ilya
> > > >> >> > > > > > > >
> > > >> >> > > > > > >
> > > >> >> > > > > >
> > > >> >> > > > > >
> > > >> >> > > > > >
> > > >> >> > > > > > --
> > > >> >> > > > > > Best regards,
> > > >> >> > > > > > Ilya
> > > >> >> > > > > >
> > > >> >> > > > >
> > > >> >> > > >
> > > >> >> > >
> > > >> >> > >
> > > >> >> > >
> > > >> >> > > --
> > > >> >> > > Best regards,
> > > >> >> > > Ilya
> > > >> >> > >
> > > >> >> >
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >> --
> > > >> >> Best regards,
> > > >> >> Ilya
> > > >> >>
> > > >> >
> > > >> >
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Ilya
> > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > > Ilya
> > >
> >
>

Re: Rebalancing - how to make it faster

Posted by Denis Magda <dm...@apache.org>.

I would enable this option only after we confirm it's stable. Until it
happens it should be treated as an experimental feature that is turned on
manually with a parameter like Ilya is suggesting.

Ilya, the name sounds good to me.

--
Denis

On Mon, Apr 9, 2018 at 9:16 AM, Anton Vinogradov <av...@apache.org> wrote:

> Ilya,
>
> WAL should be automatically disabled at initial rebalancing.
> It can't be disabled at regilar rebalancing, since you never ready to lose
> data you protected by WAL.
> Is there any need to have special param in that case?
>
> 2018-04-09 16:41 GMT+03:00 Ilya Lantukh <il...@gridgain.com>:
>
> > Igniters,
> >
> > I am currently at the finish line of
> > https://issues.apache.org/jira/browse/IGNITE-8017 ("Disable WAL during
> > initial preloading") implementation. And I need that such behavior should
> > be configurable. In my intermediate implementation I have parameter
> called
> > "disableWalDuringRebalancing" in IgniteConfiguration. Do you thing such
> > name is meaningful and self-explanatory? Do we need to ensure that it has
> > the same value on every node? Should I make it configurable per cache
> > rather than globally?
> >
> > Please share your thoughts.
> >
> > On Mon, Apr 9, 2018 at 4:32 PM, Ilya Lantukh <il...@gridgain.com>
> > wrote:
> >
> > > Denis,
> > >
> > > Those ticket are rather complex, and so I don't know when I'll be able
> to
> > > start working on them.
> > >
> > > On Fri, Mar 30, 2018 at 11:45 PM, Denis Magda <dm...@apache.org>
> wrote:
> > >
> > >> Ilya,
> > >>
> > >> Just came across the IEP put together by you:
> > >> https://cwiki.apache.org/confluence/display/IGNITE/IEP-16%
> > >> 3A+Optimization+of+rebalancing
> > >>
> > >> Excellent explanation, thanks for aggregating everything there.
> > >>
> > >> Two tickets below don't have a fixed version assigned:
> > >> https://issues.apache.org/jira/browse/IGNITE-8020
> > >> https://issues.apache.org/jira/browse/IGNITE-7935
> > >>
> > >> Do you plan to work on them in 2.6 time frame, right?
> > >>
> > >> --
> > >> Denis
> > >>
> > >> On Tue, Mar 27, 2018 at 9:29 AM, Denis Magda <dm...@apache.org>
> wrote:
> > >>
> > >> > Ilya, granted you all the required permissions. Please let me know
> if
> > >> you
> > >> > still have troubles with the wiki.
> > >> >
> > >> > --
> > >> > Denis
> > >> >
> > >> > On Tue, Mar 27, 2018 at 8:56 AM, Ilya Lantukh <
> ilantukh@gridgain.com>
> > >> > wrote:
> > >> >
> > >> >> Unfortunately, I don't have permission to create page for IEP on
> > wiki.
> > >> >> Denis, can you grant it? My username is ilantukh.
> > >> >>
> > >> >> On Mon, Mar 26, 2018 at 8:04 PM, Anton Vinogradov <av...@apache.org>
> > >> wrote:
> > >> >>
> > >> >> > >> It is impossible to disable WAL only for certain partitions
> > >> without
> > >> >> > >> completely overhauling design of Ignite storage mechanism.
> Right
> > >> now
> > >> >> we
> > >> >> > can
> > >> >> > >> afford only to change WAL mode per cache group.
> > >> >> >
> > >> >> > Cache group rebalancing is a one cache rebalancing, and then this
> > >> cache
> > >> >> > ("cache group") can be presented as a set of virtual caches.
> > >> >> > So, there is no issues for initial rebalancing.
> > >> >> > Lets disable WAL on initial rebalancing.
> > >> >> >
> > >> >> > 2018-03-26 16:46 GMT+03:00 Ilya Lantukh <il...@gridgain.com>:
> > >> >> >
> > >> >> > > Dmitry,
> > >> >> > > It is impossible to disable WAL only for certain partitions
> > without
> > >> >> > > completely overhauling design of Ignite storage mechanism.
> Right
> > >> now
> > >> >> we
> > >> >> > can
> > >> >> > > afford only to change WAL mode per cache group.
> > >> >> > >
> > >> >> > > The idea is to disable WAL when node doesn't have any partition
> > in
> > >> >> OWNING
> > >> >> > > state, which means it doesn't have any consistent data and
> won't
> > be
> > >> >> able
> > >> >> > to
> > >> >> > > restore from WAL anyway. I don't see any potential use for WAL
> on
> > >> such
> > >> >> > > node, but we can keep a configurable parameter indicating can
> we
> > >> >> > > automatically disable WAL in such case or not.
> > >> >> > >
> > >> >> > > On Fri, Mar 23, 2018 at 10:40 PM, Dmitry Pavlov <
> > >> >> dpavlov.spb@gmail.com>
> > >> >> > > wrote:
> > >> >> > >
> > >> >> > > > Denis, as I understood, there is and idea to exclude only
> > >> rebalanced
> > >> >> > > > partition(s) data. All other data will go to the WAL.
> > >> >> > > >
> > >> >> > > > Ilya, please correct me if I'm wrong.
> > >> >> > > >
> > >> >> > > > пт, 23 мар. 2018 г. в 22:15, Denis Magda <dmagda@apache.org
> >:
> > >> >> > > >
> > >> >> > > > > Ilya,
> > >> >> > > > >
> > >> >> > > > > That's a decent boost (5-20%) even having WAL enabled. Not
> > sure
> > >> >> that
> > >> >> > we
> > >> >> > > > > should stake on the WAL "off" mode here because if the
> whole
> > >> >> cluster
> > >> >> > > goes
> > >> >> > > > > down, it's then the data consistency is questionable. As an
> > >> >> > architect,
> > >> >> > > I
> > >> >> > > > > wouldn't disable WAL for the sake of rebalancing; it's too
> > >> risky.
> > >> >> > > > >
> > >> >> > > > > If you agree, then let's create the IEP. This way it will
> be
> > >> >> easier
> > >> >> > to
> > >> >> > > > > track this endeavor. BTW, are you already ready to release
> > any
> > >> >> > > > > optimizations in 2.5 that is being discussed in a separate
> > >> thread?
> > >> >> > > > >
> > >> >> > > > > --
> > >> >> > > > > Denis
> > >> >> > > > >
> > >> >> > > > >
> > >> >> > > > >
> > >> >> > > > > On Fri, Mar 23, 2018 at 6:37 AM, Ilya Lantukh <
> > >> >> ilantukh@gridgain.com
> > >> >> > >
> > >> >> > > > > wrote:
> > >> >> > > > >
> > >> >> > > > > > Denis,
> > >> >> > > > > >
> > >> >> > > > > > > - Don't you want to aggregate the tickets under an IEP?
> > >> >> > > > > > Yes, I think so.
> > >> >> > > > > >
> > >> >> > > > > > > - Does it mean we're going to update our B+Tree
> > >> >> implementation?
> > >> >> > Any
> > >> >> > > > > ideas
> > >> >> > > > > > how risky it is?
> > >> >> > > > > > One of tickets that I created (
> > >> >> > > > > > https://issues.apache.org/jira/browse/IGNITE-7935)
> > involves
> > >> >> B+Tree
> > >> >> > > > > > modification, but I am not planning to do it in the
> nearest
> > >> >> future.
> > >> >> > > It
> > >> >> > > > > > shouldn't affect existing tree operations, only introduce
> > new
> > >> >> ones
> > >> >> > > > > (putAll,
> > >> >> > > > > > invokeAll, removeAll).
> > >> >> > > > > >
> > >> >> > > > > > > - Any chance you had a prototype that shows performance
> > >> >> > > optimizations
> > >> >> > > > > the
> > >> >> > > > > > approach you are suggesting to take?
> > >> >> > > > > > I have a prototype for simplest improvements (
> > >> >> > > > https://issues.apache.org/
> > >> >> > > > > > jira/browse/IGNITE-8019 & https://issues.apache.org/
> > >> >> > > > > > jira/browse/IGNITE-8018)
> > >> >> > > > > > - together they increase throughput by 5-20%, depending
> on
> > >> >> > > > configuration
> > >> >> > > > > > and environment. Also, I've tested different WAL modes -
> > >> >> switching
> > >> >> > > from
> > >> >> > > > > > LOG_ONLY to NONE gives over 100% boost - this is what I
> > >> expect
> > >> >> from
> > >> >> > > > > > https://issues.apache.org/jira/browse/IGNITE-8017.
> > >> >> > > > > >
> > >> >> > > > > > On Thu, Mar 22, 2018 at 9:48 PM, Denis Magda <
> > >> dmagda@apache.org
> > >> >> >
> > >> >> > > > wrote:
> > >> >> > > > > >
> > >> >> > > > > > > Ilya,
> > >> >> > > > > > >
> > >> >> > > > > > > That's outstanding research and summary. Thanks for
> > >> spending
> > >> >> your
> > >> >> > > > time
> > >> >> > > > > on
> > >> >> > > > > > > this.
> > >> >> > > > > > >
> > >> >> > > > > > > Not sure I have enough expertise to challenge your
> > >> approach,
> > >> >> but
> > >> >> > it
> > >> >> > > > > > sounds
> > >> >> > > > > > > 100% reasonable to me. As side notes:
> > >> >> > > > > > >
> > >> >> > > > > > >    - Don't you want to aggregate the tickets under an
> > IEP?
> > >> >> > > > > > >    - Does it mean we're going to update our B+Tree
> > >> >> > implementation?
> > >> >> > > > Any
> > >> >> > > > > > >    ideas how risky it is?
> > >> >> > > > > > >    - Any chance you had a prototype that shows
> > performance
> > >> >> > > > > optimizations
> > >> >> > > > > > of
> > >> >> > > > > > >    the approach you are suggesting to take?
> > >> >> > > > > > >
> > >> >> > > > > > > --
> > >> >> > > > > > > Denis
> > >> >> > > > > > >
> > >> >> > > > > > > On Thu, Mar 22, 2018 at 8:38 AM, Ilya Lantukh <
> > >> >> > > ilantukh@gridgain.com
> > >> >> > > > >
> > >> >> > > > > > > wrote:
> > >> >> > > > > > >
> > >> >> > > > > > > > Igniters,
> > >> >> > > > > > > >
> > >> >> > > > > > > > I've spent some time analyzing performance of
> > rebalancing
> > >> >> > > process.
> > >> >> > > > > The
> > >> >> > > > > > > > initial goal was to understand, what limits it's
> > >> throughput,
> > >> >> > > > because
> > >> >> > > > > it
> > >> >> > > > > > > is
> > >> >> > > > > > > > significantly slower than network and storage device
> > can
> > >> >> > > > > theoretically
> > >> >> > > > > > > > handle.
> > >> >> > > > > > > >
> > >> >> > > > > > > > Turns out, our current implementation has a number of
> > >> issues
> > >> >> > > caused
> > >> >> > > > > by
> > >> >> > > > > > a
> > >> >> > > > > > > > single fundamental problem.
> > >> >> > > > > > > >
> > >> >> > > > > > > > During rebalance data is sent in batches called
> > >> >> > > > > > > > GridDhtPartitionSupplyMessages. Batch size is
> > >> configurable,
> > >> >> > > > default
> > >> >> > > > > > > value
> > >> >> > > > > > > > is 512KB, which could mean thousands of key-value
> > pairs.
> > >> >> > However,
> > >> >> > > > we
> > >> >> > > > > > > don't
> > >> >> > > > > > > > take any advantage over this fact and process each
> > entry
> > >> >> > > > > independently:
> > >> >> > > > > > > > - checkpointReadLock is acquired multiple times for
> > every
> > >> >> > entry,
> > >> >> > > > > > leading
> > >> >> > > > > > > to
> > >> >> > > > > > > > unnecessary contention - this is clearly a bug;
> > >> >> > > > > > > > - for each entry we write (and fsync, if
> configuration
> > >> >> assumes
> > >> >> > > it)
> > >> >> > > > a
> > >> >> > > > > > > > separate WAL record - so, if batch contains N
> entries,
> > we
> > >> >> might
> > >> >> > > end
> > >> >> > > > > up
> > >> >> > > > > > > > doing N fsyncs;
> > >> >> > > > > > > > - adding every entry into CacheDataStore also happens
> > >> >> > completely
> > >> >> > > > > > > > independently. It means, we will traverse and modify
> > each
> > >> >> index
> > >> >> > > > tree
> > >> >> > > > > N
> > >> >> > > > > > > > times, we will allocate space in FreeList N times and
> > we
> > >> >> will
> > >> >> > > have
> > >> >> > > > to
> > >> >> > > > > > > > additionally store in WAL O(N*log(N)) page delta
> > records.
> > >> >> > > > > > > >
> > >> >> > > > > > > > I've created a few tickets in JIRA with very
> different
> > >> >> levels
> > >> >> > of
> > >> >> > > > > scale
> > >> >> > > > > > > and
> > >> >> > > > > > > > complexity.
> > >> >> > > > > > > >
> > >> >> > > > > > > > Ways to reduce impact of independent processing:
> > >> >> > > > > > > > - https://issues.apache.org/jira/browse/IGNITE-8019
> -
> > >> >> > > > aforementioned
> > >> >> > > > > > > bug,
> > >> >> > > > > > > > causing contention on checkpointReadLock;
> > >> >> > > > > > > > - https://issues.apache.org/jira/browse/IGNITE-8018
> -
> > >> >> > > inefficiency
> > >> >> > > > > in
> > >> >> > > > > > > > GridCacheMapEntry implementation;
> > >> >> > > > > > > > - https://issues.apache.org/jira/browse/IGNITE-8017
> -
> > >> >> > > > automatically
> > >> >> > > > > > > > disable
> > >> >> > > > > > > > WAL during preloading.
> > >> >> > > > > > > >
> > >> >> > > > > > > > Ways to solve problem on more global level:
> > >> >> > > > > > > > - https://issues.apache.org/jira/browse/IGNITE-7935
> -
> > a
> > >> >> ticket
> > >> >> > > to
> > >> >> > > > > > > > introduce
> > >> >> > > > > > > > batch modification;
> > >> >> > > > > > > > - https://issues.apache.org/jira/browse/IGNITE-8020
> -
> > >> >> complete
> > >> >> > > > > > redesign
> > >> >> > > > > > > of
> > >> >> > > > > > > > rebalancing process for persistent caches, based on
> > file
> > >> >> > > transfer.
> > >> >> > > > > > > >
> > >> >> > > > > > > > Everyone is welcome to criticize above ideas, suggest
> > new
> > >> >> ones
> > >> >> > or
> > >> >> > > > > > > > participate in implementation.
> > >> >> > > > > > > >
> > >> >> > > > > > > > --
> > >> >> > > > > > > > Best regards,
> > >> >> > > > > > > > Ilya
> > >> >> > > > > > > >
> > >> >> > > > > > >
> > >> >> > > > > >
> > >> >> > > > > >
> > >> >> > > > > >
> > >> >> > > > > > --
> > >> >> > > > > > Best regards,
> > >> >> > > > > > Ilya
> > >> >> > > > > >
> > >> >> > > > >
> > >> >> > > >
> > >> >> > >
> > >> >> > >
> > >> >> > >
> > >> >> > > --
> > >> >> > > Best regards,
> > >> >> > > Ilya
> > >> >> > >
> > >> >> >
> > >> >>
> > >> >>
> > >> >>
> > >> >> --
> > >> >> Best regards,
> > >> >> Ilya
> > >> >>
> > >> >
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > > Best regards,
> > > Ilya
> > >
> >
> >
> >
> > --
> > Best regards,
> > Ilya
> >
>

Re: Rebalancing - how to make it faster

Posted by Anton Vinogradov <av...@apache.org>.

Ilya,

WAL should be automatically disabled at initial rebalancing.
It can't be disabled at regilar rebalancing, since you never ready to lose
data you protected by WAL.
Is there any need to have special param in that case?

2018-04-09 16:41 GMT+03:00 Ilya Lantukh <il...@gridgain.com>:

> Igniters,
>
> I am currently at the finish line of
> https://issues.apache.org/jira/browse/IGNITE-8017 ("Disable WAL during
> initial preloading") implementation. And I need that such behavior should
> be configurable. In my intermediate implementation I have parameter called
> "disableWalDuringRebalancing" in IgniteConfiguration. Do you thing such
> name is meaningful and self-explanatory? Do we need to ensure that it has
> the same value on every node? Should I make it configurable per cache
> rather than globally?
>
> Please share your thoughts.
>
> On Mon, Apr 9, 2018 at 4:32 PM, Ilya Lantukh <il...@gridgain.com>
> wrote:
>
> > Denis,
> >
> > Those ticket are rather complex, and so I don't know when I'll be able to
> > start working on them.
> >
> > On Fri, Mar 30, 2018 at 11:45 PM, Denis Magda <dm...@apache.org> wrote:
> >
> >> Ilya,
> >>
> >> Just came across the IEP put together by you:
> >> https://cwiki.apache.org/confluence/display/IGNITE/IEP-16%
> >> 3A+Optimization+of+rebalancing
> >>
> >> Excellent explanation, thanks for aggregating everything there.
> >>
> >> Two tickets below don't have a fixed version assigned:
> >> https://issues.apache.org/jira/browse/IGNITE-8020
> >> https://issues.apache.org/jira/browse/IGNITE-7935
> >>
> >> Do you plan to work on them in 2.6 time frame, right?
> >>
> >> --
> >> Denis
> >>
> >> On Tue, Mar 27, 2018 at 9:29 AM, Denis Magda <dm...@apache.org> wrote:
> >>
> >> > Ilya, granted you all the required permissions. Please let me know if
> >> you
> >> > still have troubles with the wiki.
> >> >
> >> > --
> >> > Denis
> >> >
> >> > On Tue, Mar 27, 2018 at 8:56 AM, Ilya Lantukh <il...@gridgain.com>
> >> > wrote:
> >> >
> >> >> Unfortunately, I don't have permission to create page for IEP on
> wiki.
> >> >> Denis, can you grant it? My username is ilantukh.
> >> >>
> >> >> On Mon, Mar 26, 2018 at 8:04 PM, Anton Vinogradov <av...@apache.org>
> >> wrote:
> >> >>
> >> >> > >> It is impossible to disable WAL only for certain partitions
> >> without
> >> >> > >> completely overhauling design of Ignite storage mechanism. Right
> >> now
> >> >> we
> >> >> > can
> >> >> > >> afford only to change WAL mode per cache group.
> >> >> >
> >> >> > Cache group rebalancing is a one cache rebalancing, and then this
> >> cache
> >> >> > ("cache group") can be presented as a set of virtual caches.
> >> >> > So, there is no issues for initial rebalancing.
> >> >> > Lets disable WAL on initial rebalancing.
> >> >> >
> >> >> > 2018-03-26 16:46 GMT+03:00 Ilya Lantukh <il...@gridgain.com>:
> >> >> >
> >> >> > > Dmitry,
> >> >> > > It is impossible to disable WAL only for certain partitions
> without
> >> >> > > completely overhauling design of Ignite storage mechanism. Right
> >> now
> >> >> we
> >> >> > can
> >> >> > > afford only to change WAL mode per cache group.
> >> >> > >
> >> >> > > The idea is to disable WAL when node doesn't have any partition
> in
> >> >> OWNING
> >> >> > > state, which means it doesn't have any consistent data and won't
> be
> >> >> able
> >> >> > to
> >> >> > > restore from WAL anyway. I don't see any potential use for WAL on
> >> such
> >> >> > > node, but we can keep a configurable parameter indicating can we
> >> >> > > automatically disable WAL in such case or not.
> >> >> > >
> >> >> > > On Fri, Mar 23, 2018 at 10:40 PM, Dmitry Pavlov <
> >> >> dpavlov.spb@gmail.com>
> >> >> > > wrote:
> >> >> > >
> >> >> > > > Denis, as I understood, there is and idea to exclude only
> >> rebalanced
> >> >> > > > partition(s) data. All other data will go to the WAL.
> >> >> > > >
> >> >> > > > Ilya, please correct me if I'm wrong.
> >> >> > > >
> >> >> > > > пт, 23 мар. 2018 г. в 22:15, Denis Magda <dm...@apache.org>:
> >> >> > > >
> >> >> > > > > Ilya,
> >> >> > > > >
> >> >> > > > > That's a decent boost (5-20%) even having WAL enabled. Not
> sure
> >> >> that
> >> >> > we
> >> >> > > > > should stake on the WAL "off" mode here because if the whole
> >> >> cluster
> >> >> > > goes
> >> >> > > > > down, it's then the data consistency is questionable. As an
> >> >> > architect,
> >> >> > > I
> >> >> > > > > wouldn't disable WAL for the sake of rebalancing; it's too
> >> risky.
> >> >> > > > >
> >> >> > > > > If you agree, then let's create the IEP. This way it will be
> >> >> easier
> >> >> > to
> >> >> > > > > track this endeavor. BTW, are you already ready to release
> any
> >> >> > > > > optimizations in 2.5 that is being discussed in a separate
> >> thread?
> >> >> > > > >
> >> >> > > > > --
> >> >> > > > > Denis
> >> >> > > > >
> >> >> > > > >
> >> >> > > > >
> >> >> > > > > On Fri, Mar 23, 2018 at 6:37 AM, Ilya Lantukh <
> >> >> ilantukh@gridgain.com
> >> >> > >
> >> >> > > > > wrote:
> >> >> > > > >
> >> >> > > > > > Denis,
> >> >> > > > > >
> >> >> > > > > > > - Don't you want to aggregate the tickets under an IEP?
> >> >> > > > > > Yes, I think so.
> >> >> > > > > >
> >> >> > > > > > > - Does it mean we're going to update our B+Tree
> >> >> implementation?
> >> >> > Any
> >> >> > > > > ideas
> >> >> > > > > > how risky it is?
> >> >> > > > > > One of tickets that I created (
> >> >> > > > > > https://issues.apache.org/jira/browse/IGNITE-7935)
> involves
> >> >> B+Tree
> >> >> > > > > > modification, but I am not planning to do it in the nearest
> >> >> future.
> >> >> > > It
> >> >> > > > > > shouldn't affect existing tree operations, only introduce
> new
> >> >> ones
> >> >> > > > > (putAll,
> >> >> > > > > > invokeAll, removeAll).
> >> >> > > > > >
> >> >> > > > > > > - Any chance you had a prototype that shows performance
> >> >> > > optimizations
> >> >> > > > > the
> >> >> > > > > > approach you are suggesting to take?
> >> >> > > > > > I have a prototype for simplest improvements (
> >> >> > > > https://issues.apache.org/
> >> >> > > > > > jira/browse/IGNITE-8019 & https://issues.apache.org/
> >> >> > > > > > jira/browse/IGNITE-8018)
> >> >> > > > > > - together they increase throughput by 5-20%, depending on
> >> >> > > > configuration
> >> >> > > > > > and environment. Also, I've tested different WAL modes -
> >> >> switching
> >> >> > > from
> >> >> > > > > > LOG_ONLY to NONE gives over 100% boost - this is what I
> >> expect
> >> >> from
> >> >> > > > > > https://issues.apache.org/jira/browse/IGNITE-8017.
> >> >> > > > > >
> >> >> > > > > > On Thu, Mar 22, 2018 at 9:48 PM, Denis Magda <
> >> dmagda@apache.org
> >> >> >
> >> >> > > > wrote:
> >> >> > > > > >
> >> >> > > > > > > Ilya,
> >> >> > > > > > >
> >> >> > > > > > > That's outstanding research and summary. Thanks for
> >> spending
> >> >> your
> >> >> > > > time
> >> >> > > > > on
> >> >> > > > > > > this.
> >> >> > > > > > >
> >> >> > > > > > > Not sure I have enough expertise to challenge your
> >> approach,
> >> >> but
> >> >> > it
> >> >> > > > > > sounds
> >> >> > > > > > > 100% reasonable to me. As side notes:
> >> >> > > > > > >
> >> >> > > > > > >    - Don't you want to aggregate the tickets under an
> IEP?
> >> >> > > > > > >    - Does it mean we're going to update our B+Tree
> >> >> > implementation?
> >> >> > > > Any
> >> >> > > > > > >    ideas how risky it is?
> >> >> > > > > > >    - Any chance you had a prototype that shows
> performance
> >> >> > > > > optimizations
> >> >> > > > > > of
> >> >> > > > > > >    the approach you are suggesting to take?
> >> >> > > > > > >
> >> >> > > > > > > --
> >> >> > > > > > > Denis
> >> >> > > > > > >
> >> >> > > > > > > On Thu, Mar 22, 2018 at 8:38 AM, Ilya Lantukh <
> >> >> > > ilantukh@gridgain.com
> >> >> > > > >
> >> >> > > > > > > wrote:
> >> >> > > > > > >
> >> >> > > > > > > > Igniters,
> >> >> > > > > > > >
> >> >> > > > > > > > I've spent some time analyzing performance of
> rebalancing
> >> >> > > process.
> >> >> > > > > The
> >> >> > > > > > > > initial goal was to understand, what limits it's
> >> throughput,
> >> >> > > > because
> >> >> > > > > it
> >> >> > > > > > > is
> >> >> > > > > > > > significantly slower than network and storage device
> can
> >> >> > > > > theoretically
> >> >> > > > > > > > handle.
> >> >> > > > > > > >
> >> >> > > > > > > > Turns out, our current implementation has a number of
> >> issues
> >> >> > > caused
> >> >> > > > > by
> >> >> > > > > > a
> >> >> > > > > > > > single fundamental problem.
> >> >> > > > > > > >
> >> >> > > > > > > > During rebalance data is sent in batches called
> >> >> > > > > > > > GridDhtPartitionSupplyMessages. Batch size is
> >> configurable,
> >> >> > > > default
> >> >> > > > > > > value
> >> >> > > > > > > > is 512KB, which could mean thousands of key-value
> pairs.
> >> >> > However,
> >> >> > > > we
> >> >> > > > > > > don't
> >> >> > > > > > > > take any advantage over this fact and process each
> entry
> >> >> > > > > independently:
> >> >> > > > > > > > - checkpointReadLock is acquired multiple times for
> every
> >> >> > entry,
> >> >> > > > > > leading
> >> >> > > > > > > to
> >> >> > > > > > > > unnecessary contention - this is clearly a bug;
> >> >> > > > > > > > - for each entry we write (and fsync, if configuration
> >> >> assumes
> >> >> > > it)
> >> >> > > > a
> >> >> > > > > > > > separate WAL record - so, if batch contains N entries,
> we
> >> >> might
> >> >> > > end
> >> >> > > > > up
> >> >> > > > > > > > doing N fsyncs;
> >> >> > > > > > > > - adding every entry into CacheDataStore also happens
> >> >> > completely
> >> >> > > > > > > > independently. It means, we will traverse and modify
> each
> >> >> index
> >> >> > > > tree
> >> >> > > > > N
> >> >> > > > > > > > times, we will allocate space in FreeList N times and
> we
> >> >> will
> >> >> > > have
> >> >> > > > to
> >> >> > > > > > > > additionally store in WAL O(N*log(N)) page delta
> records.
> >> >> > > > > > > >
> >> >> > > > > > > > I've created a few tickets in JIRA with very different
> >> >> levels
> >> >> > of
> >> >> > > > > scale
> >> >> > > > > > > and
> >> >> > > > > > > > complexity.
> >> >> > > > > > > >
> >> >> > > > > > > > Ways to reduce impact of independent processing:
> >> >> > > > > > > > - https://issues.apache.org/jira/browse/IGNITE-8019 -
> >> >> > > > aforementioned
> >> >> > > > > > > bug,
> >> >> > > > > > > > causing contention on checkpointReadLock;
> >> >> > > > > > > > - https://issues.apache.org/jira/browse/IGNITE-8018 -
> >> >> > > inefficiency
> >> >> > > > > in
> >> >> > > > > > > > GridCacheMapEntry implementation;
> >> >> > > > > > > > - https://issues.apache.org/jira/browse/IGNITE-8017 -
> >> >> > > > automatically
> >> >> > > > > > > > disable
> >> >> > > > > > > > WAL during preloading.
> >> >> > > > > > > >
> >> >> > > > > > > > Ways to solve problem on more global level:
> >> >> > > > > > > > - https://issues.apache.org/jira/browse/IGNITE-7935 -
> a
> >> >> ticket
> >> >> > > to
> >> >> > > > > > > > introduce
> >> >> > > > > > > > batch modification;
> >> >> > > > > > > > - https://issues.apache.org/jira/browse/IGNITE-8020 -
> >> >> complete
> >> >> > > > > > redesign
> >> >> > > > > > > of
> >> >> > > > > > > > rebalancing process for persistent caches, based on
> file
> >> >> > > transfer.
> >> >> > > > > > > >
> >> >> > > > > > > > Everyone is welcome to criticize above ideas, suggest
> new
> >> >> ones
> >> >> > or
> >> >> > > > > > > > participate in implementation.
> >> >> > > > > > > >
> >> >> > > > > > > > --
> >> >> > > > > > > > Best regards,
> >> >> > > > > > > > Ilya
> >> >> > > > > > > >
> >> >> > > > > > >
> >> >> > > > > >
> >> >> > > > > >
> >> >> > > > > >
> >> >> > > > > > --
> >> >> > > > > > Best regards,
> >> >> > > > > > Ilya
> >> >> > > > > >
> >> >> > > > >
> >> >> > > >
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > > --
> >> >> > > Best regards,
> >> >> > > Ilya
> >> >> > >
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Best regards,
> >> >> Ilya
> >> >>
> >> >
> >> >
> >>
> >
> >
> >
> > --
> > Best regards,
> > Ilya
> >
>
>
>
> --
> Best regards,
> Ilya
>

Re: Rebalancing - how to make it faster

Posted by Ilya Lantukh <il...@gridgain.com>.

Igniters,

I am currently at the finish line of
https://issues.apache.org/jira/browse/IGNITE-8017 ("Disable WAL during
initial preloading") implementation. And I need that such behavior should
be configurable. In my intermediate implementation I have parameter called
"disableWalDuringRebalancing" in IgniteConfiguration. Do you thing such
name is meaningful and self-explanatory? Do we need to ensure that it has
the same value on every node? Should I make it configurable per cache
rather than globally?

Please share your thoughts.

On Mon, Apr 9, 2018 at 4:32 PM, Ilya Lantukh <il...@gridgain.com> wrote:

> Denis,
>
> Those ticket are rather complex, and so I don't know when I'll be able to
> start working on them.
>
> On Fri, Mar 30, 2018 at 11:45 PM, Denis Magda <dm...@apache.org> wrote:
>
>> Ilya,
>>
>> Just came across the IEP put together by you:
>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-16%
>> 3A+Optimization+of+rebalancing
>>
>> Excellent explanation, thanks for aggregating everything there.
>>
>> Two tickets below don't have a fixed version assigned:
>> https://issues.apache.org/jira/browse/IGNITE-8020
>> https://issues.apache.org/jira/browse/IGNITE-7935
>>
>> Do you plan to work on them in 2.6 time frame, right?
>>
>> --
>> Denis
>>
>> On Tue, Mar 27, 2018 at 9:29 AM, Denis Magda <dm...@apache.org> wrote:
>>
>> > Ilya, granted you all the required permissions. Please let me know if
>> you
>> > still have troubles with the wiki.
>> >
>> > --
>> > Denis
>> >
>> > On Tue, Mar 27, 2018 at 8:56 AM, Ilya Lantukh <il...@gridgain.com>
>> > wrote:
>> >
>> >> Unfortunately, I don't have permission to create page for IEP on wiki.
>> >> Denis, can you grant it? My username is ilantukh.
>> >>
>> >> On Mon, Mar 26, 2018 at 8:04 PM, Anton Vinogradov <av...@apache.org>
>> wrote:
>> >>
>> >> > >> It is impossible to disable WAL only for certain partitions
>> without
>> >> > >> completely overhauling design of Ignite storage mechanism. Right
>> now
>> >> we
>> >> > can
>> >> > >> afford only to change WAL mode per cache group.
>> >> >
>> >> > Cache group rebalancing is a one cache rebalancing, and then this
>> cache
>> >> > ("cache group") can be presented as a set of virtual caches.
>> >> > So, there is no issues for initial rebalancing.
>> >> > Lets disable WAL on initial rebalancing.
>> >> >
>> >> > 2018-03-26 16:46 GMT+03:00 Ilya Lantukh <il...@gridgain.com>:
>> >> >
>> >> > > Dmitry,
>> >> > > It is impossible to disable WAL only for certain partitions without
>> >> > > completely overhauling design of Ignite storage mechanism. Right
>> now
>> >> we
>> >> > can
>> >> > > afford only to change WAL mode per cache group.
>> >> > >
>> >> > > The idea is to disable WAL when node doesn't have any partition in
>> >> OWNING
>> >> > > state, which means it doesn't have any consistent data and won't be
>> >> able
>> >> > to
>> >> > > restore from WAL anyway. I don't see any potential use for WAL on
>> such
>> >> > > node, but we can keep a configurable parameter indicating can we
>> >> > > automatically disable WAL in such case or not.
>> >> > >
>> >> > > On Fri, Mar 23, 2018 at 10:40 PM, Dmitry Pavlov <
>> >> dpavlov.spb@gmail.com>
>> >> > > wrote:
>> >> > >
>> >> > > > Denis, as I understood, there is and idea to exclude only
>> rebalanced
>> >> > > > partition(s) data. All other data will go to the WAL.
>> >> > > >
>> >> > > > Ilya, please correct me if I'm wrong.
>> >> > > >
>> >> > > > пт, 23 мар. 2018 г. в 22:15, Denis Magda <dm...@apache.org>:
>> >> > > >
>> >> > > > > Ilya,
>> >> > > > >
>> >> > > > > That's a decent boost (5-20%) even having WAL enabled. Not sure
>> >> that
>> >> > we
>> >> > > > > should stake on the WAL "off" mode here because if the whole
>> >> cluster
>> >> > > goes
>> >> > > > > down, it's then the data consistency is questionable. As an
>> >> > architect,
>> >> > > I
>> >> > > > > wouldn't disable WAL for the sake of rebalancing; it's too
>> risky.
>> >> > > > >
>> >> > > > > If you agree, then let's create the IEP. This way it will be
>> >> easier
>> >> > to
>> >> > > > > track this endeavor. BTW, are you already ready to release any
>> >> > > > > optimizations in 2.5 that is being discussed in a separate
>> thread?
>> >> > > > >
>> >> > > > > --
>> >> > > > > Denis
>> >> > > > >
>> >> > > > >
>> >> > > > >
>> >> > > > > On Fri, Mar 23, 2018 at 6:37 AM, Ilya Lantukh <
>> >> ilantukh@gridgain.com
>> >> > >
>> >> > > > > wrote:
>> >> > > > >
>> >> > > > > > Denis,
>> >> > > > > >
>> >> > > > > > > - Don't you want to aggregate the tickets under an IEP?
>> >> > > > > > Yes, I think so.
>> >> > > > > >
>> >> > > > > > > - Does it mean we're going to update our B+Tree
>> >> implementation?
>> >> > Any
>> >> > > > > ideas
>> >> > > > > > how risky it is?
>> >> > > > > > One of tickets that I created (
>> >> > > > > > https://issues.apache.org/jira/browse/IGNITE-7935) involves
>> >> B+Tree
>> >> > > > > > modification, but I am not planning to do it in the nearest
>> >> future.
>> >> > > It
>> >> > > > > > shouldn't affect existing tree operations, only introduce new
>> >> ones
>> >> > > > > (putAll,
>> >> > > > > > invokeAll, removeAll).
>> >> > > > > >
>> >> > > > > > > - Any chance you had a prototype that shows performance
>> >> > > optimizations
>> >> > > > > the
>> >> > > > > > approach you are suggesting to take?
>> >> > > > > > I have a prototype for simplest improvements (
>> >> > > > https://issues.apache.org/
>> >> > > > > > jira/browse/IGNITE-8019 & https://issues.apache.org/
>> >> > > > > > jira/browse/IGNITE-8018)
>> >> > > > > > - together they increase throughput by 5-20%, depending on
>> >> > > > configuration
>> >> > > > > > and environment. Also, I've tested different WAL modes -
>> >> switching
>> >> > > from
>> >> > > > > > LOG_ONLY to NONE gives over 100% boost - this is what I
>> expect
>> >> from
>> >> > > > > > https://issues.apache.org/jira/browse/IGNITE-8017.
>> >> > > > > >
>> >> > > > > > On Thu, Mar 22, 2018 at 9:48 PM, Denis Magda <
>> dmagda@apache.org
>> >> >
>> >> > > > wrote:
>> >> > > > > >
>> >> > > > > > > Ilya,
>> >> > > > > > >
>> >> > > > > > > That's outstanding research and summary. Thanks for
>> spending
>> >> your
>> >> > > > time
>> >> > > > > on
>> >> > > > > > > this.
>> >> > > > > > >
>> >> > > > > > > Not sure I have enough expertise to challenge your
>> approach,
>> >> but
>> >> > it
>> >> > > > > > sounds
>> >> > > > > > > 100% reasonable to me. As side notes:
>> >> > > > > > >
>> >> > > > > > >    - Don't you want to aggregate the tickets under an IEP?
>> >> > > > > > >    - Does it mean we're going to update our B+Tree
>> >> > implementation?
>> >> > > > Any
>> >> > > > > > >    ideas how risky it is?
>> >> > > > > > >    - Any chance you had a prototype that shows performance
>> >> > > > > optimizations
>> >> > > > > > of
>> >> > > > > > >    the approach you are suggesting to take?
>> >> > > > > > >
>> >> > > > > > > --
>> >> > > > > > > Denis
>> >> > > > > > >
>> >> > > > > > > On Thu, Mar 22, 2018 at 8:38 AM, Ilya Lantukh <
>> >> > > ilantukh@gridgain.com
>> >> > > > >
>> >> > > > > > > wrote:
>> >> > > > > > >
>> >> > > > > > > > Igniters,
>> >> > > > > > > >
>> >> > > > > > > > I've spent some time analyzing performance of rebalancing
>> >> > > process.
>> >> > > > > The
>> >> > > > > > > > initial goal was to understand, what limits it's
>> throughput,
>> >> > > > because
>> >> > > > > it
>> >> > > > > > > is
>> >> > > > > > > > significantly slower than network and storage device can
>> >> > > > > theoretically
>> >> > > > > > > > handle.
>> >> > > > > > > >
>> >> > > > > > > > Turns out, our current implementation has a number of
>> issues
>> >> > > caused
>> >> > > > > by
>> >> > > > > > a
>> >> > > > > > > > single fundamental problem.
>> >> > > > > > > >
>> >> > > > > > > > During rebalance data is sent in batches called
>> >> > > > > > > > GridDhtPartitionSupplyMessages. Batch size is
>> configurable,
>> >> > > > default
>> >> > > > > > > value
>> >> > > > > > > > is 512KB, which could mean thousands of key-value pairs.
>> >> > However,
>> >> > > > we
>> >> > > > > > > don't
>> >> > > > > > > > take any advantage over this fact and process each entry
>> >> > > > > independently:
>> >> > > > > > > > - checkpointReadLock is acquired multiple times for every
>> >> > entry,
>> >> > > > > > leading
>> >> > > > > > > to
>> >> > > > > > > > unnecessary contention - this is clearly a bug;
>> >> > > > > > > > - for each entry we write (and fsync, if configuration
>> >> assumes
>> >> > > it)
>> >> > > > a
>> >> > > > > > > > separate WAL record - so, if batch contains N entries, we
>> >> might
>> >> > > end
>> >> > > > > up
>> >> > > > > > > > doing N fsyncs;
>> >> > > > > > > > - adding every entry into CacheDataStore also happens
>> >> > completely
>> >> > > > > > > > independently. It means, we will traverse and modify each
>> >> index
>> >> > > > tree
>> >> > > > > N
>> >> > > > > > > > times, we will allocate space in FreeList N times and we
>> >> will
>> >> > > have
>> >> > > > to
>> >> > > > > > > > additionally store in WAL O(N*log(N)) page delta records.
>> >> > > > > > > >
>> >> > > > > > > > I've created a few tickets in JIRA with very different
>> >> levels
>> >> > of
>> >> > > > > scale
>> >> > > > > > > and
>> >> > > > > > > > complexity.
>> >> > > > > > > >
>> >> > > > > > > > Ways to reduce impact of independent processing:
>> >> > > > > > > > - https://issues.apache.org/jira/browse/IGNITE-8019 -
>> >> > > > aforementioned
>> >> > > > > > > bug,
>> >> > > > > > > > causing contention on checkpointReadLock;
>> >> > > > > > > > - https://issues.apache.org/jira/browse/IGNITE-8018 -
>> >> > > inefficiency
>> >> > > > > in
>> >> > > > > > > > GridCacheMapEntry implementation;
>> >> > > > > > > > - https://issues.apache.org/jira/browse/IGNITE-8017 -
>> >> > > > automatically
>> >> > > > > > > > disable
>> >> > > > > > > > WAL during preloading.
>> >> > > > > > > >
>> >> > > > > > > > Ways to solve problem on more global level:
>> >> > > > > > > > - https://issues.apache.org/jira/browse/IGNITE-7935 - a
>> >> ticket
>> >> > > to
>> >> > > > > > > > introduce
>> >> > > > > > > > batch modification;
>> >> > > > > > > > - https://issues.apache.org/jira/browse/IGNITE-8020 -
>> >> complete
>> >> > > > > > redesign
>> >> > > > > > > of
>> >> > > > > > > > rebalancing process for persistent caches, based on file
>> >> > > transfer.
>> >> > > > > > > >
>> >> > > > > > > > Everyone is welcome to criticize above ideas, suggest new
>> >> ones
>> >> > or
>> >> > > > > > > > participate in implementation.
>> >> > > > > > > >
>> >> > > > > > > > --
>> >> > > > > > > > Best regards,
>> >> > > > > > > > Ilya
>> >> > > > > > > >
>> >> > > > > > >
>> >> > > > > >
>> >> > > > > >
>> >> > > > > >
>> >> > > > > > --
>> >> > > > > > Best regards,
>> >> > > > > > Ilya
>> >> > > > > >
>> >> > > > >
>> >> > > >
>> >> > >
>> >> > >
>> >> > >
>> >> > > --
>> >> > > Best regards,
>> >> > > Ilya
>> >> > >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Best regards,
>> >> Ilya
>> >>
>> >
>> >
>>
>
>
>
> --
> Best regards,
> Ilya
>



-- 
Best regards,
Ilya

Re: Rebalancing - how to make it faster

Posted by Ilya Lantukh <il...@gridgain.com>.

Denis,

Those ticket are rather complex, and so I don't know when I'll be able to
start working on them.

On Fri, Mar 30, 2018 at 11:45 PM, Denis Magda <dm...@apache.org> wrote:

> Ilya,
>
> Just came across the IEP put together by you:
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> 16%3A+Optimization+of+rebalancing
>
> Excellent explanation, thanks for aggregating everything there.
>
> Two tickets below don't have a fixed version assigned:
> https://issues.apache.org/jira/browse/IGNITE-8020
> https://issues.apache.org/jira/browse/IGNITE-7935
>
> Do you plan to work on them in 2.6 time frame, right?
>
> --
> Denis
>
> On Tue, Mar 27, 2018 at 9:29 AM, Denis Magda <dm...@apache.org> wrote:
>
> > Ilya, granted you all the required permissions. Please let me know if you
> > still have troubles with the wiki.
> >
> > --
> > Denis
> >
> > On Tue, Mar 27, 2018 at 8:56 AM, Ilya Lantukh <il...@gridgain.com>
> > wrote:
> >
> >> Unfortunately, I don't have permission to create page for IEP on wiki.
> >> Denis, can you grant it? My username is ilantukh.
> >>
> >> On Mon, Mar 26, 2018 at 8:04 PM, Anton Vinogradov <av...@apache.org>
> wrote:
> >>
> >> > >> It is impossible to disable WAL only for certain partitions without
> >> > >> completely overhauling design of Ignite storage mechanism. Right
> now
> >> we
> >> > can
> >> > >> afford only to change WAL mode per cache group.
> >> >
> >> > Cache group rebalancing is a one cache rebalancing, and then this
> cache
> >> > ("cache group") can be presented as a set of virtual caches.
> >> > So, there is no issues for initial rebalancing.
> >> > Lets disable WAL on initial rebalancing.
> >> >
> >> > 2018-03-26 16:46 GMT+03:00 Ilya Lantukh <il...@gridgain.com>:
> >> >
> >> > > Dmitry,
> >> > > It is impossible to disable WAL only for certain partitions without
> >> > > completely overhauling design of Ignite storage mechanism. Right now
> >> we
> >> > can
> >> > > afford only to change WAL mode per cache group.
> >> > >
> >> > > The idea is to disable WAL when node doesn't have any partition in
> >> OWNING
> >> > > state, which means it doesn't have any consistent data and won't be
> >> able
> >> > to
> >> > > restore from WAL anyway. I don't see any potential use for WAL on
> such
> >> > > node, but we can keep a configurable parameter indicating can we
> >> > > automatically disable WAL in such case or not.
> >> > >
> >> > > On Fri, Mar 23, 2018 at 10:40 PM, Dmitry Pavlov <
> >> dpavlov.spb@gmail.com>
> >> > > wrote:
> >> > >
> >> > > > Denis, as I understood, there is and idea to exclude only
> rebalanced
> >> > > > partition(s) data. All other data will go to the WAL.
> >> > > >
> >> > > > Ilya, please correct me if I'm wrong.
> >> > > >
> >> > > > пт, 23 мар. 2018 г. в 22:15, Denis Magda <dm...@apache.org>:
> >> > > >
> >> > > > > Ilya,
> >> > > > >
> >> > > > > That's a decent boost (5-20%) even having WAL enabled. Not sure
> >> that
> >> > we
> >> > > > > should stake on the WAL "off" mode here because if the whole
> >> cluster
> >> > > goes
> >> > > > > down, it's then the data consistency is questionable. As an
> >> > architect,
> >> > > I
> >> > > > > wouldn't disable WAL for the sake of rebalancing; it's too
> risky.
> >> > > > >
> >> > > > > If you agree, then let's create the IEP. This way it will be
> >> easier
> >> > to
> >> > > > > track this endeavor. BTW, are you already ready to release any
> >> > > > > optimizations in 2.5 that is being discussed in a separate
> thread?
> >> > > > >
> >> > > > > --
> >> > > > > Denis
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > On Fri, Mar 23, 2018 at 6:37 AM, Ilya Lantukh <
> >> ilantukh@gridgain.com
> >> > >
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Denis,
> >> > > > > >
> >> > > > > > > - Don't you want to aggregate the tickets under an IEP?
> >> > > > > > Yes, I think so.
> >> > > > > >
> >> > > > > > > - Does it mean we're going to update our B+Tree
> >> implementation?
> >> > Any
> >> > > > > ideas
> >> > > > > > how risky it is?
> >> > > > > > One of tickets that I created (
> >> > > > > > https://issues.apache.org/jira/browse/IGNITE-7935) involves
> >> B+Tree
> >> > > > > > modification, but I am not planning to do it in the nearest
> >> future.
> >> > > It
> >> > > > > > shouldn't affect existing tree operations, only introduce new
> >> ones
> >> > > > > (putAll,
> >> > > > > > invokeAll, removeAll).
> >> > > > > >
> >> > > > > > > - Any chance you had a prototype that shows performance
> >> > > optimizations
> >> > > > > the
> >> > > > > > approach you are suggesting to take?
> >> > > > > > I have a prototype for simplest improvements (
> >> > > > https://issues.apache.org/
> >> > > > > > jira/browse/IGNITE-8019 & https://issues.apache.org/
> >> > > > > > jira/browse/IGNITE-8018)
> >> > > > > > - together they increase throughput by 5-20%, depending on
> >> > > > configuration
> >> > > > > > and environment. Also, I've tested different WAL modes -
> >> switching
> >> > > from
> >> > > > > > LOG_ONLY to NONE gives over 100% boost - this is what I expect
> >> from
> >> > > > > > https://issues.apache.org/jira/browse/IGNITE-8017.
> >> > > > > >
> >> > > > > > On Thu, Mar 22, 2018 at 9:48 PM, Denis Magda <
> dmagda@apache.org
> >> >
> >> > > > wrote:
> >> > > > > >
> >> > > > > > > Ilya,
> >> > > > > > >
> >> > > > > > > That's outstanding research and summary. Thanks for spending
> >> your
> >> > > > time
> >> > > > > on
> >> > > > > > > this.
> >> > > > > > >
> >> > > > > > > Not sure I have enough expertise to challenge your approach,
> >> but
> >> > it
> >> > > > > > sounds
> >> > > > > > > 100% reasonable to me. As side notes:
> >> > > > > > >
> >> > > > > > >    - Don't you want to aggregate the tickets under an IEP?
> >> > > > > > >    - Does it mean we're going to update our B+Tree
> >> > implementation?
> >> > > > Any
> >> > > > > > >    ideas how risky it is?
> >> > > > > > >    - Any chance you had a prototype that shows performance
> >> > > > > optimizations
> >> > > > > > of
> >> > > > > > >    the approach you are suggesting to take?
> >> > > > > > >
> >> > > > > > > --
> >> > > > > > > Denis
> >> > > > > > >
> >> > > > > > > On Thu, Mar 22, 2018 at 8:38 AM, Ilya Lantukh <
> >> > > ilantukh@gridgain.com
> >> > > > >
> >> > > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Igniters,
> >> > > > > > > >
> >> > > > > > > > I've spent some time analyzing performance of rebalancing
> >> > > process.
> >> > > > > The
> >> > > > > > > > initial goal was to understand, what limits it's
> throughput,
> >> > > > because
> >> > > > > it
> >> > > > > > > is
> >> > > > > > > > significantly slower than network and storage device can
> >> > > > > theoretically
> >> > > > > > > > handle.
> >> > > > > > > >
> >> > > > > > > > Turns out, our current implementation has a number of
> issues
> >> > > caused
> >> > > > > by
> >> > > > > > a
> >> > > > > > > > single fundamental problem.
> >> > > > > > > >
> >> > > > > > > > During rebalance data is sent in batches called
> >> > > > > > > > GridDhtPartitionSupplyMessages. Batch size is
> configurable,
> >> > > > default
> >> > > > > > > value
> >> > > > > > > > is 512KB, which could mean thousands of key-value pairs.
> >> > However,
> >> > > > we
> >> > > > > > > don't
> >> > > > > > > > take any advantage over this fact and process each entry
> >> > > > > independently:
> >> > > > > > > > - checkpointReadLock is acquired multiple times for every
> >> > entry,
> >> > > > > > leading
> >> > > > > > > to
> >> > > > > > > > unnecessary contention - this is clearly a bug;
> >> > > > > > > > - for each entry we write (and fsync, if configuration
> >> assumes
> >> > > it)
> >> > > > a
> >> > > > > > > > separate WAL record - so, if batch contains N entries, we
> >> might
> >> > > end
> >> > > > > up
> >> > > > > > > > doing N fsyncs;
> >> > > > > > > > - adding every entry into CacheDataStore also happens
> >> > completely
> >> > > > > > > > independently. It means, we will traverse and modify each
> >> index
> >> > > > tree
> >> > > > > N
> >> > > > > > > > times, we will allocate space in FreeList N times and we
> >> will
> >> > > have
> >> > > > to
> >> > > > > > > > additionally store in WAL O(N*log(N)) page delta records.
> >> > > > > > > >
> >> > > > > > > > I've created a few tickets in JIRA with very different
> >> levels
> >> > of
> >> > > > > scale
> >> > > > > > > and
> >> > > > > > > > complexity.
> >> > > > > > > >
> >> > > > > > > > Ways to reduce impact of independent processing:
> >> > > > > > > > - https://issues.apache.org/jira/browse/IGNITE-8019 -
> >> > > > aforementioned
> >> > > > > > > bug,
> >> > > > > > > > causing contention on checkpointReadLock;
> >> > > > > > > > - https://issues.apache.org/jira/browse/IGNITE-8018 -
> >> > > inefficiency
> >> > > > > in
> >> > > > > > > > GridCacheMapEntry implementation;
> >> > > > > > > > - https://issues.apache.org/jira/browse/IGNITE-8017 -
> >> > > > automatically
> >> > > > > > > > disable
> >> > > > > > > > WAL during preloading.
> >> > > > > > > >
> >> > > > > > > > Ways to solve problem on more global level:
> >> > > > > > > > - https://issues.apache.org/jira/browse/IGNITE-7935 - a
> >> ticket
> >> > > to
> >> > > > > > > > introduce
> >> > > > > > > > batch modification;
> >> > > > > > > > - https://issues.apache.org/jira/browse/IGNITE-8020 -
> >> complete
> >> > > > > > redesign
> >> > > > > > > of
> >> > > > > > > > rebalancing process for persistent caches, based on file
> >> > > transfer.
> >> > > > > > > >
> >> > > > > > > > Everyone is welcome to criticize above ideas, suggest new
> >> ones
> >> > or
> >> > > > > > > > participate in implementation.
> >> > > > > > > >
> >> > > > > > > > --
> >> > > > > > > > Best regards,
> >> > > > > > > > Ilya
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > --
> >> > > > > > Best regards,
> >> > > > > > Ilya
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Best regards,
> >> > > Ilya
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Best regards,
> >> Ilya
> >>
> >
> >
>



-- 
Best regards,
Ilya

Re: Rebalancing - how to make it faster

Posted by Denis Magda <dm...@apache.org>.

Ilya,

Just came across the IEP put together by you:
https://cwiki.apache.org/confluence/display/IGNITE/IEP-16%3A+Optimization+of+rebalancing

Excellent explanation, thanks for aggregating everything there.

Two tickets below don't have a fixed version assigned:
https://issues.apache.org/jira/browse/IGNITE-8020
https://issues.apache.org/jira/browse/IGNITE-7935

Do you plan to work on them in 2.6 time frame, right?

--
Denis

On Tue, Mar 27, 2018 at 9:29 AM, Denis Magda <dm...@apache.org> wrote:

> Ilya, granted you all the required permissions. Please let me know if you
> still have troubles with the wiki.
>
> --
> Denis
>
> On Tue, Mar 27, 2018 at 8:56 AM, Ilya Lantukh <il...@gridgain.com>
> wrote:
>
>> Unfortunately, I don't have permission to create page for IEP on wiki.
>> Denis, can you grant it? My username is ilantukh.
>>
>> On Mon, Mar 26, 2018 at 8:04 PM, Anton Vinogradov <av...@apache.org> wrote:
>>
>> > >> It is impossible to disable WAL only for certain partitions without
>> > >> completely overhauling design of Ignite storage mechanism. Right now
>> we
>> > can
>> > >> afford only to change WAL mode per cache group.
>> >
>> > Cache group rebalancing is a one cache rebalancing, and then this cache
>> > ("cache group") can be presented as a set of virtual caches.
>> > So, there is no issues for initial rebalancing.
>> > Lets disable WAL on initial rebalancing.
>> >
>> > 2018-03-26 16:46 GMT+03:00 Ilya Lantukh <il...@gridgain.com>:
>> >
>> > > Dmitry,
>> > > It is impossible to disable WAL only for certain partitions without
>> > > completely overhauling design of Ignite storage mechanism. Right now
>> we
>> > can
>> > > afford only to change WAL mode per cache group.
>> > >
>> > > The idea is to disable WAL when node doesn't have any partition in
>> OWNING
>> > > state, which means it doesn't have any consistent data and won't be
>> able
>> > to
>> > > restore from WAL anyway. I don't see any potential use for WAL on such
>> > > node, but we can keep a configurable parameter indicating can we
>> > > automatically disable WAL in such case or not.
>> > >
>> > > On Fri, Mar 23, 2018 at 10:40 PM, Dmitry Pavlov <
>> dpavlov.spb@gmail.com>
>> > > wrote:
>> > >
>> > > > Denis, as I understood, there is and idea to exclude only rebalanced
>> > > > partition(s) data. All other data will go to the WAL.
>> > > >
>> > > > Ilya, please correct me if I'm wrong.
>> > > >
>> > > > пт, 23 мар. 2018 г. в 22:15, Denis Magda <dm...@apache.org>:
>> > > >
>> > > > > Ilya,
>> > > > >
>> > > > > That's a decent boost (5-20%) even having WAL enabled. Not sure
>> that
>> > we
>> > > > > should stake on the WAL "off" mode here because if the whole
>> cluster
>> > > goes
>> > > > > down, it's then the data consistency is questionable. As an
>> > architect,
>> > > I
>> > > > > wouldn't disable WAL for the sake of rebalancing; it's too risky.
>> > > > >
>> > > > > If you agree, then let's create the IEP. This way it will be
>> easier
>> > to
>> > > > > track this endeavor. BTW, are you already ready to release any
>> > > > > optimizations in 2.5 that is being discussed in a separate thread?
>> > > > >
>> > > > > --
>> > > > > Denis
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Fri, Mar 23, 2018 at 6:37 AM, Ilya Lantukh <
>> ilantukh@gridgain.com
>> > >
>> > > > > wrote:
>> > > > >
>> > > > > > Denis,
>> > > > > >
>> > > > > > > - Don't you want to aggregate the tickets under an IEP?
>> > > > > > Yes, I think so.
>> > > > > >
>> > > > > > > - Does it mean we're going to update our B+Tree
>> implementation?
>> > Any
>> > > > > ideas
>> > > > > > how risky it is?
>> > > > > > One of tickets that I created (
>> > > > > > https://issues.apache.org/jira/browse/IGNITE-7935) involves
>> B+Tree
>> > > > > > modification, but I am not planning to do it in the nearest
>> future.
>> > > It
>> > > > > > shouldn't affect existing tree operations, only introduce new
>> ones
>> > > > > (putAll,
>> > > > > > invokeAll, removeAll).
>> > > > > >
>> > > > > > > - Any chance you had a prototype that shows performance
>> > > optimizations
>> > > > > the
>> > > > > > approach you are suggesting to take?
>> > > > > > I have a prototype for simplest improvements (
>> > > > https://issues.apache.org/
>> > > > > > jira/browse/IGNITE-8019 & https://issues.apache.org/
>> > > > > > jira/browse/IGNITE-8018)
>> > > > > > - together they increase throughput by 5-20%, depending on
>> > > > configuration
>> > > > > > and environment. Also, I've tested different WAL modes -
>> switching
>> > > from
>> > > > > > LOG_ONLY to NONE gives over 100% boost - this is what I expect
>> from
>> > > > > > https://issues.apache.org/jira/browse/IGNITE-8017.
>> > > > > >
>> > > > > > On Thu, Mar 22, 2018 at 9:48 PM, Denis Magda <dmagda@apache.org
>> >
>> > > > wrote:
>> > > > > >
>> > > > > > > Ilya,
>> > > > > > >
>> > > > > > > That's outstanding research and summary. Thanks for spending
>> your
>> > > > time
>> > > > > on
>> > > > > > > this.
>> > > > > > >
>> > > > > > > Not sure I have enough expertise to challenge your approach,
>> but
>> > it
>> > > > > > sounds
>> > > > > > > 100% reasonable to me. As side notes:
>> > > > > > >
>> > > > > > >    - Don't you want to aggregate the tickets under an IEP?
>> > > > > > >    - Does it mean we're going to update our B+Tree
>> > implementation?
>> > > > Any
>> > > > > > >    ideas how risky it is?
>> > > > > > >    - Any chance you had a prototype that shows performance
>> > > > > optimizations
>> > > > > > of
>> > > > > > >    the approach you are suggesting to take?
>> > > > > > >
>> > > > > > > --
>> > > > > > > Denis
>> > > > > > >
>> > > > > > > On Thu, Mar 22, 2018 at 8:38 AM, Ilya Lantukh <
>> > > ilantukh@gridgain.com
>> > > > >
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Igniters,
>> > > > > > > >
>> > > > > > > > I've spent some time analyzing performance of rebalancing
>> > > process.
>> > > > > The
>> > > > > > > > initial goal was to understand, what limits it's throughput,
>> > > > because
>> > > > > it
>> > > > > > > is
>> > > > > > > > significantly slower than network and storage device can
>> > > > > theoretically
>> > > > > > > > handle.
>> > > > > > > >
>> > > > > > > > Turns out, our current implementation has a number of issues
>> > > caused
>> > > > > by
>> > > > > > a
>> > > > > > > > single fundamental problem.
>> > > > > > > >
>> > > > > > > > During rebalance data is sent in batches called
>> > > > > > > > GridDhtPartitionSupplyMessages. Batch size is configurable,
>> > > > default
>> > > > > > > value
>> > > > > > > > is 512KB, which could mean thousands of key-value pairs.
>> > However,
>> > > > we
>> > > > > > > don't
>> > > > > > > > take any advantage over this fact and process each entry
>> > > > > independently:
>> > > > > > > > - checkpointReadLock is acquired multiple times for every
>> > entry,
>> > > > > > leading
>> > > > > > > to
>> > > > > > > > unnecessary contention - this is clearly a bug;
>> > > > > > > > - for each entry we write (and fsync, if configuration
>> assumes
>> > > it)
>> > > > a
>> > > > > > > > separate WAL record - so, if batch contains N entries, we
>> might
>> > > end
>> > > > > up
>> > > > > > > > doing N fsyncs;
>> > > > > > > > - adding every entry into CacheDataStore also happens
>> > completely
>> > > > > > > > independently. It means, we will traverse and modify each
>> index
>> > > > tree
>> > > > > N
>> > > > > > > > times, we will allocate space in FreeList N times and we
>> will
>> > > have
>> > > > to
>> > > > > > > > additionally store in WAL O(N*log(N)) page delta records.
>> > > > > > > >
>> > > > > > > > I've created a few tickets in JIRA with very different
>> levels
>> > of
>> > > > > scale
>> > > > > > > and
>> > > > > > > > complexity.
>> > > > > > > >
>> > > > > > > > Ways to reduce impact of independent processing:
>> > > > > > > > - https://issues.apache.org/jira/browse/IGNITE-8019 -
>> > > > aforementioned
>> > > > > > > bug,
>> > > > > > > > causing contention on checkpointReadLock;
>> > > > > > > > - https://issues.apache.org/jira/browse/IGNITE-8018 -
>> > > inefficiency
>> > > > > in
>> > > > > > > > GridCacheMapEntry implementation;
>> > > > > > > > - https://issues.apache.org/jira/browse/IGNITE-8017 -
>> > > > automatically
>> > > > > > > > disable
>> > > > > > > > WAL during preloading.
>> > > > > > > >
>> > > > > > > > Ways to solve problem on more global level:
>> > > > > > > > - https://issues.apache.org/jira/browse/IGNITE-7935 - a
>> ticket
>> > > to
>> > > > > > > > introduce
>> > > > > > > > batch modification;
>> > > > > > > > - https://issues.apache.org/jira/browse/IGNITE-8020 -
>> complete
>> > > > > > redesign
>> > > > > > > of
>> > > > > > > > rebalancing process for persistent caches, based on file
>> > > transfer.
>> > > > > > > >
>> > > > > > > > Everyone is welcome to criticize above ideas, suggest new
>> ones
>> > or
>> > > > > > > > participate in implementation.
>> > > > > > > >
>> > > > > > > > --
>> > > > > > > > Best regards,
>> > > > > > > > Ilya
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > Best regards,
>> > > > > > Ilya
>> > > > > >
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Best regards,
>> > > Ilya
>> > >
>> >
>>
>>
>>
>> --
>> Best regards,
>> Ilya
>>
>
>

Re: Rebalancing - how to make it faster

Posted by Denis Magda <dm...@apache.org>.

Ilya, granted you all the required permissions. Please let me know if you
still have troubles with the wiki.

--
Denis

On Tue, Mar 27, 2018 at 8:56 AM, Ilya Lantukh <il...@gridgain.com> wrote:

> Unfortunately, I don't have permission to create page for IEP on wiki.
> Denis, can you grant it? My username is ilantukh.
>
> On Mon, Mar 26, 2018 at 8:04 PM, Anton Vinogradov <av...@apache.org> wrote:
>
> > >> It is impossible to disable WAL only for certain partitions without
> > >> completely overhauling design of Ignite storage mechanism. Right now
> we
> > can
> > >> afford only to change WAL mode per cache group.
> >
> > Cache group rebalancing is a one cache rebalancing, and then this cache
> > ("cache group") can be presented as a set of virtual caches.
> > So, there is no issues for initial rebalancing.
> > Lets disable WAL on initial rebalancing.
> >
> > 2018-03-26 16:46 GMT+03:00 Ilya Lantukh <il...@gridgain.com>:
> >
> > > Dmitry,
> > > It is impossible to disable WAL only for certain partitions without
> > > completely overhauling design of Ignite storage mechanism. Right now we
> > can
> > > afford only to change WAL mode per cache group.
> > >
> > > The idea is to disable WAL when node doesn't have any partition in
> OWNING
> > > state, which means it doesn't have any consistent data and won't be
> able
> > to
> > > restore from WAL anyway. I don't see any potential use for WAL on such
> > > node, but we can keep a configurable parameter indicating can we
> > > automatically disable WAL in such case or not.
> > >
> > > On Fri, Mar 23, 2018 at 10:40 PM, Dmitry Pavlov <dpavlov.spb@gmail.com
> >
> > > wrote:
> > >
> > > > Denis, as I understood, there is and idea to exclude only rebalanced
> > > > partition(s) data. All other data will go to the WAL.
> > > >
> > > > Ilya, please correct me if I'm wrong.
> > > >
> > > > пт, 23 мар. 2018 г. в 22:15, Denis Magda <dm...@apache.org>:
> > > >
> > > > > Ilya,
> > > > >
> > > > > That's a decent boost (5-20%) even having WAL enabled. Not sure
> that
> > we
> > > > > should stake on the WAL "off" mode here because if the whole
> cluster
> > > goes
> > > > > down, it's then the data consistency is questionable. As an
> > architect,
> > > I
> > > > > wouldn't disable WAL for the sake of rebalancing; it's too risky.
> > > > >
> > > > > If you agree, then let's create the IEP. This way it will be easier
> > to
> > > > > track this endeavor. BTW, are you already ready to release any
> > > > > optimizations in 2.5 that is being discussed in a separate thread?
> > > > >
> > > > > --
> > > > > Denis
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Mar 23, 2018 at 6:37 AM, Ilya Lantukh <
> ilantukh@gridgain.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Denis,
> > > > > >
> > > > > > > - Don't you want to aggregate the tickets under an IEP?
> > > > > > Yes, I think so.
> > > > > >
> > > > > > > - Does it mean we're going to update our B+Tree implementation?
> > Any
> > > > > ideas
> > > > > > how risky it is?
> > > > > > One of tickets that I created (
> > > > > > https://issues.apache.org/jira/browse/IGNITE-7935) involves
> B+Tree
> > > > > > modification, but I am not planning to do it in the nearest
> future.
> > > It
> > > > > > shouldn't affect existing tree operations, only introduce new
> ones
> > > > > (putAll,
> > > > > > invokeAll, removeAll).
> > > > > >
> > > > > > > - Any chance you had a prototype that shows performance
> > > optimizations
> > > > > the
> > > > > > approach you are suggesting to take?
> > > > > > I have a prototype for simplest improvements (
> > > > https://issues.apache.org/
> > > > > > jira/browse/IGNITE-8019 & https://issues.apache.org/
> > > > > > jira/browse/IGNITE-8018)
> > > > > > - together they increase throughput by 5-20%, depending on
> > > > configuration
> > > > > > and environment. Also, I've tested different WAL modes -
> switching
> > > from
> > > > > > LOG_ONLY to NONE gives over 100% boost - this is what I expect
> from
> > > > > > https://issues.apache.org/jira/browse/IGNITE-8017.
> > > > > >
> > > > > > On Thu, Mar 22, 2018 at 9:48 PM, Denis Magda <dm...@apache.org>
> > > > wrote:
> > > > > >
> > > > > > > Ilya,
> > > > > > >
> > > > > > > That's outstanding research and summary. Thanks for spending
> your
> > > > time
> > > > > on
> > > > > > > this.
> > > > > > >
> > > > > > > Not sure I have enough expertise to challenge your approach,
> but
> > it
> > > > > > sounds
> > > > > > > 100% reasonable to me. As side notes:
> > > > > > >
> > > > > > >    - Don't you want to aggregate the tickets under an IEP?
> > > > > > >    - Does it mean we're going to update our B+Tree
> > implementation?
> > > > Any
> > > > > > >    ideas how risky it is?
> > > > > > >    - Any chance you had a prototype that shows performance
> > > > > optimizations
> > > > > > of
> > > > > > >    the approach you are suggesting to take?
> > > > > > >
> > > > > > > --
> > > > > > > Denis
> > > > > > >
> > > > > > > On Thu, Mar 22, 2018 at 8:38 AM, Ilya Lantukh <
> > > ilantukh@gridgain.com
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Igniters,
> > > > > > > >
> > > > > > > > I've spent some time analyzing performance of rebalancing
> > > process.
> > > > > The
> > > > > > > > initial goal was to understand, what limits it's throughput,
> > > > because
> > > > > it
> > > > > > > is
> > > > > > > > significantly slower than network and storage device can
> > > > > theoretically
> > > > > > > > handle.
> > > > > > > >
> > > > > > > > Turns out, our current implementation has a number of issues
> > > caused
> > > > > by
> > > > > > a
> > > > > > > > single fundamental problem.
> > > > > > > >
> > > > > > > > During rebalance data is sent in batches called
> > > > > > > > GridDhtPartitionSupplyMessages. Batch size is configurable,
> > > > default
> > > > > > > value
> > > > > > > > is 512KB, which could mean thousands of key-value pairs.
> > However,
> > > > we
> > > > > > > don't
> > > > > > > > take any advantage over this fact and process each entry
> > > > > independently:
> > > > > > > > - checkpointReadLock is acquired multiple times for every
> > entry,
> > > > > > leading
> > > > > > > to
> > > > > > > > unnecessary contention - this is clearly a bug;
> > > > > > > > - for each entry we write (and fsync, if configuration
> assumes
> > > it)
> > > > a
> > > > > > > > separate WAL record - so, if batch contains N entries, we
> might
> > > end
> > > > > up
> > > > > > > > doing N fsyncs;
> > > > > > > > - adding every entry into CacheDataStore also happens
> > completely
> > > > > > > > independently. It means, we will traverse and modify each
> index
> > > > tree
> > > > > N
> > > > > > > > times, we will allocate space in FreeList N times and we will
> > > have
> > > > to
> > > > > > > > additionally store in WAL O(N*log(N)) page delta records.
> > > > > > > >
> > > > > > > > I've created a few tickets in JIRA with very different levels
> > of
> > > > > scale
> > > > > > > and
> > > > > > > > complexity.
> > > > > > > >
> > > > > > > > Ways to reduce impact of independent processing:
> > > > > > > > - https://issues.apache.org/jira/browse/IGNITE-8019 -
> > > > aforementioned
> > > > > > > bug,
> > > > > > > > causing contention on checkpointReadLock;
> > > > > > > > - https://issues.apache.org/jira/browse/IGNITE-8018 -
> > > inefficiency
> > > > > in
> > > > > > > > GridCacheMapEntry implementation;
> > > > > > > > - https://issues.apache.org/jira/browse/IGNITE-8017 -
> > > > automatically
> > > > > > > > disable
> > > > > > > > WAL during preloading.
> > > > > > > >
> > > > > > > > Ways to solve problem on more global level:
> > > > > > > > - https://issues.apache.org/jira/browse/IGNITE-7935 - a
> ticket
> > > to
> > > > > > > > introduce
> > > > > > > > batch modification;
> > > > > > > > - https://issues.apache.org/jira/browse/IGNITE-8020 -
> complete
> > > > > > redesign
> > > > > > > of
> > > > > > > > rebalancing process for persistent caches, based on file
> > > transfer.
> > > > > > > >
> > > > > > > > Everyone is welcome to criticize above ideas, suggest new
> ones
> > or
> > > > > > > > participate in implementation.
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best regards,
> > > > > > > > Ilya
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > > Ilya
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > > Ilya
> > >
> >
>
>
>
> --
> Best regards,
> Ilya
>

Re: Rebalancing - how to make it faster

Posted by Ilya Lantukh <il...@gridgain.com>.

Unfortunately, I don't have permission to create page for IEP on wiki.
Denis, can you grant it? My username is ilantukh.

On Mon, Mar 26, 2018 at 8:04 PM, Anton Vinogradov <av...@apache.org> wrote:

> >> It is impossible to disable WAL only for certain partitions without
> >> completely overhauling design of Ignite storage mechanism. Right now we
> can
> >> afford only to change WAL mode per cache group.
>
> Cache group rebalancing is a one cache rebalancing, and then this cache
> ("cache group") can be presented as a set of virtual caches.
> So, there is no issues for initial rebalancing.
> Lets disable WAL on initial rebalancing.
>
> 2018-03-26 16:46 GMT+03:00 Ilya Lantukh <il...@gridgain.com>:
>
> > Dmitry,
> > It is impossible to disable WAL only for certain partitions without
> > completely overhauling design of Ignite storage mechanism. Right now we
> can
> > afford only to change WAL mode per cache group.
> >
> > The idea is to disable WAL when node doesn't have any partition in OWNING
> > state, which means it doesn't have any consistent data and won't be able
> to
> > restore from WAL anyway. I don't see any potential use for WAL on such
> > node, but we can keep a configurable parameter indicating can we
> > automatically disable WAL in such case or not.
> >
> > On Fri, Mar 23, 2018 at 10:40 PM, Dmitry Pavlov <dp...@gmail.com>
> > wrote:
> >
> > > Denis, as I understood, there is and idea to exclude only rebalanced
> > > partition(s) data. All other data will go to the WAL.
> > >
> > > Ilya, please correct me if I'm wrong.
> > >
> > > пт, 23 мар. 2018 г. в 22:15, Denis Magda <dm...@apache.org>:
> > >
> > > > Ilya,
> > > >
> > > > That's a decent boost (5-20%) even having WAL enabled. Not sure that
> we
> > > > should stake on the WAL "off" mode here because if the whole cluster
> > goes
> > > > down, it's then the data consistency is questionable. As an
> architect,
> > I
> > > > wouldn't disable WAL for the sake of rebalancing; it's too risky.
> > > >
> > > > If you agree, then let's create the IEP. This way it will be easier
> to
> > > > track this endeavor. BTW, are you already ready to release any
> > > > optimizations in 2.5 that is being discussed in a separate thread?
> > > >
> > > > --
> > > > Denis
> > > >
> > > >
> > > >
> > > > On Fri, Mar 23, 2018 at 6:37 AM, Ilya Lantukh <ilantukh@gridgain.com
> >
> > > > wrote:
> > > >
> > > > > Denis,
> > > > >
> > > > > > - Don't you want to aggregate the tickets under an IEP?
> > > > > Yes, I think so.
> > > > >
> > > > > > - Does it mean we're going to update our B+Tree implementation?
> Any
> > > > ideas
> > > > > how risky it is?
> > > > > One of tickets that I created (
> > > > > https://issues.apache.org/jira/browse/IGNITE-7935) involves B+Tree
> > > > > modification, but I am not planning to do it in the nearest future.
> > It
> > > > > shouldn't affect existing tree operations, only introduce new ones
> > > > (putAll,
> > > > > invokeAll, removeAll).
> > > > >
> > > > > > - Any chance you had a prototype that shows performance
> > optimizations
> > > > the
> > > > > approach you are suggesting to take?
> > > > > I have a prototype for simplest improvements (
> > > https://issues.apache.org/
> > > > > jira/browse/IGNITE-8019 & https://issues.apache.org/
> > > > > jira/browse/IGNITE-8018)
> > > > > - together they increase throughput by 5-20%, depending on
> > > configuration
> > > > > and environment. Also, I've tested different WAL modes - switching
> > from
> > > > > LOG_ONLY to NONE gives over 100% boost - this is what I expect from
> > > > > https://issues.apache.org/jira/browse/IGNITE-8017.
> > > > >
> > > > > On Thu, Mar 22, 2018 at 9:48 PM, Denis Magda <dm...@apache.org>
> > > wrote:
> > > > >
> > > > > > Ilya,
> > > > > >
> > > > > > That's outstanding research and summary. Thanks for spending your
> > > time
> > > > on
> > > > > > this.
> > > > > >
> > > > > > Not sure I have enough expertise to challenge your approach, but
> it
> > > > > sounds
> > > > > > 100% reasonable to me. As side notes:
> > > > > >
> > > > > >    - Don't you want to aggregate the tickets under an IEP?
> > > > > >    - Does it mean we're going to update our B+Tree
> implementation?
> > > Any
> > > > > >    ideas how risky it is?
> > > > > >    - Any chance you had a prototype that shows performance
> > > > optimizations
> > > > > of
> > > > > >    the approach you are suggesting to take?
> > > > > >
> > > > > > --
> > > > > > Denis
> > > > > >
> > > > > > On Thu, Mar 22, 2018 at 8:38 AM, Ilya Lantukh <
> > ilantukh@gridgain.com
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Igniters,
> > > > > > >
> > > > > > > I've spent some time analyzing performance of rebalancing
> > process.
> > > > The
> > > > > > > initial goal was to understand, what limits it's throughput,
> > > because
> > > > it
> > > > > > is
> > > > > > > significantly slower than network and storage device can
> > > > theoretically
> > > > > > > handle.
> > > > > > >
> > > > > > > Turns out, our current implementation has a number of issues
> > caused
> > > > by
> > > > > a
> > > > > > > single fundamental problem.
> > > > > > >
> > > > > > > During rebalance data is sent in batches called
> > > > > > > GridDhtPartitionSupplyMessages. Batch size is configurable,
> > > default
> > > > > > value
> > > > > > > is 512KB, which could mean thousands of key-value pairs.
> However,
> > > we
> > > > > > don't
> > > > > > > take any advantage over this fact and process each entry
> > > > independently:
> > > > > > > - checkpointReadLock is acquired multiple times for every
> entry,
> > > > > leading
> > > > > > to
> > > > > > > unnecessary contention - this is clearly a bug;
> > > > > > > - for each entry we write (and fsync, if configuration assumes
> > it)
> > > a
> > > > > > > separate WAL record - so, if batch contains N entries, we might
> > end
> > > > up
> > > > > > > doing N fsyncs;
> > > > > > > - adding every entry into CacheDataStore also happens
> completely
> > > > > > > independently. It means, we will traverse and modify each index
> > > tree
> > > > N
> > > > > > > times, we will allocate space in FreeList N times and we will
> > have
> > > to
> > > > > > > additionally store in WAL O(N*log(N)) page delta records.
> > > > > > >
> > > > > > > I've created a few tickets in JIRA with very different levels
> of
> > > > scale
> > > > > > and
> > > > > > > complexity.
> > > > > > >
> > > > > > > Ways to reduce impact of independent processing:
> > > > > > > - https://issues.apache.org/jira/browse/IGNITE-8019 -
> > > aforementioned
> > > > > > bug,
> > > > > > > causing contention on checkpointReadLock;
> > > > > > > - https://issues.apache.org/jira/browse/IGNITE-8018 -
> > inefficiency
> > > > in
> > > > > > > GridCacheMapEntry implementation;
> > > > > > > - https://issues.apache.org/jira/browse/IGNITE-8017 -
> > > automatically
> > > > > > > disable
> > > > > > > WAL during preloading.
> > > > > > >
> > > > > > > Ways to solve problem on more global level:
> > > > > > > - https://issues.apache.org/jira/browse/IGNITE-7935 - a ticket
> > to
> > > > > > > introduce
> > > > > > > batch modification;
> > > > > > > - https://issues.apache.org/jira/browse/IGNITE-8020 - complete
> > > > > redesign
> > > > > > of
> > > > > > > rebalancing process for persistent caches, based on file
> > transfer.
> > > > > > >
> > > > > > > Everyone is welcome to criticize above ideas, suggest new ones
> or
> > > > > > > participate in implementation.
> > > > > > >
> > > > > > > --
> > > > > > > Best regards,
> > > > > > > Ilya
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Ilya
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Best regards,
> > Ilya
> >
>



-- 
Best regards,
Ilya

Re: Rebalancing - how to make it faster

Posted by Anton Vinogradov <av...@apache.org>.

>> It is impossible to disable WAL only for certain partitions without
>> completely overhauling design of Ignite storage mechanism. Right now we
can
>> afford only to change WAL mode per cache group.

Cache group rebalancing is a one cache rebalancing, and then this cache
("cache group") can be presented as a set of virtual caches.
So, there is no issues for initial rebalancing.
Lets disable WAL on initial rebalancing.

2018-03-26 16:46 GMT+03:00 Ilya Lantukh <il...@gridgain.com>:

> Dmitry,
> It is impossible to disable WAL only for certain partitions without
> completely overhauling design of Ignite storage mechanism. Right now we can
> afford only to change WAL mode per cache group.
>
> The idea is to disable WAL when node doesn't have any partition in OWNING
> state, which means it doesn't have any consistent data and won't be able to
> restore from WAL anyway. I don't see any potential use for WAL on such
> node, but we can keep a configurable parameter indicating can we
> automatically disable WAL in such case or not.
>
> On Fri, Mar 23, 2018 at 10:40 PM, Dmitry Pavlov <dp...@gmail.com>
> wrote:
>
> > Denis, as I understood, there is and idea to exclude only rebalanced
> > partition(s) data. All other data will go to the WAL.
> >
> > Ilya, please correct me if I'm wrong.
> >
> > пт, 23 мар. 2018 г. в 22:15, Denis Magda <dm...@apache.org>:
> >
> > > Ilya,
> > >
> > > That's a decent boost (5-20%) even having WAL enabled. Not sure that we
> > > should stake on the WAL "off" mode here because if the whole cluster
> goes
> > > down, it's then the data consistency is questionable. As an architect,
> I
> > > wouldn't disable WAL for the sake of rebalancing; it's too risky.
> > >
> > > If you agree, then let's create the IEP. This way it will be easier to
> > > track this endeavor. BTW, are you already ready to release any
> > > optimizations in 2.5 that is being discussed in a separate thread?
> > >
> > > --
> > > Denis
> > >
> > >
> > >
> > > On Fri, Mar 23, 2018 at 6:37 AM, Ilya Lantukh <il...@gridgain.com>
> > > wrote:
> > >
> > > > Denis,
> > > >
> > > > > - Don't you want to aggregate the tickets under an IEP?
> > > > Yes, I think so.
> > > >
> > > > > - Does it mean we're going to update our B+Tree implementation? Any
> > > ideas
> > > > how risky it is?
> > > > One of tickets that I created (
> > > > https://issues.apache.org/jira/browse/IGNITE-7935) involves B+Tree
> > > > modification, but I am not planning to do it in the nearest future.
> It
> > > > shouldn't affect existing tree operations, only introduce new ones
> > > (putAll,
> > > > invokeAll, removeAll).
> > > >
> > > > > - Any chance you had a prototype that shows performance
> optimizations
> > > the
> > > > approach you are suggesting to take?
> > > > I have a prototype for simplest improvements (
> > https://issues.apache.org/
> > > > jira/browse/IGNITE-8019 & https://issues.apache.org/
> > > > jira/browse/IGNITE-8018)
> > > > - together they increase throughput by 5-20%, depending on
> > configuration
> > > > and environment. Also, I've tested different WAL modes - switching
> from
> > > > LOG_ONLY to NONE gives over 100% boost - this is what I expect from
> > > > https://issues.apache.org/jira/browse/IGNITE-8017.
> > > >
> > > > On Thu, Mar 22, 2018 at 9:48 PM, Denis Magda <dm...@apache.org>
> > wrote:
> > > >
> > > > > Ilya,
> > > > >
> > > > > That's outstanding research and summary. Thanks for spending your
> > time
> > > on
> > > > > this.
> > > > >
> > > > > Not sure I have enough expertise to challenge your approach, but it
> > > > sounds
> > > > > 100% reasonable to me. As side notes:
> > > > >
> > > > >    - Don't you want to aggregate the tickets under an IEP?
> > > > >    - Does it mean we're going to update our B+Tree implementation?
> > Any
> > > > >    ideas how risky it is?
> > > > >    - Any chance you had a prototype that shows performance
> > > optimizations
> > > > of
> > > > >    the approach you are suggesting to take?
> > > > >
> > > > > --
> > > > > Denis
> > > > >
> > > > > On Thu, Mar 22, 2018 at 8:38 AM, Ilya Lantukh <
> ilantukh@gridgain.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Igniters,
> > > > > >
> > > > > > I've spent some time analyzing performance of rebalancing
> process.
> > > The
> > > > > > initial goal was to understand, what limits it's throughput,
> > because
> > > it
> > > > > is
> > > > > > significantly slower than network and storage device can
> > > theoretically
> > > > > > handle.
> > > > > >
> > > > > > Turns out, our current implementation has a number of issues
> caused
> > > by
> > > > a
> > > > > > single fundamental problem.
> > > > > >
> > > > > > During rebalance data is sent in batches called
> > > > > > GridDhtPartitionSupplyMessages. Batch size is configurable,
> > default
> > > > > value
> > > > > > is 512KB, which could mean thousands of key-value pairs. However,
> > we
> > > > > don't
> > > > > > take any advantage over this fact and process each entry
> > > independently:
> > > > > > - checkpointReadLock is acquired multiple times for every entry,
> > > > leading
> > > > > to
> > > > > > unnecessary contention - this is clearly a bug;
> > > > > > - for each entry we write (and fsync, if configuration assumes
> it)
> > a
> > > > > > separate WAL record - so, if batch contains N entries, we might
> end
> > > up
> > > > > > doing N fsyncs;
> > > > > > - adding every entry into CacheDataStore also happens completely
> > > > > > independently. It means, we will traverse and modify each index
> > tree
> > > N
> > > > > > times, we will allocate space in FreeList N times and we will
> have
> > to
> > > > > > additionally store in WAL O(N*log(N)) page delta records.
> > > > > >
> > > > > > I've created a few tickets in JIRA with very different levels of
> > > scale
> > > > > and
> > > > > > complexity.
> > > > > >
> > > > > > Ways to reduce impact of independent processing:
> > > > > > - https://issues.apache.org/jira/browse/IGNITE-8019 -
> > aforementioned
> > > > > bug,
> > > > > > causing contention on checkpointReadLock;
> > > > > > - https://issues.apache.org/jira/browse/IGNITE-8018 -
> inefficiency
> > > in
> > > > > > GridCacheMapEntry implementation;
> > > > > > - https://issues.apache.org/jira/browse/IGNITE-8017 -
> > automatically
> > > > > > disable
> > > > > > WAL during preloading.
> > > > > >
> > > > > > Ways to solve problem on more global level:
> > > > > > - https://issues.apache.org/jira/browse/IGNITE-7935 - a ticket
> to
> > > > > > introduce
> > > > > > batch modification;
> > > > > > - https://issues.apache.org/jira/browse/IGNITE-8020 - complete
> > > > redesign
> > > > > of
> > > > > > rebalancing process for persistent caches, based on file
> transfer.
> > > > > >
> > > > > > Everyone is welcome to criticize above ideas, suggest new ones or
> > > > > > participate in implementation.
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > > Ilya
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Ilya
> > > >
> > >
> >
>
>
>
> --
> Best regards,
> Ilya
>

Re: Rebalancing - how to make it faster

Posted by Ilya Lantukh <il...@gridgain.com>.

Dmitry,
It is impossible to disable WAL only for certain partitions without
completely overhauling design of Ignite storage mechanism. Right now we can
afford only to change WAL mode per cache group.

The idea is to disable WAL when node doesn't have any partition in OWNING
state, which means it doesn't have any consistent data and won't be able to
restore from WAL anyway. I don't see any potential use for WAL on such
node, but we can keep a configurable parameter indicating can we
automatically disable WAL in such case or not.

On Fri, Mar 23, 2018 at 10:40 PM, Dmitry Pavlov <dp...@gmail.com>
wrote:

> Denis, as I understood, there is and idea to exclude only rebalanced
> partition(s) data. All other data will go to the WAL.
>
> Ilya, please correct me if I'm wrong.
>
> пт, 23 мар. 2018 г. в 22:15, Denis Magda <dm...@apache.org>:
>
> > Ilya,
> >
> > That's a decent boost (5-20%) even having WAL enabled. Not sure that we
> > should stake on the WAL "off" mode here because if the whole cluster goes
> > down, it's then the data consistency is questionable. As an architect, I
> > wouldn't disable WAL for the sake of rebalancing; it's too risky.
> >
> > If you agree, then let's create the IEP. This way it will be easier to
> > track this endeavor. BTW, are you already ready to release any
> > optimizations in 2.5 that is being discussed in a separate thread?
> >
> > --
> > Denis
> >
> >
> >
> > On Fri, Mar 23, 2018 at 6:37 AM, Ilya Lantukh <il...@gridgain.com>
> > wrote:
> >
> > > Denis,
> > >
> > > > - Don't you want to aggregate the tickets under an IEP?
> > > Yes, I think so.
> > >
> > > > - Does it mean we're going to update our B+Tree implementation? Any
> > ideas
> > > how risky it is?
> > > One of tickets that I created (
> > > https://issues.apache.org/jira/browse/IGNITE-7935) involves B+Tree
> > > modification, but I am not planning to do it in the nearest future. It
> > > shouldn't affect existing tree operations, only introduce new ones
> > (putAll,
> > > invokeAll, removeAll).
> > >
> > > > - Any chance you had a prototype that shows performance optimizations
> > the
> > > approach you are suggesting to take?
> > > I have a prototype for simplest improvements (
> https://issues.apache.org/
> > > jira/browse/IGNITE-8019 & https://issues.apache.org/
> > > jira/browse/IGNITE-8018)
> > > - together they increase throughput by 5-20%, depending on
> configuration
> > > and environment. Also, I've tested different WAL modes - switching from
> > > LOG_ONLY to NONE gives over 100% boost - this is what I expect from
> > > https://issues.apache.org/jira/browse/IGNITE-8017.
> > >
> > > On Thu, Mar 22, 2018 at 9:48 PM, Denis Magda <dm...@apache.org>
> wrote:
> > >
> > > > Ilya,
> > > >
> > > > That's outstanding research and summary. Thanks for spending your
> time
> > on
> > > > this.
> > > >
> > > > Not sure I have enough expertise to challenge your approach, but it
> > > sounds
> > > > 100% reasonable to me. As side notes:
> > > >
> > > >    - Don't you want to aggregate the tickets under an IEP?
> > > >    - Does it mean we're going to update our B+Tree implementation?
> Any
> > > >    ideas how risky it is?
> > > >    - Any chance you had a prototype that shows performance
> > optimizations
> > > of
> > > >    the approach you are suggesting to take?
> > > >
> > > > --
> > > > Denis
> > > >
> > > > On Thu, Mar 22, 2018 at 8:38 AM, Ilya Lantukh <ilantukh@gridgain.com
> >
> > > > wrote:
> > > >
> > > > > Igniters,
> > > > >
> > > > > I've spent some time analyzing performance of rebalancing process.
> > The
> > > > > initial goal was to understand, what limits it's throughput,
> because
> > it
> > > > is
> > > > > significantly slower than network and storage device can
> > theoretically
> > > > > handle.
> > > > >
> > > > > Turns out, our current implementation has a number of issues caused
> > by
> > > a
> > > > > single fundamental problem.
> > > > >
> > > > > During rebalance data is sent in batches called
> > > > > GridDhtPartitionSupplyMessages. Batch size is configurable,
> default
> > > > value
> > > > > is 512KB, which could mean thousands of key-value pairs. However,
> we
> > > > don't
> > > > > take any advantage over this fact and process each entry
> > independently:
> > > > > - checkpointReadLock is acquired multiple times for every entry,
> > > leading
> > > > to
> > > > > unnecessary contention - this is clearly a bug;
> > > > > - for each entry we write (and fsync, if configuration assumes it)
> a
> > > > > separate WAL record - so, if batch contains N entries, we might end
> > up
> > > > > doing N fsyncs;
> > > > > - adding every entry into CacheDataStore also happens completely
> > > > > independently. It means, we will traverse and modify each index
> tree
> > N
> > > > > times, we will allocate space in FreeList N times and we will have
> to
> > > > > additionally store in WAL O(N*log(N)) page delta records.
> > > > >
> > > > > I've created a few tickets in JIRA with very different levels of
> > scale
> > > > and
> > > > > complexity.
> > > > >
> > > > > Ways to reduce impact of independent processing:
> > > > > - https://issues.apache.org/jira/browse/IGNITE-8019 -
> aforementioned
> > > > bug,
> > > > > causing contention on checkpointReadLock;
> > > > > - https://issues.apache.org/jira/browse/IGNITE-8018 - inefficiency
> > in
> > > > > GridCacheMapEntry implementation;
> > > > > - https://issues.apache.org/jira/browse/IGNITE-8017 -
> automatically
> > > > > disable
> > > > > WAL during preloading.
> > > > >
> > > > > Ways to solve problem on more global level:
> > > > > - https://issues.apache.org/jira/browse/IGNITE-7935 - a ticket to
> > > > > introduce
> > > > > batch modification;
> > > > > - https://issues.apache.org/jira/browse/IGNITE-8020 - complete
> > > redesign
> > > > of
> > > > > rebalancing process for persistent caches, based on file transfer.
> > > > >
> > > > > Everyone is welcome to criticize above ideas, suggest new ones or
> > > > > participate in implementation.
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Ilya
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > > Ilya
> > >
> >
>



-- 
Best regards,
Ilya

Re: Rebalancing - how to make it faster

Posted by Dmitry Pavlov <dp...@gmail.com>.

Denis, as I understood, there is and idea to exclude only rebalanced
partition(s) data. All other data will go to the WAL.

Ilya, please correct me if I'm wrong.

пт, 23 мар. 2018 г. в 22:15, Denis Magda <dm...@apache.org>:

> Ilya,
>
> That's a decent boost (5-20%) even having WAL enabled. Not sure that we
> should stake on the WAL "off" mode here because if the whole cluster goes
> down, it's then the data consistency is questionable. As an architect, I
> wouldn't disable WAL for the sake of rebalancing; it's too risky.
>
> If you agree, then let's create the IEP. This way it will be easier to
> track this endeavor. BTW, are you already ready to release any
> optimizations in 2.5 that is being discussed in a separate thread?
>
> --
> Denis
>
>
>
> On Fri, Mar 23, 2018 at 6:37 AM, Ilya Lantukh <il...@gridgain.com>
> wrote:
>
> > Denis,
> >
> > > - Don't you want to aggregate the tickets under an IEP?
> > Yes, I think so.
> >
> > > - Does it mean we're going to update our B+Tree implementation? Any
> ideas
> > how risky it is?
> > One of tickets that I created (
> > https://issues.apache.org/jira/browse/IGNITE-7935) involves B+Tree
> > modification, but I am not planning to do it in the nearest future. It
> > shouldn't affect existing tree operations, only introduce new ones
> (putAll,
> > invokeAll, removeAll).
> >
> > > - Any chance you had a prototype that shows performance optimizations
> the
> > approach you are suggesting to take?
> > I have a prototype for simplest improvements (https://issues.apache.org/
> > jira/browse/IGNITE-8019 & https://issues.apache.org/
> > jira/browse/IGNITE-8018)
> > - together they increase throughput by 5-20%, depending on configuration
> > and environment. Also, I've tested different WAL modes - switching from
> > LOG_ONLY to NONE gives over 100% boost - this is what I expect from
> > https://issues.apache.org/jira/browse/IGNITE-8017.
> >
> > On Thu, Mar 22, 2018 at 9:48 PM, Denis Magda <dm...@apache.org> wrote:
> >
> > > Ilya,
> > >
> > > That's outstanding research and summary. Thanks for spending your time
> on
> > > this.
> > >
> > > Not sure I have enough expertise to challenge your approach, but it
> > sounds
> > > 100% reasonable to me. As side notes:
> > >
> > >    - Don't you want to aggregate the tickets under an IEP?
> > >    - Does it mean we're going to update our B+Tree implementation? Any
> > >    ideas how risky it is?
> > >    - Any chance you had a prototype that shows performance
> optimizations
> > of
> > >    the approach you are suggesting to take?
> > >
> > > --
> > > Denis
> > >
> > > On Thu, Mar 22, 2018 at 8:38 AM, Ilya Lantukh <il...@gridgain.com>
> > > wrote:
> > >
> > > > Igniters,
> > > >
> > > > I've spent some time analyzing performance of rebalancing process.
> The
> > > > initial goal was to understand, what limits it's throughput, because
> it
> > > is
> > > > significantly slower than network and storage device can
> theoretically
> > > > handle.
> > > >
> > > > Turns out, our current implementation has a number of issues caused
> by
> > a
> > > > single fundamental problem.
> > > >
> > > > During rebalance data is sent in batches called
> > > > GridDhtPartitionSupplyMessages. Batch size is configurable, default
> > > value
> > > > is 512KB, which could mean thousands of key-value pairs. However, we
> > > don't
> > > > take any advantage over this fact and process each entry
> independently:
> > > > - checkpointReadLock is acquired multiple times for every entry,
> > leading
> > > to
> > > > unnecessary contention - this is clearly a bug;
> > > > - for each entry we write (and fsync, if configuration assumes it) a
> > > > separate WAL record - so, if batch contains N entries, we might end
> up
> > > > doing N fsyncs;
> > > > - adding every entry into CacheDataStore also happens completely
> > > > independently. It means, we will traverse and modify each index tree
> N
> > > > times, we will allocate space in FreeList N times and we will have to
> > > > additionally store in WAL O(N*log(N)) page delta records.
> > > >
> > > > I've created a few tickets in JIRA with very different levels of
> scale
> > > and
> > > > complexity.
> > > >
> > > > Ways to reduce impact of independent processing:
> > > > - https://issues.apache.org/jira/browse/IGNITE-8019 - aforementioned
> > > bug,
> > > > causing contention on checkpointReadLock;
> > > > - https://issues.apache.org/jira/browse/IGNITE-8018 - inefficiency
> in
> > > > GridCacheMapEntry implementation;
> > > > - https://issues.apache.org/jira/browse/IGNITE-8017 - automatically
> > > > disable
> > > > WAL during preloading.
> > > >
> > > > Ways to solve problem on more global level:
> > > > - https://issues.apache.org/jira/browse/IGNITE-7935 - a ticket to
> > > > introduce
> > > > batch modification;
> > > > - https://issues.apache.org/jira/browse/IGNITE-8020 - complete
> > redesign
> > > of
> > > > rebalancing process for persistent caches, based on file transfer.
> > > >
> > > > Everyone is welcome to criticize above ideas, suggest new ones or
> > > > participate in implementation.
> > > >
> > > > --
> > > > Best regards,
> > > > Ilya
> > > >
> > >
> >
> >
> >
> > --
> > Best regards,
> > Ilya
> >
>

Re: Rebalancing - how to make it faster

Posted by Denis Magda <dm...@apache.org>.

Ilya,

That's a decent boost (5-20%) even having WAL enabled. Not sure that we
should stake on the WAL "off" mode here because if the whole cluster goes
down, it's then the data consistency is questionable. As an architect, I
wouldn't disable WAL for the sake of rebalancing; it's too risky.

If you agree, then let's create the IEP. This way it will be easier to
track this endeavor. BTW, are you already ready to release any
optimizations in 2.5 that is being discussed in a separate thread?

--
Denis



On Fri, Mar 23, 2018 at 6:37 AM, Ilya Lantukh <il...@gridgain.com> wrote:

> Denis,
>
> > - Don't you want to aggregate the tickets under an IEP?
> Yes, I think so.
>
> > - Does it mean we're going to update our B+Tree implementation? Any ideas
> how risky it is?
> One of tickets that I created (
> https://issues.apache.org/jira/browse/IGNITE-7935) involves B+Tree
> modification, but I am not planning to do it in the nearest future. It
> shouldn't affect existing tree operations, only introduce new ones (putAll,
> invokeAll, removeAll).
>
> > - Any chance you had a prototype that shows performance optimizations the
> approach you are suggesting to take?
> I have a prototype for simplest improvements (https://issues.apache.org/
> jira/browse/IGNITE-8019 & https://issues.apache.org/
> jira/browse/IGNITE-8018)
> - together they increase throughput by 5-20%, depending on configuration
> and environment. Also, I've tested different WAL modes - switching from
> LOG_ONLY to NONE gives over 100% boost - this is what I expect from
> https://issues.apache.org/jira/browse/IGNITE-8017.
>
> On Thu, Mar 22, 2018 at 9:48 PM, Denis Magda <dm...@apache.org> wrote:
>
> > Ilya,
> >
> > That's outstanding research and summary. Thanks for spending your time on
> > this.
> >
> > Not sure I have enough expertise to challenge your approach, but it
> sounds
> > 100% reasonable to me. As side notes:
> >
> >    - Don't you want to aggregate the tickets under an IEP?
> >    - Does it mean we're going to update our B+Tree implementation? Any
> >    ideas how risky it is?
> >    - Any chance you had a prototype that shows performance optimizations
> of
> >    the approach you are suggesting to take?
> >
> > --
> > Denis
> >
> > On Thu, Mar 22, 2018 at 8:38 AM, Ilya Lantukh <il...@gridgain.com>
> > wrote:
> >
> > > Igniters,
> > >
> > > I've spent some time analyzing performance of rebalancing process. The
> > > initial goal was to understand, what limits it's throughput, because it
> > is
> > > significantly slower than network and storage device can theoretically
> > > handle.
> > >
> > > Turns out, our current implementation has a number of issues caused by
> a
> > > single fundamental problem.
> > >
> > > During rebalance data is sent in batches called
> > > GridDhtPartitionSupplyMessages. Batch size is configurable, default
> > value
> > > is 512KB, which could mean thousands of key-value pairs. However, we
> > don't
> > > take any advantage over this fact and process each entry independently:
> > > - checkpointReadLock is acquired multiple times for every entry,
> leading
> > to
> > > unnecessary contention - this is clearly a bug;
> > > - for each entry we write (and fsync, if configuration assumes it) a
> > > separate WAL record - so, if batch contains N entries, we might end up
> > > doing N fsyncs;
> > > - adding every entry into CacheDataStore also happens completely
> > > independently. It means, we will traverse and modify each index tree N
> > > times, we will allocate space in FreeList N times and we will have to
> > > additionally store in WAL O(N*log(N)) page delta records.
> > >
> > > I've created a few tickets in JIRA with very different levels of scale
> > and
> > > complexity.
> > >
> > > Ways to reduce impact of independent processing:
> > > - https://issues.apache.org/jira/browse/IGNITE-8019 - aforementioned
> > bug,
> > > causing contention on checkpointReadLock;
> > > - https://issues.apache.org/jira/browse/IGNITE-8018 - inefficiency in
> > > GridCacheMapEntry implementation;
> > > - https://issues.apache.org/jira/browse/IGNITE-8017 - automatically
> > > disable
> > > WAL during preloading.
> > >
> > > Ways to solve problem on more global level:
> > > - https://issues.apache.org/jira/browse/IGNITE-7935 - a ticket to
> > > introduce
> > > batch modification;
> > > - https://issues.apache.org/jira/browse/IGNITE-8020 - complete
> redesign
> > of
> > > rebalancing process for persistent caches, based on file transfer.
> > >
> > > Everyone is welcome to criticize above ideas, suggest new ones or
> > > participate in implementation.
> > >
> > > --
> > > Best regards,
> > > Ilya
> > >
> >
>
>
>
> --
> Best regards,
> Ilya
>

Re: Rebalancing - how to make it faster

Posted by Ilya Lantukh <il...@gridgain.com>.

Denis,

> - Don't you want to aggregate the tickets under an IEP?
Yes, I think so.

> - Does it mean we're going to update our B+Tree implementation? Any ideas
how risky it is?
One of tickets that I created (
https://issues.apache.org/jira/browse/IGNITE-7935) involves B+Tree
modification, but I am not planning to do it in the nearest future. It
shouldn't affect existing tree operations, only introduce new ones (putAll,
invokeAll, removeAll).

> - Any chance you had a prototype that shows performance optimizations the
approach you are suggesting to take?
I have a prototype for simplest improvements (https://issues.apache.org/
jira/browse/IGNITE-8019 & https://issues.apache.org/jira/browse/IGNITE-8018)
- together they increase throughput by 5-20%, depending on configuration
and environment. Also, I've tested different WAL modes - switching from
LOG_ONLY to NONE gives over 100% boost - this is what I expect from
https://issues.apache.org/jira/browse/IGNITE-8017.

On Thu, Mar 22, 2018 at 9:48 PM, Denis Magda <dm...@apache.org> wrote:

> Ilya,
>
> That's outstanding research and summary. Thanks for spending your time on
> this.
>
> Not sure I have enough expertise to challenge your approach, but it sounds
> 100% reasonable to me. As side notes:
>
>    - Don't you want to aggregate the tickets under an IEP?
>    - Does it mean we're going to update our B+Tree implementation? Any
>    ideas how risky it is?
>    - Any chance you had a prototype that shows performance optimizations of
>    the approach you are suggesting to take?
>
> --
> Denis
>
> On Thu, Mar 22, 2018 at 8:38 AM, Ilya Lantukh <il...@gridgain.com>
> wrote:
>
> > Igniters,
> >
> > I've spent some time analyzing performance of rebalancing process. The
> > initial goal was to understand, what limits it's throughput, because it
> is
> > significantly slower than network and storage device can theoretically
> > handle.
> >
> > Turns out, our current implementation has a number of issues caused by a
> > single fundamental problem.
> >
> > During rebalance data is sent in batches called
> > GridDhtPartitionSupplyMessages. Batch size is configurable, default
> value
> > is 512KB, which could mean thousands of key-value pairs. However, we
> don't
> > take any advantage over this fact and process each entry independently:
> > - checkpointReadLock is acquired multiple times for every entry, leading
> to
> > unnecessary contention - this is clearly a bug;
> > - for each entry we write (and fsync, if configuration assumes it) a
> > separate WAL record - so, if batch contains N entries, we might end up
> > doing N fsyncs;
> > - adding every entry into CacheDataStore also happens completely
> > independently. It means, we will traverse and modify each index tree N
> > times, we will allocate space in FreeList N times and we will have to
> > additionally store in WAL O(N*log(N)) page delta records.
> >
> > I've created a few tickets in JIRA with very different levels of scale
> and
> > complexity.
> >
> > Ways to reduce impact of independent processing:
> > - https://issues.apache.org/jira/browse/IGNITE-8019 - aforementioned
> bug,
> > causing contention on checkpointReadLock;
> > - https://issues.apache.org/jira/browse/IGNITE-8018 - inefficiency in
> > GridCacheMapEntry implementation;
> > - https://issues.apache.org/jira/browse/IGNITE-8017 - automatically
> > disable
> > WAL during preloading.
> >
> > Ways to solve problem on more global level:
> > - https://issues.apache.org/jira/browse/IGNITE-7935 - a ticket to
> > introduce
> > batch modification;
> > - https://issues.apache.org/jira/browse/IGNITE-8020 - complete redesign
> of
> > rebalancing process for persistent caches, based on file transfer.
> >
> > Everyone is welcome to criticize above ideas, suggest new ones or
> > participate in implementation.
> >
> > --
> > Best regards,
> > Ilya
> >
>



-- 
Best regards,
Ilya

Re: Rebalancing - how to make it faster

Posted by Denis Magda <dm...@apache.org>.

Ilya,

That's outstanding research and summary. Thanks for spending your time on
this.

Not sure I have enough expertise to challenge your approach, but it sounds
100% reasonable to me. As side notes:

   - Don't you want to aggregate the tickets under an IEP?
   - Does it mean we're going to update our B+Tree implementation? Any
   ideas how risky it is?
   - Any chance you had a prototype that shows performance optimizations of
   the approach you are suggesting to take?

--
Denis

On Thu, Mar 22, 2018 at 8:38 AM, Ilya Lantukh <il...@gridgain.com> wrote:

> Igniters,
>
> I've spent some time analyzing performance of rebalancing process. The
> initial goal was to understand, what limits it's throughput, because it is
> significantly slower than network and storage device can theoretically
> handle.
>
> Turns out, our current implementation has a number of issues caused by a
> single fundamental problem.
>
> During rebalance data is sent in batches called
> GridDhtPartitionSupplyMessages. Batch size is configurable, default value
> is 512KB, which could mean thousands of key-value pairs. However, we don't
> take any advantage over this fact and process each entry independently:
> - checkpointReadLock is acquired multiple times for every entry, leading to
> unnecessary contention - this is clearly a bug;
> - for each entry we write (and fsync, if configuration assumes it) a
> separate WAL record - so, if batch contains N entries, we might end up
> doing N fsyncs;
> - adding every entry into CacheDataStore also happens completely
> independently. It means, we will traverse and modify each index tree N
> times, we will allocate space in FreeList N times and we will have to
> additionally store in WAL O(N*log(N)) page delta records.
>
> I've created a few tickets in JIRA with very different levels of scale and
> complexity.
>
> Ways to reduce impact of independent processing:
> - https://issues.apache.org/jira/browse/IGNITE-8019 - aforementioned bug,
> causing contention on checkpointReadLock;
> - https://issues.apache.org/jira/browse/IGNITE-8018 - inefficiency in
> GridCacheMapEntry implementation;
> - https://issues.apache.org/jira/browse/IGNITE-8017 - automatically
> disable
> WAL during preloading.
>
> Ways to solve problem on more global level:
> - https://issues.apache.org/jira/browse/IGNITE-7935 - a ticket to
> introduce
> batch modification;
> - https://issues.apache.org/jira/browse/IGNITE-8020 - complete redesign of
> rebalancing process for persistent caches, based on file transfer.
>
> Everyone is welcome to criticize above ideas, suggest new ones or
> participate in implementation.
>
> --
> Best regards,
> Ilya
>