You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ignite.apache.org by Anton Vinogradov <av...@apache.org> on 2019/06/20 11:15:18 UTC

Read Repair (ex. Consistency Check) - review request #2

Igniters,
I'm glad to introduce Read Repair feature [0] provides additional
consistency guarantee for Ignite.

1) Why we need it?
The detailed explanation can be found at IEP-31 [1].
In short, because of bugs, it's possible to gain an inconsistent state.
We need additional features to handle this case.

Currently we able to check cluster using Idle_verify [2] feature, but it
will not fix the data, will not even tell which entries are broken.
Read Repair is a feature to understand which entries are broken and to fix
them.

1) How it works?
IgniteCache now able to provide special proxy [3] withReadRepair().
This proxy guarantee that data will be gained from all owners and compared.
In the case of consistency violation situation, data will be recovered and
a special event recorded.

3) Naming?
Feature name based on Cassandra's Read Repair feature [4], which is pretty
similar.

4) Limitations which can be fixed in the future?
  * MVCC and Near caches are not supported.
  * Atomic caches can be checked (false positive case is possible on this
check), but can't be recovered.
  * Partial entry removal can't be recovered.
  * Entries streamed using data streamer (using not a "cache.put" based
updater) and loaded by cache.load
  are perceived as inconsistent since they may have different versions for
same keys.
  * Only explicit get operations are supported (getAndReplace, getAndPut,
etc can be supported in future).

5) What's left?
  * SQL/ThinClient/etc support.
  * Metrics (found/repaired).
  * Simple per-partition recovery feature able to work in the background in
addition to per-entry recovery feature.

6) Is code checked?
  * Pull Request #5656 [5] (feature) - has green TC.
  * Pull Request #6575 [6] (RunAll with the feature enabled for every get()
request) - has a limited amount of failures (because of data streamer,
cache.load, etc).

Thoughts?

[0] https://issues.apache.org/jira/browse/IGNITE-10663
[1]
https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix
[2]
https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
[3]
https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140
[4]
https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html
[5] https://github.com/apache/ignite/pull/5656
[6] https://github.com/apache/ignite/pull/6575

Re: Read Repair (ex. Consistency Check) - review request #2

Posted by Anton Vinogradov <av...@apache.org>.
Nikolay,

Initial idea is to count fixes per cache.
In other words, event recording should cause metric increment.
Could you help me with RR metrics implementation as a part of your metrics
journey?

On Thu, Jul 11, 2019 at 10:18 AM Nikolay Izhikov <ni...@apache.org>
wrote:

> Anton,
>
> > - metrics are better for monitoring, but the Event is enough for my wish
> to cover AI with consistency check,
>
> Can you clarify, do you have plans to add metrics of RR events?
>
> I think it should be count of incosistency events per cache(maybe per
> partition).
>
>
> В Чт, 11/07/2019 в 09:53 +0300, Anton Vinogradov пишет:
> > Slava,
> >
> > Thanks for your review first!
> >
> > > > Anyway, I left some comments in your pull-request at github. Please
> take
> >
> > a
> > > > look. The most annoying thing is poorly documented code :(
> >
> > Since your comments mostly about Javadoc (does this mean that my solution
> > is so great that you ask me only to fix Javadocs :) ?),
> > I'd like to propose you to provide fixes as a PR since you have a vision
> of
> > how it should be made. I'll review them and merge shortly.
> >
> > > > By the way, is it required to add test related to fail-over
> scenarios?
> >
> > The best check is to use RR at real code.
> > For example, I'm injecting RR now to the test with concurrent
> modifications
> > and restarts [1].
> >
> > > > I just checked, the IEP page and I still cannot find Jira tickets
> that
> > > > should cover existing limitations/improvements.
> > > > I would suggest creating the following tasks, at least:
> >
> > Mostly agree with you, but
> > - MVCC is not production ready,
> > - not sure near support really required,
> > - metrics are better for monitoring, but the Event is enough for my wish
> to
> > cover AI with consistency check,
> > - do we really need Platforms and Thin Client support?
> > Also, we should not produce stillborn issue.
> > All limitations listed at proxy creation method and they definitely are
> not
> > showstoppers and may be fixed later if someone interested.
> > Сoming back to AI 3.0 discussion, we have A LOT of features and it's
> almost
> > impossible (require much more time that feature's cost) to support them
> all.
> > I will be pretty happy in case someone will do this and provide help if
> > necessary!
> >
> > > > Perhaps, it would be a good idea to think about the recovery too
> >
> > Do you mean per partition check and recovery?
> > That's a good idea, but I found it's not easy to imagine API to for such
> > tool.
> > In case you ready to assist with proper API/design this will definitely
> > help.
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-11973
> >
> > On Wed, Jul 10, 2019 at 3:43 PM Вячеслав Коптилин <
> slava.koptilin@gmail.com>
> > wrote:
> >
> > > Perhaps, it would be a good idea to think about the recovery tool/
> > > control-utility command that will allow achieving the same goal.
> > > If I am not mistaken it was already proposed in the email thread.
> > >
> > > Thanks,
> > > S.
> > >
> > > ср, 10 июл. 2019 г. в 15:33, Вячеслав Коптилин <
> slava.koptilin@gmail.com>:
> > >
> > > > Hi Anton,
> > > >
> > > > Well, the ReadRepair feature is finally merged and that is good news
> :)
> > > >
> > > > Unfortunately, I cannot find a consensus about the whole
> functionality in
> > > > any of these topics:
> > > >  -
> > > >
> > >
> > >
> http://apache-ignite-developers.2346864.n4.nabble.com/Consistency-check-and-fix-review-request-td41629.html
> > > >  -
> > > >
> > >
> > >
> http://apache-ignite-developers.2346864.n4.nabble.com/Read-Repair-ex-Consistency-Check-review-request-2-td42421.html
> > > > Also, there are no comments/discussion in JIRA. That makes me sad :(
> > > > especially when a feature is huge, not obvious and involves changing
> > >
> > > public
> > > > API (and that is the case, I think).
> > > >
> > > > Anyway, I left some comments in your pull-request at github. Please
> take
> > >
> > > a
> > > > look. The most annoying thing is poorly documented code :(
> > > > By the way, is it required to add test related to fail-over
> scenarios?
> > > >
> > > > I just checked, the IEP page and I still cannot find Jira tickets
> that
> > > > should cover existing limitations/improvements.
> > > > I would suggest creating the following tasks, at least:
> > > >  - MVCC support
> > > >  - Near caches
> > > >  - Additional metrics (number of violations, number of repaired
> entries
> > > > etc)
> > > >  - Ignite C++ (It looks like, .Net is already aware of that feature)
> > > >  - Thin clients support
> > > >  - Perhaps, it would be useful to support different strategies to
> resolve
> > > > inconsistencies
> > > >
> > > > What do you think?
> > > >
> > > > Thanks,
> > > > S.
> > > >
> > > >
> > > > ср, 10 июл. 2019 г. в 10:16, Anton Vinogradov <av...@apache.org>:
> > > >
> > > > > Folks,
> > > > >
> > > > > Thanks to everyone for tips and reviews.
> > > > > Yardstick checked, no performance drop found.
> > > > > Additional measurement: RR get() is just up to 7% slower than
> regular
> > >
> > > get
> > > > > on real benchmarks (8 clients, 4 servers, 3 backups)
> > > > > Code merged to the master.
> > > > > "Must have" tasks created and attached to the IEP.
> > > > >
> > > > > On Thu, Jul 4, 2019 at 12:18 PM Anton Vinogradov <av...@apache.org>
> wrote:
> > > > >
> > > > > > Folks,
> > > > > >
> > > > > > Just a minor update.
> > > > > >
> > > > > > RunAll [1] with enabled ReadRepair proxy is almost green now (~10
> > >
> > > tests
> > > > > > left, started with 6k :)).
> > > > > > During the analisys, I've found some tests with
> > > > > > - unexpected repairs at tx caches
> > > > > > - inconsistent state after the test finished (different entries
> across
> > > > >
> > > > > the
> > > > > > topology)
> > > > > > For example,
> > > > > > - testInvokeAllAppliedOnceOnBinaryTypeRegistration generates
> obsolete
> > > > > > versions on backups in case of retry, fixed [2]
> > > > > > - initial cache load generates not equal versions on backups,
> fixed
> > >
> > > [3]
> > > > > > - testAccountTxNodeRestart causes unexpected repairs (entries
> have
> > > > > > different versions), to be investigated.
> > > > > >
> > > > > > What's next?
> > > > > >
> > > > > > 1) Going to merge the solution once "RunAll with ReadRepair
> enabled"
> > > > > > becomes fully green.
> > > > > > 2) Going to add special check after each test which will ensure
> caches
> > > > > > content after the test is consistent.
> > > > > > 2.1) The Same check can (should?) be injected to
> > > > > > awaitPartitionMapExchange() and similar methods.
> > > > > > 3) Update Jepsen tests with RR checks.
> > > > > > 4) Introduce per partition RR.
> > > > > >
> > > > > > So, the final goal is to be sure that Ignite produces only
> consistent
> > > > >
> > > > > data
> > > > > > and to have a feature to solve consistency in case we gain
> > >
> > > inconsistent
> > > > > > state somehow.
> > > > > >
> > > > > > Limitations?
> > > > > >
> > > > > > Currently, RR has some limitations, but they are not related to
> real
> > > > > > production cases.
> > > > > > In case someone interested to support, for example, MVCC or near
> > >
> > > caches,
> > > > > > please, feel free to contribute.
> > > > > >
> > > > > > [1]
> > > > > >
> > >
> > >
> https://mtcga.gridgain.com/pr.html?serverId=apache&suiteId=IgniteTests24Java8_RunAll&branchForTc=pull/6575/head&action=Latest
> > > > > > [2]
> > > > > >
> > >
> > >
> https://github.com/apache/ignite/pull/5656/commits/6f6ec4434095e692af209c61833a350f3013408c
> > > > > > [3]
> > > > > >
> > >
> > >
> https://github.com/apache/ignite/pull/5656/commits/255e552b474839e470c66a77e74e3c807bc76f13
> > > > > >
> > > > > > On Fri, Jun 28, 2019 at 2:41 PM Anton Vinogradov <av...@apache.org>
> > >
> > > wrote:
> > > > > >
> > > > > > > Slava,
> > > > > > >
> > > > > > > > > I will take a look at your pull request if you don't mind.
> > > > > > >
> > > > > > > Great news!
> > > > > > >
> > > > > > > > > In any way, could you please update the IEP page with the
> list of
> > > > > > > > > constraints/limitations of the proposed approach, TODOs,
> etc?
> > > > > > >
> > > > > > > Not sure we should keep this at IEP until list (#4 from
> original
> > > > >
> > > > > letter)
> > > > > > > is not confirmed.
> > > > > > >
> > > > > > > Currently, I'm checking RunAll with RR enabled to almost each
> get
> > > > >
> > > > > request.
> > > > > > > "Almost" means: readRepair = !ctx.readThrough() &&
> > > > > > > ctx.config().getBackups() > 0 && !ctx.isNear() &&
> !ctx.mvccEnabled()
> > > > > > > For now I have 60 failed tests and amount decreasing.
> > > > > > >
> > > > > > > > > For instance, I would like to see all these limitations on
> the IEP
> > > > > > >
> > > > > > > page as
> > > > > > > > > JIRA tickets. Perhaps, it would be good to create an
> epic/umbrella
> > > > > > >
> > > > > > > ticket
> > > > > > > > > in order to track all activities related to `Read Repair`
> feature.
> > > > > > >
> > > > > > > Let's do this at merge day to be sure useless issues will not
> be
> > > > >
> > > > > created.
> > > > > > >
> > > > > > > On Fri, Jun 28, 2019 at 2:01 PM Вячеслав Коптилин <
> > > > > > > slava.koptilin@gmail.com> wrote:
> > > > > > >
> > > > > > > > Hi Anton,
> > > > > > > >
> > > > > > > > I will take a look at your pull request if you don't mind.
> > > > > > > >
> > > > > > > > In any way, could you please update the IEP page with the
> list of
> > > > > > > > constraints/limitations of the proposed approach, TODOs, etc?
> > > > > > > > For instance, I would like to see all these limitations on
> the IEP
> > > > >
> > > > > page
> > > > > > > > as
> > > > > > > > JIRA tickets. Perhaps, it would be good to create an
> epic/umbrella
> > > > >
> > > > > ticket
> > > > > > > > in order to track all activities related to `Read Repair`
> feature.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > S.
> > > > > > > >
> > > > > > > > чт, 20 июн. 2019 г. в 14:15, Anton Vinogradov <av@apache.org
> >:
> > > > > > > >
> > > > > > > > > Igniters,
> > > > > > > > > I'm glad to introduce Read Repair feature [0] provides
> additional
> > > > > > > > > consistency guarantee for Ignite.
> > > > > > > > >
> > > > > > > > > 1) Why we need it?
> > > > > > > > > The detailed explanation can be found at IEP-31 [1].
> > > > > > > > > In short, because of bugs, it's possible to gain an
> inconsistent
> > > > >
> > > > > state.
> > > > > > > > > We need additional features to handle this case.
> > > > > > > > >
> > > > > > > > > Currently we able to check cluster using Idle_verify [2]
> feature,
> > > > >
> > > > > but
> > > > > > > > it
> > > > > > > > > will not fix the data, will not even tell which entries are
> > >
> > > broken.
> > > > > > > > > Read Repair is a feature to understand which entries are
> broken
> > >
> > > and
> > > > > to
> > > > > > > > fix
> > > > > > > > > them.
> > > > > > > > >
> > > > > > > > > 1) How it works?
> > > > > > > > > IgniteCache now able to provide special proxy [3]
> > >
> > > withReadRepair().
> > > > > > > > > This proxy guarantee that data will be gained from all
> owners and
> > > > > > > >
> > > > > > > > compared.
> > > > > > > > > In the case of consistency violation situation, data will
> be
> > > > >
> > > > > recovered
> > > > > > > > and
> > > > > > > > > a special event recorded.
> > > > > > > > >
> > > > > > > > > 3) Naming?
> > > > > > > > > Feature name based on Cassandra's Read Repair feature [4],
> which
> > >
> > > is
> > > > > > > > pretty
> > > > > > > > > similar.
> > > > > > > > >
> > > > > > > > > 4) Limitations which can be fixed in the future?
> > > > > > > > >   * MVCC and Near caches are not supported.
> > > > > > > > >   * Atomic caches can be checked (false positive case is
> possible
> > >
> > > on
> > > > > > > > this
> > > > > > > > > check), but can't be recovered.
> > > > > > > > >   * Partial entry removal can't be recovered.
> > > > > > > > >   * Entries streamed using data streamer (using not a
> "cache.put"
> > > > >
> > > > > based
> > > > > > > > > updater) and loaded by cache.load
> > > > > > > > >   are perceived as inconsistent since they may have
> different
> > > > >
> > > > > versions
> > > > > > > > for
> > > > > > > > > same keys.
> > > > > > > > >   * Only explicit get operations are supported
> (getAndReplace,
> > > > > > > >
> > > > > > > > getAndPut,
> > > > > > > > > etc can be supported in future).
> > > > > > > > >
> > > > > > > > > 5) What's left?
> > > > > > > > >   * SQL/ThinClient/etc support.
> > > > > > > > >   * Metrics (found/repaired).
> > > > > > > > >   * Simple per-partition recovery feature able to work in
> the
> > > > > > > >
> > > > > > > > background in
> > > > > > > > > addition to per-entry recovery feature.
> > > > > > > > >
> > > > > > > > > 6) Is code checked?
> > > > > > > > >   * Pull Request #5656 [5] (feature) - has green TC.
> > > > > > > > >   * Pull Request #6575 [6] (RunAll with the feature
> enabled for
> > > > >
> > > > > every
> > > > > > > > get()
> > > > > > > > > request) - has a limited amount of failures (because of
> data
> > > > >
> > > > > streamer,
> > > > > > > > > cache.load, etc).
> > > > > > > > >
> > > > > > > > > Thoughts?
> > > > > > > > >
> > > > > > > > > [0] https://issues.apache.org/jira/browse/IGNITE-10663
> > > > > > > > > [1]
> > > > > > > > >
> > > > > > > > >
> > >
> > >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix
> > > > > > > > > [2]
> > > > > > > > >
> > > > > > > > >
> > >
> > >
> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
> > > > > > > > > [3]
> > > > > > > > >
> > > > > > > > >
> > >
> > >
> https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140
> > > > > > > > > [4]
> > > > > > > > >
> > > > > > > > >
> > >
> > >
> https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html
> > > > > > > > > [5] https://github.com/apache/ignite/pull/5656
> > > > > > > > > [6] https://github.com/apache/ignite/pull/6575
> > > > > > > > >
>

Re: Read Repair (ex. Consistency Check) - review request #2

Posted by Nikolay Izhikov <ni...@apache.org>.
Anton,

> - metrics are better for monitoring, but the Event is enough for my wish to cover AI with consistency check,

Can you clarify, do you have plans to add metrics of RR events?

I think it should be count of incosistency events per cache(maybe per partition).


В Чт, 11/07/2019 в 09:53 +0300, Anton Vinogradov пишет:
> Slava,
> 
> Thanks for your review first!
> 
> > > Anyway, I left some comments in your pull-request at github. Please take
> 
> a
> > > look. The most annoying thing is poorly documented code :(
> 
> Since your comments mostly about Javadoc (does this mean that my solution
> is so great that you ask me only to fix Javadocs :) ?),
> I'd like to propose you to provide fixes as a PR since you have a vision of
> how it should be made. I'll review them and merge shortly.
> 
> > > By the way, is it required to add test related to fail-over scenarios?
> 
> The best check is to use RR at real code.
> For example, I'm injecting RR now to the test with concurrent modifications
> and restarts [1].
> 
> > > I just checked, the IEP page and I still cannot find Jira tickets that
> > > should cover existing limitations/improvements.
> > > I would suggest creating the following tasks, at least:
> 
> Mostly agree with you, but
> - MVCC is not production ready,
> - not sure near support really required,
> - metrics are better for monitoring, but the Event is enough for my wish to
> cover AI with consistency check,
> - do we really need Platforms and Thin Client support?
> Also, we should not produce stillborn issue.
> All limitations listed at proxy creation method and they definitely are not
> showstoppers and may be fixed later if someone interested.
> Сoming back to AI 3.0 discussion, we have A LOT of features and it's almost
> impossible (require much more time that feature's cost) to support them all.
> I will be pretty happy in case someone will do this and provide help if
> necessary!
> 
> > > Perhaps, it would be a good idea to think about the recovery too
> 
> Do you mean per partition check and recovery?
> That's a good idea, but I found it's not easy to imagine API to for such
> tool.
> In case you ready to assist with proper API/design this will definitely
> help.
> 
> [1] https://issues.apache.org/jira/browse/IGNITE-11973
> 
> On Wed, Jul 10, 2019 at 3:43 PM Вячеслав Коптилин <sl...@gmail.com>
> wrote:
> 
> > Perhaps, it would be a good idea to think about the recovery tool/
> > control-utility command that will allow achieving the same goal.
> > If I am not mistaken it was already proposed in the email thread.
> > 
> > Thanks,
> > S.
> > 
> > ср, 10 июл. 2019 г. в 15:33, Вячеслав Коптилин <sl...@gmail.com>:
> > 
> > > Hi Anton,
> > > 
> > > Well, the ReadRepair feature is finally merged and that is good news :)
> > > 
> > > Unfortunately, I cannot find a consensus about the whole functionality in
> > > any of these topics:
> > >  -
> > > 
> > 
> > http://apache-ignite-developers.2346864.n4.nabble.com/Consistency-check-and-fix-review-request-td41629.html
> > >  -
> > > 
> > 
> > http://apache-ignite-developers.2346864.n4.nabble.com/Read-Repair-ex-Consistency-Check-review-request-2-td42421.html
> > > Also, there are no comments/discussion in JIRA. That makes me sad :(
> > > especially when a feature is huge, not obvious and involves changing
> > 
> > public
> > > API (and that is the case, I think).
> > > 
> > > Anyway, I left some comments in your pull-request at github. Please take
> > 
> > a
> > > look. The most annoying thing is poorly documented code :(
> > > By the way, is it required to add test related to fail-over scenarios?
> > > 
> > > I just checked, the IEP page and I still cannot find Jira tickets that
> > > should cover existing limitations/improvements.
> > > I would suggest creating the following tasks, at least:
> > >  - MVCC support
> > >  - Near caches
> > >  - Additional metrics (number of violations, number of repaired entries
> > > etc)
> > >  - Ignite C++ (It looks like, .Net is already aware of that feature)
> > >  - Thin clients support
> > >  - Perhaps, it would be useful to support different strategies to resolve
> > > inconsistencies
> > > 
> > > What do you think?
> > > 
> > > Thanks,
> > > S.
> > > 
> > > 
> > > ср, 10 июл. 2019 г. в 10:16, Anton Vinogradov <av...@apache.org>:
> > > 
> > > > Folks,
> > > > 
> > > > Thanks to everyone for tips and reviews.
> > > > Yardstick checked, no performance drop found.
> > > > Additional measurement: RR get() is just up to 7% slower than regular
> > 
> > get
> > > > on real benchmarks (8 clients, 4 servers, 3 backups)
> > > > Code merged to the master.
> > > > "Must have" tasks created and attached to the IEP.
> > > > 
> > > > On Thu, Jul 4, 2019 at 12:18 PM Anton Vinogradov <av...@apache.org> wrote:
> > > > 
> > > > > Folks,
> > > > > 
> > > > > Just a minor update.
> > > > > 
> > > > > RunAll [1] with enabled ReadRepair proxy is almost green now (~10
> > 
> > tests
> > > > > left, started with 6k :)).
> > > > > During the analisys, I've found some tests with
> > > > > - unexpected repairs at tx caches
> > > > > - inconsistent state after the test finished (different entries across
> > > > 
> > > > the
> > > > > topology)
> > > > > For example,
> > > > > - testInvokeAllAppliedOnceOnBinaryTypeRegistration generates obsolete
> > > > > versions on backups in case of retry, fixed [2]
> > > > > - initial cache load generates not equal versions on backups, fixed
> > 
> > [3]
> > > > > - testAccountTxNodeRestart causes unexpected repairs (entries have
> > > > > different versions), to be investigated.
> > > > > 
> > > > > What's next?
> > > > > 
> > > > > 1) Going to merge the solution once "RunAll with ReadRepair enabled"
> > > > > becomes fully green.
> > > > > 2) Going to add special check after each test which will ensure caches
> > > > > content after the test is consistent.
> > > > > 2.1) The Same check can (should?) be injected to
> > > > > awaitPartitionMapExchange() and similar methods.
> > > > > 3) Update Jepsen tests with RR checks.
> > > > > 4) Introduce per partition RR.
> > > > > 
> > > > > So, the final goal is to be sure that Ignite produces only consistent
> > > > 
> > > > data
> > > > > and to have a feature to solve consistency in case we gain
> > 
> > inconsistent
> > > > > state somehow.
> > > > > 
> > > > > Limitations?
> > > > > 
> > > > > Currently, RR has some limitations, but they are not related to real
> > > > > production cases.
> > > > > In case someone interested to support, for example, MVCC or near
> > 
> > caches,
> > > > > please, feel free to contribute.
> > > > > 
> > > > > [1]
> > > > > 
> > 
> > https://mtcga.gridgain.com/pr.html?serverId=apache&suiteId=IgniteTests24Java8_RunAll&branchForTc=pull/6575/head&action=Latest
> > > > > [2]
> > > > > 
> > 
> > https://github.com/apache/ignite/pull/5656/commits/6f6ec4434095e692af209c61833a350f3013408c
> > > > > [3]
> > > > > 
> > 
> > https://github.com/apache/ignite/pull/5656/commits/255e552b474839e470c66a77e74e3c807bc76f13
> > > > > 
> > > > > On Fri, Jun 28, 2019 at 2:41 PM Anton Vinogradov <av...@apache.org>
> > 
> > wrote:
> > > > > 
> > > > > > Slava,
> > > > > > 
> > > > > > > > I will take a look at your pull request if you don't mind.
> > > > > > 
> > > > > > Great news!
> > > > > > 
> > > > > > > > In any way, could you please update the IEP page with the list of
> > > > > > > > constraints/limitations of the proposed approach, TODOs, etc?
> > > > > > 
> > > > > > Not sure we should keep this at IEP until list (#4 from original
> > > > 
> > > > letter)
> > > > > > is not confirmed.
> > > > > > 
> > > > > > Currently, I'm checking RunAll with RR enabled to almost each get
> > > > 
> > > > request.
> > > > > > "Almost" means: readRepair = !ctx.readThrough() &&
> > > > > > ctx.config().getBackups() > 0 && !ctx.isNear() && !ctx.mvccEnabled()
> > > > > > For now I have 60 failed tests and amount decreasing.
> > > > > > 
> > > > > > > > For instance, I would like to see all these limitations on the IEP
> > > > > > 
> > > > > > page as
> > > > > > > > JIRA tickets. Perhaps, it would be good to create an epic/umbrella
> > > > > > 
> > > > > > ticket
> > > > > > > > in order to track all activities related to `Read Repair` feature.
> > > > > > 
> > > > > > Let's do this at merge day to be sure useless issues will not be
> > > > 
> > > > created.
> > > > > > 
> > > > > > On Fri, Jun 28, 2019 at 2:01 PM Вячеслав Коптилин <
> > > > > > slava.koptilin@gmail.com> wrote:
> > > > > > 
> > > > > > > Hi Anton,
> > > > > > > 
> > > > > > > I will take a look at your pull request if you don't mind.
> > > > > > > 
> > > > > > > In any way, could you please update the IEP page with the list of
> > > > > > > constraints/limitations of the proposed approach, TODOs, etc?
> > > > > > > For instance, I would like to see all these limitations on the IEP
> > > > 
> > > > page
> > > > > > > as
> > > > > > > JIRA tickets. Perhaps, it would be good to create an epic/umbrella
> > > > 
> > > > ticket
> > > > > > > in order to track all activities related to `Read Repair` feature.
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > S.
> > > > > > > 
> > > > > > > чт, 20 июн. 2019 г. в 14:15, Anton Vinogradov <av...@apache.org>:
> > > > > > > 
> > > > > > > > Igniters,
> > > > > > > > I'm glad to introduce Read Repair feature [0] provides additional
> > > > > > > > consistency guarantee for Ignite.
> > > > > > > > 
> > > > > > > > 1) Why we need it?
> > > > > > > > The detailed explanation can be found at IEP-31 [1].
> > > > > > > > In short, because of bugs, it's possible to gain an inconsistent
> > > > 
> > > > state.
> > > > > > > > We need additional features to handle this case.
> > > > > > > > 
> > > > > > > > Currently we able to check cluster using Idle_verify [2] feature,
> > > > 
> > > > but
> > > > > > > it
> > > > > > > > will not fix the data, will not even tell which entries are
> > 
> > broken.
> > > > > > > > Read Repair is a feature to understand which entries are broken
> > 
> > and
> > > > to
> > > > > > > fix
> > > > > > > > them.
> > > > > > > > 
> > > > > > > > 1) How it works?
> > > > > > > > IgniteCache now able to provide special proxy [3]
> > 
> > withReadRepair().
> > > > > > > > This proxy guarantee that data will be gained from all owners and
> > > > > > > 
> > > > > > > compared.
> > > > > > > > In the case of consistency violation situation, data will be
> > > > 
> > > > recovered
> > > > > > > and
> > > > > > > > a special event recorded.
> > > > > > > > 
> > > > > > > > 3) Naming?
> > > > > > > > Feature name based on Cassandra's Read Repair feature [4], which
> > 
> > is
> > > > > > > pretty
> > > > > > > > similar.
> > > > > > > > 
> > > > > > > > 4) Limitations which can be fixed in the future?
> > > > > > > >   * MVCC and Near caches are not supported.
> > > > > > > >   * Atomic caches can be checked (false positive case is possible
> > 
> > on
> > > > > > > this
> > > > > > > > check), but can't be recovered.
> > > > > > > >   * Partial entry removal can't be recovered.
> > > > > > > >   * Entries streamed using data streamer (using not a "cache.put"
> > > > 
> > > > based
> > > > > > > > updater) and loaded by cache.load
> > > > > > > >   are perceived as inconsistent since they may have different
> > > > 
> > > > versions
> > > > > > > for
> > > > > > > > same keys.
> > > > > > > >   * Only explicit get operations are supported (getAndReplace,
> > > > > > > 
> > > > > > > getAndPut,
> > > > > > > > etc can be supported in future).
> > > > > > > > 
> > > > > > > > 5) What's left?
> > > > > > > >   * SQL/ThinClient/etc support.
> > > > > > > >   * Metrics (found/repaired).
> > > > > > > >   * Simple per-partition recovery feature able to work in the
> > > > > > > 
> > > > > > > background in
> > > > > > > > addition to per-entry recovery feature.
> > > > > > > > 
> > > > > > > > 6) Is code checked?
> > > > > > > >   * Pull Request #5656 [5] (feature) - has green TC.
> > > > > > > >   * Pull Request #6575 [6] (RunAll with the feature enabled for
> > > > 
> > > > every
> > > > > > > get()
> > > > > > > > request) - has a limited amount of failures (because of data
> > > > 
> > > > streamer,
> > > > > > > > cache.load, etc).
> > > > > > > > 
> > > > > > > > Thoughts?
> > > > > > > > 
> > > > > > > > [0] https://issues.apache.org/jira/browse/IGNITE-10663
> > > > > > > > [1]
> > > > > > > > 
> > > > > > > > 
> > 
> > https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix
> > > > > > > > [2]
> > > > > > > > 
> > > > > > > > 
> > 
> > https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
> > > > > > > > [3]
> > > > > > > > 
> > > > > > > > 
> > 
> > https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140
> > > > > > > > [4]
> > > > > > > > 
> > > > > > > > 
> > 
> > https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html
> > > > > > > > [5] https://github.com/apache/ignite/pull/5656
> > > > > > > > [6] https://github.com/apache/ignite/pull/6575
> > > > > > > > 

Re: Read Repair (ex. Consistency Check) - review request #2

Posted by Павлухин Иван <vo...@gmail.com>.
Anton,

No doubts, just a lack of attention from my side. Thank you for clarifying.

Regarding MVCC, it's Beta state is mentioned in documentation [1] (a
warning badge in the beginning).

And I think it is a really good idea to employ some means of
delivering some experimental stuff. It can be somehow similar to
feature toggling widely used today. I can imagine a following process:
1. A non-trivial feature development is started and all related stuff
is marked as an experimental in a way transaparent for everybody.
2. The development continues and the feature can be merged to master
by chunks when some parts are ready.
3. It is possible to have such feature in releases in beta/experiment
state to gather a feedback from users.
4. Once the feature is completely ready (including documentation) the
experiment mark is removed.

[1] https://apacheignite.readme.io/docs/multiversion-concurrency-control

пт, 19 июл. 2019 г. в 08:31, Anton Vinogradov <av...@apache.org>:
>
> Ivan,
>
> My code is always production-ready, any doubts? :)
>
> Anyway, that's a good question.
> It's a good case to mark feature experimental somehow while it's roadmap
> not finished.
> Was MVCC marked as experimental?
> API can be marked as Experimental [1] since java 9, but we're not able to
> use this annotation for now.
>
> Anyway again, seems it depends on the release date will RR be marked as
> experimental or not.
> In case we'll finish roadmap before 2.8 release there will be no need for
> such mark.
> So, let's assume it's production-ready until another state is not confirmed
> :)
>
> [1] https://docs.oracle.com/javase/9/docs/api/jdk/jfr/Experimental.html
>
> On Thu, Jul 18, 2019 at 4:56 PM Павлухин Иван <vo...@gmail.com> wrote:
>
> > Folks,
> >
> > Sorry, I was not very attentive. Could you please clarify whether (or
> > not) this feature is currently in an experiment state? And a couple of
> > obvious things:
> > 1. If it is experimental it should be clear for a user (e.g. from javadoc).
> > 2. If not all limitations should be described in javadocs and
> > documentation.
> >
> > ср, 17 июл. 2019 г. в 14:59, Вячеслав Коптилин <sl...@gmail.com>:
> > >
> > > Hello Anton,
> > >
> > > It's not possible, currently, to fix atomic caches.
> > > > You may only check the consistency. *And it's better than nothing*, I
> > > > think.
> > >
> > > Fair enough.
> > >
> > > >> 3. IgniteConsistencyViolationException is absolutely useless. It does
> > not
> > > > >> provide any information about the issue and possible way to fix it.
> > > > It means that some keys from your get operation are broken.
> > > > IgniteConsistencyViolationException CAN be extended with a list of
> > broken
> > > > keys in the future.
> > >
> > > I think it SHOULD be extended with additional fields/methods in the same
> > > way as CacheConsistencyViolationEvent
> > >
> > > >> Well, near caches are widely used and fully transactional, so I think
> > it
> > > > >> makes sense to support the feature for near caches too.
> > > > As I told before, it will be nice to implement this in the future, but
> > we
> > > > have more important tasks for now.
> > >
> > > I do not insist that it must be done right now.
> > >
> > >
> > > > >> For instance, I would like to see all these limitations on the IEP
> > page
> > > > as
> > > > >> JIRA tickets. Perhaps, it would be good to create an epic/umbrella
> > > > ticket
> > > > >> in order to track all activities related to `Read Repair` feature.
> > > > Let's do this at merge day to be sure useless issues will not be
> > created.
> > > >
> > >
> > > I am just trying to say that it is a good time to create tickets in order
> > > to track all of that otherwise the chance that all these
> > > limitations/improvements will not be addressed is very high.
> > >
> > > Thanks,
> > > S.
> > >
> > > вт, 16 июл. 2019 г. в 09:07, Anton Vinogradov <av...@apache.org>:
> > >
> > > > Svala,
> > > >
> > > > >> Could you please take a look at PR:
> > > > Going to review today, thanks for attaching the bot visa.
> > > >
> > > > >> 1. Should I consider that my cluster is broken? There is no answer!
> > The
> > > > >> false-positive result is possible.
> > > > That's a question about atomic nature.
> > > > It's not impossible to lock atomic entry to perform the check.
> > > > You should perform some attempts, it's your decision how many.
> > > > By default, atomic RR performs 3 attempts, you may increase this by
> > setting
> > > > IGNITE_NEAR_GET_MAX_REMAPS or by just performing additional gets.
> > > >
> > > > >> 2. What should be done here in order to check/resolve the issue?
> > > > Perhaps, I
> > > > >> should restart a node (which one?), restart the whole cluster, put
> > a new
> > > > >> value...
> > > > It's not possible, currently, to fix atomic caches.
> > > > You may only check the consistency. And it's better than nothing, I
> > think.
> > > > We should find a way how to fix atomic consistency first.
> > > > A possible strategy is to use ЕntryProcessor which will replace all
> > owner's
> > > > values with "latest" and do nothing in case newest (than latest) value
> > > > found (opposite to preloading approach).
> > > >
> > > > >> 3. IgniteConsistencyViolationException is absolutely useless. It
> > does
> > > > not
> > > > >> provide any information about the issue and possible way to fix it.
> > > > It means that some keys from your get operation are broken.
> > > > IgniteConsistencyViolationException CAN be extended with a list of
> > broken
> > > > keys in the future.
> > > >
> > > > >> It seems that transactional caches are covered much better.
> > > > Correct.
> > > > Tx caches consistency is more important that atomic consistency,
> > that's why
> > > > it was implemented first.
> > > > BTW, AFAIK, atomics also were not fixed at 10078 [1].
> > > >
> > > > >> Well, near caches are widely used and fully transactional, so I
> > think it
> > > > >> makes sense to support the feature for near caches too.
> > > > As I told before, it will be nice to implement this in the future, but
> > we
> > > > have more important tasks for now.
> > > > The main goal was to cover tx caches, to be able to fix them in case
> > of the
> > > > real problem at production.
> > > >
> > > > Summarizing the roadmap,
> > > > My goal now is to finish the tx case, now we have an issue with false
> > > > positive consistency violation [2].
> > > > Also, we're going to update Jepsen tests [3] with RR to ensure tx
> > caches
> > > > fixed.
> > > > Next main goal is to use RR at TC checks [4], help with this issue are
> > > > appreciated.
> > > >
> > > > [1] https://issues.apache.org/jira/browse/IGNITE-10078
> > > > [2] https://issues.apache.org/jira/browse/IGNITE-11973
> > > > [3] https://issues.apache.org/jira/browse/IGNITE-11972
> > > > [4] https://issues.apache.org/jira/browse/IGNITE-11971
> > > >
> > > >
> > > > On Mon, Jul 15, 2019 at 4:51 PM Dmitriy Pavlov <dp...@apache.org>
> > wrote:
> > > >
> > > > > Ok,  thank you
> > > > >
> > > > > пн, 15 июл. 2019 г., 16:46 Nikolay Izhikov <ni...@apache.org>:
> > > > >
> > > > > > I did the review.
> > > > > >
> > > > > > пн, 15 июля 2019 г., 16:15 Dmitriy Pavlov <dp...@apache.org>:
> > > > > >
> > > > > > > Igniters, who did a review of
> > > > > > > https://issues.apache.org/jira/browse/IGNITE-10663 before the
> > merge?
> > > > > > I've
> > > > > > > checked both PR   https://github.com/apache/ignite/pull/5656
> > and
> > > > > Issue,
> > > > > > > and dev.list thread and didn't find any LGTM.
> > > > > > >
> > > > > > > Anton, since you've rejected lazy consensus in our process, we
> > have
> > > > RTC
> > > > > > in
> > > > > > > that (core) module. So I'd like to know if the fix was covered
> > by the
> > > > > > > review.
> > > > > > >
> > > > > > > Because you're a committer, a reviewer can be non-committer. So,
> > who
> > > > > was
> > > > > > a
> > > > > > > reviewer? Or was process ignored?
> > > > > > >
> > > > > > > пн, 15 июл. 2019 г. в 15:37, Вячеслав Коптилин <
> > > > > slava.koptilin@gmail.com
> > > > > > >:
> > > > > > >
> > > > > > > > Hello Anton,
> > > > > > > >
> > > > > > > > > I'd like to propose you to provide fixes as a PR since you
> > have a
> > > > > > > vision
> > > > > > > > of how it should be made. I'll review them and merge shortly.
> > > > > > > > Could you please take a look at PR:
> > > > > > > > https://github.com/apache/ignite/pull/6689
> > > > > > > >
> > > > > > > > > Since your comments mostly about Javadoc (does this mean
> > that my
> > > > > > > solution
> > > > > > > > is so great that you ask me only to fix Javadocs :) ?),
> > > > > > > > In my humble opinion, I would consider this feature as
> > experimental
> > > > > one
> > > > > > > (It
> > > > > > > > does not seem production-ready).
> > > > > > > > Let me clarify this with the following simple example:
> > > > > > > >
> > > > > > > >     try {
> > > > > > > >         atomicCache.withReadRepair().getAll(keys);
> > > > > > > >     }
> > > > > > > >     catch (CacheException e) {
> > > > > > > >         // What should be done here from the end-user point of
> > > > view?
> > > > > > > >     }
> > > > > > > >
> > > > > > > > 1. Should I consider that my cluster is broken? There is no
> > answer!
> > > > > The
> > > > > > > > false-positive result is possible.
> > > > > > > > 2. What should be done here in order to check/resolve the
> > issue?
> > > > > > > Perhaps, I
> > > > > > > > should restart a node (which one?), restart the whole cluster,
> > put
> > > > a
> > > > > > new
> > > > > > > > value...
> > > > > > > > 3. IgniteConsistencyViolationException is absolutely useless.
> > It
> > > > does
> > > > > > not
> > > > > > > > provide any information about the issue and possible way to
> > fix it.
> > > > > > > >
> > > > > > > > It seems that transactional caches are covered much better.
> > > > > > > >
> > > > > > > > > Mostly agree with you, but
> > > > > > > > > - MVCC is not production ready,
> > > > > > > > > - not sure near support really required,
> > > > > > > > > - metrics are better for monitoring, but the Event is enough
> > for
> > > > my
> > > > > > > wish
> > > > > > > > to
> > > > > > > > > cover AI with consistency check,
> > > > > > > > > - do we really need Platforms and Thin Client support?
> > > > > > > > Well, near caches are widely used and fully transactional, so I
> > > > think
> > > > > > it
> > > > > > > > makes sense to support the feature for near caches too.
> > > > > > > > .Net is already aware of 'ReadRepair'. It seems to me, that it
> > can
> > > > be
> > > > > > > > easily supported for C++. I don't see a reason why it should
> > not be
> > > > > > done
> > > > > > > :)
> > > > > > > >
> > > > > > > > > Do you mean per partition check and recovery? That's a good
> > idea,
> > > > > > but I
> > > > > > > > found it's not easy to imagine API to for such tool.
> > > > > > > > Yep, perhaps it can be done on the idle cluster via
> > `idle-verify`
> > > > > > command
> > > > > > > > with additional flag. Agreed, that this approach is not the
> > best
> > > > one
> > > > > > > > definitely.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > S.
> > > > > > > >
> > > > > > > > чт, 11 июл. 2019 г. в 09:53, Anton Vinogradov <av...@apache.org>:
> > > > > > > >
> > > > > > > > > Slava,
> > > > > > > > >
> > > > > > > > > Thanks for your review first!
> > > > > > > > >
> > > > > > > > > >> Anyway, I left some comments in your pull-request at
> > github.
> > > > > > Please
> > > > > > > > take
> > > > > > > > > a
> > > > > > > > > >> look. The most annoying thing is poorly documented code :(
> > > > > > > > > Since your comments mostly about Javadoc (does this mean
> > that my
> > > > > > > solution
> > > > > > > > > is so great that you ask me only to fix Javadocs :) ?),
> > > > > > > > > I'd like to propose you to provide fixes as a PR since you
> > have a
> > > > > > > vision
> > > > > > > > of
> > > > > > > > > how it should be made. I'll review them and merge shortly.
> > > > > > > > >
> > > > > > > > > >> By the way, is it required to add test related to
> > fail-over
> > > > > > > scenarios?
> > > > > > > > > The best check is to use RR at real code.
> > > > > > > > > For example, I'm injecting RR now to the test with concurrent
> > > > > > > > modifications
> > > > > > > > > and restarts [1].
> > > > > > > > >
> > > > > > > > > >> I just checked, the IEP page and I still cannot find Jira
> > > > > tickets
> > > > > > > that
> > > > > > > > > >> should cover existing limitations/improvements.
> > > > > > > > > >> I would suggest creating the following tasks, at least:
> > > > > > > > > Mostly agree with you, but
> > > > > > > > > - MVCC is not production ready,
> > > > > > > > > - not sure near support really required,
> > > > > > > > > - metrics are better for monitoring, but the Event is enough
> > for
> > > > my
> > > > > > > wish
> > > > > > > > to
> > > > > > > > > cover AI with consistency check,
> > > > > > > > > - do we really need Platforms and Thin Client support?
> > > > > > > > > Also, we should not produce stillborn issue.
> > > > > > > > > All limitations listed at proxy creation method and they
> > > > definitely
> > > > > > are
> > > > > > > > not
> > > > > > > > > showstoppers and may be fixed later if someone interested.
> > > > > > > > > Сoming back to AI 3.0 discussion, we have A LOT of features
> > and
> > > > > it's
> > > > > > > > almost
> > > > > > > > > impossible (require much more time that feature's cost) to
> > > > support
> > > > > > them
> > > > > > > > > all.
> > > > > > > > > I will be pretty happy in case someone will do this and
> > provide
> > > > > help
> > > > > > if
> > > > > > > > > necessary!
> > > > > > > > >
> > > > > > > > > >> Perhaps, it would be a good idea to think about the
> > recovery
> > > > too
> > > > > > > > > Do you mean per partition check and recovery?
> > > > > > > > > That's a good idea, but I found it's not easy to imagine API
> > to
> > > > for
> > > > > > > such
> > > > > > > > > tool.
> > > > > > > > > In case you ready to assist with proper API/design this will
> > > > > > definitely
> > > > > > > > > help.
> > > > > > > > >
> > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11973
> > > > > > > > >
> > > > > > > > > On Wed, Jul 10, 2019 at 3:43 PM Вячеслав Коптилин <
> > > > > > > > > slava.koptilin@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Perhaps, it would be a good idea to think about the
> > recovery
> > > > > tool/
> > > > > > > > > > control-utility command that will allow achieving the same
> > > > goal.
> > > > > > > > > > If I am not mistaken it was already proposed in the email
> > > > thread.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > S.
> > > > > > > > > >
> > > > > > > > > > ср, 10 июл. 2019 г. в 15:33, Вячеслав Коптилин <
> > > > > > > > slava.koptilin@gmail.com
> > > > > > > > > >:
> > > > > > > > > >
> > > > > > > > > > > Hi Anton,
> > > > > > > > > > >
> > > > > > > > > > > Well, the ReadRepair feature is finally merged and that
> > is
> > > > good
> > > > > > > news
> > > > > > > > :)
> > > > > > > > > > >
> > > > > > > > > > > Unfortunately, I cannot find a consensus about the whole
> > > > > > > > functionality
> > > > > > > > > in
> > > > > > > > > > > any of these topics:
> > > > > > > > > > >  -
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > http://apache-ignite-developers.2346864.n4.nabble.com/Consistency-check-and-fix-review-request-td41629.html
> > > > > > > > > > >  -
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > http://apache-ignite-developers.2346864.n4.nabble.com/Read-Repair-ex-Consistency-Check-review-request-2-td42421.html
> > > > > > > > > > > Also, there are no comments/discussion in JIRA. That
> > makes me
> > > > > sad
> > > > > > > :(
> > > > > > > > > > > especially when a feature is huge, not obvious and
> > involves
> > > > > > > changing
> > > > > > > > > > public
> > > > > > > > > > > API (and that is the case, I think).
> > > > > > > > > > >
> > > > > > > > > > > Anyway, I left some comments in your pull-request at
> > github.
> > > > > > Please
> > > > > > > > > take
> > > > > > > > > > a
> > > > > > > > > > > look. The most annoying thing is poorly documented code
> > :(
> > > > > > > > > > > By the way, is it required to add test related to
> > fail-over
> > > > > > > > scenarios?
> > > > > > > > > > >
> > > > > > > > > > > I just checked, the IEP page and I still cannot find Jira
> > > > > tickets
> > > > > > > > that
> > > > > > > > > > > should cover existing limitations/improvements.
> > > > > > > > > > > I would suggest creating the following tasks, at least:
> > > > > > > > > > >  - MVCC support
> > > > > > > > > > >  - Near caches
> > > > > > > > > > >  - Additional metrics (number of violations, number of
> > > > repaired
> > > > > > > > entries
> > > > > > > > > > > etc)
> > > > > > > > > > >  - Ignite C++ (It looks like, .Net is already aware of
> > that
> > > > > > > feature)
> > > > > > > > > > >  - Thin clients support
> > > > > > > > > > >  - Perhaps, it would be useful to support different
> > > > strategies
> > > > > to
> > > > > > > > > resolve
> > > > > > > > > > > inconsistencies
> > > > > > > > > > >
> > > > > > > > > > > What do you think?
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > S.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > ср, 10 июл. 2019 г. в 10:16, Anton Vinogradov <
> > av@apache.org
> > > > >:
> > > > > > > > > > >
> > > > > > > > > > >> Folks,
> > > > > > > > > > >>
> > > > > > > > > > >> Thanks to everyone for tips and reviews.
> > > > > > > > > > >> Yardstick checked, no performance drop found.
> > > > > > > > > > >> Additional measurement: RR get() is just up to 7% slower
> > > > than
> > > > > > > > regular
> > > > > > > > > > get
> > > > > > > > > > >> on real benchmarks (8 clients, 4 servers, 3 backups)
> > > > > > > > > > >> Code merged to the master.
> > > > > > > > > > >> "Must have" tasks created and attached to the IEP.
> > > > > > > > > > >>
> > > > > > > > > > >> On Thu, Jul 4, 2019 at 12:18 PM Anton Vinogradov <
> > > > > av@apache.org
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > >>
> > > > > > > > > > >> > Folks,
> > > > > > > > > > >> >
> > > > > > > > > > >> > Just a minor update.
> > > > > > > > > > >> >
> > > > > > > > > > >> > RunAll [1] with enabled ReadRepair proxy is almost
> > green
> > > > now
> > > > > > > (~10
> > > > > > > > > > tests
> > > > > > > > > > >> > left, started with 6k :)).
> > > > > > > > > > >> > During the analisys, I've found some tests with
> > > > > > > > > > >> > - unexpected repairs at tx caches
> > > > > > > > > > >> > - inconsistent state after the test finished
> > (different
> > > > > > entries
> > > > > > > > > across
> > > > > > > > > > >> the
> > > > > > > > > > >> > topology)
> > > > > > > > > > >> > For example,
> > > > > > > > > > >> > - testInvokeAllAppliedOnceOnBinaryTypeRegistration
> > > > generates
> > > > > > > > > obsolete
> > > > > > > > > > >> > versions on backups in case of retry, fixed [2]
> > > > > > > > > > >> > - initial cache load generates not equal versions on
> > > > > backups,
> > > > > > > > fixed
> > > > > > > > > > [3]
> > > > > > > > > > >> > - testAccountTxNodeRestart causes unexpected repairs
> > > > > (entries
> > > > > > > have
> > > > > > > > > > >> > different versions), to be investigated.
> > > > > > > > > > >> >
> > > > > > > > > > >> > What's next?
> > > > > > > > > > >> >
> > > > > > > > > > >> > 1) Going to merge the solution once "RunAll with
> > > > ReadRepair
> > > > > > > > enabled"
> > > > > > > > > > >> > becomes fully green.
> > > > > > > > > > >> > 2) Going to add special check after each test which
> > will
> > > > > > ensure
> > > > > > > > > caches
> > > > > > > > > > >> > content after the test is consistent.
> > > > > > > > > > >> > 2.1) The Same check can (should?) be injected to
> > > > > > > > > > >> > awaitPartitionMapExchange() and similar methods.
> > > > > > > > > > >> > 3) Update Jepsen tests with RR checks.
> > > > > > > > > > >> > 4) Introduce per partition RR.
> > > > > > > > > > >> >
> > > > > > > > > > >> > So, the final goal is to be sure that Ignite produces
> > only
> > > > > > > > > consistent
> > > > > > > > > > >> data
> > > > > > > > > > >> > and to have a feature to solve consistency in case we
> > gain
> > > > > > > > > > inconsistent
> > > > > > > > > > >> > state somehow.
> > > > > > > > > > >> >
> > > > > > > > > > >> > Limitations?
> > > > > > > > > > >> >
> > > > > > > > > > >> > Currently, RR has some limitations, but they are not
> > > > related
> > > > > > to
> > > > > > > > real
> > > > > > > > > > >> > production cases.
> > > > > > > > > > >> > In case someone interested to support, for example,
> > MVCC
> > > > or
> > > > > > near
> > > > > > > > > > caches,
> > > > > > > > > > >> > please, feel free to contribute.
> > > > > > > > > > >> >
> > > > > > > > > > >> > [1]
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > https://mtcga.gridgain.com/pr.html?serverId=apache&suiteId=IgniteTests24Java8_RunAll&branchForTc=pull/6575/head&action=Latest
> > > > > > > > > > >> > [2]
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > https://github.com/apache/ignite/pull/5656/commits/6f6ec4434095e692af209c61833a350f3013408c
> > > > > > > > > > >> > [3]
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > https://github.com/apache/ignite/pull/5656/commits/255e552b474839e470c66a77e74e3c807bc76f13
> > > > > > > > > > >> >
> > > > > > > > > > >> > On Fri, Jun 28, 2019 at 2:41 PM Anton Vinogradov <
> > > > > > av@apache.org
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >> >
> > > > > > > > > > >> >> Slava,
> > > > > > > > > > >> >>
> > > > > > > > > > >> >> >> I will take a look at your pull request if you
> > don't
> > > > > mind.
> > > > > > > > > > >> >> Great news!
> > > > > > > > > > >> >>
> > > > > > > > > > >> >> >> In any way, could you please update the IEP page
> > with
> > > > > the
> > > > > > > list
> > > > > > > > > of
> > > > > > > > > > >> >> >> constraints/limitations of the proposed approach,
> > > > TODOs,
> > > > > > > etc?
> > > > > > > > > > >> >> Not sure we should keep this at IEP until list (#4
> > from
> > > > > > > original
> > > > > > > > > > >> letter)
> > > > > > > > > > >> >> is not confirmed.
> > > > > > > > > > >> >>
> > > > > > > > > > >> >> Currently, I'm checking RunAll with RR enabled to
> > almost
> > > > > each
> > > > > > > get
> > > > > > > > > > >> request.
> > > > > > > > > > >> >> "Almost" means: readRepair = !ctx.readThrough() &&
> > > > > > > > > > >> >> ctx.config().getBackups() > 0 && !ctx.isNear() &&
> > > > > > > > > !ctx.mvccEnabled()
> > > > > > > > > > >> >> For now I have 60 failed tests and amount decreasing.
> > > > > > > > > > >> >>
> > > > > > > > > > >> >> >> For instance, I would like to see all these
> > > > limitations
> > > > > on
> > > > > > > the
> > > > > > > > > IEP
> > > > > > > > > > >> >> page as
> > > > > > > > > > >> >> >> JIRA tickets. Perhaps, it would be good to create
> > an
> > > > > > > > > epic/umbrella
> > > > > > > > > > >> >> ticket
> > > > > > > > > > >> >> >> in order to track all activities related to `Read
> > > > > Repair`
> > > > > > > > > feature.
> > > > > > > > > > >> >> Let's do this at merge day to be sure useless issues
> > will
> > > > > not
> > > > > > > be
> > > > > > > > > > >> created.
> > > > > > > > > > >> >>
> > > > > > > > > > >> >> On Fri, Jun 28, 2019 at 2:01 PM Вячеслав Коптилин <
> > > > > > > > > > >> >> slava.koptilin@gmail.com> wrote:
> > > > > > > > > > >> >>
> > > > > > > > > > >> >>> Hi Anton,
> > > > > > > > > > >> >>>
> > > > > > > > > > >> >>> I will take a look at your pull request if you don't
> > > > mind.
> > > > > > > > > > >> >>>
> > > > > > > > > > >> >>> In any way, could you please update the IEP page
> > with
> > > > the
> > > > > > list
> > > > > > > > of
> > > > > > > > > > >> >>> constraints/limitations of the proposed approach,
> > TODOs,
> > > > > > etc?
> > > > > > > > > > >> >>> For instance, I would like to see all these
> > limitations
> > > > on
> > > > > > the
> > > > > > > > IEP
> > > > > > > > > > >> page
> > > > > > > > > > >> >>> as
> > > > > > > > > > >> >>> JIRA tickets. Perhaps, it would be good to create an
> > > > > > > > epic/umbrella
> > > > > > > > > > >> ticket
> > > > > > > > > > >> >>> in order to track all activities related to `Read
> > > > Repair`
> > > > > > > > feature.
> > > > > > > > > > >> >>>
> > > > > > > > > > >> >>> Thanks,
> > > > > > > > > > >> >>> S.
> > > > > > > > > > >> >>>
> > > > > > > > > > >> >>> чт, 20 июн. 2019 г. в 14:15, Anton Vinogradov <
> > > > > > av@apache.org
> > > > > > > >:
> > > > > > > > > > >> >>>
> > > > > > > > > > >> >>> > Igniters,
> > > > > > > > > > >> >>> > I'm glad to introduce Read Repair feature [0]
> > provides
> > > > > > > > > additional
> > > > > > > > > > >> >>> > consistency guarantee for Ignite.
> > > > > > > > > > >> >>> >
> > > > > > > > > > >> >>> > 1) Why we need it?
> > > > > > > > > > >> >>> > The detailed explanation can be found at IEP-31
> > [1].
> > > > > > > > > > >> >>> > In short, because of bugs, it's possible to gain
> > an
> > > > > > > > inconsistent
> > > > > > > > > > >> state.
> > > > > > > > > > >> >>> > We need additional features to handle this case.
> > > > > > > > > > >> >>> >
> > > > > > > > > > >> >>> > Currently we able to check cluster using
> > Idle_verify
> > > > [2]
> > > > > > > > > feature,
> > > > > > > > > > >> but
> > > > > > > > > > >> >>> it
> > > > > > > > > > >> >>> > will not fix the data, will not even tell which
> > > > entries
> > > > > > are
> > > > > > > > > > broken.
> > > > > > > > > > >> >>> > Read Repair is a feature to understand which
> > entries
> > > > are
> > > > > > > > broken
> > > > > > > > > > and
> > > > > > > > > > >> to
> > > > > > > > > > >> >>> fix
> > > > > > > > > > >> >>> > them.
> > > > > > > > > > >> >>> >
> > > > > > > > > > >> >>> > 1) How it works?
> > > > > > > > > > >> >>> > IgniteCache now able to provide special proxy [3]
> > > > > > > > > > withReadRepair().
> > > > > > > > > > >> >>> > This proxy guarantee that data will be gained
> > from all
> > > > > > > owners
> > > > > > > > > and
> > > > > > > > > > >> >>> compared.
> > > > > > > > > > >> >>> > In the case of consistency violation situation,
> > data
> > > > > will
> > > > > > be
> > > > > > > > > > >> recovered
> > > > > > > > > > >> >>> and
> > > > > > > > > > >> >>> > a special event recorded.
> > > > > > > > > > >> >>> >
> > > > > > > > > > >> >>> > 3) Naming?
> > > > > > > > > > >> >>> > Feature name based on Cassandra's Read Repair
> > feature
> > > > > [4],
> > > > > > > > which
> > > > > > > > > > is
> > > > > > > > > > >> >>> pretty
> > > > > > > > > > >> >>> > similar.
> > > > > > > > > > >> >>> >
> > > > > > > > > > >> >>> > 4) Limitations which can be fixed in the future?
> > > > > > > > > > >> >>> >   * MVCC and Near caches are not supported.
> > > > > > > > > > >> >>> >   * Atomic caches can be checked (false positive
> > case
> > > > is
> > > > > > > > > possible
> > > > > > > > > > on
> > > > > > > > > > >> >>> this
> > > > > > > > > > >> >>> > check), but can't be recovered.
> > > > > > > > > > >> >>> >   * Partial entry removal can't be recovered.
> > > > > > > > > > >> >>> >   * Entries streamed using data streamer (using
> > not a
> > > > > > > > > "cache.put"
> > > > > > > > > > >> based
> > > > > > > > > > >> >>> > updater) and loaded by cache.load
> > > > > > > > > > >> >>> >   are perceived as inconsistent since they may
> > have
> > > > > > > different
> > > > > > > > > > >> versions
> > > > > > > > > > >> >>> for
> > > > > > > > > > >> >>> > same keys.
> > > > > > > > > > >> >>> >   * Only explicit get operations are supported
> > > > > > > (getAndReplace,
> > > > > > > > > > >> >>> getAndPut,
> > > > > > > > > > >> >>> > etc can be supported in future).
> > > > > > > > > > >> >>> >
> > > > > > > > > > >> >>> > 5) What's left?
> > > > > > > > > > >> >>> >   * SQL/ThinClient/etc support.
> > > > > > > > > > >> >>> >   * Metrics (found/repaired).
> > > > > > > > > > >> >>> >   * Simple per-partition recovery feature able to
> > work
> > > > > in
> > > > > > > the
> > > > > > > > > > >> >>> background in
> > > > > > > > > > >> >>> > addition to per-entry recovery feature.
> > > > > > > > > > >> >>> >
> > > > > > > > > > >> >>> > 6) Is code checked?
> > > > > > > > > > >> >>> >   * Pull Request #5656 [5] (feature) - has green
> > TC.
> > > > > > > > > > >> >>> >   * Pull Request #6575 [6] (RunAll with the
> > feature
> > > > > > enabled
> > > > > > > > for
> > > > > > > > > > >> every
> > > > > > > > > > >> >>> get()
> > > > > > > > > > >> >>> > request) - has a limited amount of failures
> > (because
> > > > of
> > > > > > data
> > > > > > > > > > >> streamer,
> > > > > > > > > > >> >>> > cache.load, etc).
> > > > > > > > > > >> >>> >
> > > > > > > > > > >> >>> > Thoughts?
> > > > > > > > > > >> >>> >
> > > > > > > > > > >> >>> > [0]
> > > > https://issues.apache.org/jira/browse/IGNITE-10663
> > > > > > > > > > >> >>> > [1]
> > > > > > > > > > >> >>> >
> > > > > > > > > > >> >>> >
> > > > > > > > > > >> >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix
> > > > > > > > > > >> >>> > [2]
> > > > > > > > > > >> >>> >
> > > > > > > > > > >> >>> >
> > > > > > > > > > >> >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
> > > > > > > > > > >> >>> > [3]
> > > > > > > > > > >> >>> >
> > > > > > > > > > >> >>> >
> > > > > > > > > > >> >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140
> > > > > > > > > > >> >>> > [4]
> > > > > > > > > > >> >>> >
> > > > > > > > > > >> >>> >
> > > > > > > > > > >> >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html
> > > > > > > > > > >> >>> > [5] https://github.com/apache/ignite/pull/5656
> > > > > > > > > > >> >>> > [6] https://github.com/apache/ignite/pull/6575
> > > > > > > > > > >> >>> >
> > > > > > > > > > >> >>>
> > > > > > > > > > >> >>
> > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> >
> >
> > --
> > Best regards,
> > Ivan Pavlukhin
> >



-- 
Best regards,
Ivan Pavlukhin

Re: Read Repair (ex. Consistency Check) - review request #2

Posted by Anton Vinogradov <av...@apache.org>.
Ivan,

My code is always production-ready, any doubts? :)

Anyway, that's a good question.
It's a good case to mark feature experimental somehow while it's roadmap
not finished.
Was MVCC marked as experimental?
API can be marked as Experimental [1] since java 9, but we're not able to
use this annotation for now.

Anyway again, seems it depends on the release date will RR be marked as
experimental or not.
In case we'll finish roadmap before 2.8 release there will be no need for
such mark.
So, let's assume it's production-ready until another state is not confirmed
:)

[1] https://docs.oracle.com/javase/9/docs/api/jdk/jfr/Experimental.html

On Thu, Jul 18, 2019 at 4:56 PM Павлухин Иван <vo...@gmail.com> wrote:

> Folks,
>
> Sorry, I was not very attentive. Could you please clarify whether (or
> not) this feature is currently in an experiment state? And a couple of
> obvious things:
> 1. If it is experimental it should be clear for a user (e.g. from javadoc).
> 2. If not all limitations should be described in javadocs and
> documentation.
>
> ср, 17 июл. 2019 г. в 14:59, Вячеслав Коптилин <sl...@gmail.com>:
> >
> > Hello Anton,
> >
> > It's not possible, currently, to fix atomic caches.
> > > You may only check the consistency. *And it's better than nothing*, I
> > > think.
> >
> > Fair enough.
> >
> > >> 3. IgniteConsistencyViolationException is absolutely useless. It does
> not
> > > >> provide any information about the issue and possible way to fix it.
> > > It means that some keys from your get operation are broken.
> > > IgniteConsistencyViolationException CAN be extended with a list of
> broken
> > > keys in the future.
> >
> > I think it SHOULD be extended with additional fields/methods in the same
> > way as CacheConsistencyViolationEvent
> >
> > >> Well, near caches are widely used and fully transactional, so I think
> it
> > > >> makes sense to support the feature for near caches too.
> > > As I told before, it will be nice to implement this in the future, but
> we
> > > have more important tasks for now.
> >
> > I do not insist that it must be done right now.
> >
> >
> > > >> For instance, I would like to see all these limitations on the IEP
> page
> > > as
> > > >> JIRA tickets. Perhaps, it would be good to create an epic/umbrella
> > > ticket
> > > >> in order to track all activities related to `Read Repair` feature.
> > > Let's do this at merge day to be sure useless issues will not be
> created.
> > >
> >
> > I am just trying to say that it is a good time to create tickets in order
> > to track all of that otherwise the chance that all these
> > limitations/improvements will not be addressed is very high.
> >
> > Thanks,
> > S.
> >
> > вт, 16 июл. 2019 г. в 09:07, Anton Vinogradov <av...@apache.org>:
> >
> > > Svala,
> > >
> > > >> Could you please take a look at PR:
> > > Going to review today, thanks for attaching the bot visa.
> > >
> > > >> 1. Should I consider that my cluster is broken? There is no answer!
> The
> > > >> false-positive result is possible.
> > > That's a question about atomic nature.
> > > It's not impossible to lock atomic entry to perform the check.
> > > You should perform some attempts, it's your decision how many.
> > > By default, atomic RR performs 3 attempts, you may increase this by
> setting
> > > IGNITE_NEAR_GET_MAX_REMAPS or by just performing additional gets.
> > >
> > > >> 2. What should be done here in order to check/resolve the issue?
> > > Perhaps, I
> > > >> should restart a node (which one?), restart the whole cluster, put
> a new
> > > >> value...
> > > It's not possible, currently, to fix atomic caches.
> > > You may only check the consistency. And it's better than nothing, I
> think.
> > > We should find a way how to fix atomic consistency first.
> > > A possible strategy is to use ЕntryProcessor which will replace all
> owner's
> > > values with "latest" and do nothing in case newest (than latest) value
> > > found (opposite to preloading approach).
> > >
> > > >> 3. IgniteConsistencyViolationException is absolutely useless. It
> does
> > > not
> > > >> provide any information about the issue and possible way to fix it.
> > > It means that some keys from your get operation are broken.
> > > IgniteConsistencyViolationException CAN be extended with a list of
> broken
> > > keys in the future.
> > >
> > > >> It seems that transactional caches are covered much better.
> > > Correct.
> > > Tx caches consistency is more important that atomic consistency,
> that's why
> > > it was implemented first.
> > > BTW, AFAIK, atomics also were not fixed at 10078 [1].
> > >
> > > >> Well, near caches are widely used and fully transactional, so I
> think it
> > > >> makes sense to support the feature for near caches too.
> > > As I told before, it will be nice to implement this in the future, but
> we
> > > have more important tasks for now.
> > > The main goal was to cover tx caches, to be able to fix them in case
> of the
> > > real problem at production.
> > >
> > > Summarizing the roadmap,
> > > My goal now is to finish the tx case, now we have an issue with false
> > > positive consistency violation [2].
> > > Also, we're going to update Jepsen tests [3] with RR to ensure tx
> caches
> > > fixed.
> > > Next main goal is to use RR at TC checks [4], help with this issue are
> > > appreciated.
> > >
> > > [1] https://issues.apache.org/jira/browse/IGNITE-10078
> > > [2] https://issues.apache.org/jira/browse/IGNITE-11973
> > > [3] https://issues.apache.org/jira/browse/IGNITE-11972
> > > [4] https://issues.apache.org/jira/browse/IGNITE-11971
> > >
> > >
> > > On Mon, Jul 15, 2019 at 4:51 PM Dmitriy Pavlov <dp...@apache.org>
> wrote:
> > >
> > > > Ok,  thank you
> > > >
> > > > пн, 15 июл. 2019 г., 16:46 Nikolay Izhikov <ni...@apache.org>:
> > > >
> > > > > I did the review.
> > > > >
> > > > > пн, 15 июля 2019 г., 16:15 Dmitriy Pavlov <dp...@apache.org>:
> > > > >
> > > > > > Igniters, who did a review of
> > > > > > https://issues.apache.org/jira/browse/IGNITE-10663 before the
> merge?
> > > > > I've
> > > > > > checked both PR   https://github.com/apache/ignite/pull/5656
> and
> > > > Issue,
> > > > > > and dev.list thread and didn't find any LGTM.
> > > > > >
> > > > > > Anton, since you've rejected lazy consensus in our process, we
> have
> > > RTC
> > > > > in
> > > > > > that (core) module. So I'd like to know if the fix was covered
> by the
> > > > > > review.
> > > > > >
> > > > > > Because you're a committer, a reviewer can be non-committer. So,
> who
> > > > was
> > > > > a
> > > > > > reviewer? Or was process ignored?
> > > > > >
> > > > > > пн, 15 июл. 2019 г. в 15:37, Вячеслав Коптилин <
> > > > slava.koptilin@gmail.com
> > > > > >:
> > > > > >
> > > > > > > Hello Anton,
> > > > > > >
> > > > > > > > I'd like to propose you to provide fixes as a PR since you
> have a
> > > > > > vision
> > > > > > > of how it should be made. I'll review them and merge shortly.
> > > > > > > Could you please take a look at PR:
> > > > > > > https://github.com/apache/ignite/pull/6689
> > > > > > >
> > > > > > > > Since your comments mostly about Javadoc (does this mean
> that my
> > > > > > solution
> > > > > > > is so great that you ask me only to fix Javadocs :) ?),
> > > > > > > In my humble opinion, I would consider this feature as
> experimental
> > > > one
> > > > > > (It
> > > > > > > does not seem production-ready).
> > > > > > > Let me clarify this with the following simple example:
> > > > > > >
> > > > > > >     try {
> > > > > > >         atomicCache.withReadRepair().getAll(keys);
> > > > > > >     }
> > > > > > >     catch (CacheException e) {
> > > > > > >         // What should be done here from the end-user point of
> > > view?
> > > > > > >     }
> > > > > > >
> > > > > > > 1. Should I consider that my cluster is broken? There is no
> answer!
> > > > The
> > > > > > > false-positive result is possible.
> > > > > > > 2. What should be done here in order to check/resolve the
> issue?
> > > > > > Perhaps, I
> > > > > > > should restart a node (which one?), restart the whole cluster,
> put
> > > a
> > > > > new
> > > > > > > value...
> > > > > > > 3. IgniteConsistencyViolationException is absolutely useless.
> It
> > > does
> > > > > not
> > > > > > > provide any information about the issue and possible way to
> fix it.
> > > > > > >
> > > > > > > It seems that transactional caches are covered much better.
> > > > > > >
> > > > > > > > Mostly agree with you, but
> > > > > > > > - MVCC is not production ready,
> > > > > > > > - not sure near support really required,
> > > > > > > > - metrics are better for monitoring, but the Event is enough
> for
> > > my
> > > > > > wish
> > > > > > > to
> > > > > > > > cover AI with consistency check,
> > > > > > > > - do we really need Platforms and Thin Client support?
> > > > > > > Well, near caches are widely used and fully transactional, so I
> > > think
> > > > > it
> > > > > > > makes sense to support the feature for near caches too.
> > > > > > > .Net is already aware of 'ReadRepair'. It seems to me, that it
> can
> > > be
> > > > > > > easily supported for C++. I don't see a reason why it should
> not be
> > > > > done
> > > > > > :)
> > > > > > >
> > > > > > > > Do you mean per partition check and recovery? That's a good
> idea,
> > > > > but I
> > > > > > > found it's not easy to imagine API to for such tool.
> > > > > > > Yep, perhaps it can be done on the idle cluster via
> `idle-verify`
> > > > > command
> > > > > > > with additional flag. Agreed, that this approach is not the
> best
> > > one
> > > > > > > definitely.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > S.
> > > > > > >
> > > > > > > чт, 11 июл. 2019 г. в 09:53, Anton Vinogradov <av...@apache.org>:
> > > > > > >
> > > > > > > > Slava,
> > > > > > > >
> > > > > > > > Thanks for your review first!
> > > > > > > >
> > > > > > > > >> Anyway, I left some comments in your pull-request at
> github.
> > > > > Please
> > > > > > > take
> > > > > > > > a
> > > > > > > > >> look. The most annoying thing is poorly documented code :(
> > > > > > > > Since your comments mostly about Javadoc (does this mean
> that my
> > > > > > solution
> > > > > > > > is so great that you ask me only to fix Javadocs :) ?),
> > > > > > > > I'd like to propose you to provide fixes as a PR since you
> have a
> > > > > > vision
> > > > > > > of
> > > > > > > > how it should be made. I'll review them and merge shortly.
> > > > > > > >
> > > > > > > > >> By the way, is it required to add test related to
> fail-over
> > > > > > scenarios?
> > > > > > > > The best check is to use RR at real code.
> > > > > > > > For example, I'm injecting RR now to the test with concurrent
> > > > > > > modifications
> > > > > > > > and restarts [1].
> > > > > > > >
> > > > > > > > >> I just checked, the IEP page and I still cannot find Jira
> > > > tickets
> > > > > > that
> > > > > > > > >> should cover existing limitations/improvements.
> > > > > > > > >> I would suggest creating the following tasks, at least:
> > > > > > > > Mostly agree with you, but
> > > > > > > > - MVCC is not production ready,
> > > > > > > > - not sure near support really required,
> > > > > > > > - metrics are better for monitoring, but the Event is enough
> for
> > > my
> > > > > > wish
> > > > > > > to
> > > > > > > > cover AI with consistency check,
> > > > > > > > - do we really need Platforms and Thin Client support?
> > > > > > > > Also, we should not produce stillborn issue.
> > > > > > > > All limitations listed at proxy creation method and they
> > > definitely
> > > > > are
> > > > > > > not
> > > > > > > > showstoppers and may be fixed later if someone interested.
> > > > > > > > Сoming back to AI 3.0 discussion, we have A LOT of features
> and
> > > > it's
> > > > > > > almost
> > > > > > > > impossible (require much more time that feature's cost) to
> > > support
> > > > > them
> > > > > > > > all.
> > > > > > > > I will be pretty happy in case someone will do this and
> provide
> > > > help
> > > > > if
> > > > > > > > necessary!
> > > > > > > >
> > > > > > > > >> Perhaps, it would be a good idea to think about the
> recovery
> > > too
> > > > > > > > Do you mean per partition check and recovery?
> > > > > > > > That's a good idea, but I found it's not easy to imagine API
> to
> > > for
> > > > > > such
> > > > > > > > tool.
> > > > > > > > In case you ready to assist with proper API/design this will
> > > > > definitely
> > > > > > > > help.
> > > > > > > >
> > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11973
> > > > > > > >
> > > > > > > > On Wed, Jul 10, 2019 at 3:43 PM Вячеслав Коптилин <
> > > > > > > > slava.koptilin@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Perhaps, it would be a good idea to think about the
> recovery
> > > > tool/
> > > > > > > > > control-utility command that will allow achieving the same
> > > goal.
> > > > > > > > > If I am not mistaken it was already proposed in the email
> > > thread.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > S.
> > > > > > > > >
> > > > > > > > > ср, 10 июл. 2019 г. в 15:33, Вячеслав Коптилин <
> > > > > > > slava.koptilin@gmail.com
> > > > > > > > >:
> > > > > > > > >
> > > > > > > > > > Hi Anton,
> > > > > > > > > >
> > > > > > > > > > Well, the ReadRepair feature is finally merged and that
> is
> > > good
> > > > > > news
> > > > > > > :)
> > > > > > > > > >
> > > > > > > > > > Unfortunately, I cannot find a consensus about the whole
> > > > > > > functionality
> > > > > > > > in
> > > > > > > > > > any of these topics:
> > > > > > > > > >  -
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> http://apache-ignite-developers.2346864.n4.nabble.com/Consistency-check-and-fix-review-request-td41629.html
> > > > > > > > > >  -
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> http://apache-ignite-developers.2346864.n4.nabble.com/Read-Repair-ex-Consistency-Check-review-request-2-td42421.html
> > > > > > > > > > Also, there are no comments/discussion in JIRA. That
> makes me
> > > > sad
> > > > > > :(
> > > > > > > > > > especially when a feature is huge, not obvious and
> involves
> > > > > > changing
> > > > > > > > > public
> > > > > > > > > > API (and that is the case, I think).
> > > > > > > > > >
> > > > > > > > > > Anyway, I left some comments in your pull-request at
> github.
> > > > > Please
> > > > > > > > take
> > > > > > > > > a
> > > > > > > > > > look. The most annoying thing is poorly documented code
> :(
> > > > > > > > > > By the way, is it required to add test related to
> fail-over
> > > > > > > scenarios?
> > > > > > > > > >
> > > > > > > > > > I just checked, the IEP page and I still cannot find Jira
> > > > tickets
> > > > > > > that
> > > > > > > > > > should cover existing limitations/improvements.
> > > > > > > > > > I would suggest creating the following tasks, at least:
> > > > > > > > > >  - MVCC support
> > > > > > > > > >  - Near caches
> > > > > > > > > >  - Additional metrics (number of violations, number of
> > > repaired
> > > > > > > entries
> > > > > > > > > > etc)
> > > > > > > > > >  - Ignite C++ (It looks like, .Net is already aware of
> that
> > > > > > feature)
> > > > > > > > > >  - Thin clients support
> > > > > > > > > >  - Perhaps, it would be useful to support different
> > > strategies
> > > > to
> > > > > > > > resolve
> > > > > > > > > > inconsistencies
> > > > > > > > > >
> > > > > > > > > > What do you think?
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > S.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > ср, 10 июл. 2019 г. в 10:16, Anton Vinogradov <
> av@apache.org
> > > >:
> > > > > > > > > >
> > > > > > > > > >> Folks,
> > > > > > > > > >>
> > > > > > > > > >> Thanks to everyone for tips and reviews.
> > > > > > > > > >> Yardstick checked, no performance drop found.
> > > > > > > > > >> Additional measurement: RR get() is just up to 7% slower
> > > than
> > > > > > > regular
> > > > > > > > > get
> > > > > > > > > >> on real benchmarks (8 clients, 4 servers, 3 backups)
> > > > > > > > > >> Code merged to the master.
> > > > > > > > > >> "Must have" tasks created and attached to the IEP.
> > > > > > > > > >>
> > > > > > > > > >> On Thu, Jul 4, 2019 at 12:18 PM Anton Vinogradov <
> > > > av@apache.org
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > >>
> > > > > > > > > >> > Folks,
> > > > > > > > > >> >
> > > > > > > > > >> > Just a minor update.
> > > > > > > > > >> >
> > > > > > > > > >> > RunAll [1] with enabled ReadRepair proxy is almost
> green
> > > now
> > > > > > (~10
> > > > > > > > > tests
> > > > > > > > > >> > left, started with 6k :)).
> > > > > > > > > >> > During the analisys, I've found some tests with
> > > > > > > > > >> > - unexpected repairs at tx caches
> > > > > > > > > >> > - inconsistent state after the test finished
> (different
> > > > > entries
> > > > > > > > across
> > > > > > > > > >> the
> > > > > > > > > >> > topology)
> > > > > > > > > >> > For example,
> > > > > > > > > >> > - testInvokeAllAppliedOnceOnBinaryTypeRegistration
> > > generates
> > > > > > > > obsolete
> > > > > > > > > >> > versions on backups in case of retry, fixed [2]
> > > > > > > > > >> > - initial cache load generates not equal versions on
> > > > backups,
> > > > > > > fixed
> > > > > > > > > [3]
> > > > > > > > > >> > - testAccountTxNodeRestart causes unexpected repairs
> > > > (entries
> > > > > > have
> > > > > > > > > >> > different versions), to be investigated.
> > > > > > > > > >> >
> > > > > > > > > >> > What's next?
> > > > > > > > > >> >
> > > > > > > > > >> > 1) Going to merge the solution once "RunAll with
> > > ReadRepair
> > > > > > > enabled"
> > > > > > > > > >> > becomes fully green.
> > > > > > > > > >> > 2) Going to add special check after each test which
> will
> > > > > ensure
> > > > > > > > caches
> > > > > > > > > >> > content after the test is consistent.
> > > > > > > > > >> > 2.1) The Same check can (should?) be injected to
> > > > > > > > > >> > awaitPartitionMapExchange() and similar methods.
> > > > > > > > > >> > 3) Update Jepsen tests with RR checks.
> > > > > > > > > >> > 4) Introduce per partition RR.
> > > > > > > > > >> >
> > > > > > > > > >> > So, the final goal is to be sure that Ignite produces
> only
> > > > > > > > consistent
> > > > > > > > > >> data
> > > > > > > > > >> > and to have a feature to solve consistency in case we
> gain
> > > > > > > > > inconsistent
> > > > > > > > > >> > state somehow.
> > > > > > > > > >> >
> > > > > > > > > >> > Limitations?
> > > > > > > > > >> >
> > > > > > > > > >> > Currently, RR has some limitations, but they are not
> > > related
> > > > > to
> > > > > > > real
> > > > > > > > > >> > production cases.
> > > > > > > > > >> > In case someone interested to support, for example,
> MVCC
> > > or
> > > > > near
> > > > > > > > > caches,
> > > > > > > > > >> > please, feel free to contribute.
> > > > > > > > > >> >
> > > > > > > > > >> > [1]
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://mtcga.gridgain.com/pr.html?serverId=apache&suiteId=IgniteTests24Java8_RunAll&branchForTc=pull/6575/head&action=Latest
> > > > > > > > > >> > [2]
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://github.com/apache/ignite/pull/5656/commits/6f6ec4434095e692af209c61833a350f3013408c
> > > > > > > > > >> > [3]
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://github.com/apache/ignite/pull/5656/commits/255e552b474839e470c66a77e74e3c807bc76f13
> > > > > > > > > >> >
> > > > > > > > > >> > On Fri, Jun 28, 2019 at 2:41 PM Anton Vinogradov <
> > > > > av@apache.org
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >> >
> > > > > > > > > >> >> Slava,
> > > > > > > > > >> >>
> > > > > > > > > >> >> >> I will take a look at your pull request if you
> don't
> > > > mind.
> > > > > > > > > >> >> Great news!
> > > > > > > > > >> >>
> > > > > > > > > >> >> >> In any way, could you please update the IEP page
> with
> > > > the
> > > > > > list
> > > > > > > > of
> > > > > > > > > >> >> >> constraints/limitations of the proposed approach,
> > > TODOs,
> > > > > > etc?
> > > > > > > > > >> >> Not sure we should keep this at IEP until list (#4
> from
> > > > > > original
> > > > > > > > > >> letter)
> > > > > > > > > >> >> is not confirmed.
> > > > > > > > > >> >>
> > > > > > > > > >> >> Currently, I'm checking RunAll with RR enabled to
> almost
> > > > each
> > > > > > get
> > > > > > > > > >> request.
> > > > > > > > > >> >> "Almost" means: readRepair = !ctx.readThrough() &&
> > > > > > > > > >> >> ctx.config().getBackups() > 0 && !ctx.isNear() &&
> > > > > > > > !ctx.mvccEnabled()
> > > > > > > > > >> >> For now I have 60 failed tests and amount decreasing.
> > > > > > > > > >> >>
> > > > > > > > > >> >> >> For instance, I would like to see all these
> > > limitations
> > > > on
> > > > > > the
> > > > > > > > IEP
> > > > > > > > > >> >> page as
> > > > > > > > > >> >> >> JIRA tickets. Perhaps, it would be good to create
> an
> > > > > > > > epic/umbrella
> > > > > > > > > >> >> ticket
> > > > > > > > > >> >> >> in order to track all activities related to `Read
> > > > Repair`
> > > > > > > > feature.
> > > > > > > > > >> >> Let's do this at merge day to be sure useless issues
> will
> > > > not
> > > > > > be
> > > > > > > > > >> created.
> > > > > > > > > >> >>
> > > > > > > > > >> >> On Fri, Jun 28, 2019 at 2:01 PM Вячеслав Коптилин <
> > > > > > > > > >> >> slava.koptilin@gmail.com> wrote:
> > > > > > > > > >> >>
> > > > > > > > > >> >>> Hi Anton,
> > > > > > > > > >> >>>
> > > > > > > > > >> >>> I will take a look at your pull request if you don't
> > > mind.
> > > > > > > > > >> >>>
> > > > > > > > > >> >>> In any way, could you please update the IEP page
> with
> > > the
> > > > > list
> > > > > > > of
> > > > > > > > > >> >>> constraints/limitations of the proposed approach,
> TODOs,
> > > > > etc?
> > > > > > > > > >> >>> For instance, I would like to see all these
> limitations
> > > on
> > > > > the
> > > > > > > IEP
> > > > > > > > > >> page
> > > > > > > > > >> >>> as
> > > > > > > > > >> >>> JIRA tickets. Perhaps, it would be good to create an
> > > > > > > epic/umbrella
> > > > > > > > > >> ticket
> > > > > > > > > >> >>> in order to track all activities related to `Read
> > > Repair`
> > > > > > > feature.
> > > > > > > > > >> >>>
> > > > > > > > > >> >>> Thanks,
> > > > > > > > > >> >>> S.
> > > > > > > > > >> >>>
> > > > > > > > > >> >>> чт, 20 июн. 2019 г. в 14:15, Anton Vinogradov <
> > > > > av@apache.org
> > > > > > >:
> > > > > > > > > >> >>>
> > > > > > > > > >> >>> > Igniters,
> > > > > > > > > >> >>> > I'm glad to introduce Read Repair feature [0]
> provides
> > > > > > > > additional
> > > > > > > > > >> >>> > consistency guarantee for Ignite.
> > > > > > > > > >> >>> >
> > > > > > > > > >> >>> > 1) Why we need it?
> > > > > > > > > >> >>> > The detailed explanation can be found at IEP-31
> [1].
> > > > > > > > > >> >>> > In short, because of bugs, it's possible to gain
> an
> > > > > > > inconsistent
> > > > > > > > > >> state.
> > > > > > > > > >> >>> > We need additional features to handle this case.
> > > > > > > > > >> >>> >
> > > > > > > > > >> >>> > Currently we able to check cluster using
> Idle_verify
> > > [2]
> > > > > > > > feature,
> > > > > > > > > >> but
> > > > > > > > > >> >>> it
> > > > > > > > > >> >>> > will not fix the data, will not even tell which
> > > entries
> > > > > are
> > > > > > > > > broken.
> > > > > > > > > >> >>> > Read Repair is a feature to understand which
> entries
> > > are
> > > > > > > broken
> > > > > > > > > and
> > > > > > > > > >> to
> > > > > > > > > >> >>> fix
> > > > > > > > > >> >>> > them.
> > > > > > > > > >> >>> >
> > > > > > > > > >> >>> > 1) How it works?
> > > > > > > > > >> >>> > IgniteCache now able to provide special proxy [3]
> > > > > > > > > withReadRepair().
> > > > > > > > > >> >>> > This proxy guarantee that data will be gained
> from all
> > > > > > owners
> > > > > > > > and
> > > > > > > > > >> >>> compared.
> > > > > > > > > >> >>> > In the case of consistency violation situation,
> data
> > > > will
> > > > > be
> > > > > > > > > >> recovered
> > > > > > > > > >> >>> and
> > > > > > > > > >> >>> > a special event recorded.
> > > > > > > > > >> >>> >
> > > > > > > > > >> >>> > 3) Naming?
> > > > > > > > > >> >>> > Feature name based on Cassandra's Read Repair
> feature
> > > > [4],
> > > > > > > which
> > > > > > > > > is
> > > > > > > > > >> >>> pretty
> > > > > > > > > >> >>> > similar.
> > > > > > > > > >> >>> >
> > > > > > > > > >> >>> > 4) Limitations which can be fixed in the future?
> > > > > > > > > >> >>> >   * MVCC and Near caches are not supported.
> > > > > > > > > >> >>> >   * Atomic caches can be checked (false positive
> case
> > > is
> > > > > > > > possible
> > > > > > > > > on
> > > > > > > > > >> >>> this
> > > > > > > > > >> >>> > check), but can't be recovered.
> > > > > > > > > >> >>> >   * Partial entry removal can't be recovered.
> > > > > > > > > >> >>> >   * Entries streamed using data streamer (using
> not a
> > > > > > > > "cache.put"
> > > > > > > > > >> based
> > > > > > > > > >> >>> > updater) and loaded by cache.load
> > > > > > > > > >> >>> >   are perceived as inconsistent since they may
> have
> > > > > > different
> > > > > > > > > >> versions
> > > > > > > > > >> >>> for
> > > > > > > > > >> >>> > same keys.
> > > > > > > > > >> >>> >   * Only explicit get operations are supported
> > > > > > (getAndReplace,
> > > > > > > > > >> >>> getAndPut,
> > > > > > > > > >> >>> > etc can be supported in future).
> > > > > > > > > >> >>> >
> > > > > > > > > >> >>> > 5) What's left?
> > > > > > > > > >> >>> >   * SQL/ThinClient/etc support.
> > > > > > > > > >> >>> >   * Metrics (found/repaired).
> > > > > > > > > >> >>> >   * Simple per-partition recovery feature able to
> work
> > > > in
> > > > > > the
> > > > > > > > > >> >>> background in
> > > > > > > > > >> >>> > addition to per-entry recovery feature.
> > > > > > > > > >> >>> >
> > > > > > > > > >> >>> > 6) Is code checked?
> > > > > > > > > >> >>> >   * Pull Request #5656 [5] (feature) - has green
> TC.
> > > > > > > > > >> >>> >   * Pull Request #6575 [6] (RunAll with the
> feature
> > > > > enabled
> > > > > > > for
> > > > > > > > > >> every
> > > > > > > > > >> >>> get()
> > > > > > > > > >> >>> > request) - has a limited amount of failures
> (because
> > > of
> > > > > data
> > > > > > > > > >> streamer,
> > > > > > > > > >> >>> > cache.load, etc).
> > > > > > > > > >> >>> >
> > > > > > > > > >> >>> > Thoughts?
> > > > > > > > > >> >>> >
> > > > > > > > > >> >>> > [0]
> > > https://issues.apache.org/jira/browse/IGNITE-10663
> > > > > > > > > >> >>> > [1]
> > > > > > > > > >> >>> >
> > > > > > > > > >> >>> >
> > > > > > > > > >> >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix
> > > > > > > > > >> >>> > [2]
> > > > > > > > > >> >>> >
> > > > > > > > > >> >>> >
> > > > > > > > > >> >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
> > > > > > > > > >> >>> > [3]
> > > > > > > > > >> >>> >
> > > > > > > > > >> >>> >
> > > > > > > > > >> >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140
> > > > > > > > > >> >>> > [4]
> > > > > > > > > >> >>> >
> > > > > > > > > >> >>> >
> > > > > > > > > >> >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html
> > > > > > > > > >> >>> > [5] https://github.com/apache/ignite/pull/5656
> > > > > > > > > >> >>> > [6] https://github.com/apache/ignite/pull/6575
> > > > > > > > > >> >>> >
> > > > > > > > > >> >>>
> > > > > > > > > >> >>
> > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
>
>
>
> --
> Best regards,
> Ivan Pavlukhin
>

Re: Read Repair (ex. Consistency Check) - review request #2

Posted by Павлухин Иван <vo...@gmail.com>.
Folks,

Sorry, I was not very attentive. Could you please clarify whether (or
not) this feature is currently in an experiment state? And a couple of
obvious things:
1. If it is experimental it should be clear for a user (e.g. from javadoc).
2. If not all limitations should be described in javadocs and documentation.

ср, 17 июл. 2019 г. в 14:59, Вячеслав Коптилин <sl...@gmail.com>:
>
> Hello Anton,
>
> It's not possible, currently, to fix atomic caches.
> > You may only check the consistency. *And it's better than nothing*, I
> > think.
>
> Fair enough.
>
> >> 3. IgniteConsistencyViolationException is absolutely useless. It does not
> > >> provide any information about the issue and possible way to fix it.
> > It means that some keys from your get operation are broken.
> > IgniteConsistencyViolationException CAN be extended with a list of broken
> > keys in the future.
>
> I think it SHOULD be extended with additional fields/methods in the same
> way as CacheConsistencyViolationEvent
>
> >> Well, near caches are widely used and fully transactional, so I think it
> > >> makes sense to support the feature for near caches too.
> > As I told before, it will be nice to implement this in the future, but we
> > have more important tasks for now.
>
> I do not insist that it must be done right now.
>
>
> > >> For instance, I would like to see all these limitations on the IEP page
> > as
> > >> JIRA tickets. Perhaps, it would be good to create an epic/umbrella
> > ticket
> > >> in order to track all activities related to `Read Repair` feature.
> > Let's do this at merge day to be sure useless issues will not be created.
> >
>
> I am just trying to say that it is a good time to create tickets in order
> to track all of that otherwise the chance that all these
> limitations/improvements will not be addressed is very high.
>
> Thanks,
> S.
>
> вт, 16 июл. 2019 г. в 09:07, Anton Vinogradov <av...@apache.org>:
>
> > Svala,
> >
> > >> Could you please take a look at PR:
> > Going to review today, thanks for attaching the bot visa.
> >
> > >> 1. Should I consider that my cluster is broken? There is no answer! The
> > >> false-positive result is possible.
> > That's a question about atomic nature.
> > It's not impossible to lock atomic entry to perform the check.
> > You should perform some attempts, it's your decision how many.
> > By default, atomic RR performs 3 attempts, you may increase this by setting
> > IGNITE_NEAR_GET_MAX_REMAPS or by just performing additional gets.
> >
> > >> 2. What should be done here in order to check/resolve the issue?
> > Perhaps, I
> > >> should restart a node (which one?), restart the whole cluster, put a new
> > >> value...
> > It's not possible, currently, to fix atomic caches.
> > You may only check the consistency. And it's better than nothing, I think.
> > We should find a way how to fix atomic consistency first.
> > A possible strategy is to use ЕntryProcessor which will replace all owner's
> > values with "latest" and do nothing in case newest (than latest) value
> > found (opposite to preloading approach).
> >
> > >> 3. IgniteConsistencyViolationException is absolutely useless. It does
> > not
> > >> provide any information about the issue and possible way to fix it.
> > It means that some keys from your get operation are broken.
> > IgniteConsistencyViolationException CAN be extended with a list of broken
> > keys in the future.
> >
> > >> It seems that transactional caches are covered much better.
> > Correct.
> > Tx caches consistency is more important that atomic consistency, that's why
> > it was implemented first.
> > BTW, AFAIK, atomics also were not fixed at 10078 [1].
> >
> > >> Well, near caches are widely used and fully transactional, so I think it
> > >> makes sense to support the feature for near caches too.
> > As I told before, it will be nice to implement this in the future, but we
> > have more important tasks for now.
> > The main goal was to cover tx caches, to be able to fix them in case of the
> > real problem at production.
> >
> > Summarizing the roadmap,
> > My goal now is to finish the tx case, now we have an issue with false
> > positive consistency violation [2].
> > Also, we're going to update Jepsen tests [3] with RR to ensure tx caches
> > fixed.
> > Next main goal is to use RR at TC checks [4], help with this issue are
> > appreciated.
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-10078
> > [2] https://issues.apache.org/jira/browse/IGNITE-11973
> > [3] https://issues.apache.org/jira/browse/IGNITE-11972
> > [4] https://issues.apache.org/jira/browse/IGNITE-11971
> >
> >
> > On Mon, Jul 15, 2019 at 4:51 PM Dmitriy Pavlov <dp...@apache.org> wrote:
> >
> > > Ok,  thank you
> > >
> > > пн, 15 июл. 2019 г., 16:46 Nikolay Izhikov <ni...@apache.org>:
> > >
> > > > I did the review.
> > > >
> > > > пн, 15 июля 2019 г., 16:15 Dmitriy Pavlov <dp...@apache.org>:
> > > >
> > > > > Igniters, who did a review of
> > > > > https://issues.apache.org/jira/browse/IGNITE-10663 before the merge?
> > > > I've
> > > > > checked both PR   https://github.com/apache/ignite/pull/5656  and
> > > Issue,
> > > > > and dev.list thread and didn't find any LGTM.
> > > > >
> > > > > Anton, since you've rejected lazy consensus in our process, we have
> > RTC
> > > > in
> > > > > that (core) module. So I'd like to know if the fix was covered by the
> > > > > review.
> > > > >
> > > > > Because you're a committer, a reviewer can be non-committer. So, who
> > > was
> > > > a
> > > > > reviewer? Or was process ignored?
> > > > >
> > > > > пн, 15 июл. 2019 г. в 15:37, Вячеслав Коптилин <
> > > slava.koptilin@gmail.com
> > > > >:
> > > > >
> > > > > > Hello Anton,
> > > > > >
> > > > > > > I'd like to propose you to provide fixes as a PR since you have a
> > > > > vision
> > > > > > of how it should be made. I'll review them and merge shortly.
> > > > > > Could you please take a look at PR:
> > > > > > https://github.com/apache/ignite/pull/6689
> > > > > >
> > > > > > > Since your comments mostly about Javadoc (does this mean that my
> > > > > solution
> > > > > > is so great that you ask me only to fix Javadocs :) ?),
> > > > > > In my humble opinion, I would consider this feature as experimental
> > > one
> > > > > (It
> > > > > > does not seem production-ready).
> > > > > > Let me clarify this with the following simple example:
> > > > > >
> > > > > >     try {
> > > > > >         atomicCache.withReadRepair().getAll(keys);
> > > > > >     }
> > > > > >     catch (CacheException e) {
> > > > > >         // What should be done here from the end-user point of
> > view?
> > > > > >     }
> > > > > >
> > > > > > 1. Should I consider that my cluster is broken? There is no answer!
> > > The
> > > > > > false-positive result is possible.
> > > > > > 2. What should be done here in order to check/resolve the issue?
> > > > > Perhaps, I
> > > > > > should restart a node (which one?), restart the whole cluster, put
> > a
> > > > new
> > > > > > value...
> > > > > > 3. IgniteConsistencyViolationException is absolutely useless. It
> > does
> > > > not
> > > > > > provide any information about the issue and possible way to fix it.
> > > > > >
> > > > > > It seems that transactional caches are covered much better.
> > > > > >
> > > > > > > Mostly agree with you, but
> > > > > > > - MVCC is not production ready,
> > > > > > > - not sure near support really required,
> > > > > > > - metrics are better for monitoring, but the Event is enough for
> > my
> > > > > wish
> > > > > > to
> > > > > > > cover AI with consistency check,
> > > > > > > - do we really need Platforms and Thin Client support?
> > > > > > Well, near caches are widely used and fully transactional, so I
> > think
> > > > it
> > > > > > makes sense to support the feature for near caches too.
> > > > > > .Net is already aware of 'ReadRepair'. It seems to me, that it can
> > be
> > > > > > easily supported for C++. I don't see a reason why it should not be
> > > > done
> > > > > :)
> > > > > >
> > > > > > > Do you mean per partition check and recovery? That's a good idea,
> > > > but I
> > > > > > found it's not easy to imagine API to for such tool.
> > > > > > Yep, perhaps it can be done on the idle cluster via `idle-verify`
> > > > command
> > > > > > with additional flag. Agreed, that this approach is not the best
> > one
> > > > > > definitely.
> > > > > >
> > > > > > Thanks,
> > > > > > S.
> > > > > >
> > > > > > чт, 11 июл. 2019 г. в 09:53, Anton Vinogradov <av...@apache.org>:
> > > > > >
> > > > > > > Slava,
> > > > > > >
> > > > > > > Thanks for your review first!
> > > > > > >
> > > > > > > >> Anyway, I left some comments in your pull-request at github.
> > > > Please
> > > > > > take
> > > > > > > a
> > > > > > > >> look. The most annoying thing is poorly documented code :(
> > > > > > > Since your comments mostly about Javadoc (does this mean that my
> > > > > solution
> > > > > > > is so great that you ask me only to fix Javadocs :) ?),
> > > > > > > I'd like to propose you to provide fixes as a PR since you have a
> > > > > vision
> > > > > > of
> > > > > > > how it should be made. I'll review them and merge shortly.
> > > > > > >
> > > > > > > >> By the way, is it required to add test related to fail-over
> > > > > scenarios?
> > > > > > > The best check is to use RR at real code.
> > > > > > > For example, I'm injecting RR now to the test with concurrent
> > > > > > modifications
> > > > > > > and restarts [1].
> > > > > > >
> > > > > > > >> I just checked, the IEP page and I still cannot find Jira
> > > tickets
> > > > > that
> > > > > > > >> should cover existing limitations/improvements.
> > > > > > > >> I would suggest creating the following tasks, at least:
> > > > > > > Mostly agree with you, but
> > > > > > > - MVCC is not production ready,
> > > > > > > - not sure near support really required,
> > > > > > > - metrics are better for monitoring, but the Event is enough for
> > my
> > > > > wish
> > > > > > to
> > > > > > > cover AI with consistency check,
> > > > > > > - do we really need Platforms and Thin Client support?
> > > > > > > Also, we should not produce stillborn issue.
> > > > > > > All limitations listed at proxy creation method and they
> > definitely
> > > > are
> > > > > > not
> > > > > > > showstoppers and may be fixed later if someone interested.
> > > > > > > Сoming back to AI 3.0 discussion, we have A LOT of features and
> > > it's
> > > > > > almost
> > > > > > > impossible (require much more time that feature's cost) to
> > support
> > > > them
> > > > > > > all.
> > > > > > > I will be pretty happy in case someone will do this and provide
> > > help
> > > > if
> > > > > > > necessary!
> > > > > > >
> > > > > > > >> Perhaps, it would be a good idea to think about the recovery
> > too
> > > > > > > Do you mean per partition check and recovery?
> > > > > > > That's a good idea, but I found it's not easy to imagine API to
> > for
> > > > > such
> > > > > > > tool.
> > > > > > > In case you ready to assist with proper API/design this will
> > > > definitely
> > > > > > > help.
> > > > > > >
> > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11973
> > > > > > >
> > > > > > > On Wed, Jul 10, 2019 at 3:43 PM Вячеслав Коптилин <
> > > > > > > slava.koptilin@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Perhaps, it would be a good idea to think about the recovery
> > > tool/
> > > > > > > > control-utility command that will allow achieving the same
> > goal.
> > > > > > > > If I am not mistaken it was already proposed in the email
> > thread.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > S.
> > > > > > > >
> > > > > > > > ср, 10 июл. 2019 г. в 15:33, Вячеслав Коптилин <
> > > > > > slava.koptilin@gmail.com
> > > > > > > >:
> > > > > > > >
> > > > > > > > > Hi Anton,
> > > > > > > > >
> > > > > > > > > Well, the ReadRepair feature is finally merged and that is
> > good
> > > > > news
> > > > > > :)
> > > > > > > > >
> > > > > > > > > Unfortunately, I cannot find a consensus about the whole
> > > > > > functionality
> > > > > > > in
> > > > > > > > > any of these topics:
> > > > > > > > >  -
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > http://apache-ignite-developers.2346864.n4.nabble.com/Consistency-check-and-fix-review-request-td41629.html
> > > > > > > > >  -
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > http://apache-ignite-developers.2346864.n4.nabble.com/Read-Repair-ex-Consistency-Check-review-request-2-td42421.html
> > > > > > > > > Also, there are no comments/discussion in JIRA. That makes me
> > > sad
> > > > > :(
> > > > > > > > > especially when a feature is huge, not obvious and involves
> > > > > changing
> > > > > > > > public
> > > > > > > > > API (and that is the case, I think).
> > > > > > > > >
> > > > > > > > > Anyway, I left some comments in your pull-request at github.
> > > > Please
> > > > > > > take
> > > > > > > > a
> > > > > > > > > look. The most annoying thing is poorly documented code :(
> > > > > > > > > By the way, is it required to add test related to fail-over
> > > > > > scenarios?
> > > > > > > > >
> > > > > > > > > I just checked, the IEP page and I still cannot find Jira
> > > tickets
> > > > > > that
> > > > > > > > > should cover existing limitations/improvements.
> > > > > > > > > I would suggest creating the following tasks, at least:
> > > > > > > > >  - MVCC support
> > > > > > > > >  - Near caches
> > > > > > > > >  - Additional metrics (number of violations, number of
> > repaired
> > > > > > entries
> > > > > > > > > etc)
> > > > > > > > >  - Ignite C++ (It looks like, .Net is already aware of that
> > > > > feature)
> > > > > > > > >  - Thin clients support
> > > > > > > > >  - Perhaps, it would be useful to support different
> > strategies
> > > to
> > > > > > > resolve
> > > > > > > > > inconsistencies
> > > > > > > > >
> > > > > > > > > What do you think?
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > S.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > ср, 10 июл. 2019 г. в 10:16, Anton Vinogradov <av@apache.org
> > >:
> > > > > > > > >
> > > > > > > > >> Folks,
> > > > > > > > >>
> > > > > > > > >> Thanks to everyone for tips and reviews.
> > > > > > > > >> Yardstick checked, no performance drop found.
> > > > > > > > >> Additional measurement: RR get() is just up to 7% slower
> > than
> > > > > > regular
> > > > > > > > get
> > > > > > > > >> on real benchmarks (8 clients, 4 servers, 3 backups)
> > > > > > > > >> Code merged to the master.
> > > > > > > > >> "Must have" tasks created and attached to the IEP.
> > > > > > > > >>
> > > > > > > > >> On Thu, Jul 4, 2019 at 12:18 PM Anton Vinogradov <
> > > av@apache.org
> > > > >
> > > > > > > wrote:
> > > > > > > > >>
> > > > > > > > >> > Folks,
> > > > > > > > >> >
> > > > > > > > >> > Just a minor update.
> > > > > > > > >> >
> > > > > > > > >> > RunAll [1] with enabled ReadRepair proxy is almost green
> > now
> > > > > (~10
> > > > > > > > tests
> > > > > > > > >> > left, started with 6k :)).
> > > > > > > > >> > During the analisys, I've found some tests with
> > > > > > > > >> > - unexpected repairs at tx caches
> > > > > > > > >> > - inconsistent state after the test finished (different
> > > > entries
> > > > > > > across
> > > > > > > > >> the
> > > > > > > > >> > topology)
> > > > > > > > >> > For example,
> > > > > > > > >> > - testInvokeAllAppliedOnceOnBinaryTypeRegistration
> > generates
> > > > > > > obsolete
> > > > > > > > >> > versions on backups in case of retry, fixed [2]
> > > > > > > > >> > - initial cache load generates not equal versions on
> > > backups,
> > > > > > fixed
> > > > > > > > [3]
> > > > > > > > >> > - testAccountTxNodeRestart causes unexpected repairs
> > > (entries
> > > > > have
> > > > > > > > >> > different versions), to be investigated.
> > > > > > > > >> >
> > > > > > > > >> > What's next?
> > > > > > > > >> >
> > > > > > > > >> > 1) Going to merge the solution once "RunAll with
> > ReadRepair
> > > > > > enabled"
> > > > > > > > >> > becomes fully green.
> > > > > > > > >> > 2) Going to add special check after each test which will
> > > > ensure
> > > > > > > caches
> > > > > > > > >> > content after the test is consistent.
> > > > > > > > >> > 2.1) The Same check can (should?) be injected to
> > > > > > > > >> > awaitPartitionMapExchange() and similar methods.
> > > > > > > > >> > 3) Update Jepsen tests with RR checks.
> > > > > > > > >> > 4) Introduce per partition RR.
> > > > > > > > >> >
> > > > > > > > >> > So, the final goal is to be sure that Ignite produces only
> > > > > > > consistent
> > > > > > > > >> data
> > > > > > > > >> > and to have a feature to solve consistency in case we gain
> > > > > > > > inconsistent
> > > > > > > > >> > state somehow.
> > > > > > > > >> >
> > > > > > > > >> > Limitations?
> > > > > > > > >> >
> > > > > > > > >> > Currently, RR has some limitations, but they are not
> > related
> > > > to
> > > > > > real
> > > > > > > > >> > production cases.
> > > > > > > > >> > In case someone interested to support, for example, MVCC
> > or
> > > > near
> > > > > > > > caches,
> > > > > > > > >> > please, feel free to contribute.
> > > > > > > > >> >
> > > > > > > > >> > [1]
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://mtcga.gridgain.com/pr.html?serverId=apache&suiteId=IgniteTests24Java8_RunAll&branchForTc=pull/6575/head&action=Latest
> > > > > > > > >> > [2]
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://github.com/apache/ignite/pull/5656/commits/6f6ec4434095e692af209c61833a350f3013408c
> > > > > > > > >> > [3]
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://github.com/apache/ignite/pull/5656/commits/255e552b474839e470c66a77e74e3c807bc76f13
> > > > > > > > >> >
> > > > > > > > >> > On Fri, Jun 28, 2019 at 2:41 PM Anton Vinogradov <
> > > > av@apache.org
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >> >
> > > > > > > > >> >> Slava,
> > > > > > > > >> >>
> > > > > > > > >> >> >> I will take a look at your pull request if you don't
> > > mind.
> > > > > > > > >> >> Great news!
> > > > > > > > >> >>
> > > > > > > > >> >> >> In any way, could you please update the IEP page with
> > > the
> > > > > list
> > > > > > > of
> > > > > > > > >> >> >> constraints/limitations of the proposed approach,
> > TODOs,
> > > > > etc?
> > > > > > > > >> >> Not sure we should keep this at IEP until list (#4 from
> > > > > original
> > > > > > > > >> letter)
> > > > > > > > >> >> is not confirmed.
> > > > > > > > >> >>
> > > > > > > > >> >> Currently, I'm checking RunAll with RR enabled to almost
> > > each
> > > > > get
> > > > > > > > >> request.
> > > > > > > > >> >> "Almost" means: readRepair = !ctx.readThrough() &&
> > > > > > > > >> >> ctx.config().getBackups() > 0 && !ctx.isNear() &&
> > > > > > > !ctx.mvccEnabled()
> > > > > > > > >> >> For now I have 60 failed tests and amount decreasing.
> > > > > > > > >> >>
> > > > > > > > >> >> >> For instance, I would like to see all these
> > limitations
> > > on
> > > > > the
> > > > > > > IEP
> > > > > > > > >> >> page as
> > > > > > > > >> >> >> JIRA tickets. Perhaps, it would be good to create an
> > > > > > > epic/umbrella
> > > > > > > > >> >> ticket
> > > > > > > > >> >> >> in order to track all activities related to `Read
> > > Repair`
> > > > > > > feature.
> > > > > > > > >> >> Let's do this at merge day to be sure useless issues will
> > > not
> > > > > be
> > > > > > > > >> created.
> > > > > > > > >> >>
> > > > > > > > >> >> On Fri, Jun 28, 2019 at 2:01 PM Вячеслав Коптилин <
> > > > > > > > >> >> slava.koptilin@gmail.com> wrote:
> > > > > > > > >> >>
> > > > > > > > >> >>> Hi Anton,
> > > > > > > > >> >>>
> > > > > > > > >> >>> I will take a look at your pull request if you don't
> > mind.
> > > > > > > > >> >>>
> > > > > > > > >> >>> In any way, could you please update the IEP page with
> > the
> > > > list
> > > > > > of
> > > > > > > > >> >>> constraints/limitations of the proposed approach, TODOs,
> > > > etc?
> > > > > > > > >> >>> For instance, I would like to see all these limitations
> > on
> > > > the
> > > > > > IEP
> > > > > > > > >> page
> > > > > > > > >> >>> as
> > > > > > > > >> >>> JIRA tickets. Perhaps, it would be good to create an
> > > > > > epic/umbrella
> > > > > > > > >> ticket
> > > > > > > > >> >>> in order to track all activities related to `Read
> > Repair`
> > > > > > feature.
> > > > > > > > >> >>>
> > > > > > > > >> >>> Thanks,
> > > > > > > > >> >>> S.
> > > > > > > > >> >>>
> > > > > > > > >> >>> чт, 20 июн. 2019 г. в 14:15, Anton Vinogradov <
> > > > av@apache.org
> > > > > >:
> > > > > > > > >> >>>
> > > > > > > > >> >>> > Igniters,
> > > > > > > > >> >>> > I'm glad to introduce Read Repair feature [0] provides
> > > > > > > additional
> > > > > > > > >> >>> > consistency guarantee for Ignite.
> > > > > > > > >> >>> >
> > > > > > > > >> >>> > 1) Why we need it?
> > > > > > > > >> >>> > The detailed explanation can be found at IEP-31 [1].
> > > > > > > > >> >>> > In short, because of bugs, it's possible to gain an
> > > > > > inconsistent
> > > > > > > > >> state.
> > > > > > > > >> >>> > We need additional features to handle this case.
> > > > > > > > >> >>> >
> > > > > > > > >> >>> > Currently we able to check cluster using Idle_verify
> > [2]
> > > > > > > feature,
> > > > > > > > >> but
> > > > > > > > >> >>> it
> > > > > > > > >> >>> > will not fix the data, will not even tell which
> > entries
> > > > are
> > > > > > > > broken.
> > > > > > > > >> >>> > Read Repair is a feature to understand which entries
> > are
> > > > > > broken
> > > > > > > > and
> > > > > > > > >> to
> > > > > > > > >> >>> fix
> > > > > > > > >> >>> > them.
> > > > > > > > >> >>> >
> > > > > > > > >> >>> > 1) How it works?
> > > > > > > > >> >>> > IgniteCache now able to provide special proxy [3]
> > > > > > > > withReadRepair().
> > > > > > > > >> >>> > This proxy guarantee that data will be gained from all
> > > > > owners
> > > > > > > and
> > > > > > > > >> >>> compared.
> > > > > > > > >> >>> > In the case of consistency violation situation, data
> > > will
> > > > be
> > > > > > > > >> recovered
> > > > > > > > >> >>> and
> > > > > > > > >> >>> > a special event recorded.
> > > > > > > > >> >>> >
> > > > > > > > >> >>> > 3) Naming?
> > > > > > > > >> >>> > Feature name based on Cassandra's Read Repair feature
> > > [4],
> > > > > > which
> > > > > > > > is
> > > > > > > > >> >>> pretty
> > > > > > > > >> >>> > similar.
> > > > > > > > >> >>> >
> > > > > > > > >> >>> > 4) Limitations which can be fixed in the future?
> > > > > > > > >> >>> >   * MVCC and Near caches are not supported.
> > > > > > > > >> >>> >   * Atomic caches can be checked (false positive case
> > is
> > > > > > > possible
> > > > > > > > on
> > > > > > > > >> >>> this
> > > > > > > > >> >>> > check), but can't be recovered.
> > > > > > > > >> >>> >   * Partial entry removal can't be recovered.
> > > > > > > > >> >>> >   * Entries streamed using data streamer (using not a
> > > > > > > "cache.put"
> > > > > > > > >> based
> > > > > > > > >> >>> > updater) and loaded by cache.load
> > > > > > > > >> >>> >   are perceived as inconsistent since they may have
> > > > > different
> > > > > > > > >> versions
> > > > > > > > >> >>> for
> > > > > > > > >> >>> > same keys.
> > > > > > > > >> >>> >   * Only explicit get operations are supported
> > > > > (getAndReplace,
> > > > > > > > >> >>> getAndPut,
> > > > > > > > >> >>> > etc can be supported in future).
> > > > > > > > >> >>> >
> > > > > > > > >> >>> > 5) What's left?
> > > > > > > > >> >>> >   * SQL/ThinClient/etc support.
> > > > > > > > >> >>> >   * Metrics (found/repaired).
> > > > > > > > >> >>> >   * Simple per-partition recovery feature able to work
> > > in
> > > > > the
> > > > > > > > >> >>> background in
> > > > > > > > >> >>> > addition to per-entry recovery feature.
> > > > > > > > >> >>> >
> > > > > > > > >> >>> > 6) Is code checked?
> > > > > > > > >> >>> >   * Pull Request #5656 [5] (feature) - has green TC.
> > > > > > > > >> >>> >   * Pull Request #6575 [6] (RunAll with the feature
> > > > enabled
> > > > > > for
> > > > > > > > >> every
> > > > > > > > >> >>> get()
> > > > > > > > >> >>> > request) - has a limited amount of failures (because
> > of
> > > > data
> > > > > > > > >> streamer,
> > > > > > > > >> >>> > cache.load, etc).
> > > > > > > > >> >>> >
> > > > > > > > >> >>> > Thoughts?
> > > > > > > > >> >>> >
> > > > > > > > >> >>> > [0]
> > https://issues.apache.org/jira/browse/IGNITE-10663
> > > > > > > > >> >>> > [1]
> > > > > > > > >> >>> >
> > > > > > > > >> >>> >
> > > > > > > > >> >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix
> > > > > > > > >> >>> > [2]
> > > > > > > > >> >>> >
> > > > > > > > >> >>> >
> > > > > > > > >> >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
> > > > > > > > >> >>> > [3]
> > > > > > > > >> >>> >
> > > > > > > > >> >>> >
> > > > > > > > >> >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140
> > > > > > > > >> >>> > [4]
> > > > > > > > >> >>> >
> > > > > > > > >> >>> >
> > > > > > > > >> >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html
> > > > > > > > >> >>> > [5] https://github.com/apache/ignite/pull/5656
> > > > > > > > >> >>> > [6] https://github.com/apache/ignite/pull/6575
> > > > > > > > >> >>> >
> > > > > > > > >> >>>
> > > > > > > > >> >>
> > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >



-- 
Best regards,
Ivan Pavlukhin

Re: Read Repair (ex. Consistency Check) - review request #2

Posted by Вячеслав Коптилин <sl...@gmail.com>.
Hello Anton,

It's not possible, currently, to fix atomic caches.
> You may only check the consistency. *And it's better than nothing*, I
> think.

Fair enough.

>> 3. IgniteConsistencyViolationException is absolutely useless. It does not
> >> provide any information about the issue and possible way to fix it.
> It means that some keys from your get operation are broken.
> IgniteConsistencyViolationException CAN be extended with a list of broken
> keys in the future.

I think it SHOULD be extended with additional fields/methods in the same
way as CacheConsistencyViolationEvent

>> Well, near caches are widely used and fully transactional, so I think it
> >> makes sense to support the feature for near caches too.
> As I told before, it will be nice to implement this in the future, but we
> have more important tasks for now.

I do not insist that it must be done right now.


> >> For instance, I would like to see all these limitations on the IEP page
> as
> >> JIRA tickets. Perhaps, it would be good to create an epic/umbrella
> ticket
> >> in order to track all activities related to `Read Repair` feature.
> Let's do this at merge day to be sure useless issues will not be created.
>

I am just trying to say that it is a good time to create tickets in order
to track all of that otherwise the chance that all these
limitations/improvements will not be addressed is very high.

Thanks,
S.

вт, 16 июл. 2019 г. в 09:07, Anton Vinogradov <av...@apache.org>:

> Svala,
>
> >> Could you please take a look at PR:
> Going to review today, thanks for attaching the bot visa.
>
> >> 1. Should I consider that my cluster is broken? There is no answer! The
> >> false-positive result is possible.
> That's a question about atomic nature.
> It's not impossible to lock atomic entry to perform the check.
> You should perform some attempts, it's your decision how many.
> By default, atomic RR performs 3 attempts, you may increase this by setting
> IGNITE_NEAR_GET_MAX_REMAPS or by just performing additional gets.
>
> >> 2. What should be done here in order to check/resolve the issue?
> Perhaps, I
> >> should restart a node (which one?), restart the whole cluster, put a new
> >> value...
> It's not possible, currently, to fix atomic caches.
> You may only check the consistency. And it's better than nothing, I think.
> We should find a way how to fix atomic consistency first.
> A possible strategy is to use ЕntryProcessor which will replace all owner's
> values with "latest" and do nothing in case newest (than latest) value
> found (opposite to preloading approach).
>
> >> 3. IgniteConsistencyViolationException is absolutely useless. It does
> not
> >> provide any information about the issue and possible way to fix it.
> It means that some keys from your get operation are broken.
> IgniteConsistencyViolationException CAN be extended with a list of broken
> keys in the future.
>
> >> It seems that transactional caches are covered much better.
> Correct.
> Tx caches consistency is more important that atomic consistency, that's why
> it was implemented first.
> BTW, AFAIK, atomics also were not fixed at 10078 [1].
>
> >> Well, near caches are widely used and fully transactional, so I think it
> >> makes sense to support the feature for near caches too.
> As I told before, it will be nice to implement this in the future, but we
> have more important tasks for now.
> The main goal was to cover tx caches, to be able to fix them in case of the
> real problem at production.
>
> Summarizing the roadmap,
> My goal now is to finish the tx case, now we have an issue with false
> positive consistency violation [2].
> Also, we're going to update Jepsen tests [3] with RR to ensure tx caches
> fixed.
> Next main goal is to use RR at TC checks [4], help with this issue are
> appreciated.
>
> [1] https://issues.apache.org/jira/browse/IGNITE-10078
> [2] https://issues.apache.org/jira/browse/IGNITE-11973
> [3] https://issues.apache.org/jira/browse/IGNITE-11972
> [4] https://issues.apache.org/jira/browse/IGNITE-11971
>
>
> On Mon, Jul 15, 2019 at 4:51 PM Dmitriy Pavlov <dp...@apache.org> wrote:
>
> > Ok,  thank you
> >
> > пн, 15 июл. 2019 г., 16:46 Nikolay Izhikov <ni...@apache.org>:
> >
> > > I did the review.
> > >
> > > пн, 15 июля 2019 г., 16:15 Dmitriy Pavlov <dp...@apache.org>:
> > >
> > > > Igniters, who did a review of
> > > > https://issues.apache.org/jira/browse/IGNITE-10663 before the merge?
> > > I've
> > > > checked both PR   https://github.com/apache/ignite/pull/5656  and
> > Issue,
> > > > and dev.list thread and didn't find any LGTM.
> > > >
> > > > Anton, since you've rejected lazy consensus in our process, we have
> RTC
> > > in
> > > > that (core) module. So I'd like to know if the fix was covered by the
> > > > review.
> > > >
> > > > Because you're a committer, a reviewer can be non-committer. So, who
> > was
> > > a
> > > > reviewer? Or was process ignored?
> > > >
> > > > пн, 15 июл. 2019 г. в 15:37, Вячеслав Коптилин <
> > slava.koptilin@gmail.com
> > > >:
> > > >
> > > > > Hello Anton,
> > > > >
> > > > > > I'd like to propose you to provide fixes as a PR since you have a
> > > > vision
> > > > > of how it should be made. I'll review them and merge shortly.
> > > > > Could you please take a look at PR:
> > > > > https://github.com/apache/ignite/pull/6689
> > > > >
> > > > > > Since your comments mostly about Javadoc (does this mean that my
> > > > solution
> > > > > is so great that you ask me only to fix Javadocs :) ?),
> > > > > In my humble opinion, I would consider this feature as experimental
> > one
> > > > (It
> > > > > does not seem production-ready).
> > > > > Let me clarify this with the following simple example:
> > > > >
> > > > >     try {
> > > > >         atomicCache.withReadRepair().getAll(keys);
> > > > >     }
> > > > >     catch (CacheException e) {
> > > > >         // What should be done here from the end-user point of
> view?
> > > > >     }
> > > > >
> > > > > 1. Should I consider that my cluster is broken? There is no answer!
> > The
> > > > > false-positive result is possible.
> > > > > 2. What should be done here in order to check/resolve the issue?
> > > > Perhaps, I
> > > > > should restart a node (which one?), restart the whole cluster, put
> a
> > > new
> > > > > value...
> > > > > 3. IgniteConsistencyViolationException is absolutely useless. It
> does
> > > not
> > > > > provide any information about the issue and possible way to fix it.
> > > > >
> > > > > It seems that transactional caches are covered much better.
> > > > >
> > > > > > Mostly agree with you, but
> > > > > > - MVCC is not production ready,
> > > > > > - not sure near support really required,
> > > > > > - metrics are better for monitoring, but the Event is enough for
> my
> > > > wish
> > > > > to
> > > > > > cover AI with consistency check,
> > > > > > - do we really need Platforms and Thin Client support?
> > > > > Well, near caches are widely used and fully transactional, so I
> think
> > > it
> > > > > makes sense to support the feature for near caches too.
> > > > > .Net is already aware of 'ReadRepair'. It seems to me, that it can
> be
> > > > > easily supported for C++. I don't see a reason why it should not be
> > > done
> > > > :)
> > > > >
> > > > > > Do you mean per partition check and recovery? That's a good idea,
> > > but I
> > > > > found it's not easy to imagine API to for such tool.
> > > > > Yep, perhaps it can be done on the idle cluster via `idle-verify`
> > > command
> > > > > with additional flag. Agreed, that this approach is not the best
> one
> > > > > definitely.
> > > > >
> > > > > Thanks,
> > > > > S.
> > > > >
> > > > > чт, 11 июл. 2019 г. в 09:53, Anton Vinogradov <av...@apache.org>:
> > > > >
> > > > > > Slava,
> > > > > >
> > > > > > Thanks for your review first!
> > > > > >
> > > > > > >> Anyway, I left some comments in your pull-request at github.
> > > Please
> > > > > take
> > > > > > a
> > > > > > >> look. The most annoying thing is poorly documented code :(
> > > > > > Since your comments mostly about Javadoc (does this mean that my
> > > > solution
> > > > > > is so great that you ask me only to fix Javadocs :) ?),
> > > > > > I'd like to propose you to provide fixes as a PR since you have a
> > > > vision
> > > > > of
> > > > > > how it should be made. I'll review them and merge shortly.
> > > > > >
> > > > > > >> By the way, is it required to add test related to fail-over
> > > > scenarios?
> > > > > > The best check is to use RR at real code.
> > > > > > For example, I'm injecting RR now to the test with concurrent
> > > > > modifications
> > > > > > and restarts [1].
> > > > > >
> > > > > > >> I just checked, the IEP page and I still cannot find Jira
> > tickets
> > > > that
> > > > > > >> should cover existing limitations/improvements.
> > > > > > >> I would suggest creating the following tasks, at least:
> > > > > > Mostly agree with you, but
> > > > > > - MVCC is not production ready,
> > > > > > - not sure near support really required,
> > > > > > - metrics are better for monitoring, but the Event is enough for
> my
> > > > wish
> > > > > to
> > > > > > cover AI with consistency check,
> > > > > > - do we really need Platforms and Thin Client support?
> > > > > > Also, we should not produce stillborn issue.
> > > > > > All limitations listed at proxy creation method and they
> definitely
> > > are
> > > > > not
> > > > > > showstoppers and may be fixed later if someone interested.
> > > > > > Сoming back to AI 3.0 discussion, we have A LOT of features and
> > it's
> > > > > almost
> > > > > > impossible (require much more time that feature's cost) to
> support
> > > them
> > > > > > all.
> > > > > > I will be pretty happy in case someone will do this and provide
> > help
> > > if
> > > > > > necessary!
> > > > > >
> > > > > > >> Perhaps, it would be a good idea to think about the recovery
> too
> > > > > > Do you mean per partition check and recovery?
> > > > > > That's a good idea, but I found it's not easy to imagine API to
> for
> > > > such
> > > > > > tool.
> > > > > > In case you ready to assist with proper API/design this will
> > > definitely
> > > > > > help.
> > > > > >
> > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11973
> > > > > >
> > > > > > On Wed, Jul 10, 2019 at 3:43 PM Вячеслав Коптилин <
> > > > > > slava.koptilin@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Perhaps, it would be a good idea to think about the recovery
> > tool/
> > > > > > > control-utility command that will allow achieving the same
> goal.
> > > > > > > If I am not mistaken it was already proposed in the email
> thread.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > S.
> > > > > > >
> > > > > > > ср, 10 июл. 2019 г. в 15:33, Вячеслав Коптилин <
> > > > > slava.koptilin@gmail.com
> > > > > > >:
> > > > > > >
> > > > > > > > Hi Anton,
> > > > > > > >
> > > > > > > > Well, the ReadRepair feature is finally merged and that is
> good
> > > > news
> > > > > :)
> > > > > > > >
> > > > > > > > Unfortunately, I cannot find a consensus about the whole
> > > > > functionality
> > > > > > in
> > > > > > > > any of these topics:
> > > > > > > >  -
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache-ignite-developers.2346864.n4.nabble.com/Consistency-check-and-fix-review-request-td41629.html
> > > > > > > >  -
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache-ignite-developers.2346864.n4.nabble.com/Read-Repair-ex-Consistency-Check-review-request-2-td42421.html
> > > > > > > > Also, there are no comments/discussion in JIRA. That makes me
> > sad
> > > > :(
> > > > > > > > especially when a feature is huge, not obvious and involves
> > > > changing
> > > > > > > public
> > > > > > > > API (and that is the case, I think).
> > > > > > > >
> > > > > > > > Anyway, I left some comments in your pull-request at github.
> > > Please
> > > > > > take
> > > > > > > a
> > > > > > > > look. The most annoying thing is poorly documented code :(
> > > > > > > > By the way, is it required to add test related to fail-over
> > > > > scenarios?
> > > > > > > >
> > > > > > > > I just checked, the IEP page and I still cannot find Jira
> > tickets
> > > > > that
> > > > > > > > should cover existing limitations/improvements.
> > > > > > > > I would suggest creating the following tasks, at least:
> > > > > > > >  - MVCC support
> > > > > > > >  - Near caches
> > > > > > > >  - Additional metrics (number of violations, number of
> repaired
> > > > > entries
> > > > > > > > etc)
> > > > > > > >  - Ignite C++ (It looks like, .Net is already aware of that
> > > > feature)
> > > > > > > >  - Thin clients support
> > > > > > > >  - Perhaps, it would be useful to support different
> strategies
> > to
> > > > > > resolve
> > > > > > > > inconsistencies
> > > > > > > >
> > > > > > > > What do you think?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > S.
> > > > > > > >
> > > > > > > >
> > > > > > > > ср, 10 июл. 2019 г. в 10:16, Anton Vinogradov <av@apache.org
> >:
> > > > > > > >
> > > > > > > >> Folks,
> > > > > > > >>
> > > > > > > >> Thanks to everyone for tips and reviews.
> > > > > > > >> Yardstick checked, no performance drop found.
> > > > > > > >> Additional measurement: RR get() is just up to 7% slower
> than
> > > > > regular
> > > > > > > get
> > > > > > > >> on real benchmarks (8 clients, 4 servers, 3 backups)
> > > > > > > >> Code merged to the master.
> > > > > > > >> "Must have" tasks created and attached to the IEP.
> > > > > > > >>
> > > > > > > >> On Thu, Jul 4, 2019 at 12:18 PM Anton Vinogradov <
> > av@apache.org
> > > >
> > > > > > wrote:
> > > > > > > >>
> > > > > > > >> > Folks,
> > > > > > > >> >
> > > > > > > >> > Just a minor update.
> > > > > > > >> >
> > > > > > > >> > RunAll [1] with enabled ReadRepair proxy is almost green
> now
> > > > (~10
> > > > > > > tests
> > > > > > > >> > left, started with 6k :)).
> > > > > > > >> > During the analisys, I've found some tests with
> > > > > > > >> > - unexpected repairs at tx caches
> > > > > > > >> > - inconsistent state after the test finished (different
> > > entries
> > > > > > across
> > > > > > > >> the
> > > > > > > >> > topology)
> > > > > > > >> > For example,
> > > > > > > >> > - testInvokeAllAppliedOnceOnBinaryTypeRegistration
> generates
> > > > > > obsolete
> > > > > > > >> > versions on backups in case of retry, fixed [2]
> > > > > > > >> > - initial cache load generates not equal versions on
> > backups,
> > > > > fixed
> > > > > > > [3]
> > > > > > > >> > - testAccountTxNodeRestart causes unexpected repairs
> > (entries
> > > > have
> > > > > > > >> > different versions), to be investigated.
> > > > > > > >> >
> > > > > > > >> > What's next?
> > > > > > > >> >
> > > > > > > >> > 1) Going to merge the solution once "RunAll with
> ReadRepair
> > > > > enabled"
> > > > > > > >> > becomes fully green.
> > > > > > > >> > 2) Going to add special check after each test which will
> > > ensure
> > > > > > caches
> > > > > > > >> > content after the test is consistent.
> > > > > > > >> > 2.1) The Same check can (should?) be injected to
> > > > > > > >> > awaitPartitionMapExchange() and similar methods.
> > > > > > > >> > 3) Update Jepsen tests with RR checks.
> > > > > > > >> > 4) Introduce per partition RR.
> > > > > > > >> >
> > > > > > > >> > So, the final goal is to be sure that Ignite produces only
> > > > > > consistent
> > > > > > > >> data
> > > > > > > >> > and to have a feature to solve consistency in case we gain
> > > > > > > inconsistent
> > > > > > > >> > state somehow.
> > > > > > > >> >
> > > > > > > >> > Limitations?
> > > > > > > >> >
> > > > > > > >> > Currently, RR has some limitations, but they are not
> related
> > > to
> > > > > real
> > > > > > > >> > production cases.
> > > > > > > >> > In case someone interested to support, for example, MVCC
> or
> > > near
> > > > > > > caches,
> > > > > > > >> > please, feel free to contribute.
> > > > > > > >> >
> > > > > > > >> > [1]
> > > > > > > >> >
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://mtcga.gridgain.com/pr.html?serverId=apache&suiteId=IgniteTests24Java8_RunAll&branchForTc=pull/6575/head&action=Latest
> > > > > > > >> > [2]
> > > > > > > >> >
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/ignite/pull/5656/commits/6f6ec4434095e692af209c61833a350f3013408c
> > > > > > > >> > [3]
> > > > > > > >> >
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/ignite/pull/5656/commits/255e552b474839e470c66a77e74e3c807bc76f13
> > > > > > > >> >
> > > > > > > >> > On Fri, Jun 28, 2019 at 2:41 PM Anton Vinogradov <
> > > av@apache.org
> > > > >
> > > > > > > wrote:
> > > > > > > >> >
> > > > > > > >> >> Slava,
> > > > > > > >> >>
> > > > > > > >> >> >> I will take a look at your pull request if you don't
> > mind.
> > > > > > > >> >> Great news!
> > > > > > > >> >>
> > > > > > > >> >> >> In any way, could you please update the IEP page with
> > the
> > > > list
> > > > > > of
> > > > > > > >> >> >> constraints/limitations of the proposed approach,
> TODOs,
> > > > etc?
> > > > > > > >> >> Not sure we should keep this at IEP until list (#4 from
> > > > original
> > > > > > > >> letter)
> > > > > > > >> >> is not confirmed.
> > > > > > > >> >>
> > > > > > > >> >> Currently, I'm checking RunAll with RR enabled to almost
> > each
> > > > get
> > > > > > > >> request.
> > > > > > > >> >> "Almost" means: readRepair = !ctx.readThrough() &&
> > > > > > > >> >> ctx.config().getBackups() > 0 && !ctx.isNear() &&
> > > > > > !ctx.mvccEnabled()
> > > > > > > >> >> For now I have 60 failed tests and amount decreasing.
> > > > > > > >> >>
> > > > > > > >> >> >> For instance, I would like to see all these
> limitations
> > on
> > > > the
> > > > > > IEP
> > > > > > > >> >> page as
> > > > > > > >> >> >> JIRA tickets. Perhaps, it would be good to create an
> > > > > > epic/umbrella
> > > > > > > >> >> ticket
> > > > > > > >> >> >> in order to track all activities related to `Read
> > Repair`
> > > > > > feature.
> > > > > > > >> >> Let's do this at merge day to be sure useless issues will
> > not
> > > > be
> > > > > > > >> created.
> > > > > > > >> >>
> > > > > > > >> >> On Fri, Jun 28, 2019 at 2:01 PM Вячеслав Коптилин <
> > > > > > > >> >> slava.koptilin@gmail.com> wrote:
> > > > > > > >> >>
> > > > > > > >> >>> Hi Anton,
> > > > > > > >> >>>
> > > > > > > >> >>> I will take a look at your pull request if you don't
> mind.
> > > > > > > >> >>>
> > > > > > > >> >>> In any way, could you please update the IEP page with
> the
> > > list
> > > > > of
> > > > > > > >> >>> constraints/limitations of the proposed approach, TODOs,
> > > etc?
> > > > > > > >> >>> For instance, I would like to see all these limitations
> on
> > > the
> > > > > IEP
> > > > > > > >> page
> > > > > > > >> >>> as
> > > > > > > >> >>> JIRA tickets. Perhaps, it would be good to create an
> > > > > epic/umbrella
> > > > > > > >> ticket
> > > > > > > >> >>> in order to track all activities related to `Read
> Repair`
> > > > > feature.
> > > > > > > >> >>>
> > > > > > > >> >>> Thanks,
> > > > > > > >> >>> S.
> > > > > > > >> >>>
> > > > > > > >> >>> чт, 20 июн. 2019 г. в 14:15, Anton Vinogradov <
> > > av@apache.org
> > > > >:
> > > > > > > >> >>>
> > > > > > > >> >>> > Igniters,
> > > > > > > >> >>> > I'm glad to introduce Read Repair feature [0] provides
> > > > > > additional
> > > > > > > >> >>> > consistency guarantee for Ignite.
> > > > > > > >> >>> >
> > > > > > > >> >>> > 1) Why we need it?
> > > > > > > >> >>> > The detailed explanation can be found at IEP-31 [1].
> > > > > > > >> >>> > In short, because of bugs, it's possible to gain an
> > > > > inconsistent
> > > > > > > >> state.
> > > > > > > >> >>> > We need additional features to handle this case.
> > > > > > > >> >>> >
> > > > > > > >> >>> > Currently we able to check cluster using Idle_verify
> [2]
> > > > > > feature,
> > > > > > > >> but
> > > > > > > >> >>> it
> > > > > > > >> >>> > will not fix the data, will not even tell which
> entries
> > > are
> > > > > > > broken.
> > > > > > > >> >>> > Read Repair is a feature to understand which entries
> are
> > > > > broken
> > > > > > > and
> > > > > > > >> to
> > > > > > > >> >>> fix
> > > > > > > >> >>> > them.
> > > > > > > >> >>> >
> > > > > > > >> >>> > 1) How it works?
> > > > > > > >> >>> > IgniteCache now able to provide special proxy [3]
> > > > > > > withReadRepair().
> > > > > > > >> >>> > This proxy guarantee that data will be gained from all
> > > > owners
> > > > > > and
> > > > > > > >> >>> compared.
> > > > > > > >> >>> > In the case of consistency violation situation, data
> > will
> > > be
> > > > > > > >> recovered
> > > > > > > >> >>> and
> > > > > > > >> >>> > a special event recorded.
> > > > > > > >> >>> >
> > > > > > > >> >>> > 3) Naming?
> > > > > > > >> >>> > Feature name based on Cassandra's Read Repair feature
> > [4],
> > > > > which
> > > > > > > is
> > > > > > > >> >>> pretty
> > > > > > > >> >>> > similar.
> > > > > > > >> >>> >
> > > > > > > >> >>> > 4) Limitations which can be fixed in the future?
> > > > > > > >> >>> >   * MVCC and Near caches are not supported.
> > > > > > > >> >>> >   * Atomic caches can be checked (false positive case
> is
> > > > > > possible
> > > > > > > on
> > > > > > > >> >>> this
> > > > > > > >> >>> > check), but can't be recovered.
> > > > > > > >> >>> >   * Partial entry removal can't be recovered.
> > > > > > > >> >>> >   * Entries streamed using data streamer (using not a
> > > > > > "cache.put"
> > > > > > > >> based
> > > > > > > >> >>> > updater) and loaded by cache.load
> > > > > > > >> >>> >   are perceived as inconsistent since they may have
> > > > different
> > > > > > > >> versions
> > > > > > > >> >>> for
> > > > > > > >> >>> > same keys.
> > > > > > > >> >>> >   * Only explicit get operations are supported
> > > > (getAndReplace,
> > > > > > > >> >>> getAndPut,
> > > > > > > >> >>> > etc can be supported in future).
> > > > > > > >> >>> >
> > > > > > > >> >>> > 5) What's left?
> > > > > > > >> >>> >   * SQL/ThinClient/etc support.
> > > > > > > >> >>> >   * Metrics (found/repaired).
> > > > > > > >> >>> >   * Simple per-partition recovery feature able to work
> > in
> > > > the
> > > > > > > >> >>> background in
> > > > > > > >> >>> > addition to per-entry recovery feature.
> > > > > > > >> >>> >
> > > > > > > >> >>> > 6) Is code checked?
> > > > > > > >> >>> >   * Pull Request #5656 [5] (feature) - has green TC.
> > > > > > > >> >>> >   * Pull Request #6575 [6] (RunAll with the feature
> > > enabled
> > > > > for
> > > > > > > >> every
> > > > > > > >> >>> get()
> > > > > > > >> >>> > request) - has a limited amount of failures (because
> of
> > > data
> > > > > > > >> streamer,
> > > > > > > >> >>> > cache.load, etc).
> > > > > > > >> >>> >
> > > > > > > >> >>> > Thoughts?
> > > > > > > >> >>> >
> > > > > > > >> >>> > [0]
> https://issues.apache.org/jira/browse/IGNITE-10663
> > > > > > > >> >>> > [1]
> > > > > > > >> >>> >
> > > > > > > >> >>> >
> > > > > > > >> >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix
> > > > > > > >> >>> > [2]
> > > > > > > >> >>> >
> > > > > > > >> >>> >
> > > > > > > >> >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
> > > > > > > >> >>> > [3]
> > > > > > > >> >>> >
> > > > > > > >> >>> >
> > > > > > > >> >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140
> > > > > > > >> >>> > [4]
> > > > > > > >> >>> >
> > > > > > > >> >>> >
> > > > > > > >> >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html
> > > > > > > >> >>> > [5] https://github.com/apache/ignite/pull/5656
> > > > > > > >> >>> > [6] https://github.com/apache/ignite/pull/6575
> > > > > > > >> >>> >
> > > > > > > >> >>>
> > > > > > > >> >>
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Read Repair (ex. Consistency Check) - review request #2

Posted by Anton Vinogradov <av...@apache.org>.
Svala,

>> Could you please take a look at PR:
Going to review today, thanks for attaching the bot visa.

>> 1. Should I consider that my cluster is broken? There is no answer! The
>> false-positive result is possible.
That's a question about atomic nature.
It's not impossible to lock atomic entry to perform the check.
You should perform some attempts, it's your decision how many.
By default, atomic RR performs 3 attempts, you may increase this by setting
IGNITE_NEAR_GET_MAX_REMAPS or by just performing additional gets.

>> 2. What should be done here in order to check/resolve the issue?
Perhaps, I
>> should restart a node (which one?), restart the whole cluster, put a new
>> value...
It's not possible, currently, to fix atomic caches.
You may only check the consistency. And it's better than nothing, I think.
We should find a way how to fix atomic consistency first.
A possible strategy is to use ЕntryProcessor which will replace all owner's
values with "latest" and do nothing in case newest (than latest) value
found (opposite to preloading approach).

>> 3. IgniteConsistencyViolationException is absolutely useless. It does not
>> provide any information about the issue and possible way to fix it.
It means that some keys from your get operation are broken.
IgniteConsistencyViolationException CAN be extended with a list of broken
keys in the future.

>> It seems that transactional caches are covered much better.
Correct.
Tx caches consistency is more important that atomic consistency, that's why
it was implemented first.
BTW, AFAIK, atomics also were not fixed at 10078 [1].

>> Well, near caches are widely used and fully transactional, so I think it
>> makes sense to support the feature for near caches too.
As I told before, it will be nice to implement this in the future, but we
have more important tasks for now.
The main goal was to cover tx caches, to be able to fix them in case of the
real problem at production.

Summarizing the roadmap,
My goal now is to finish the tx case, now we have an issue with false
positive consistency violation [2].
Also, we're going to update Jepsen tests [3] with RR to ensure tx caches
fixed.
Next main goal is to use RR at TC checks [4], help with this issue are
appreciated.

[1] https://issues.apache.org/jira/browse/IGNITE-10078
[2] https://issues.apache.org/jira/browse/IGNITE-11973
[3] https://issues.apache.org/jira/browse/IGNITE-11972
[4] https://issues.apache.org/jira/browse/IGNITE-11971


On Mon, Jul 15, 2019 at 4:51 PM Dmitriy Pavlov <dp...@apache.org> wrote:

> Ok,  thank you
>
> пн, 15 июл. 2019 г., 16:46 Nikolay Izhikov <ni...@apache.org>:
>
> > I did the review.
> >
> > пн, 15 июля 2019 г., 16:15 Dmitriy Pavlov <dp...@apache.org>:
> >
> > > Igniters, who did a review of
> > > https://issues.apache.org/jira/browse/IGNITE-10663 before the merge?
> > I've
> > > checked both PR   https://github.com/apache/ignite/pull/5656  and
> Issue,
> > > and dev.list thread and didn't find any LGTM.
> > >
> > > Anton, since you've rejected lazy consensus in our process, we have RTC
> > in
> > > that (core) module. So I'd like to know if the fix was covered by the
> > > review.
> > >
> > > Because you're a committer, a reviewer can be non-committer. So, who
> was
> > a
> > > reviewer? Or was process ignored?
> > >
> > > пн, 15 июл. 2019 г. в 15:37, Вячеслав Коптилин <
> slava.koptilin@gmail.com
> > >:
> > >
> > > > Hello Anton,
> > > >
> > > > > I'd like to propose you to provide fixes as a PR since you have a
> > > vision
> > > > of how it should be made. I'll review them and merge shortly.
> > > > Could you please take a look at PR:
> > > > https://github.com/apache/ignite/pull/6689
> > > >
> > > > > Since your comments mostly about Javadoc (does this mean that my
> > > solution
> > > > is so great that you ask me only to fix Javadocs :) ?),
> > > > In my humble opinion, I would consider this feature as experimental
> one
> > > (It
> > > > does not seem production-ready).
> > > > Let me clarify this with the following simple example:
> > > >
> > > >     try {
> > > >         atomicCache.withReadRepair().getAll(keys);
> > > >     }
> > > >     catch (CacheException e) {
> > > >         // What should be done here from the end-user point of view?
> > > >     }
> > > >
> > > > 1. Should I consider that my cluster is broken? There is no answer!
> The
> > > > false-positive result is possible.
> > > > 2. What should be done here in order to check/resolve the issue?
> > > Perhaps, I
> > > > should restart a node (which one?), restart the whole cluster, put a
> > new
> > > > value...
> > > > 3. IgniteConsistencyViolationException is absolutely useless. It does
> > not
> > > > provide any information about the issue and possible way to fix it.
> > > >
> > > > It seems that transactional caches are covered much better.
> > > >
> > > > > Mostly agree with you, but
> > > > > - MVCC is not production ready,
> > > > > - not sure near support really required,
> > > > > - metrics are better for monitoring, but the Event is enough for my
> > > wish
> > > > to
> > > > > cover AI with consistency check,
> > > > > - do we really need Platforms and Thin Client support?
> > > > Well, near caches are widely used and fully transactional, so I think
> > it
> > > > makes sense to support the feature for near caches too.
> > > > .Net is already aware of 'ReadRepair'. It seems to me, that it can be
> > > > easily supported for C++. I don't see a reason why it should not be
> > done
> > > :)
> > > >
> > > > > Do you mean per partition check and recovery? That's a good idea,
> > but I
> > > > found it's not easy to imagine API to for such tool.
> > > > Yep, perhaps it can be done on the idle cluster via `idle-verify`
> > command
> > > > with additional flag. Agreed, that this approach is not the best one
> > > > definitely.
> > > >
> > > > Thanks,
> > > > S.
> > > >
> > > > чт, 11 июл. 2019 г. в 09:53, Anton Vinogradov <av...@apache.org>:
> > > >
> > > > > Slava,
> > > > >
> > > > > Thanks for your review first!
> > > > >
> > > > > >> Anyway, I left some comments in your pull-request at github.
> > Please
> > > > take
> > > > > a
> > > > > >> look. The most annoying thing is poorly documented code :(
> > > > > Since your comments mostly about Javadoc (does this mean that my
> > > solution
> > > > > is so great that you ask me only to fix Javadocs :) ?),
> > > > > I'd like to propose you to provide fixes as a PR since you have a
> > > vision
> > > > of
> > > > > how it should be made. I'll review them and merge shortly.
> > > > >
> > > > > >> By the way, is it required to add test related to fail-over
> > > scenarios?
> > > > > The best check is to use RR at real code.
> > > > > For example, I'm injecting RR now to the test with concurrent
> > > > modifications
> > > > > and restarts [1].
> > > > >
> > > > > >> I just checked, the IEP page and I still cannot find Jira
> tickets
> > > that
> > > > > >> should cover existing limitations/improvements.
> > > > > >> I would suggest creating the following tasks, at least:
> > > > > Mostly agree with you, but
> > > > > - MVCC is not production ready,
> > > > > - not sure near support really required,
> > > > > - metrics are better for monitoring, but the Event is enough for my
> > > wish
> > > > to
> > > > > cover AI with consistency check,
> > > > > - do we really need Platforms and Thin Client support?
> > > > > Also, we should not produce stillborn issue.
> > > > > All limitations listed at proxy creation method and they definitely
> > are
> > > > not
> > > > > showstoppers and may be fixed later if someone interested.
> > > > > Сoming back to AI 3.0 discussion, we have A LOT of features and
> it's
> > > > almost
> > > > > impossible (require much more time that feature's cost) to support
> > them
> > > > > all.
> > > > > I will be pretty happy in case someone will do this and provide
> help
> > if
> > > > > necessary!
> > > > >
> > > > > >> Perhaps, it would be a good idea to think about the recovery too
> > > > > Do you mean per partition check and recovery?
> > > > > That's a good idea, but I found it's not easy to imagine API to for
> > > such
> > > > > tool.
> > > > > In case you ready to assist with proper API/design this will
> > definitely
> > > > > help.
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11973
> > > > >
> > > > > On Wed, Jul 10, 2019 at 3:43 PM Вячеслав Коптилин <
> > > > > slava.koptilin@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Perhaps, it would be a good idea to think about the recovery
> tool/
> > > > > > control-utility command that will allow achieving the same goal.
> > > > > > If I am not mistaken it was already proposed in the email thread.
> > > > > >
> > > > > > Thanks,
> > > > > > S.
> > > > > >
> > > > > > ср, 10 июл. 2019 г. в 15:33, Вячеслав Коптилин <
> > > > slava.koptilin@gmail.com
> > > > > >:
> > > > > >
> > > > > > > Hi Anton,
> > > > > > >
> > > > > > > Well, the ReadRepair feature is finally merged and that is good
> > > news
> > > > :)
> > > > > > >
> > > > > > > Unfortunately, I cannot find a consensus about the whole
> > > > functionality
> > > > > in
> > > > > > > any of these topics:
> > > > > > >  -
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache-ignite-developers.2346864.n4.nabble.com/Consistency-check-and-fix-review-request-td41629.html
> > > > > > >  -
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache-ignite-developers.2346864.n4.nabble.com/Read-Repair-ex-Consistency-Check-review-request-2-td42421.html
> > > > > > > Also, there are no comments/discussion in JIRA. That makes me
> sad
> > > :(
> > > > > > > especially when a feature is huge, not obvious and involves
> > > changing
> > > > > > public
> > > > > > > API (and that is the case, I think).
> > > > > > >
> > > > > > > Anyway, I left some comments in your pull-request at github.
> > Please
> > > > > take
> > > > > > a
> > > > > > > look. The most annoying thing is poorly documented code :(
> > > > > > > By the way, is it required to add test related to fail-over
> > > > scenarios?
> > > > > > >
> > > > > > > I just checked, the IEP page and I still cannot find Jira
> tickets
> > > > that
> > > > > > > should cover existing limitations/improvements.
> > > > > > > I would suggest creating the following tasks, at least:
> > > > > > >  - MVCC support
> > > > > > >  - Near caches
> > > > > > >  - Additional metrics (number of violations, number of repaired
> > > > entries
> > > > > > > etc)
> > > > > > >  - Ignite C++ (It looks like, .Net is already aware of that
> > > feature)
> > > > > > >  - Thin clients support
> > > > > > >  - Perhaps, it would be useful to support different strategies
> to
> > > > > resolve
> > > > > > > inconsistencies
> > > > > > >
> > > > > > > What do you think?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > S.
> > > > > > >
> > > > > > >
> > > > > > > ср, 10 июл. 2019 г. в 10:16, Anton Vinogradov <av...@apache.org>:
> > > > > > >
> > > > > > >> Folks,
> > > > > > >>
> > > > > > >> Thanks to everyone for tips and reviews.
> > > > > > >> Yardstick checked, no performance drop found.
> > > > > > >> Additional measurement: RR get() is just up to 7% slower than
> > > > regular
> > > > > > get
> > > > > > >> on real benchmarks (8 clients, 4 servers, 3 backups)
> > > > > > >> Code merged to the master.
> > > > > > >> "Must have" tasks created and attached to the IEP.
> > > > > > >>
> > > > > > >> On Thu, Jul 4, 2019 at 12:18 PM Anton Vinogradov <
> av@apache.org
> > >
> > > > > wrote:
> > > > > > >>
> > > > > > >> > Folks,
> > > > > > >> >
> > > > > > >> > Just a minor update.
> > > > > > >> >
> > > > > > >> > RunAll [1] with enabled ReadRepair proxy is almost green now
> > > (~10
> > > > > > tests
> > > > > > >> > left, started with 6k :)).
> > > > > > >> > During the analisys, I've found some tests with
> > > > > > >> > - unexpected repairs at tx caches
> > > > > > >> > - inconsistent state after the test finished (different
> > entries
> > > > > across
> > > > > > >> the
> > > > > > >> > topology)
> > > > > > >> > For example,
> > > > > > >> > - testInvokeAllAppliedOnceOnBinaryTypeRegistration generates
> > > > > obsolete
> > > > > > >> > versions on backups in case of retry, fixed [2]
> > > > > > >> > - initial cache load generates not equal versions on
> backups,
> > > > fixed
> > > > > > [3]
> > > > > > >> > - testAccountTxNodeRestart causes unexpected repairs
> (entries
> > > have
> > > > > > >> > different versions), to be investigated.
> > > > > > >> >
> > > > > > >> > What's next?
> > > > > > >> >
> > > > > > >> > 1) Going to merge the solution once "RunAll with ReadRepair
> > > > enabled"
> > > > > > >> > becomes fully green.
> > > > > > >> > 2) Going to add special check after each test which will
> > ensure
> > > > > caches
> > > > > > >> > content after the test is consistent.
> > > > > > >> > 2.1) The Same check can (should?) be injected to
> > > > > > >> > awaitPartitionMapExchange() and similar methods.
> > > > > > >> > 3) Update Jepsen tests with RR checks.
> > > > > > >> > 4) Introduce per partition RR.
> > > > > > >> >
> > > > > > >> > So, the final goal is to be sure that Ignite produces only
> > > > > consistent
> > > > > > >> data
> > > > > > >> > and to have a feature to solve consistency in case we gain
> > > > > > inconsistent
> > > > > > >> > state somehow.
> > > > > > >> >
> > > > > > >> > Limitations?
> > > > > > >> >
> > > > > > >> > Currently, RR has some limitations, but they are not related
> > to
> > > > real
> > > > > > >> > production cases.
> > > > > > >> > In case someone interested to support, for example, MVCC or
> > near
> > > > > > caches,
> > > > > > >> > please, feel free to contribute.
> > > > > > >> >
> > > > > > >> > [1]
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://mtcga.gridgain.com/pr.html?serverId=apache&suiteId=IgniteTests24Java8_RunAll&branchForTc=pull/6575/head&action=Latest
> > > > > > >> > [2]
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/ignite/pull/5656/commits/6f6ec4434095e692af209c61833a350f3013408c
> > > > > > >> > [3]
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/ignite/pull/5656/commits/255e552b474839e470c66a77e74e3c807bc76f13
> > > > > > >> >
> > > > > > >> > On Fri, Jun 28, 2019 at 2:41 PM Anton Vinogradov <
> > av@apache.org
> > > >
> > > > > > wrote:
> > > > > > >> >
> > > > > > >> >> Slava,
> > > > > > >> >>
> > > > > > >> >> >> I will take a look at your pull request if you don't
> mind.
> > > > > > >> >> Great news!
> > > > > > >> >>
> > > > > > >> >> >> In any way, could you please update the IEP page with
> the
> > > list
> > > > > of
> > > > > > >> >> >> constraints/limitations of the proposed approach, TODOs,
> > > etc?
> > > > > > >> >> Not sure we should keep this at IEP until list (#4 from
> > > original
> > > > > > >> letter)
> > > > > > >> >> is not confirmed.
> > > > > > >> >>
> > > > > > >> >> Currently, I'm checking RunAll with RR enabled to almost
> each
> > > get
> > > > > > >> request.
> > > > > > >> >> "Almost" means: readRepair = !ctx.readThrough() &&
> > > > > > >> >> ctx.config().getBackups() > 0 && !ctx.isNear() &&
> > > > > !ctx.mvccEnabled()
> > > > > > >> >> For now I have 60 failed tests and amount decreasing.
> > > > > > >> >>
> > > > > > >> >> >> For instance, I would like to see all these limitations
> on
> > > the
> > > > > IEP
> > > > > > >> >> page as
> > > > > > >> >> >> JIRA tickets. Perhaps, it would be good to create an
> > > > > epic/umbrella
> > > > > > >> >> ticket
> > > > > > >> >> >> in order to track all activities related to `Read
> Repair`
> > > > > feature.
> > > > > > >> >> Let's do this at merge day to be sure useless issues will
> not
> > > be
> > > > > > >> created.
> > > > > > >> >>
> > > > > > >> >> On Fri, Jun 28, 2019 at 2:01 PM Вячеслав Коптилин <
> > > > > > >> >> slava.koptilin@gmail.com> wrote:
> > > > > > >> >>
> > > > > > >> >>> Hi Anton,
> > > > > > >> >>>
> > > > > > >> >>> I will take a look at your pull request if you don't mind.
> > > > > > >> >>>
> > > > > > >> >>> In any way, could you please update the IEP page with the
> > list
> > > > of
> > > > > > >> >>> constraints/limitations of the proposed approach, TODOs,
> > etc?
> > > > > > >> >>> For instance, I would like to see all these limitations on
> > the
> > > > IEP
> > > > > > >> page
> > > > > > >> >>> as
> > > > > > >> >>> JIRA tickets. Perhaps, it would be good to create an
> > > > epic/umbrella
> > > > > > >> ticket
> > > > > > >> >>> in order to track all activities related to `Read Repair`
> > > > feature.
> > > > > > >> >>>
> > > > > > >> >>> Thanks,
> > > > > > >> >>> S.
> > > > > > >> >>>
> > > > > > >> >>> чт, 20 июн. 2019 г. в 14:15, Anton Vinogradov <
> > av@apache.org
> > > >:
> > > > > > >> >>>
> > > > > > >> >>> > Igniters,
> > > > > > >> >>> > I'm glad to introduce Read Repair feature [0] provides
> > > > > additional
> > > > > > >> >>> > consistency guarantee for Ignite.
> > > > > > >> >>> >
> > > > > > >> >>> > 1) Why we need it?
> > > > > > >> >>> > The detailed explanation can be found at IEP-31 [1].
> > > > > > >> >>> > In short, because of bugs, it's possible to gain an
> > > > inconsistent
> > > > > > >> state.
> > > > > > >> >>> > We need additional features to handle this case.
> > > > > > >> >>> >
> > > > > > >> >>> > Currently we able to check cluster using Idle_verify [2]
> > > > > feature,
> > > > > > >> but
> > > > > > >> >>> it
> > > > > > >> >>> > will not fix the data, will not even tell which entries
> > are
> > > > > > broken.
> > > > > > >> >>> > Read Repair is a feature to understand which entries are
> > > > broken
> > > > > > and
> > > > > > >> to
> > > > > > >> >>> fix
> > > > > > >> >>> > them.
> > > > > > >> >>> >
> > > > > > >> >>> > 1) How it works?
> > > > > > >> >>> > IgniteCache now able to provide special proxy [3]
> > > > > > withReadRepair().
> > > > > > >> >>> > This proxy guarantee that data will be gained from all
> > > owners
> > > > > and
> > > > > > >> >>> compared.
> > > > > > >> >>> > In the case of consistency violation situation, data
> will
> > be
> > > > > > >> recovered
> > > > > > >> >>> and
> > > > > > >> >>> > a special event recorded.
> > > > > > >> >>> >
> > > > > > >> >>> > 3) Naming?
> > > > > > >> >>> > Feature name based on Cassandra's Read Repair feature
> [4],
> > > > which
> > > > > > is
> > > > > > >> >>> pretty
> > > > > > >> >>> > similar.
> > > > > > >> >>> >
> > > > > > >> >>> > 4) Limitations which can be fixed in the future?
> > > > > > >> >>> >   * MVCC and Near caches are not supported.
> > > > > > >> >>> >   * Atomic caches can be checked (false positive case is
> > > > > possible
> > > > > > on
> > > > > > >> >>> this
> > > > > > >> >>> > check), but can't be recovered.
> > > > > > >> >>> >   * Partial entry removal can't be recovered.
> > > > > > >> >>> >   * Entries streamed using data streamer (using not a
> > > > > "cache.put"
> > > > > > >> based
> > > > > > >> >>> > updater) and loaded by cache.load
> > > > > > >> >>> >   are perceived as inconsistent since they may have
> > > different
> > > > > > >> versions
> > > > > > >> >>> for
> > > > > > >> >>> > same keys.
> > > > > > >> >>> >   * Only explicit get operations are supported
> > > (getAndReplace,
> > > > > > >> >>> getAndPut,
> > > > > > >> >>> > etc can be supported in future).
> > > > > > >> >>> >
> > > > > > >> >>> > 5) What's left?
> > > > > > >> >>> >   * SQL/ThinClient/etc support.
> > > > > > >> >>> >   * Metrics (found/repaired).
> > > > > > >> >>> >   * Simple per-partition recovery feature able to work
> in
> > > the
> > > > > > >> >>> background in
> > > > > > >> >>> > addition to per-entry recovery feature.
> > > > > > >> >>> >
> > > > > > >> >>> > 6) Is code checked?
> > > > > > >> >>> >   * Pull Request #5656 [5] (feature) - has green TC.
> > > > > > >> >>> >   * Pull Request #6575 [6] (RunAll with the feature
> > enabled
> > > > for
> > > > > > >> every
> > > > > > >> >>> get()
> > > > > > >> >>> > request) - has a limited amount of failures (because of
> > data
> > > > > > >> streamer,
> > > > > > >> >>> > cache.load, etc).
> > > > > > >> >>> >
> > > > > > >> >>> > Thoughts?
> > > > > > >> >>> >
> > > > > > >> >>> > [0] https://issues.apache.org/jira/browse/IGNITE-10663
> > > > > > >> >>> > [1]
> > > > > > >> >>> >
> > > > > > >> >>> >
> > > > > > >> >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix
> > > > > > >> >>> > [2]
> > > > > > >> >>> >
> > > > > > >> >>> >
> > > > > > >> >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
> > > > > > >> >>> > [3]
> > > > > > >> >>> >
> > > > > > >> >>> >
> > > > > > >> >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140
> > > > > > >> >>> > [4]
> > > > > > >> >>> >
> > > > > > >> >>> >
> > > > > > >> >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html
> > > > > > >> >>> > [5] https://github.com/apache/ignite/pull/5656
> > > > > > >> >>> > [6] https://github.com/apache/ignite/pull/6575
> > > > > > >> >>> >
> > > > > > >> >>>
> > > > > > >> >>
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Read Repair (ex. Consistency Check) - review request #2

Posted by Dmitriy Pavlov <dp...@apache.org>.
Ok,  thank you

пн, 15 июл. 2019 г., 16:46 Nikolay Izhikov <ni...@apache.org>:

> I did the review.
>
> пн, 15 июля 2019 г., 16:15 Dmitriy Pavlov <dp...@apache.org>:
>
> > Igniters, who did a review of
> > https://issues.apache.org/jira/browse/IGNITE-10663 before the merge?
> I've
> > checked both PR   https://github.com/apache/ignite/pull/5656  and Issue,
> > and dev.list thread and didn't find any LGTM.
> >
> > Anton, since you've rejected lazy consensus in our process, we have RTC
> in
> > that (core) module. So I'd like to know if the fix was covered by the
> > review.
> >
> > Because you're a committer, a reviewer can be non-committer. So, who was
> a
> > reviewer? Or was process ignored?
> >
> > пн, 15 июл. 2019 г. в 15:37, Вячеслав Коптилин <slava.koptilin@gmail.com
> >:
> >
> > > Hello Anton,
> > >
> > > > I'd like to propose you to provide fixes as a PR since you have a
> > vision
> > > of how it should be made. I'll review them and merge shortly.
> > > Could you please take a look at PR:
> > > https://github.com/apache/ignite/pull/6689
> > >
> > > > Since your comments mostly about Javadoc (does this mean that my
> > solution
> > > is so great that you ask me only to fix Javadocs :) ?),
> > > In my humble opinion, I would consider this feature as experimental one
> > (It
> > > does not seem production-ready).
> > > Let me clarify this with the following simple example:
> > >
> > >     try {
> > >         atomicCache.withReadRepair().getAll(keys);
> > >     }
> > >     catch (CacheException e) {
> > >         // What should be done here from the end-user point of view?
> > >     }
> > >
> > > 1. Should I consider that my cluster is broken? There is no answer! The
> > > false-positive result is possible.
> > > 2. What should be done here in order to check/resolve the issue?
> > Perhaps, I
> > > should restart a node (which one?), restart the whole cluster, put a
> new
> > > value...
> > > 3. IgniteConsistencyViolationException is absolutely useless. It does
> not
> > > provide any information about the issue and possible way to fix it.
> > >
> > > It seems that transactional caches are covered much better.
> > >
> > > > Mostly agree with you, but
> > > > - MVCC is not production ready,
> > > > - not sure near support really required,
> > > > - metrics are better for monitoring, but the Event is enough for my
> > wish
> > > to
> > > > cover AI with consistency check,
> > > > - do we really need Platforms and Thin Client support?
> > > Well, near caches are widely used and fully transactional, so I think
> it
> > > makes sense to support the feature for near caches too.
> > > .Net is already aware of 'ReadRepair'. It seems to me, that it can be
> > > easily supported for C++. I don't see a reason why it should not be
> done
> > :)
> > >
> > > > Do you mean per partition check and recovery? That's a good idea,
> but I
> > > found it's not easy to imagine API to for such tool.
> > > Yep, perhaps it can be done on the idle cluster via `idle-verify`
> command
> > > with additional flag. Agreed, that this approach is not the best one
> > > definitely.
> > >
> > > Thanks,
> > > S.
> > >
> > > чт, 11 июл. 2019 г. в 09:53, Anton Vinogradov <av...@apache.org>:
> > >
> > > > Slava,
> > > >
> > > > Thanks for your review first!
> > > >
> > > > >> Anyway, I left some comments in your pull-request at github.
> Please
> > > take
> > > > a
> > > > >> look. The most annoying thing is poorly documented code :(
> > > > Since your comments mostly about Javadoc (does this mean that my
> > solution
> > > > is so great that you ask me only to fix Javadocs :) ?),
> > > > I'd like to propose you to provide fixes as a PR since you have a
> > vision
> > > of
> > > > how it should be made. I'll review them and merge shortly.
> > > >
> > > > >> By the way, is it required to add test related to fail-over
> > scenarios?
> > > > The best check is to use RR at real code.
> > > > For example, I'm injecting RR now to the test with concurrent
> > > modifications
> > > > and restarts [1].
> > > >
> > > > >> I just checked, the IEP page and I still cannot find Jira tickets
> > that
> > > > >> should cover existing limitations/improvements.
> > > > >> I would suggest creating the following tasks, at least:
> > > > Mostly agree with you, but
> > > > - MVCC is not production ready,
> > > > - not sure near support really required,
> > > > - metrics are better for monitoring, but the Event is enough for my
> > wish
> > > to
> > > > cover AI with consistency check,
> > > > - do we really need Platforms and Thin Client support?
> > > > Also, we should not produce stillborn issue.
> > > > All limitations listed at proxy creation method and they definitely
> are
> > > not
> > > > showstoppers and may be fixed later if someone interested.
> > > > Сoming back to AI 3.0 discussion, we have A LOT of features and it's
> > > almost
> > > > impossible (require much more time that feature's cost) to support
> them
> > > > all.
> > > > I will be pretty happy in case someone will do this and provide help
> if
> > > > necessary!
> > > >
> > > > >> Perhaps, it would be a good idea to think about the recovery too
> > > > Do you mean per partition check and recovery?
> > > > That's a good idea, but I found it's not easy to imagine API to for
> > such
> > > > tool.
> > > > In case you ready to assist with proper API/design this will
> definitely
> > > > help.
> > > >
> > > > [1] https://issues.apache.org/jira/browse/IGNITE-11973
> > > >
> > > > On Wed, Jul 10, 2019 at 3:43 PM Вячеслав Коптилин <
> > > > slava.koptilin@gmail.com>
> > > > wrote:
> > > >
> > > > > Perhaps, it would be a good idea to think about the recovery tool/
> > > > > control-utility command that will allow achieving the same goal.
> > > > > If I am not mistaken it was already proposed in the email thread.
> > > > >
> > > > > Thanks,
> > > > > S.
> > > > >
> > > > > ср, 10 июл. 2019 г. в 15:33, Вячеслав Коптилин <
> > > slava.koptilin@gmail.com
> > > > >:
> > > > >
> > > > > > Hi Anton,
> > > > > >
> > > > > > Well, the ReadRepair feature is finally merged and that is good
> > news
> > > :)
> > > > > >
> > > > > > Unfortunately, I cannot find a consensus about the whole
> > > functionality
> > > > in
> > > > > > any of these topics:
> > > > > >  -
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache-ignite-developers.2346864.n4.nabble.com/Consistency-check-and-fix-review-request-td41629.html
> > > > > >  -
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache-ignite-developers.2346864.n4.nabble.com/Read-Repair-ex-Consistency-Check-review-request-2-td42421.html
> > > > > > Also, there are no comments/discussion in JIRA. That makes me sad
> > :(
> > > > > > especially when a feature is huge, not obvious and involves
> > changing
> > > > > public
> > > > > > API (and that is the case, I think).
> > > > > >
> > > > > > Anyway, I left some comments in your pull-request at github.
> Please
> > > > take
> > > > > a
> > > > > > look. The most annoying thing is poorly documented code :(
> > > > > > By the way, is it required to add test related to fail-over
> > > scenarios?
> > > > > >
> > > > > > I just checked, the IEP page and I still cannot find Jira tickets
> > > that
> > > > > > should cover existing limitations/improvements.
> > > > > > I would suggest creating the following tasks, at least:
> > > > > >  - MVCC support
> > > > > >  - Near caches
> > > > > >  - Additional metrics (number of violations, number of repaired
> > > entries
> > > > > > etc)
> > > > > >  - Ignite C++ (It looks like, .Net is already aware of that
> > feature)
> > > > > >  - Thin clients support
> > > > > >  - Perhaps, it would be useful to support different strategies to
> > > > resolve
> > > > > > inconsistencies
> > > > > >
> > > > > > What do you think?
> > > > > >
> > > > > > Thanks,
> > > > > > S.
> > > > > >
> > > > > >
> > > > > > ср, 10 июл. 2019 г. в 10:16, Anton Vinogradov <av...@apache.org>:
> > > > > >
> > > > > >> Folks,
> > > > > >>
> > > > > >> Thanks to everyone for tips and reviews.
> > > > > >> Yardstick checked, no performance drop found.
> > > > > >> Additional measurement: RR get() is just up to 7% slower than
> > > regular
> > > > > get
> > > > > >> on real benchmarks (8 clients, 4 servers, 3 backups)
> > > > > >> Code merged to the master.
> > > > > >> "Must have" tasks created and attached to the IEP.
> > > > > >>
> > > > > >> On Thu, Jul 4, 2019 at 12:18 PM Anton Vinogradov <av@apache.org
> >
> > > > wrote:
> > > > > >>
> > > > > >> > Folks,
> > > > > >> >
> > > > > >> > Just a minor update.
> > > > > >> >
> > > > > >> > RunAll [1] with enabled ReadRepair proxy is almost green now
> > (~10
> > > > > tests
> > > > > >> > left, started with 6k :)).
> > > > > >> > During the analisys, I've found some tests with
> > > > > >> > - unexpected repairs at tx caches
> > > > > >> > - inconsistent state after the test finished (different
> entries
> > > > across
> > > > > >> the
> > > > > >> > topology)
> > > > > >> > For example,
> > > > > >> > - testInvokeAllAppliedOnceOnBinaryTypeRegistration generates
> > > > obsolete
> > > > > >> > versions on backups in case of retry, fixed [2]
> > > > > >> > - initial cache load generates not equal versions on backups,
> > > fixed
> > > > > [3]
> > > > > >> > - testAccountTxNodeRestart causes unexpected repairs (entries
> > have
> > > > > >> > different versions), to be investigated.
> > > > > >> >
> > > > > >> > What's next?
> > > > > >> >
> > > > > >> > 1) Going to merge the solution once "RunAll with ReadRepair
> > > enabled"
> > > > > >> > becomes fully green.
> > > > > >> > 2) Going to add special check after each test which will
> ensure
> > > > caches
> > > > > >> > content after the test is consistent.
> > > > > >> > 2.1) The Same check can (should?) be injected to
> > > > > >> > awaitPartitionMapExchange() and similar methods.
> > > > > >> > 3) Update Jepsen tests with RR checks.
> > > > > >> > 4) Introduce per partition RR.
> > > > > >> >
> > > > > >> > So, the final goal is to be sure that Ignite produces only
> > > > consistent
> > > > > >> data
> > > > > >> > and to have a feature to solve consistency in case we gain
> > > > > inconsistent
> > > > > >> > state somehow.
> > > > > >> >
> > > > > >> > Limitations?
> > > > > >> >
> > > > > >> > Currently, RR has some limitations, but they are not related
> to
> > > real
> > > > > >> > production cases.
> > > > > >> > In case someone interested to support, for example, MVCC or
> near
> > > > > caches,
> > > > > >> > please, feel free to contribute.
> > > > > >> >
> > > > > >> > [1]
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://mtcga.gridgain.com/pr.html?serverId=apache&suiteId=IgniteTests24Java8_RunAll&branchForTc=pull/6575/head&action=Latest
> > > > > >> > [2]
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://github.com/apache/ignite/pull/5656/commits/6f6ec4434095e692af209c61833a350f3013408c
> > > > > >> > [3]
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://github.com/apache/ignite/pull/5656/commits/255e552b474839e470c66a77e74e3c807bc76f13
> > > > > >> >
> > > > > >> > On Fri, Jun 28, 2019 at 2:41 PM Anton Vinogradov <
> av@apache.org
> > >
> > > > > wrote:
> > > > > >> >
> > > > > >> >> Slava,
> > > > > >> >>
> > > > > >> >> >> I will take a look at your pull request if you don't mind.
> > > > > >> >> Great news!
> > > > > >> >>
> > > > > >> >> >> In any way, could you please update the IEP page with the
> > list
> > > > of
> > > > > >> >> >> constraints/limitations of the proposed approach, TODOs,
> > etc?
> > > > > >> >> Not sure we should keep this at IEP until list (#4 from
> > original
> > > > > >> letter)
> > > > > >> >> is not confirmed.
> > > > > >> >>
> > > > > >> >> Currently, I'm checking RunAll with RR enabled to almost each
> > get
> > > > > >> request.
> > > > > >> >> "Almost" means: readRepair = !ctx.readThrough() &&
> > > > > >> >> ctx.config().getBackups() > 0 && !ctx.isNear() &&
> > > > !ctx.mvccEnabled()
> > > > > >> >> For now I have 60 failed tests and amount decreasing.
> > > > > >> >>
> > > > > >> >> >> For instance, I would like to see all these limitations on
> > the
> > > > IEP
> > > > > >> >> page as
> > > > > >> >> >> JIRA tickets. Perhaps, it would be good to create an
> > > > epic/umbrella
> > > > > >> >> ticket
> > > > > >> >> >> in order to track all activities related to `Read Repair`
> > > > feature.
> > > > > >> >> Let's do this at merge day to be sure useless issues will not
> > be
> > > > > >> created.
> > > > > >> >>
> > > > > >> >> On Fri, Jun 28, 2019 at 2:01 PM Вячеслав Коптилин <
> > > > > >> >> slava.koptilin@gmail.com> wrote:
> > > > > >> >>
> > > > > >> >>> Hi Anton,
> > > > > >> >>>
> > > > > >> >>> I will take a look at your pull request if you don't mind.
> > > > > >> >>>
> > > > > >> >>> In any way, could you please update the IEP page with the
> list
> > > of
> > > > > >> >>> constraints/limitations of the proposed approach, TODOs,
> etc?
> > > > > >> >>> For instance, I would like to see all these limitations on
> the
> > > IEP
> > > > > >> page
> > > > > >> >>> as
> > > > > >> >>> JIRA tickets. Perhaps, it would be good to create an
> > > epic/umbrella
> > > > > >> ticket
> > > > > >> >>> in order to track all activities related to `Read Repair`
> > > feature.
> > > > > >> >>>
> > > > > >> >>> Thanks,
> > > > > >> >>> S.
> > > > > >> >>>
> > > > > >> >>> чт, 20 июн. 2019 г. в 14:15, Anton Vinogradov <
> av@apache.org
> > >:
> > > > > >> >>>
> > > > > >> >>> > Igniters,
> > > > > >> >>> > I'm glad to introduce Read Repair feature [0] provides
> > > > additional
> > > > > >> >>> > consistency guarantee for Ignite.
> > > > > >> >>> >
> > > > > >> >>> > 1) Why we need it?
> > > > > >> >>> > The detailed explanation can be found at IEP-31 [1].
> > > > > >> >>> > In short, because of bugs, it's possible to gain an
> > > inconsistent
> > > > > >> state.
> > > > > >> >>> > We need additional features to handle this case.
> > > > > >> >>> >
> > > > > >> >>> > Currently we able to check cluster using Idle_verify [2]
> > > > feature,
> > > > > >> but
> > > > > >> >>> it
> > > > > >> >>> > will not fix the data, will not even tell which entries
> are
> > > > > broken.
> > > > > >> >>> > Read Repair is a feature to understand which entries are
> > > broken
> > > > > and
> > > > > >> to
> > > > > >> >>> fix
> > > > > >> >>> > them.
> > > > > >> >>> >
> > > > > >> >>> > 1) How it works?
> > > > > >> >>> > IgniteCache now able to provide special proxy [3]
> > > > > withReadRepair().
> > > > > >> >>> > This proxy guarantee that data will be gained from all
> > owners
> > > > and
> > > > > >> >>> compared.
> > > > > >> >>> > In the case of consistency violation situation, data will
> be
> > > > > >> recovered
> > > > > >> >>> and
> > > > > >> >>> > a special event recorded.
> > > > > >> >>> >
> > > > > >> >>> > 3) Naming?
> > > > > >> >>> > Feature name based on Cassandra's Read Repair feature [4],
> > > which
> > > > > is
> > > > > >> >>> pretty
> > > > > >> >>> > similar.
> > > > > >> >>> >
> > > > > >> >>> > 4) Limitations which can be fixed in the future?
> > > > > >> >>> >   * MVCC and Near caches are not supported.
> > > > > >> >>> >   * Atomic caches can be checked (false positive case is
> > > > possible
> > > > > on
> > > > > >> >>> this
> > > > > >> >>> > check), but can't be recovered.
> > > > > >> >>> >   * Partial entry removal can't be recovered.
> > > > > >> >>> >   * Entries streamed using data streamer (using not a
> > > > "cache.put"
> > > > > >> based
> > > > > >> >>> > updater) and loaded by cache.load
> > > > > >> >>> >   are perceived as inconsistent since they may have
> > different
> > > > > >> versions
> > > > > >> >>> for
> > > > > >> >>> > same keys.
> > > > > >> >>> >   * Only explicit get operations are supported
> > (getAndReplace,
> > > > > >> >>> getAndPut,
> > > > > >> >>> > etc can be supported in future).
> > > > > >> >>> >
> > > > > >> >>> > 5) What's left?
> > > > > >> >>> >   * SQL/ThinClient/etc support.
> > > > > >> >>> >   * Metrics (found/repaired).
> > > > > >> >>> >   * Simple per-partition recovery feature able to work in
> > the
> > > > > >> >>> background in
> > > > > >> >>> > addition to per-entry recovery feature.
> > > > > >> >>> >
> > > > > >> >>> > 6) Is code checked?
> > > > > >> >>> >   * Pull Request #5656 [5] (feature) - has green TC.
> > > > > >> >>> >   * Pull Request #6575 [6] (RunAll with the feature
> enabled
> > > for
> > > > > >> every
> > > > > >> >>> get()
> > > > > >> >>> > request) - has a limited amount of failures (because of
> data
> > > > > >> streamer,
> > > > > >> >>> > cache.load, etc).
> > > > > >> >>> >
> > > > > >> >>> > Thoughts?
> > > > > >> >>> >
> > > > > >> >>> > [0] https://issues.apache.org/jira/browse/IGNITE-10663
> > > > > >> >>> > [1]
> > > > > >> >>> >
> > > > > >> >>> >
> > > > > >> >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix
> > > > > >> >>> > [2]
> > > > > >> >>> >
> > > > > >> >>> >
> > > > > >> >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
> > > > > >> >>> > [3]
> > > > > >> >>> >
> > > > > >> >>> >
> > > > > >> >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140
> > > > > >> >>> > [4]
> > > > > >> >>> >
> > > > > >> >>> >
> > > > > >> >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html
> > > > > >> >>> > [5] https://github.com/apache/ignite/pull/5656
> > > > > >> >>> > [6] https://github.com/apache/ignite/pull/6575
> > > > > >> >>> >
> > > > > >> >>>
> > > > > >> >>
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Read Repair (ex. Consistency Check) - review request #2

Posted by Nikolay Izhikov <ni...@apache.org>.
I did the review.

пн, 15 июля 2019 г., 16:15 Dmitriy Pavlov <dp...@apache.org>:

> Igniters, who did a review of
> https://issues.apache.org/jira/browse/IGNITE-10663 before the merge? I've
> checked both PR   https://github.com/apache/ignite/pull/5656  and Issue,
> and dev.list thread and didn't find any LGTM.
>
> Anton, since you've rejected lazy consensus in our process, we have RTC in
> that (core) module. So I'd like to know if the fix was covered by the
> review.
>
> Because you're a committer, a reviewer can be non-committer. So, who was a
> reviewer? Or was process ignored?
>
> пн, 15 июл. 2019 г. в 15:37, Вячеслав Коптилин <sl...@gmail.com>:
>
> > Hello Anton,
> >
> > > I'd like to propose you to provide fixes as a PR since you have a
> vision
> > of how it should be made. I'll review them and merge shortly.
> > Could you please take a look at PR:
> > https://github.com/apache/ignite/pull/6689
> >
> > > Since your comments mostly about Javadoc (does this mean that my
> solution
> > is so great that you ask me only to fix Javadocs :) ?),
> > In my humble opinion, I would consider this feature as experimental one
> (It
> > does not seem production-ready).
> > Let me clarify this with the following simple example:
> >
> >     try {
> >         atomicCache.withReadRepair().getAll(keys);
> >     }
> >     catch (CacheException e) {
> >         // What should be done here from the end-user point of view?
> >     }
> >
> > 1. Should I consider that my cluster is broken? There is no answer! The
> > false-positive result is possible.
> > 2. What should be done here in order to check/resolve the issue?
> Perhaps, I
> > should restart a node (which one?), restart the whole cluster, put a new
> > value...
> > 3. IgniteConsistencyViolationException is absolutely useless. It does not
> > provide any information about the issue and possible way to fix it.
> >
> > It seems that transactional caches are covered much better.
> >
> > > Mostly agree with you, but
> > > - MVCC is not production ready,
> > > - not sure near support really required,
> > > - metrics are better for monitoring, but the Event is enough for my
> wish
> > to
> > > cover AI with consistency check,
> > > - do we really need Platforms and Thin Client support?
> > Well, near caches are widely used and fully transactional, so I think it
> > makes sense to support the feature for near caches too.
> > .Net is already aware of 'ReadRepair'. It seems to me, that it can be
> > easily supported for C++. I don't see a reason why it should not be done
> :)
> >
> > > Do you mean per partition check and recovery? That's a good idea, but I
> > found it's not easy to imagine API to for such tool.
> > Yep, perhaps it can be done on the idle cluster via `idle-verify` command
> > with additional flag. Agreed, that this approach is not the best one
> > definitely.
> >
> > Thanks,
> > S.
> >
> > чт, 11 июл. 2019 г. в 09:53, Anton Vinogradov <av...@apache.org>:
> >
> > > Slava,
> > >
> > > Thanks for your review first!
> > >
> > > >> Anyway, I left some comments in your pull-request at github. Please
> > take
> > > a
> > > >> look. The most annoying thing is poorly documented code :(
> > > Since your comments mostly about Javadoc (does this mean that my
> solution
> > > is so great that you ask me only to fix Javadocs :) ?),
> > > I'd like to propose you to provide fixes as a PR since you have a
> vision
> > of
> > > how it should be made. I'll review them and merge shortly.
> > >
> > > >> By the way, is it required to add test related to fail-over
> scenarios?
> > > The best check is to use RR at real code.
> > > For example, I'm injecting RR now to the test with concurrent
> > modifications
> > > and restarts [1].
> > >
> > > >> I just checked, the IEP page and I still cannot find Jira tickets
> that
> > > >> should cover existing limitations/improvements.
> > > >> I would suggest creating the following tasks, at least:
> > > Mostly agree with you, but
> > > - MVCC is not production ready,
> > > - not sure near support really required,
> > > - metrics are better for monitoring, but the Event is enough for my
> wish
> > to
> > > cover AI with consistency check,
> > > - do we really need Platforms and Thin Client support?
> > > Also, we should not produce stillborn issue.
> > > All limitations listed at proxy creation method and they definitely are
> > not
> > > showstoppers and may be fixed later if someone interested.
> > > Сoming back to AI 3.0 discussion, we have A LOT of features and it's
> > almost
> > > impossible (require much more time that feature's cost) to support them
> > > all.
> > > I will be pretty happy in case someone will do this and provide help if
> > > necessary!
> > >
> > > >> Perhaps, it would be a good idea to think about the recovery too
> > > Do you mean per partition check and recovery?
> > > That's a good idea, but I found it's not easy to imagine API to for
> such
> > > tool.
> > > In case you ready to assist with proper API/design this will definitely
> > > help.
> > >
> > > [1] https://issues.apache.org/jira/browse/IGNITE-11973
> > >
> > > On Wed, Jul 10, 2019 at 3:43 PM Вячеслав Коптилин <
> > > slava.koptilin@gmail.com>
> > > wrote:
> > >
> > > > Perhaps, it would be a good idea to think about the recovery tool/
> > > > control-utility command that will allow achieving the same goal.
> > > > If I am not mistaken it was already proposed in the email thread.
> > > >
> > > > Thanks,
> > > > S.
> > > >
> > > > ср, 10 июл. 2019 г. в 15:33, Вячеслав Коптилин <
> > slava.koptilin@gmail.com
> > > >:
> > > >
> > > > > Hi Anton,
> > > > >
> > > > > Well, the ReadRepair feature is finally merged and that is good
> news
> > :)
> > > > >
> > > > > Unfortunately, I cannot find a consensus about the whole
> > functionality
> > > in
> > > > > any of these topics:
> > > > >  -
> > > > >
> > > >
> > >
> >
> http://apache-ignite-developers.2346864.n4.nabble.com/Consistency-check-and-fix-review-request-td41629.html
> > > > >  -
> > > > >
> > > >
> > >
> >
> http://apache-ignite-developers.2346864.n4.nabble.com/Read-Repair-ex-Consistency-Check-review-request-2-td42421.html
> > > > > Also, there are no comments/discussion in JIRA. That makes me sad
> :(
> > > > > especially when a feature is huge, not obvious and involves
> changing
> > > > public
> > > > > API (and that is the case, I think).
> > > > >
> > > > > Anyway, I left some comments in your pull-request at github. Please
> > > take
> > > > a
> > > > > look. The most annoying thing is poorly documented code :(
> > > > > By the way, is it required to add test related to fail-over
> > scenarios?
> > > > >
> > > > > I just checked, the IEP page and I still cannot find Jira tickets
> > that
> > > > > should cover existing limitations/improvements.
> > > > > I would suggest creating the following tasks, at least:
> > > > >  - MVCC support
> > > > >  - Near caches
> > > > >  - Additional metrics (number of violations, number of repaired
> > entries
> > > > > etc)
> > > > >  - Ignite C++ (It looks like, .Net is already aware of that
> feature)
> > > > >  - Thin clients support
> > > > >  - Perhaps, it would be useful to support different strategies to
> > > resolve
> > > > > inconsistencies
> > > > >
> > > > > What do you think?
> > > > >
> > > > > Thanks,
> > > > > S.
> > > > >
> > > > >
> > > > > ср, 10 июл. 2019 г. в 10:16, Anton Vinogradov <av...@apache.org>:
> > > > >
> > > > >> Folks,
> > > > >>
> > > > >> Thanks to everyone for tips and reviews.
> > > > >> Yardstick checked, no performance drop found.
> > > > >> Additional measurement: RR get() is just up to 7% slower than
> > regular
> > > > get
> > > > >> on real benchmarks (8 clients, 4 servers, 3 backups)
> > > > >> Code merged to the master.
> > > > >> "Must have" tasks created and attached to the IEP.
> > > > >>
> > > > >> On Thu, Jul 4, 2019 at 12:18 PM Anton Vinogradov <av...@apache.org>
> > > wrote:
> > > > >>
> > > > >> > Folks,
> > > > >> >
> > > > >> > Just a minor update.
> > > > >> >
> > > > >> > RunAll [1] with enabled ReadRepair proxy is almost green now
> (~10
> > > > tests
> > > > >> > left, started with 6k :)).
> > > > >> > During the analisys, I've found some tests with
> > > > >> > - unexpected repairs at tx caches
> > > > >> > - inconsistent state after the test finished (different entries
> > > across
> > > > >> the
> > > > >> > topology)
> > > > >> > For example,
> > > > >> > - testInvokeAllAppliedOnceOnBinaryTypeRegistration generates
> > > obsolete
> > > > >> > versions on backups in case of retry, fixed [2]
> > > > >> > - initial cache load generates not equal versions on backups,
> > fixed
> > > > [3]
> > > > >> > - testAccountTxNodeRestart causes unexpected repairs (entries
> have
> > > > >> > different versions), to be investigated.
> > > > >> >
> > > > >> > What's next?
> > > > >> >
> > > > >> > 1) Going to merge the solution once "RunAll with ReadRepair
> > enabled"
> > > > >> > becomes fully green.
> > > > >> > 2) Going to add special check after each test which will ensure
> > > caches
> > > > >> > content after the test is consistent.
> > > > >> > 2.1) The Same check can (should?) be injected to
> > > > >> > awaitPartitionMapExchange() and similar methods.
> > > > >> > 3) Update Jepsen tests with RR checks.
> > > > >> > 4) Introduce per partition RR.
> > > > >> >
> > > > >> > So, the final goal is to be sure that Ignite produces only
> > > consistent
> > > > >> data
> > > > >> > and to have a feature to solve consistency in case we gain
> > > > inconsistent
> > > > >> > state somehow.
> > > > >> >
> > > > >> > Limitations?
> > > > >> >
> > > > >> > Currently, RR has some limitations, but they are not related to
> > real
> > > > >> > production cases.
> > > > >> > In case someone interested to support, for example, MVCC or near
> > > > caches,
> > > > >> > please, feel free to contribute.
> > > > >> >
> > > > >> > [1]
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://mtcga.gridgain.com/pr.html?serverId=apache&suiteId=IgniteTests24Java8_RunAll&branchForTc=pull/6575/head&action=Latest
> > > > >> > [2]
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/ignite/pull/5656/commits/6f6ec4434095e692af209c61833a350f3013408c
> > > > >> > [3]
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/ignite/pull/5656/commits/255e552b474839e470c66a77e74e3c807bc76f13
> > > > >> >
> > > > >> > On Fri, Jun 28, 2019 at 2:41 PM Anton Vinogradov <av@apache.org
> >
> > > > wrote:
> > > > >> >
> > > > >> >> Slava,
> > > > >> >>
> > > > >> >> >> I will take a look at your pull request if you don't mind.
> > > > >> >> Great news!
> > > > >> >>
> > > > >> >> >> In any way, could you please update the IEP page with the
> list
> > > of
> > > > >> >> >> constraints/limitations of the proposed approach, TODOs,
> etc?
> > > > >> >> Not sure we should keep this at IEP until list (#4 from
> original
> > > > >> letter)
> > > > >> >> is not confirmed.
> > > > >> >>
> > > > >> >> Currently, I'm checking RunAll with RR enabled to almost each
> get
> > > > >> request.
> > > > >> >> "Almost" means: readRepair = !ctx.readThrough() &&
> > > > >> >> ctx.config().getBackups() > 0 && !ctx.isNear() &&
> > > !ctx.mvccEnabled()
> > > > >> >> For now I have 60 failed tests and amount decreasing.
> > > > >> >>
> > > > >> >> >> For instance, I would like to see all these limitations on
> the
> > > IEP
> > > > >> >> page as
> > > > >> >> >> JIRA tickets. Perhaps, it would be good to create an
> > > epic/umbrella
> > > > >> >> ticket
> > > > >> >> >> in order to track all activities related to `Read Repair`
> > > feature.
> > > > >> >> Let's do this at merge day to be sure useless issues will not
> be
> > > > >> created.
> > > > >> >>
> > > > >> >> On Fri, Jun 28, 2019 at 2:01 PM Вячеслав Коптилин <
> > > > >> >> slava.koptilin@gmail.com> wrote:
> > > > >> >>
> > > > >> >>> Hi Anton,
> > > > >> >>>
> > > > >> >>> I will take a look at your pull request if you don't mind.
> > > > >> >>>
> > > > >> >>> In any way, could you please update the IEP page with the list
> > of
> > > > >> >>> constraints/limitations of the proposed approach, TODOs, etc?
> > > > >> >>> For instance, I would like to see all these limitations on the
> > IEP
> > > > >> page
> > > > >> >>> as
> > > > >> >>> JIRA tickets. Perhaps, it would be good to create an
> > epic/umbrella
> > > > >> ticket
> > > > >> >>> in order to track all activities related to `Read Repair`
> > feature.
> > > > >> >>>
> > > > >> >>> Thanks,
> > > > >> >>> S.
> > > > >> >>>
> > > > >> >>> чт, 20 июн. 2019 г. в 14:15, Anton Vinogradov <av@apache.org
> >:
> > > > >> >>>
> > > > >> >>> > Igniters,
> > > > >> >>> > I'm glad to introduce Read Repair feature [0] provides
> > > additional
> > > > >> >>> > consistency guarantee for Ignite.
> > > > >> >>> >
> > > > >> >>> > 1) Why we need it?
> > > > >> >>> > The detailed explanation can be found at IEP-31 [1].
> > > > >> >>> > In short, because of bugs, it's possible to gain an
> > inconsistent
> > > > >> state.
> > > > >> >>> > We need additional features to handle this case.
> > > > >> >>> >
> > > > >> >>> > Currently we able to check cluster using Idle_verify [2]
> > > feature,
> > > > >> but
> > > > >> >>> it
> > > > >> >>> > will not fix the data, will not even tell which entries are
> > > > broken.
> > > > >> >>> > Read Repair is a feature to understand which entries are
> > broken
> > > > and
> > > > >> to
> > > > >> >>> fix
> > > > >> >>> > them.
> > > > >> >>> >
> > > > >> >>> > 1) How it works?
> > > > >> >>> > IgniteCache now able to provide special proxy [3]
> > > > withReadRepair().
> > > > >> >>> > This proxy guarantee that data will be gained from all
> owners
> > > and
> > > > >> >>> compared.
> > > > >> >>> > In the case of consistency violation situation, data will be
> > > > >> recovered
> > > > >> >>> and
> > > > >> >>> > a special event recorded.
> > > > >> >>> >
> > > > >> >>> > 3) Naming?
> > > > >> >>> > Feature name based on Cassandra's Read Repair feature [4],
> > which
> > > > is
> > > > >> >>> pretty
> > > > >> >>> > similar.
> > > > >> >>> >
> > > > >> >>> > 4) Limitations which can be fixed in the future?
> > > > >> >>> >   * MVCC and Near caches are not supported.
> > > > >> >>> >   * Atomic caches can be checked (false positive case is
> > > possible
> > > > on
> > > > >> >>> this
> > > > >> >>> > check), but can't be recovered.
> > > > >> >>> >   * Partial entry removal can't be recovered.
> > > > >> >>> >   * Entries streamed using data streamer (using not a
> > > "cache.put"
> > > > >> based
> > > > >> >>> > updater) and loaded by cache.load
> > > > >> >>> >   are perceived as inconsistent since they may have
> different
> > > > >> versions
> > > > >> >>> for
> > > > >> >>> > same keys.
> > > > >> >>> >   * Only explicit get operations are supported
> (getAndReplace,
> > > > >> >>> getAndPut,
> > > > >> >>> > etc can be supported in future).
> > > > >> >>> >
> > > > >> >>> > 5) What's left?
> > > > >> >>> >   * SQL/ThinClient/etc support.
> > > > >> >>> >   * Metrics (found/repaired).
> > > > >> >>> >   * Simple per-partition recovery feature able to work in
> the
> > > > >> >>> background in
> > > > >> >>> > addition to per-entry recovery feature.
> > > > >> >>> >
> > > > >> >>> > 6) Is code checked?
> > > > >> >>> >   * Pull Request #5656 [5] (feature) - has green TC.
> > > > >> >>> >   * Pull Request #6575 [6] (RunAll with the feature enabled
> > for
> > > > >> every
> > > > >> >>> get()
> > > > >> >>> > request) - has a limited amount of failures (because of data
> > > > >> streamer,
> > > > >> >>> > cache.load, etc).
> > > > >> >>> >
> > > > >> >>> > Thoughts?
> > > > >> >>> >
> > > > >> >>> > [0] https://issues.apache.org/jira/browse/IGNITE-10663
> > > > >> >>> > [1]
> > > > >> >>> >
> > > > >> >>> >
> > > > >> >>>
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix
> > > > >> >>> > [2]
> > > > >> >>> >
> > > > >> >>> >
> > > > >> >>>
> > > > >>
> > > >
> > >
> >
> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
> > > > >> >>> > [3]
> > > > >> >>> >
> > > > >> >>> >
> > > > >> >>>
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140
> > > > >> >>> > [4]
> > > > >> >>> >
> > > > >> >>> >
> > > > >> >>>
> > > > >>
> > > >
> > >
> >
> https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html
> > > > >> >>> > [5] https://github.com/apache/ignite/pull/5656
> > > > >> >>> > [6] https://github.com/apache/ignite/pull/6575
> > > > >> >>> >
> > > > >> >>>
> > > > >> >>
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: Read Repair (ex. Consistency Check) - review request #2

Posted by Dmitriy Pavlov <dp...@apache.org>.
Igniters, who did a review of
https://issues.apache.org/jira/browse/IGNITE-10663 before the merge? I've
checked both PR   https://github.com/apache/ignite/pull/5656  and Issue,
and dev.list thread and didn't find any LGTM.

Anton, since you've rejected lazy consensus in our process, we have RTC in
that (core) module. So I'd like to know if the fix was covered by the
review.

Because you're a committer, a reviewer can be non-committer. So, who was a
reviewer? Or was process ignored?

пн, 15 июл. 2019 г. в 15:37, Вячеслав Коптилин <sl...@gmail.com>:

> Hello Anton,
>
> > I'd like to propose you to provide fixes as a PR since you have a vision
> of how it should be made. I'll review them and merge shortly.
> Could you please take a look at PR:
> https://github.com/apache/ignite/pull/6689
>
> > Since your comments mostly about Javadoc (does this mean that my solution
> is so great that you ask me only to fix Javadocs :) ?),
> In my humble opinion, I would consider this feature as experimental one (It
> does not seem production-ready).
> Let me clarify this with the following simple example:
>
>     try {
>         atomicCache.withReadRepair().getAll(keys);
>     }
>     catch (CacheException e) {
>         // What should be done here from the end-user point of view?
>     }
>
> 1. Should I consider that my cluster is broken? There is no answer! The
> false-positive result is possible.
> 2. What should be done here in order to check/resolve the issue? Perhaps, I
> should restart a node (which one?), restart the whole cluster, put a new
> value...
> 3. IgniteConsistencyViolationException is absolutely useless. It does not
> provide any information about the issue and possible way to fix it.
>
> It seems that transactional caches are covered much better.
>
> > Mostly agree with you, but
> > - MVCC is not production ready,
> > - not sure near support really required,
> > - metrics are better for monitoring, but the Event is enough for my wish
> to
> > cover AI with consistency check,
> > - do we really need Platforms and Thin Client support?
> Well, near caches are widely used and fully transactional, so I think it
> makes sense to support the feature for near caches too.
> .Net is already aware of 'ReadRepair'. It seems to me, that it can be
> easily supported for C++. I don't see a reason why it should not be done :)
>
> > Do you mean per partition check and recovery? That's a good idea, but I
> found it's not easy to imagine API to for such tool.
> Yep, perhaps it can be done on the idle cluster via `idle-verify` command
> with additional flag. Agreed, that this approach is not the best one
> definitely.
>
> Thanks,
> S.
>
> чт, 11 июл. 2019 г. в 09:53, Anton Vinogradov <av...@apache.org>:
>
> > Slava,
> >
> > Thanks for your review first!
> >
> > >> Anyway, I left some comments in your pull-request at github. Please
> take
> > a
> > >> look. The most annoying thing is poorly documented code :(
> > Since your comments mostly about Javadoc (does this mean that my solution
> > is so great that you ask me only to fix Javadocs :) ?),
> > I'd like to propose you to provide fixes as a PR since you have a vision
> of
> > how it should be made. I'll review them and merge shortly.
> >
> > >> By the way, is it required to add test related to fail-over scenarios?
> > The best check is to use RR at real code.
> > For example, I'm injecting RR now to the test with concurrent
> modifications
> > and restarts [1].
> >
> > >> I just checked, the IEP page and I still cannot find Jira tickets that
> > >> should cover existing limitations/improvements.
> > >> I would suggest creating the following tasks, at least:
> > Mostly agree with you, but
> > - MVCC is not production ready,
> > - not sure near support really required,
> > - metrics are better for monitoring, but the Event is enough for my wish
> to
> > cover AI with consistency check,
> > - do we really need Platforms and Thin Client support?
> > Also, we should not produce stillborn issue.
> > All limitations listed at proxy creation method and they definitely are
> not
> > showstoppers and may be fixed later if someone interested.
> > Сoming back to AI 3.0 discussion, we have A LOT of features and it's
> almost
> > impossible (require much more time that feature's cost) to support them
> > all.
> > I will be pretty happy in case someone will do this and provide help if
> > necessary!
> >
> > >> Perhaps, it would be a good idea to think about the recovery too
> > Do you mean per partition check and recovery?
> > That's a good idea, but I found it's not easy to imagine API to for such
> > tool.
> > In case you ready to assist with proper API/design this will definitely
> > help.
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-11973
> >
> > On Wed, Jul 10, 2019 at 3:43 PM Вячеслав Коптилин <
> > slava.koptilin@gmail.com>
> > wrote:
> >
> > > Perhaps, it would be a good idea to think about the recovery tool/
> > > control-utility command that will allow achieving the same goal.
> > > If I am not mistaken it was already proposed in the email thread.
> > >
> > > Thanks,
> > > S.
> > >
> > > ср, 10 июл. 2019 г. в 15:33, Вячеслав Коптилин <
> slava.koptilin@gmail.com
> > >:
> > >
> > > > Hi Anton,
> > > >
> > > > Well, the ReadRepair feature is finally merged and that is good news
> :)
> > > >
> > > > Unfortunately, I cannot find a consensus about the whole
> functionality
> > in
> > > > any of these topics:
> > > >  -
> > > >
> > >
> >
> http://apache-ignite-developers.2346864.n4.nabble.com/Consistency-check-and-fix-review-request-td41629.html
> > > >  -
> > > >
> > >
> >
> http://apache-ignite-developers.2346864.n4.nabble.com/Read-Repair-ex-Consistency-Check-review-request-2-td42421.html
> > > > Also, there are no comments/discussion in JIRA. That makes me sad :(
> > > > especially when a feature is huge, not obvious and involves changing
> > > public
> > > > API (and that is the case, I think).
> > > >
> > > > Anyway, I left some comments in your pull-request at github. Please
> > take
> > > a
> > > > look. The most annoying thing is poorly documented code :(
> > > > By the way, is it required to add test related to fail-over
> scenarios?
> > > >
> > > > I just checked, the IEP page and I still cannot find Jira tickets
> that
> > > > should cover existing limitations/improvements.
> > > > I would suggest creating the following tasks, at least:
> > > >  - MVCC support
> > > >  - Near caches
> > > >  - Additional metrics (number of violations, number of repaired
> entries
> > > > etc)
> > > >  - Ignite C++ (It looks like, .Net is already aware of that feature)
> > > >  - Thin clients support
> > > >  - Perhaps, it would be useful to support different strategies to
> > resolve
> > > > inconsistencies
> > > >
> > > > What do you think?
> > > >
> > > > Thanks,
> > > > S.
> > > >
> > > >
> > > > ср, 10 июл. 2019 г. в 10:16, Anton Vinogradov <av...@apache.org>:
> > > >
> > > >> Folks,
> > > >>
> > > >> Thanks to everyone for tips and reviews.
> > > >> Yardstick checked, no performance drop found.
> > > >> Additional measurement: RR get() is just up to 7% slower than
> regular
> > > get
> > > >> on real benchmarks (8 clients, 4 servers, 3 backups)
> > > >> Code merged to the master.
> > > >> "Must have" tasks created and attached to the IEP.
> > > >>
> > > >> On Thu, Jul 4, 2019 at 12:18 PM Anton Vinogradov <av...@apache.org>
> > wrote:
> > > >>
> > > >> > Folks,
> > > >> >
> > > >> > Just a minor update.
> > > >> >
> > > >> > RunAll [1] with enabled ReadRepair proxy is almost green now (~10
> > > tests
> > > >> > left, started with 6k :)).
> > > >> > During the analisys, I've found some tests with
> > > >> > - unexpected repairs at tx caches
> > > >> > - inconsistent state after the test finished (different entries
> > across
> > > >> the
> > > >> > topology)
> > > >> > For example,
> > > >> > - testInvokeAllAppliedOnceOnBinaryTypeRegistration generates
> > obsolete
> > > >> > versions on backups in case of retry, fixed [2]
> > > >> > - initial cache load generates not equal versions on backups,
> fixed
> > > [3]
> > > >> > - testAccountTxNodeRestart causes unexpected repairs (entries have
> > > >> > different versions), to be investigated.
> > > >> >
> > > >> > What's next?
> > > >> >
> > > >> > 1) Going to merge the solution once "RunAll with ReadRepair
> enabled"
> > > >> > becomes fully green.
> > > >> > 2) Going to add special check after each test which will ensure
> > caches
> > > >> > content after the test is consistent.
> > > >> > 2.1) The Same check can (should?) be injected to
> > > >> > awaitPartitionMapExchange() and similar methods.
> > > >> > 3) Update Jepsen tests with RR checks.
> > > >> > 4) Introduce per partition RR.
> > > >> >
> > > >> > So, the final goal is to be sure that Ignite produces only
> > consistent
> > > >> data
> > > >> > and to have a feature to solve consistency in case we gain
> > > inconsistent
> > > >> > state somehow.
> > > >> >
> > > >> > Limitations?
> > > >> >
> > > >> > Currently, RR has some limitations, but they are not related to
> real
> > > >> > production cases.
> > > >> > In case someone interested to support, for example, MVCC or near
> > > caches,
> > > >> > please, feel free to contribute.
> > > >> >
> > > >> > [1]
> > > >> >
> > > >>
> > >
> >
> https://mtcga.gridgain.com/pr.html?serverId=apache&suiteId=IgniteTests24Java8_RunAll&branchForTc=pull/6575/head&action=Latest
> > > >> > [2]
> > > >> >
> > > >>
> > >
> >
> https://github.com/apache/ignite/pull/5656/commits/6f6ec4434095e692af209c61833a350f3013408c
> > > >> > [3]
> > > >> >
> > > >>
> > >
> >
> https://github.com/apache/ignite/pull/5656/commits/255e552b474839e470c66a77e74e3c807bc76f13
> > > >> >
> > > >> > On Fri, Jun 28, 2019 at 2:41 PM Anton Vinogradov <av...@apache.org>
> > > wrote:
> > > >> >
> > > >> >> Slava,
> > > >> >>
> > > >> >> >> I will take a look at your pull request if you don't mind.
> > > >> >> Great news!
> > > >> >>
> > > >> >> >> In any way, could you please update the IEP page with the list
> > of
> > > >> >> >> constraints/limitations of the proposed approach, TODOs, etc?
> > > >> >> Not sure we should keep this at IEP until list (#4 from original
> > > >> letter)
> > > >> >> is not confirmed.
> > > >> >>
> > > >> >> Currently, I'm checking RunAll with RR enabled to almost each get
> > > >> request.
> > > >> >> "Almost" means: readRepair = !ctx.readThrough() &&
> > > >> >> ctx.config().getBackups() > 0 && !ctx.isNear() &&
> > !ctx.mvccEnabled()
> > > >> >> For now I have 60 failed tests and amount decreasing.
> > > >> >>
> > > >> >> >> For instance, I would like to see all these limitations on the
> > IEP
> > > >> >> page as
> > > >> >> >> JIRA tickets. Perhaps, it would be good to create an
> > epic/umbrella
> > > >> >> ticket
> > > >> >> >> in order to track all activities related to `Read Repair`
> > feature.
> > > >> >> Let's do this at merge day to be sure useless issues will not be
> > > >> created.
> > > >> >>
> > > >> >> On Fri, Jun 28, 2019 at 2:01 PM Вячеслав Коптилин <
> > > >> >> slava.koptilin@gmail.com> wrote:
> > > >> >>
> > > >> >>> Hi Anton,
> > > >> >>>
> > > >> >>> I will take a look at your pull request if you don't mind.
> > > >> >>>
> > > >> >>> In any way, could you please update the IEP page with the list
> of
> > > >> >>> constraints/limitations of the proposed approach, TODOs, etc?
> > > >> >>> For instance, I would like to see all these limitations on the
> IEP
> > > >> page
> > > >> >>> as
> > > >> >>> JIRA tickets. Perhaps, it would be good to create an
> epic/umbrella
> > > >> ticket
> > > >> >>> in order to track all activities related to `Read Repair`
> feature.
> > > >> >>>
> > > >> >>> Thanks,
> > > >> >>> S.
> > > >> >>>
> > > >> >>> чт, 20 июн. 2019 г. в 14:15, Anton Vinogradov <av...@apache.org>:
> > > >> >>>
> > > >> >>> > Igniters,
> > > >> >>> > I'm glad to introduce Read Repair feature [0] provides
> > additional
> > > >> >>> > consistency guarantee for Ignite.
> > > >> >>> >
> > > >> >>> > 1) Why we need it?
> > > >> >>> > The detailed explanation can be found at IEP-31 [1].
> > > >> >>> > In short, because of bugs, it's possible to gain an
> inconsistent
> > > >> state.
> > > >> >>> > We need additional features to handle this case.
> > > >> >>> >
> > > >> >>> > Currently we able to check cluster using Idle_verify [2]
> > feature,
> > > >> but
> > > >> >>> it
> > > >> >>> > will not fix the data, will not even tell which entries are
> > > broken.
> > > >> >>> > Read Repair is a feature to understand which entries are
> broken
> > > and
> > > >> to
> > > >> >>> fix
> > > >> >>> > them.
> > > >> >>> >
> > > >> >>> > 1) How it works?
> > > >> >>> > IgniteCache now able to provide special proxy [3]
> > > withReadRepair().
> > > >> >>> > This proxy guarantee that data will be gained from all owners
> > and
> > > >> >>> compared.
> > > >> >>> > In the case of consistency violation situation, data will be
> > > >> recovered
> > > >> >>> and
> > > >> >>> > a special event recorded.
> > > >> >>> >
> > > >> >>> > 3) Naming?
> > > >> >>> > Feature name based on Cassandra's Read Repair feature [4],
> which
> > > is
> > > >> >>> pretty
> > > >> >>> > similar.
> > > >> >>> >
> > > >> >>> > 4) Limitations which can be fixed in the future?
> > > >> >>> >   * MVCC and Near caches are not supported.
> > > >> >>> >   * Atomic caches can be checked (false positive case is
> > possible
> > > on
> > > >> >>> this
> > > >> >>> > check), but can't be recovered.
> > > >> >>> >   * Partial entry removal can't be recovered.
> > > >> >>> >   * Entries streamed using data streamer (using not a
> > "cache.put"
> > > >> based
> > > >> >>> > updater) and loaded by cache.load
> > > >> >>> >   are perceived as inconsistent since they may have different
> > > >> versions
> > > >> >>> for
> > > >> >>> > same keys.
> > > >> >>> >   * Only explicit get operations are supported (getAndReplace,
> > > >> >>> getAndPut,
> > > >> >>> > etc can be supported in future).
> > > >> >>> >
> > > >> >>> > 5) What's left?
> > > >> >>> >   * SQL/ThinClient/etc support.
> > > >> >>> >   * Metrics (found/repaired).
> > > >> >>> >   * Simple per-partition recovery feature able to work in the
> > > >> >>> background in
> > > >> >>> > addition to per-entry recovery feature.
> > > >> >>> >
> > > >> >>> > 6) Is code checked?
> > > >> >>> >   * Pull Request #5656 [5] (feature) - has green TC.
> > > >> >>> >   * Pull Request #6575 [6] (RunAll with the feature enabled
> for
> > > >> every
> > > >> >>> get()
> > > >> >>> > request) - has a limited amount of failures (because of data
> > > >> streamer,
> > > >> >>> > cache.load, etc).
> > > >> >>> >
> > > >> >>> > Thoughts?
> > > >> >>> >
> > > >> >>> > [0] https://issues.apache.org/jira/browse/IGNITE-10663
> > > >> >>> > [1]
> > > >> >>> >
> > > >> >>> >
> > > >> >>>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix
> > > >> >>> > [2]
> > > >> >>> >
> > > >> >>> >
> > > >> >>>
> > > >>
> > >
> >
> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
> > > >> >>> > [3]
> > > >> >>> >
> > > >> >>> >
> > > >> >>>
> > > >>
> > >
> >
> https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140
> > > >> >>> > [4]
> > > >> >>> >
> > > >> >>> >
> > > >> >>>
> > > >>
> > >
> >
> https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html
> > > >> >>> > [5] https://github.com/apache/ignite/pull/5656
> > > >> >>> > [6] https://github.com/apache/ignite/pull/6575
> > > >> >>> >
> > > >> >>>
> > > >> >>
> > > >>
> > > >
> > >
> >
>

Re: Read Repair (ex. Consistency Check) - review request #2

Posted by Вячеслав Коптилин <sl...@gmail.com>.
Hello Anton,

> I'd like to propose you to provide fixes as a PR since you have a vision
of how it should be made. I'll review them and merge shortly.
Could you please take a look at PR:
https://github.com/apache/ignite/pull/6689

> Since your comments mostly about Javadoc (does this mean that my solution
is so great that you ask me only to fix Javadocs :) ?),
In my humble opinion, I would consider this feature as experimental one (It
does not seem production-ready).
Let me clarify this with the following simple example:

    try {
        atomicCache.withReadRepair().getAll(keys);
    }
    catch (CacheException e) {
        // What should be done here from the end-user point of view?
    }

1. Should I consider that my cluster is broken? There is no answer! The
false-positive result is possible.
2. What should be done here in order to check/resolve the issue? Perhaps, I
should restart a node (which one?), restart the whole cluster, put a new
value...
3. IgniteConsistencyViolationException is absolutely useless. It does not
provide any information about the issue and possible way to fix it.

It seems that transactional caches are covered much better.

> Mostly agree with you, but
> - MVCC is not production ready,
> - not sure near support really required,
> - metrics are better for monitoring, but the Event is enough for my wish
to
> cover AI with consistency check,
> - do we really need Platforms and Thin Client support?
Well, near caches are widely used and fully transactional, so I think it
makes sense to support the feature for near caches too.
.Net is already aware of 'ReadRepair'. It seems to me, that it can be
easily supported for C++. I don't see a reason why it should not be done :)

> Do you mean per partition check and recovery? That's a good idea, but I
found it's not easy to imagine API to for such tool.
Yep, perhaps it can be done on the idle cluster via `idle-verify` command
with additional flag. Agreed, that this approach is not the best one
definitely.

Thanks,
S.

чт, 11 июл. 2019 г. в 09:53, Anton Vinogradov <av...@apache.org>:

> Slava,
>
> Thanks for your review first!
>
> >> Anyway, I left some comments in your pull-request at github. Please take
> a
> >> look. The most annoying thing is poorly documented code :(
> Since your comments mostly about Javadoc (does this mean that my solution
> is so great that you ask me only to fix Javadocs :) ?),
> I'd like to propose you to provide fixes as a PR since you have a vision of
> how it should be made. I'll review them and merge shortly.
>
> >> By the way, is it required to add test related to fail-over scenarios?
> The best check is to use RR at real code.
> For example, I'm injecting RR now to the test with concurrent modifications
> and restarts [1].
>
> >> I just checked, the IEP page and I still cannot find Jira tickets that
> >> should cover existing limitations/improvements.
> >> I would suggest creating the following tasks, at least:
> Mostly agree with you, but
> - MVCC is not production ready,
> - not sure near support really required,
> - metrics are better for monitoring, but the Event is enough for my wish to
> cover AI with consistency check,
> - do we really need Platforms and Thin Client support?
> Also, we should not produce stillborn issue.
> All limitations listed at proxy creation method and they definitely are not
> showstoppers and may be fixed later if someone interested.
> Сoming back to AI 3.0 discussion, we have A LOT of features and it's almost
> impossible (require much more time that feature's cost) to support them
> all.
> I will be pretty happy in case someone will do this and provide help if
> necessary!
>
> >> Perhaps, it would be a good idea to think about the recovery too
> Do you mean per partition check and recovery?
> That's a good idea, but I found it's not easy to imagine API to for such
> tool.
> In case you ready to assist with proper API/design this will definitely
> help.
>
> [1] https://issues.apache.org/jira/browse/IGNITE-11973
>
> On Wed, Jul 10, 2019 at 3:43 PM Вячеслав Коптилин <
> slava.koptilin@gmail.com>
> wrote:
>
> > Perhaps, it would be a good idea to think about the recovery tool/
> > control-utility command that will allow achieving the same goal.
> > If I am not mistaken it was already proposed in the email thread.
> >
> > Thanks,
> > S.
> >
> > ср, 10 июл. 2019 г. в 15:33, Вячеслав Коптилин <slava.koptilin@gmail.com
> >:
> >
> > > Hi Anton,
> > >
> > > Well, the ReadRepair feature is finally merged and that is good news :)
> > >
> > > Unfortunately, I cannot find a consensus about the whole functionality
> in
> > > any of these topics:
> > >  -
> > >
> >
> http://apache-ignite-developers.2346864.n4.nabble.com/Consistency-check-and-fix-review-request-td41629.html
> > >  -
> > >
> >
> http://apache-ignite-developers.2346864.n4.nabble.com/Read-Repair-ex-Consistency-Check-review-request-2-td42421.html
> > > Also, there are no comments/discussion in JIRA. That makes me sad :(
> > > especially when a feature is huge, not obvious and involves changing
> > public
> > > API (and that is the case, I think).
> > >
> > > Anyway, I left some comments in your pull-request at github. Please
> take
> > a
> > > look. The most annoying thing is poorly documented code :(
> > > By the way, is it required to add test related to fail-over scenarios?
> > >
> > > I just checked, the IEP page and I still cannot find Jira tickets that
> > > should cover existing limitations/improvements.
> > > I would suggest creating the following tasks, at least:
> > >  - MVCC support
> > >  - Near caches
> > >  - Additional metrics (number of violations, number of repaired entries
> > > etc)
> > >  - Ignite C++ (It looks like, .Net is already aware of that feature)
> > >  - Thin clients support
> > >  - Perhaps, it would be useful to support different strategies to
> resolve
> > > inconsistencies
> > >
> > > What do you think?
> > >
> > > Thanks,
> > > S.
> > >
> > >
> > > ср, 10 июл. 2019 г. в 10:16, Anton Vinogradov <av...@apache.org>:
> > >
> > >> Folks,
> > >>
> > >> Thanks to everyone for tips and reviews.
> > >> Yardstick checked, no performance drop found.
> > >> Additional measurement: RR get() is just up to 7% slower than regular
> > get
> > >> on real benchmarks (8 clients, 4 servers, 3 backups)
> > >> Code merged to the master.
> > >> "Must have" tasks created and attached to the IEP.
> > >>
> > >> On Thu, Jul 4, 2019 at 12:18 PM Anton Vinogradov <av...@apache.org>
> wrote:
> > >>
> > >> > Folks,
> > >> >
> > >> > Just a minor update.
> > >> >
> > >> > RunAll [1] with enabled ReadRepair proxy is almost green now (~10
> > tests
> > >> > left, started with 6k :)).
> > >> > During the analisys, I've found some tests with
> > >> > - unexpected repairs at tx caches
> > >> > - inconsistent state after the test finished (different entries
> across
> > >> the
> > >> > topology)
> > >> > For example,
> > >> > - testInvokeAllAppliedOnceOnBinaryTypeRegistration generates
> obsolete
> > >> > versions on backups in case of retry, fixed [2]
> > >> > - initial cache load generates not equal versions on backups, fixed
> > [3]
> > >> > - testAccountTxNodeRestart causes unexpected repairs (entries have
> > >> > different versions), to be investigated.
> > >> >
> > >> > What's next?
> > >> >
> > >> > 1) Going to merge the solution once "RunAll with ReadRepair enabled"
> > >> > becomes fully green.
> > >> > 2) Going to add special check after each test which will ensure
> caches
> > >> > content after the test is consistent.
> > >> > 2.1) The Same check can (should?) be injected to
> > >> > awaitPartitionMapExchange() and similar methods.
> > >> > 3) Update Jepsen tests with RR checks.
> > >> > 4) Introduce per partition RR.
> > >> >
> > >> > So, the final goal is to be sure that Ignite produces only
> consistent
> > >> data
> > >> > and to have a feature to solve consistency in case we gain
> > inconsistent
> > >> > state somehow.
> > >> >
> > >> > Limitations?
> > >> >
> > >> > Currently, RR has some limitations, but they are not related to real
> > >> > production cases.
> > >> > In case someone interested to support, for example, MVCC or near
> > caches,
> > >> > please, feel free to contribute.
> > >> >
> > >> > [1]
> > >> >
> > >>
> >
> https://mtcga.gridgain.com/pr.html?serverId=apache&suiteId=IgniteTests24Java8_RunAll&branchForTc=pull/6575/head&action=Latest
> > >> > [2]
> > >> >
> > >>
> >
> https://github.com/apache/ignite/pull/5656/commits/6f6ec4434095e692af209c61833a350f3013408c
> > >> > [3]
> > >> >
> > >>
> >
> https://github.com/apache/ignite/pull/5656/commits/255e552b474839e470c66a77e74e3c807bc76f13
> > >> >
> > >> > On Fri, Jun 28, 2019 at 2:41 PM Anton Vinogradov <av...@apache.org>
> > wrote:
> > >> >
> > >> >> Slava,
> > >> >>
> > >> >> >> I will take a look at your pull request if you don't mind.
> > >> >> Great news!
> > >> >>
> > >> >> >> In any way, could you please update the IEP page with the list
> of
> > >> >> >> constraints/limitations of the proposed approach, TODOs, etc?
> > >> >> Not sure we should keep this at IEP until list (#4 from original
> > >> letter)
> > >> >> is not confirmed.
> > >> >>
> > >> >> Currently, I'm checking RunAll with RR enabled to almost each get
> > >> request.
> > >> >> "Almost" means: readRepair = !ctx.readThrough() &&
> > >> >> ctx.config().getBackups() > 0 && !ctx.isNear() &&
> !ctx.mvccEnabled()
> > >> >> For now I have 60 failed tests and amount decreasing.
> > >> >>
> > >> >> >> For instance, I would like to see all these limitations on the
> IEP
> > >> >> page as
> > >> >> >> JIRA tickets. Perhaps, it would be good to create an
> epic/umbrella
> > >> >> ticket
> > >> >> >> in order to track all activities related to `Read Repair`
> feature.
> > >> >> Let's do this at merge day to be sure useless issues will not be
> > >> created.
> > >> >>
> > >> >> On Fri, Jun 28, 2019 at 2:01 PM Вячеслав Коптилин <
> > >> >> slava.koptilin@gmail.com> wrote:
> > >> >>
> > >> >>> Hi Anton,
> > >> >>>
> > >> >>> I will take a look at your pull request if you don't mind.
> > >> >>>
> > >> >>> In any way, could you please update the IEP page with the list of
> > >> >>> constraints/limitations of the proposed approach, TODOs, etc?
> > >> >>> For instance, I would like to see all these limitations on the IEP
> > >> page
> > >> >>> as
> > >> >>> JIRA tickets. Perhaps, it would be good to create an epic/umbrella
> > >> ticket
> > >> >>> in order to track all activities related to `Read Repair` feature.
> > >> >>>
> > >> >>> Thanks,
> > >> >>> S.
> > >> >>>
> > >> >>> чт, 20 июн. 2019 г. в 14:15, Anton Vinogradov <av...@apache.org>:
> > >> >>>
> > >> >>> > Igniters,
> > >> >>> > I'm glad to introduce Read Repair feature [0] provides
> additional
> > >> >>> > consistency guarantee for Ignite.
> > >> >>> >
> > >> >>> > 1) Why we need it?
> > >> >>> > The detailed explanation can be found at IEP-31 [1].
> > >> >>> > In short, because of bugs, it's possible to gain an inconsistent
> > >> state.
> > >> >>> > We need additional features to handle this case.
> > >> >>> >
> > >> >>> > Currently we able to check cluster using Idle_verify [2]
> feature,
> > >> but
> > >> >>> it
> > >> >>> > will not fix the data, will not even tell which entries are
> > broken.
> > >> >>> > Read Repair is a feature to understand which entries are broken
> > and
> > >> to
> > >> >>> fix
> > >> >>> > them.
> > >> >>> >
> > >> >>> > 1) How it works?
> > >> >>> > IgniteCache now able to provide special proxy [3]
> > withReadRepair().
> > >> >>> > This proxy guarantee that data will be gained from all owners
> and
> > >> >>> compared.
> > >> >>> > In the case of consistency violation situation, data will be
> > >> recovered
> > >> >>> and
> > >> >>> > a special event recorded.
> > >> >>> >
> > >> >>> > 3) Naming?
> > >> >>> > Feature name based on Cassandra's Read Repair feature [4], which
> > is
> > >> >>> pretty
> > >> >>> > similar.
> > >> >>> >
> > >> >>> > 4) Limitations which can be fixed in the future?
> > >> >>> >   * MVCC and Near caches are not supported.
> > >> >>> >   * Atomic caches can be checked (false positive case is
> possible
> > on
> > >> >>> this
> > >> >>> > check), but can't be recovered.
> > >> >>> >   * Partial entry removal can't be recovered.
> > >> >>> >   * Entries streamed using data streamer (using not a
> "cache.put"
> > >> based
> > >> >>> > updater) and loaded by cache.load
> > >> >>> >   are perceived as inconsistent since they may have different
> > >> versions
> > >> >>> for
> > >> >>> > same keys.
> > >> >>> >   * Only explicit get operations are supported (getAndReplace,
> > >> >>> getAndPut,
> > >> >>> > etc can be supported in future).
> > >> >>> >
> > >> >>> > 5) What's left?
> > >> >>> >   * SQL/ThinClient/etc support.
> > >> >>> >   * Metrics (found/repaired).
> > >> >>> >   * Simple per-partition recovery feature able to work in the
> > >> >>> background in
> > >> >>> > addition to per-entry recovery feature.
> > >> >>> >
> > >> >>> > 6) Is code checked?
> > >> >>> >   * Pull Request #5656 [5] (feature) - has green TC.
> > >> >>> >   * Pull Request #6575 [6] (RunAll with the feature enabled for
> > >> every
> > >> >>> get()
> > >> >>> > request) - has a limited amount of failures (because of data
> > >> streamer,
> > >> >>> > cache.load, etc).
> > >> >>> >
> > >> >>> > Thoughts?
> > >> >>> >
> > >> >>> > [0] https://issues.apache.org/jira/browse/IGNITE-10663
> > >> >>> > [1]
> > >> >>> >
> > >> >>> >
> > >> >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix
> > >> >>> > [2]
> > >> >>> >
> > >> >>> >
> > >> >>>
> > >>
> >
> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
> > >> >>> > [3]
> > >> >>> >
> > >> >>> >
> > >> >>>
> > >>
> >
> https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140
> > >> >>> > [4]
> > >> >>> >
> > >> >>> >
> > >> >>>
> > >>
> >
> https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html
> > >> >>> > [5] https://github.com/apache/ignite/pull/5656
> > >> >>> > [6] https://github.com/apache/ignite/pull/6575
> > >> >>> >
> > >> >>>
> > >> >>
> > >>
> > >
> >
>

Re: Read Repair (ex. Consistency Check) - review request #2

Posted by Anton Vinogradov <av...@apache.org>.
Slava,

Thanks for your review first!

>> Anyway, I left some comments in your pull-request at github. Please take
a
>> look. The most annoying thing is poorly documented code :(
Since your comments mostly about Javadoc (does this mean that my solution
is so great that you ask me only to fix Javadocs :) ?),
I'd like to propose you to provide fixes as a PR since you have a vision of
how it should be made. I'll review them and merge shortly.

>> By the way, is it required to add test related to fail-over scenarios?
The best check is to use RR at real code.
For example, I'm injecting RR now to the test with concurrent modifications
and restarts [1].

>> I just checked, the IEP page and I still cannot find Jira tickets that
>> should cover existing limitations/improvements.
>> I would suggest creating the following tasks, at least:
Mostly agree with you, but
- MVCC is not production ready,
- not sure near support really required,
- metrics are better for monitoring, but the Event is enough for my wish to
cover AI with consistency check,
- do we really need Platforms and Thin Client support?
Also, we should not produce stillborn issue.
All limitations listed at proxy creation method and they definitely are not
showstoppers and may be fixed later if someone interested.
Сoming back to AI 3.0 discussion, we have A LOT of features and it's almost
impossible (require much more time that feature's cost) to support them all.
I will be pretty happy in case someone will do this and provide help if
necessary!

>> Perhaps, it would be a good idea to think about the recovery too
Do you mean per partition check and recovery?
That's a good idea, but I found it's not easy to imagine API to for such
tool.
In case you ready to assist with proper API/design this will definitely
help.

[1] https://issues.apache.org/jira/browse/IGNITE-11973

On Wed, Jul 10, 2019 at 3:43 PM Вячеслав Коптилин <sl...@gmail.com>
wrote:

> Perhaps, it would be a good idea to think about the recovery tool/
> control-utility command that will allow achieving the same goal.
> If I am not mistaken it was already proposed in the email thread.
>
> Thanks,
> S.
>
> ср, 10 июл. 2019 г. в 15:33, Вячеслав Коптилин <sl...@gmail.com>:
>
> > Hi Anton,
> >
> > Well, the ReadRepair feature is finally merged and that is good news :)
> >
> > Unfortunately, I cannot find a consensus about the whole functionality in
> > any of these topics:
> >  -
> >
> http://apache-ignite-developers.2346864.n4.nabble.com/Consistency-check-and-fix-review-request-td41629.html
> >  -
> >
> http://apache-ignite-developers.2346864.n4.nabble.com/Read-Repair-ex-Consistency-Check-review-request-2-td42421.html
> > Also, there are no comments/discussion in JIRA. That makes me sad :(
> > especially when a feature is huge, not obvious and involves changing
> public
> > API (and that is the case, I think).
> >
> > Anyway, I left some comments in your pull-request at github. Please take
> a
> > look. The most annoying thing is poorly documented code :(
> > By the way, is it required to add test related to fail-over scenarios?
> >
> > I just checked, the IEP page and I still cannot find Jira tickets that
> > should cover existing limitations/improvements.
> > I would suggest creating the following tasks, at least:
> >  - MVCC support
> >  - Near caches
> >  - Additional metrics (number of violations, number of repaired entries
> > etc)
> >  - Ignite C++ (It looks like, .Net is already aware of that feature)
> >  - Thin clients support
> >  - Perhaps, it would be useful to support different strategies to resolve
> > inconsistencies
> >
> > What do you think?
> >
> > Thanks,
> > S.
> >
> >
> > ср, 10 июл. 2019 г. в 10:16, Anton Vinogradov <av...@apache.org>:
> >
> >> Folks,
> >>
> >> Thanks to everyone for tips and reviews.
> >> Yardstick checked, no performance drop found.
> >> Additional measurement: RR get() is just up to 7% slower than regular
> get
> >> on real benchmarks (8 clients, 4 servers, 3 backups)
> >> Code merged to the master.
> >> "Must have" tasks created and attached to the IEP.
> >>
> >> On Thu, Jul 4, 2019 at 12:18 PM Anton Vinogradov <av...@apache.org> wrote:
> >>
> >> > Folks,
> >> >
> >> > Just a minor update.
> >> >
> >> > RunAll [1] with enabled ReadRepair proxy is almost green now (~10
> tests
> >> > left, started with 6k :)).
> >> > During the analisys, I've found some tests with
> >> > - unexpected repairs at tx caches
> >> > - inconsistent state after the test finished (different entries across
> >> the
> >> > topology)
> >> > For example,
> >> > - testInvokeAllAppliedOnceOnBinaryTypeRegistration generates obsolete
> >> > versions on backups in case of retry, fixed [2]
> >> > - initial cache load generates not equal versions on backups, fixed
> [3]
> >> > - testAccountTxNodeRestart causes unexpected repairs (entries have
> >> > different versions), to be investigated.
> >> >
> >> > What's next?
> >> >
> >> > 1) Going to merge the solution once "RunAll with ReadRepair enabled"
> >> > becomes fully green.
> >> > 2) Going to add special check after each test which will ensure caches
> >> > content after the test is consistent.
> >> > 2.1) The Same check can (should?) be injected to
> >> > awaitPartitionMapExchange() and similar methods.
> >> > 3) Update Jepsen tests with RR checks.
> >> > 4) Introduce per partition RR.
> >> >
> >> > So, the final goal is to be sure that Ignite produces only consistent
> >> data
> >> > and to have a feature to solve consistency in case we gain
> inconsistent
> >> > state somehow.
> >> >
> >> > Limitations?
> >> >
> >> > Currently, RR has some limitations, but they are not related to real
> >> > production cases.
> >> > In case someone interested to support, for example, MVCC or near
> caches,
> >> > please, feel free to contribute.
> >> >
> >> > [1]
> >> >
> >>
> https://mtcga.gridgain.com/pr.html?serverId=apache&suiteId=IgniteTests24Java8_RunAll&branchForTc=pull/6575/head&action=Latest
> >> > [2]
> >> >
> >>
> https://github.com/apache/ignite/pull/5656/commits/6f6ec4434095e692af209c61833a350f3013408c
> >> > [3]
> >> >
> >>
> https://github.com/apache/ignite/pull/5656/commits/255e552b474839e470c66a77e74e3c807bc76f13
> >> >
> >> > On Fri, Jun 28, 2019 at 2:41 PM Anton Vinogradov <av...@apache.org>
> wrote:
> >> >
> >> >> Slava,
> >> >>
> >> >> >> I will take a look at your pull request if you don't mind.
> >> >> Great news!
> >> >>
> >> >> >> In any way, could you please update the IEP page with the list of
> >> >> >> constraints/limitations of the proposed approach, TODOs, etc?
> >> >> Not sure we should keep this at IEP until list (#4 from original
> >> letter)
> >> >> is not confirmed.
> >> >>
> >> >> Currently, I'm checking RunAll with RR enabled to almost each get
> >> request.
> >> >> "Almost" means: readRepair = !ctx.readThrough() &&
> >> >> ctx.config().getBackups() > 0 && !ctx.isNear() && !ctx.mvccEnabled()
> >> >> For now I have 60 failed tests and amount decreasing.
> >> >>
> >> >> >> For instance, I would like to see all these limitations on the IEP
> >> >> page as
> >> >> >> JIRA tickets. Perhaps, it would be good to create an epic/umbrella
> >> >> ticket
> >> >> >> in order to track all activities related to `Read Repair` feature.
> >> >> Let's do this at merge day to be sure useless issues will not be
> >> created.
> >> >>
> >> >> On Fri, Jun 28, 2019 at 2:01 PM Вячеслав Коптилин <
> >> >> slava.koptilin@gmail.com> wrote:
> >> >>
> >> >>> Hi Anton,
> >> >>>
> >> >>> I will take a look at your pull request if you don't mind.
> >> >>>
> >> >>> In any way, could you please update the IEP page with the list of
> >> >>> constraints/limitations of the proposed approach, TODOs, etc?
> >> >>> For instance, I would like to see all these limitations on the IEP
> >> page
> >> >>> as
> >> >>> JIRA tickets. Perhaps, it would be good to create an epic/umbrella
> >> ticket
> >> >>> in order to track all activities related to `Read Repair` feature.
> >> >>>
> >> >>> Thanks,
> >> >>> S.
> >> >>>
> >> >>> чт, 20 июн. 2019 г. в 14:15, Anton Vinogradov <av...@apache.org>:
> >> >>>
> >> >>> > Igniters,
> >> >>> > I'm glad to introduce Read Repair feature [0] provides additional
> >> >>> > consistency guarantee for Ignite.
> >> >>> >
> >> >>> > 1) Why we need it?
> >> >>> > The detailed explanation can be found at IEP-31 [1].
> >> >>> > In short, because of bugs, it's possible to gain an inconsistent
> >> state.
> >> >>> > We need additional features to handle this case.
> >> >>> >
> >> >>> > Currently we able to check cluster using Idle_verify [2] feature,
> >> but
> >> >>> it
> >> >>> > will not fix the data, will not even tell which entries are
> broken.
> >> >>> > Read Repair is a feature to understand which entries are broken
> and
> >> to
> >> >>> fix
> >> >>> > them.
> >> >>> >
> >> >>> > 1) How it works?
> >> >>> > IgniteCache now able to provide special proxy [3]
> withReadRepair().
> >> >>> > This proxy guarantee that data will be gained from all owners and
> >> >>> compared.
> >> >>> > In the case of consistency violation situation, data will be
> >> recovered
> >> >>> and
> >> >>> > a special event recorded.
> >> >>> >
> >> >>> > 3) Naming?
> >> >>> > Feature name based on Cassandra's Read Repair feature [4], which
> is
> >> >>> pretty
> >> >>> > similar.
> >> >>> >
> >> >>> > 4) Limitations which can be fixed in the future?
> >> >>> >   * MVCC and Near caches are not supported.
> >> >>> >   * Atomic caches can be checked (false positive case is possible
> on
> >> >>> this
> >> >>> > check), but can't be recovered.
> >> >>> >   * Partial entry removal can't be recovered.
> >> >>> >   * Entries streamed using data streamer (using not a "cache.put"
> >> based
> >> >>> > updater) and loaded by cache.load
> >> >>> >   are perceived as inconsistent since they may have different
> >> versions
> >> >>> for
> >> >>> > same keys.
> >> >>> >   * Only explicit get operations are supported (getAndReplace,
> >> >>> getAndPut,
> >> >>> > etc can be supported in future).
> >> >>> >
> >> >>> > 5) What's left?
> >> >>> >   * SQL/ThinClient/etc support.
> >> >>> >   * Metrics (found/repaired).
> >> >>> >   * Simple per-partition recovery feature able to work in the
> >> >>> background in
> >> >>> > addition to per-entry recovery feature.
> >> >>> >
> >> >>> > 6) Is code checked?
> >> >>> >   * Pull Request #5656 [5] (feature) - has green TC.
> >> >>> >   * Pull Request #6575 [6] (RunAll with the feature enabled for
> >> every
> >> >>> get()
> >> >>> > request) - has a limited amount of failures (because of data
> >> streamer,
> >> >>> > cache.load, etc).
> >> >>> >
> >> >>> > Thoughts?
> >> >>> >
> >> >>> > [0] https://issues.apache.org/jira/browse/IGNITE-10663
> >> >>> > [1]
> >> >>> >
> >> >>> >
> >> >>>
> >>
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix
> >> >>> > [2]
> >> >>> >
> >> >>> >
> >> >>>
> >>
> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
> >> >>> > [3]
> >> >>> >
> >> >>> >
> >> >>>
> >>
> https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140
> >> >>> > [4]
> >> >>> >
> >> >>> >
> >> >>>
> >>
> https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html
> >> >>> > [5] https://github.com/apache/ignite/pull/5656
> >> >>> > [6] https://github.com/apache/ignite/pull/6575
> >> >>> >
> >> >>>
> >> >>
> >>
> >
>

Re: Read Repair (ex. Consistency Check) - review request #2

Posted by Вячеслав Коптилин <sl...@gmail.com>.
Perhaps, it would be a good idea to think about the recovery tool/
control-utility command that will allow achieving the same goal.
If I am not mistaken it was already proposed in the email thread.

Thanks,
S.

ср, 10 июл. 2019 г. в 15:33, Вячеслав Коптилин <sl...@gmail.com>:

> Hi Anton,
>
> Well, the ReadRepair feature is finally merged and that is good news :)
>
> Unfortunately, I cannot find a consensus about the whole functionality in
> any of these topics:
>  -
> http://apache-ignite-developers.2346864.n4.nabble.com/Consistency-check-and-fix-review-request-td41629.html
>  -
> http://apache-ignite-developers.2346864.n4.nabble.com/Read-Repair-ex-Consistency-Check-review-request-2-td42421.html
> Also, there are no comments/discussion in JIRA. That makes me sad :(
> especially when a feature is huge, not obvious and involves changing public
> API (and that is the case, I think).
>
> Anyway, I left some comments in your pull-request at github. Please take a
> look. The most annoying thing is poorly documented code :(
> By the way, is it required to add test related to fail-over scenarios?
>
> I just checked, the IEP page and I still cannot find Jira tickets that
> should cover existing limitations/improvements.
> I would suggest creating the following tasks, at least:
>  - MVCC support
>  - Near caches
>  - Additional metrics (number of violations, number of repaired entries
> etc)
>  - Ignite C++ (It looks like, .Net is already aware of that feature)
>  - Thin clients support
>  - Perhaps, it would be useful to support different strategies to resolve
> inconsistencies
>
> What do you think?
>
> Thanks,
> S.
>
>
> ср, 10 июл. 2019 г. в 10:16, Anton Vinogradov <av...@apache.org>:
>
>> Folks,
>>
>> Thanks to everyone for tips and reviews.
>> Yardstick checked, no performance drop found.
>> Additional measurement: RR get() is just up to 7% slower than regular get
>> on real benchmarks (8 clients, 4 servers, 3 backups)
>> Code merged to the master.
>> "Must have" tasks created and attached to the IEP.
>>
>> On Thu, Jul 4, 2019 at 12:18 PM Anton Vinogradov <av...@apache.org> wrote:
>>
>> > Folks,
>> >
>> > Just a minor update.
>> >
>> > RunAll [1] with enabled ReadRepair proxy is almost green now (~10 tests
>> > left, started with 6k :)).
>> > During the analisys, I've found some tests with
>> > - unexpected repairs at tx caches
>> > - inconsistent state after the test finished (different entries across
>> the
>> > topology)
>> > For example,
>> > - testInvokeAllAppliedOnceOnBinaryTypeRegistration generates obsolete
>> > versions on backups in case of retry, fixed [2]
>> > - initial cache load generates not equal versions on backups, fixed [3]
>> > - testAccountTxNodeRestart causes unexpected repairs (entries have
>> > different versions), to be investigated.
>> >
>> > What's next?
>> >
>> > 1) Going to merge the solution once "RunAll with ReadRepair enabled"
>> > becomes fully green.
>> > 2) Going to add special check after each test which will ensure caches
>> > content after the test is consistent.
>> > 2.1) The Same check can (should?) be injected to
>> > awaitPartitionMapExchange() and similar methods.
>> > 3) Update Jepsen tests with RR checks.
>> > 4) Introduce per partition RR.
>> >
>> > So, the final goal is to be sure that Ignite produces only consistent
>> data
>> > and to have a feature to solve consistency in case we gain inconsistent
>> > state somehow.
>> >
>> > Limitations?
>> >
>> > Currently, RR has some limitations, but they are not related to real
>> > production cases.
>> > In case someone interested to support, for example, MVCC or near caches,
>> > please, feel free to contribute.
>> >
>> > [1]
>> >
>> https://mtcga.gridgain.com/pr.html?serverId=apache&suiteId=IgniteTests24Java8_RunAll&branchForTc=pull/6575/head&action=Latest
>> > [2]
>> >
>> https://github.com/apache/ignite/pull/5656/commits/6f6ec4434095e692af209c61833a350f3013408c
>> > [3]
>> >
>> https://github.com/apache/ignite/pull/5656/commits/255e552b474839e470c66a77e74e3c807bc76f13
>> >
>> > On Fri, Jun 28, 2019 at 2:41 PM Anton Vinogradov <av...@apache.org> wrote:
>> >
>> >> Slava,
>> >>
>> >> >> I will take a look at your pull request if you don't mind.
>> >> Great news!
>> >>
>> >> >> In any way, could you please update the IEP page with the list of
>> >> >> constraints/limitations of the proposed approach, TODOs, etc?
>> >> Not sure we should keep this at IEP until list (#4 from original
>> letter)
>> >> is not confirmed.
>> >>
>> >> Currently, I'm checking RunAll with RR enabled to almost each get
>> request.
>> >> "Almost" means: readRepair = !ctx.readThrough() &&
>> >> ctx.config().getBackups() > 0 && !ctx.isNear() && !ctx.mvccEnabled()
>> >> For now I have 60 failed tests and amount decreasing.
>> >>
>> >> >> For instance, I would like to see all these limitations on the IEP
>> >> page as
>> >> >> JIRA tickets. Perhaps, it would be good to create an epic/umbrella
>> >> ticket
>> >> >> in order to track all activities related to `Read Repair` feature.
>> >> Let's do this at merge day to be sure useless issues will not be
>> created.
>> >>
>> >> On Fri, Jun 28, 2019 at 2:01 PM Вячеслав Коптилин <
>> >> slava.koptilin@gmail.com> wrote:
>> >>
>> >>> Hi Anton,
>> >>>
>> >>> I will take a look at your pull request if you don't mind.
>> >>>
>> >>> In any way, could you please update the IEP page with the list of
>> >>> constraints/limitations of the proposed approach, TODOs, etc?
>> >>> For instance, I would like to see all these limitations on the IEP
>> page
>> >>> as
>> >>> JIRA tickets. Perhaps, it would be good to create an epic/umbrella
>> ticket
>> >>> in order to track all activities related to `Read Repair` feature.
>> >>>
>> >>> Thanks,
>> >>> S.
>> >>>
>> >>> чт, 20 июн. 2019 г. в 14:15, Anton Vinogradov <av...@apache.org>:
>> >>>
>> >>> > Igniters,
>> >>> > I'm glad to introduce Read Repair feature [0] provides additional
>> >>> > consistency guarantee for Ignite.
>> >>> >
>> >>> > 1) Why we need it?
>> >>> > The detailed explanation can be found at IEP-31 [1].
>> >>> > In short, because of bugs, it's possible to gain an inconsistent
>> state.
>> >>> > We need additional features to handle this case.
>> >>> >
>> >>> > Currently we able to check cluster using Idle_verify [2] feature,
>> but
>> >>> it
>> >>> > will not fix the data, will not even tell which entries are broken.
>> >>> > Read Repair is a feature to understand which entries are broken and
>> to
>> >>> fix
>> >>> > them.
>> >>> >
>> >>> > 1) How it works?
>> >>> > IgniteCache now able to provide special proxy [3] withReadRepair().
>> >>> > This proxy guarantee that data will be gained from all owners and
>> >>> compared.
>> >>> > In the case of consistency violation situation, data will be
>> recovered
>> >>> and
>> >>> > a special event recorded.
>> >>> >
>> >>> > 3) Naming?
>> >>> > Feature name based on Cassandra's Read Repair feature [4], which is
>> >>> pretty
>> >>> > similar.
>> >>> >
>> >>> > 4) Limitations which can be fixed in the future?
>> >>> >   * MVCC and Near caches are not supported.
>> >>> >   * Atomic caches can be checked (false positive case is possible on
>> >>> this
>> >>> > check), but can't be recovered.
>> >>> >   * Partial entry removal can't be recovered.
>> >>> >   * Entries streamed using data streamer (using not a "cache.put"
>> based
>> >>> > updater) and loaded by cache.load
>> >>> >   are perceived as inconsistent since they may have different
>> versions
>> >>> for
>> >>> > same keys.
>> >>> >   * Only explicit get operations are supported (getAndReplace,
>> >>> getAndPut,
>> >>> > etc can be supported in future).
>> >>> >
>> >>> > 5) What's left?
>> >>> >   * SQL/ThinClient/etc support.
>> >>> >   * Metrics (found/repaired).
>> >>> >   * Simple per-partition recovery feature able to work in the
>> >>> background in
>> >>> > addition to per-entry recovery feature.
>> >>> >
>> >>> > 6) Is code checked?
>> >>> >   * Pull Request #5656 [5] (feature) - has green TC.
>> >>> >   * Pull Request #6575 [6] (RunAll with the feature enabled for
>> every
>> >>> get()
>> >>> > request) - has a limited amount of failures (because of data
>> streamer,
>> >>> > cache.load, etc).
>> >>> >
>> >>> > Thoughts?
>> >>> >
>> >>> > [0] https://issues.apache.org/jira/browse/IGNITE-10663
>> >>> > [1]
>> >>> >
>> >>> >
>> >>>
>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix
>> >>> > [2]
>> >>> >
>> >>> >
>> >>>
>> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
>> >>> > [3]
>> >>> >
>> >>> >
>> >>>
>> https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140
>> >>> > [4]
>> >>> >
>> >>> >
>> >>>
>> https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html
>> >>> > [5] https://github.com/apache/ignite/pull/5656
>> >>> > [6] https://github.com/apache/ignite/pull/6575
>> >>> >
>> >>>
>> >>
>>
>

Re: Read Repair (ex. Consistency Check) - review request #2

Posted by Вячеслав Коптилин <sl...@gmail.com>.
Hi Anton,

Well, the ReadRepair feature is finally merged and that is good news :)

Unfortunately, I cannot find a consensus about the whole functionality in
any of these topics:
 -
http://apache-ignite-developers.2346864.n4.nabble.com/Consistency-check-and-fix-review-request-td41629.html
 -
http://apache-ignite-developers.2346864.n4.nabble.com/Read-Repair-ex-Consistency-Check-review-request-2-td42421.html
Also, there are no comments/discussion in JIRA. That makes me sad :(
especially when a feature is huge, not obvious and involves changing public
API (and that is the case, I think).

Anyway, I left some comments in your pull-request at github. Please take a
look. The most annoying thing is poorly documented code :(
By the way, is it required to add test related to fail-over scenarios?

I just checked, the IEP page and I still cannot find Jira tickets that
should cover existing limitations/improvements.
I would suggest creating the following tasks, at least:
 - MVCC support
 - Near caches
 - Additional metrics (number of violations, number of repaired entries etc)
 - Ignite C++ (It looks like, .Net is already aware of that feature)
 - Thin clients support
 - Perhaps, it would be useful to support different strategies to resolve
inconsistencies

What do you think?

Thanks,
S.


ср, 10 июл. 2019 г. в 10:16, Anton Vinogradov <av...@apache.org>:

> Folks,
>
> Thanks to everyone for tips and reviews.
> Yardstick checked, no performance drop found.
> Additional measurement: RR get() is just up to 7% slower than regular get
> on real benchmarks (8 clients, 4 servers, 3 backups)
> Code merged to the master.
> "Must have" tasks created and attached to the IEP.
>
> On Thu, Jul 4, 2019 at 12:18 PM Anton Vinogradov <av...@apache.org> wrote:
>
> > Folks,
> >
> > Just a minor update.
> >
> > RunAll [1] with enabled ReadRepair proxy is almost green now (~10 tests
> > left, started with 6k :)).
> > During the analisys, I've found some tests with
> > - unexpected repairs at tx caches
> > - inconsistent state after the test finished (different entries across
> the
> > topology)
> > For example,
> > - testInvokeAllAppliedOnceOnBinaryTypeRegistration generates obsolete
> > versions on backups in case of retry, fixed [2]
> > - initial cache load generates not equal versions on backups, fixed [3]
> > - testAccountTxNodeRestart causes unexpected repairs (entries have
> > different versions), to be investigated.
> >
> > What's next?
> >
> > 1) Going to merge the solution once "RunAll with ReadRepair enabled"
> > becomes fully green.
> > 2) Going to add special check after each test which will ensure caches
> > content after the test is consistent.
> > 2.1) The Same check can (should?) be injected to
> > awaitPartitionMapExchange() and similar methods.
> > 3) Update Jepsen tests with RR checks.
> > 4) Introduce per partition RR.
> >
> > So, the final goal is to be sure that Ignite produces only consistent
> data
> > and to have a feature to solve consistency in case we gain inconsistent
> > state somehow.
> >
> > Limitations?
> >
> > Currently, RR has some limitations, but they are not related to real
> > production cases.
> > In case someone interested to support, for example, MVCC or near caches,
> > please, feel free to contribute.
> >
> > [1]
> >
> https://mtcga.gridgain.com/pr.html?serverId=apache&suiteId=IgniteTests24Java8_RunAll&branchForTc=pull/6575/head&action=Latest
> > [2]
> >
> https://github.com/apache/ignite/pull/5656/commits/6f6ec4434095e692af209c61833a350f3013408c
> > [3]
> >
> https://github.com/apache/ignite/pull/5656/commits/255e552b474839e470c66a77e74e3c807bc76f13
> >
> > On Fri, Jun 28, 2019 at 2:41 PM Anton Vinogradov <av...@apache.org> wrote:
> >
> >> Slava,
> >>
> >> >> I will take a look at your pull request if you don't mind.
> >> Great news!
> >>
> >> >> In any way, could you please update the IEP page with the list of
> >> >> constraints/limitations of the proposed approach, TODOs, etc?
> >> Not sure we should keep this at IEP until list (#4 from original letter)
> >> is not confirmed.
> >>
> >> Currently, I'm checking RunAll with RR enabled to almost each get
> request.
> >> "Almost" means: readRepair = !ctx.readThrough() &&
> >> ctx.config().getBackups() > 0 && !ctx.isNear() && !ctx.mvccEnabled()
> >> For now I have 60 failed tests and amount decreasing.
> >>
> >> >> For instance, I would like to see all these limitations on the IEP
> >> page as
> >> >> JIRA tickets. Perhaps, it would be good to create an epic/umbrella
> >> ticket
> >> >> in order to track all activities related to `Read Repair` feature.
> >> Let's do this at merge day to be sure useless issues will not be
> created.
> >>
> >> On Fri, Jun 28, 2019 at 2:01 PM Вячеслав Коптилин <
> >> slava.koptilin@gmail.com> wrote:
> >>
> >>> Hi Anton,
> >>>
> >>> I will take a look at your pull request if you don't mind.
> >>>
> >>> In any way, could you please update the IEP page with the list of
> >>> constraints/limitations of the proposed approach, TODOs, etc?
> >>> For instance, I would like to see all these limitations on the IEP page
> >>> as
> >>> JIRA tickets. Perhaps, it would be good to create an epic/umbrella
> ticket
> >>> in order to track all activities related to `Read Repair` feature.
> >>>
> >>> Thanks,
> >>> S.
> >>>
> >>> чт, 20 июн. 2019 г. в 14:15, Anton Vinogradov <av...@apache.org>:
> >>>
> >>> > Igniters,
> >>> > I'm glad to introduce Read Repair feature [0] provides additional
> >>> > consistency guarantee for Ignite.
> >>> >
> >>> > 1) Why we need it?
> >>> > The detailed explanation can be found at IEP-31 [1].
> >>> > In short, because of bugs, it's possible to gain an inconsistent
> state.
> >>> > We need additional features to handle this case.
> >>> >
> >>> > Currently we able to check cluster using Idle_verify [2] feature, but
> >>> it
> >>> > will not fix the data, will not even tell which entries are broken.
> >>> > Read Repair is a feature to understand which entries are broken and
> to
> >>> fix
> >>> > them.
> >>> >
> >>> > 1) How it works?
> >>> > IgniteCache now able to provide special proxy [3] withReadRepair().
> >>> > This proxy guarantee that data will be gained from all owners and
> >>> compared.
> >>> > In the case of consistency violation situation, data will be
> recovered
> >>> and
> >>> > a special event recorded.
> >>> >
> >>> > 3) Naming?
> >>> > Feature name based on Cassandra's Read Repair feature [4], which is
> >>> pretty
> >>> > similar.
> >>> >
> >>> > 4) Limitations which can be fixed in the future?
> >>> >   * MVCC and Near caches are not supported.
> >>> >   * Atomic caches can be checked (false positive case is possible on
> >>> this
> >>> > check), but can't be recovered.
> >>> >   * Partial entry removal can't be recovered.
> >>> >   * Entries streamed using data streamer (using not a "cache.put"
> based
> >>> > updater) and loaded by cache.load
> >>> >   are perceived as inconsistent since they may have different
> versions
> >>> for
> >>> > same keys.
> >>> >   * Only explicit get operations are supported (getAndReplace,
> >>> getAndPut,
> >>> > etc can be supported in future).
> >>> >
> >>> > 5) What's left?
> >>> >   * SQL/ThinClient/etc support.
> >>> >   * Metrics (found/repaired).
> >>> >   * Simple per-partition recovery feature able to work in the
> >>> background in
> >>> > addition to per-entry recovery feature.
> >>> >
> >>> > 6) Is code checked?
> >>> >   * Pull Request #5656 [5] (feature) - has green TC.
> >>> >   * Pull Request #6575 [6] (RunAll with the feature enabled for every
> >>> get()
> >>> > request) - has a limited amount of failures (because of data
> streamer,
> >>> > cache.load, etc).
> >>> >
> >>> > Thoughts?
> >>> >
> >>> > [0] https://issues.apache.org/jira/browse/IGNITE-10663
> >>> > [1]
> >>> >
> >>> >
> >>>
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix
> >>> > [2]
> >>> >
> >>> >
> >>>
> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
> >>> > [3]
> >>> >
> >>> >
> >>>
> https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140
> >>> > [4]
> >>> >
> >>> >
> >>>
> https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html
> >>> > [5] https://github.com/apache/ignite/pull/5656
> >>> > [6] https://github.com/apache/ignite/pull/6575
> >>> >
> >>>
> >>
>

Re: Read Repair (ex. Consistency Check) - review request #2

Posted by Anton Vinogradov <av...@apache.org>.
Folks,

Thanks to everyone for tips and reviews.
Yardstick checked, no performance drop found.
Additional measurement: RR get() is just up to 7% slower than regular get
on real benchmarks (8 clients, 4 servers, 3 backups)
Code merged to the master.
"Must have" tasks created and attached to the IEP.

On Thu, Jul 4, 2019 at 12:18 PM Anton Vinogradov <av...@apache.org> wrote:

> Folks,
>
> Just a minor update.
>
> RunAll [1] with enabled ReadRepair proxy is almost green now (~10 tests
> left, started with 6k :)).
> During the analisys, I've found some tests with
> - unexpected repairs at tx caches
> - inconsistent state after the test finished (different entries across the
> topology)
> For example,
> - testInvokeAllAppliedOnceOnBinaryTypeRegistration generates obsolete
> versions on backups in case of retry, fixed [2]
> - initial cache load generates not equal versions on backups, fixed [3]
> - testAccountTxNodeRestart causes unexpected repairs (entries have
> different versions), to be investigated.
>
> What's next?
>
> 1) Going to merge the solution once "RunAll with ReadRepair enabled"
> becomes fully green.
> 2) Going to add special check after each test which will ensure caches
> content after the test is consistent.
> 2.1) The Same check can (should?) be injected to
> awaitPartitionMapExchange() and similar methods.
> 3) Update Jepsen tests with RR checks.
> 4) Introduce per partition RR.
>
> So, the final goal is to be sure that Ignite produces only consistent data
> and to have a feature to solve consistency in case we gain inconsistent
> state somehow.
>
> Limitations?
>
> Currently, RR has some limitations, but they are not related to real
> production cases.
> In case someone interested to support, for example, MVCC or near caches,
> please, feel free to contribute.
>
> [1]
> https://mtcga.gridgain.com/pr.html?serverId=apache&suiteId=IgniteTests24Java8_RunAll&branchForTc=pull/6575/head&action=Latest
> [2]
> https://github.com/apache/ignite/pull/5656/commits/6f6ec4434095e692af209c61833a350f3013408c
> [3]
> https://github.com/apache/ignite/pull/5656/commits/255e552b474839e470c66a77e74e3c807bc76f13
>
> On Fri, Jun 28, 2019 at 2:41 PM Anton Vinogradov <av...@apache.org> wrote:
>
>> Slava,
>>
>> >> I will take a look at your pull request if you don't mind.
>> Great news!
>>
>> >> In any way, could you please update the IEP page with the list of
>> >> constraints/limitations of the proposed approach, TODOs, etc?
>> Not sure we should keep this at IEP until list (#4 from original letter)
>> is not confirmed.
>>
>> Currently, I'm checking RunAll with RR enabled to almost each get request.
>> "Almost" means: readRepair = !ctx.readThrough() &&
>> ctx.config().getBackups() > 0 && !ctx.isNear() && !ctx.mvccEnabled()
>> For now I have 60 failed tests and amount decreasing.
>>
>> >> For instance, I would like to see all these limitations on the IEP
>> page as
>> >> JIRA tickets. Perhaps, it would be good to create an epic/umbrella
>> ticket
>> >> in order to track all activities related to `Read Repair` feature.
>> Let's do this at merge day to be sure useless issues will not be created.
>>
>> On Fri, Jun 28, 2019 at 2:01 PM Вячеслав Коптилин <
>> slava.koptilin@gmail.com> wrote:
>>
>>> Hi Anton,
>>>
>>> I will take a look at your pull request if you don't mind.
>>>
>>> In any way, could you please update the IEP page with the list of
>>> constraints/limitations of the proposed approach, TODOs, etc?
>>> For instance, I would like to see all these limitations on the IEP page
>>> as
>>> JIRA tickets. Perhaps, it would be good to create an epic/umbrella ticket
>>> in order to track all activities related to `Read Repair` feature.
>>>
>>> Thanks,
>>> S.
>>>
>>> чт, 20 июн. 2019 г. в 14:15, Anton Vinogradov <av...@apache.org>:
>>>
>>> > Igniters,
>>> > I'm glad to introduce Read Repair feature [0] provides additional
>>> > consistency guarantee for Ignite.
>>> >
>>> > 1) Why we need it?
>>> > The detailed explanation can be found at IEP-31 [1].
>>> > In short, because of bugs, it's possible to gain an inconsistent state.
>>> > We need additional features to handle this case.
>>> >
>>> > Currently we able to check cluster using Idle_verify [2] feature, but
>>> it
>>> > will not fix the data, will not even tell which entries are broken.
>>> > Read Repair is a feature to understand which entries are broken and to
>>> fix
>>> > them.
>>> >
>>> > 1) How it works?
>>> > IgniteCache now able to provide special proxy [3] withReadRepair().
>>> > This proxy guarantee that data will be gained from all owners and
>>> compared.
>>> > In the case of consistency violation situation, data will be recovered
>>> and
>>> > a special event recorded.
>>> >
>>> > 3) Naming?
>>> > Feature name based on Cassandra's Read Repair feature [4], which is
>>> pretty
>>> > similar.
>>> >
>>> > 4) Limitations which can be fixed in the future?
>>> >   * MVCC and Near caches are not supported.
>>> >   * Atomic caches can be checked (false positive case is possible on
>>> this
>>> > check), but can't be recovered.
>>> >   * Partial entry removal can't be recovered.
>>> >   * Entries streamed using data streamer (using not a "cache.put" based
>>> > updater) and loaded by cache.load
>>> >   are perceived as inconsistent since they may have different versions
>>> for
>>> > same keys.
>>> >   * Only explicit get operations are supported (getAndReplace,
>>> getAndPut,
>>> > etc can be supported in future).
>>> >
>>> > 5) What's left?
>>> >   * SQL/ThinClient/etc support.
>>> >   * Metrics (found/repaired).
>>> >   * Simple per-partition recovery feature able to work in the
>>> background in
>>> > addition to per-entry recovery feature.
>>> >
>>> > 6) Is code checked?
>>> >   * Pull Request #5656 [5] (feature) - has green TC.
>>> >   * Pull Request #6575 [6] (RunAll with the feature enabled for every
>>> get()
>>> > request) - has a limited amount of failures (because of data streamer,
>>> > cache.load, etc).
>>> >
>>> > Thoughts?
>>> >
>>> > [0] https://issues.apache.org/jira/browse/IGNITE-10663
>>> > [1]
>>> >
>>> >
>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix
>>> > [2]
>>> >
>>> >
>>> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
>>> > [3]
>>> >
>>> >
>>> https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140
>>> > [4]
>>> >
>>> >
>>> https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html
>>> > [5] https://github.com/apache/ignite/pull/5656
>>> > [6] https://github.com/apache/ignite/pull/6575
>>> >
>>>
>>

Re: Read Repair (ex. Consistency Check) - review request #2

Posted by Anton Vinogradov <av...@apache.org>.
Folks,

Just a minor update.

RunAll [1] with enabled ReadRepair proxy is almost green now (~10 tests
left, started with 6k :)).
During the analisys, I've found some tests with
- unexpected repairs at tx caches
- inconsistent state after the test finished (different entries across the
topology)
For example,
- testInvokeAllAppliedOnceOnBinaryTypeRegistration generates obsolete
versions on backups in case of retry, fixed [2]
- initial cache load generates not equal versions on backups, fixed [3]
- testAccountTxNodeRestart causes unexpected repairs (entries have
different versions), to be investigated.

What's next?

1) Going to merge the solution once "RunAll with ReadRepair enabled"
becomes fully green.
2) Going to add special check after each test which will ensure caches
content after the test is consistent.
2.1) The Same check can (should?) be injected to
awaitPartitionMapExchange() and similar methods.
3) Update Jepsen tests with RR checks.
4) Introduce per partition RR.

So, the final goal is to be sure that Ignite produces only consistent data
and to have a feature to solve consistency in case we gain inconsistent
state somehow.

Limitations?

Currently, RR has some limitations, but they are not related to real
production cases.
In case someone interested to support, for example, MVCC or near caches,
please, feel free to contribute.

[1]
https://mtcga.gridgain.com/pr.html?serverId=apache&suiteId=IgniteTests24Java8_RunAll&branchForTc=pull/6575/head&action=Latest
[2]
https://github.com/apache/ignite/pull/5656/commits/6f6ec4434095e692af209c61833a350f3013408c
[3]
https://github.com/apache/ignite/pull/5656/commits/255e552b474839e470c66a77e74e3c807bc76f13

On Fri, Jun 28, 2019 at 2:41 PM Anton Vinogradov <av...@apache.org> wrote:

> Slava,
>
> >> I will take a look at your pull request if you don't mind.
> Great news!
>
> >> In any way, could you please update the IEP page with the list of
> >> constraints/limitations of the proposed approach, TODOs, etc?
> Not sure we should keep this at IEP until list (#4 from original letter)
> is not confirmed.
>
> Currently, I'm checking RunAll with RR enabled to almost each get request.
> "Almost" means: readRepair = !ctx.readThrough() &&
> ctx.config().getBackups() > 0 && !ctx.isNear() && !ctx.mvccEnabled()
> For now I have 60 failed tests and amount decreasing.
>
> >> For instance, I would like to see all these limitations on the IEP page
> as
> >> JIRA tickets. Perhaps, it would be good to create an epic/umbrella
> ticket
> >> in order to track all activities related to `Read Repair` feature.
> Let's do this at merge day to be sure useless issues will not be created.
>
> On Fri, Jun 28, 2019 at 2:01 PM Вячеслав Коптилин <
> slava.koptilin@gmail.com> wrote:
>
>> Hi Anton,
>>
>> I will take a look at your pull request if you don't mind.
>>
>> In any way, could you please update the IEP page with the list of
>> constraints/limitations of the proposed approach, TODOs, etc?
>> For instance, I would like to see all these limitations on the IEP page as
>> JIRA tickets. Perhaps, it would be good to create an epic/umbrella ticket
>> in order to track all activities related to `Read Repair` feature.
>>
>> Thanks,
>> S.
>>
>> чт, 20 июн. 2019 г. в 14:15, Anton Vinogradov <av...@apache.org>:
>>
>> > Igniters,
>> > I'm glad to introduce Read Repair feature [0] provides additional
>> > consistency guarantee for Ignite.
>> >
>> > 1) Why we need it?
>> > The detailed explanation can be found at IEP-31 [1].
>> > In short, because of bugs, it's possible to gain an inconsistent state.
>> > We need additional features to handle this case.
>> >
>> > Currently we able to check cluster using Idle_verify [2] feature, but it
>> > will not fix the data, will not even tell which entries are broken.
>> > Read Repair is a feature to understand which entries are broken and to
>> fix
>> > them.
>> >
>> > 1) How it works?
>> > IgniteCache now able to provide special proxy [3] withReadRepair().
>> > This proxy guarantee that data will be gained from all owners and
>> compared.
>> > In the case of consistency violation situation, data will be recovered
>> and
>> > a special event recorded.
>> >
>> > 3) Naming?
>> > Feature name based on Cassandra's Read Repair feature [4], which is
>> pretty
>> > similar.
>> >
>> > 4) Limitations which can be fixed in the future?
>> >   * MVCC and Near caches are not supported.
>> >   * Atomic caches can be checked (false positive case is possible on
>> this
>> > check), but can't be recovered.
>> >   * Partial entry removal can't be recovered.
>> >   * Entries streamed using data streamer (using not a "cache.put" based
>> > updater) and loaded by cache.load
>> >   are perceived as inconsistent since they may have different versions
>> for
>> > same keys.
>> >   * Only explicit get operations are supported (getAndReplace,
>> getAndPut,
>> > etc can be supported in future).
>> >
>> > 5) What's left?
>> >   * SQL/ThinClient/etc support.
>> >   * Metrics (found/repaired).
>> >   * Simple per-partition recovery feature able to work in the
>> background in
>> > addition to per-entry recovery feature.
>> >
>> > 6) Is code checked?
>> >   * Pull Request #5656 [5] (feature) - has green TC.
>> >   * Pull Request #6575 [6] (RunAll with the feature enabled for every
>> get()
>> > request) - has a limited amount of failures (because of data streamer,
>> > cache.load, etc).
>> >
>> > Thoughts?
>> >
>> > [0] https://issues.apache.org/jira/browse/IGNITE-10663
>> > [1]
>> >
>> >
>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix
>> > [2]
>> >
>> >
>> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
>> > [3]
>> >
>> >
>> https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140
>> > [4]
>> >
>> >
>> https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html
>> > [5] https://github.com/apache/ignite/pull/5656
>> > [6] https://github.com/apache/ignite/pull/6575
>> >
>>
>

Re: Read Repair (ex. Consistency Check) - review request #2

Posted by Anton Vinogradov <av...@apache.org>.
Slava,

>> I will take a look at your pull request if you don't mind.
Great news!

>> In any way, could you please update the IEP page with the list of
>> constraints/limitations of the proposed approach, TODOs, etc?
Not sure we should keep this at IEP until list (#4 from original letter) is
not confirmed.

Currently, I'm checking RunAll with RR enabled to almost each get request.
"Almost" means: readRepair = !ctx.readThrough() &&
ctx.config().getBackups() > 0 && !ctx.isNear() && !ctx.mvccEnabled()
For now I have 60 failed tests and amount decreasing.

>> For instance, I would like to see all these limitations on the IEP page
as
>> JIRA tickets. Perhaps, it would be good to create an epic/umbrella ticket
>> in order to track all activities related to `Read Repair` feature.
Let's do this at merge day to be sure useless issues will not be created.

On Fri, Jun 28, 2019 at 2:01 PM Вячеслав Коптилин <sl...@gmail.com>
wrote:

> Hi Anton,
>
> I will take a look at your pull request if you don't mind.
>
> In any way, could you please update the IEP page with the list of
> constraints/limitations of the proposed approach, TODOs, etc?
> For instance, I would like to see all these limitations on the IEP page as
> JIRA tickets. Perhaps, it would be good to create an epic/umbrella ticket
> in order to track all activities related to `Read Repair` feature.
>
> Thanks,
> S.
>
> чт, 20 июн. 2019 г. в 14:15, Anton Vinogradov <av...@apache.org>:
>
> > Igniters,
> > I'm glad to introduce Read Repair feature [0] provides additional
> > consistency guarantee for Ignite.
> >
> > 1) Why we need it?
> > The detailed explanation can be found at IEP-31 [1].
> > In short, because of bugs, it's possible to gain an inconsistent state.
> > We need additional features to handle this case.
> >
> > Currently we able to check cluster using Idle_verify [2] feature, but it
> > will not fix the data, will not even tell which entries are broken.
> > Read Repair is a feature to understand which entries are broken and to
> fix
> > them.
> >
> > 1) How it works?
> > IgniteCache now able to provide special proxy [3] withReadRepair().
> > This proxy guarantee that data will be gained from all owners and
> compared.
> > In the case of consistency violation situation, data will be recovered
> and
> > a special event recorded.
> >
> > 3) Naming?
> > Feature name based on Cassandra's Read Repair feature [4], which is
> pretty
> > similar.
> >
> > 4) Limitations which can be fixed in the future?
> >   * MVCC and Near caches are not supported.
> >   * Atomic caches can be checked (false positive case is possible on this
> > check), but can't be recovered.
> >   * Partial entry removal can't be recovered.
> >   * Entries streamed using data streamer (using not a "cache.put" based
> > updater) and loaded by cache.load
> >   are perceived as inconsistent since they may have different versions
> for
> > same keys.
> >   * Only explicit get operations are supported (getAndReplace, getAndPut,
> > etc can be supported in future).
> >
> > 5) What's left?
> >   * SQL/ThinClient/etc support.
> >   * Metrics (found/repaired).
> >   * Simple per-partition recovery feature able to work in the background
> in
> > addition to per-entry recovery feature.
> >
> > 6) Is code checked?
> >   * Pull Request #5656 [5] (feature) - has green TC.
> >   * Pull Request #6575 [6] (RunAll with the feature enabled for every
> get()
> > request) - has a limited amount of failures (because of data streamer,
> > cache.load, etc).
> >
> > Thoughts?
> >
> > [0] https://issues.apache.org/jira/browse/IGNITE-10663
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix
> > [2]
> >
> >
> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
> > [3]
> >
> >
> https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140
> > [4]
> >
> >
> https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html
> > [5] https://github.com/apache/ignite/pull/5656
> > [6] https://github.com/apache/ignite/pull/6575
> >
>

Re: Read Repair (ex. Consistency Check) - review request #2

Posted by Вячеслав Коптилин <sl...@gmail.com>.
Hi Anton,

I will take a look at your pull request if you don't mind.

In any way, could you please update the IEP page with the list of
constraints/limitations of the proposed approach, TODOs, etc?
For instance, I would like to see all these limitations on the IEP page as
JIRA tickets. Perhaps, it would be good to create an epic/umbrella ticket
in order to track all activities related to `Read Repair` feature.

Thanks,
S.

чт, 20 июн. 2019 г. в 14:15, Anton Vinogradov <av...@apache.org>:

> Igniters,
> I'm glad to introduce Read Repair feature [0] provides additional
> consistency guarantee for Ignite.
>
> 1) Why we need it?
> The detailed explanation can be found at IEP-31 [1].
> In short, because of bugs, it's possible to gain an inconsistent state.
> We need additional features to handle this case.
>
> Currently we able to check cluster using Idle_verify [2] feature, but it
> will not fix the data, will not even tell which entries are broken.
> Read Repair is a feature to understand which entries are broken and to fix
> them.
>
> 1) How it works?
> IgniteCache now able to provide special proxy [3] withReadRepair().
> This proxy guarantee that data will be gained from all owners and compared.
> In the case of consistency violation situation, data will be recovered and
> a special event recorded.
>
> 3) Naming?
> Feature name based on Cassandra's Read Repair feature [4], which is pretty
> similar.
>
> 4) Limitations which can be fixed in the future?
>   * MVCC and Near caches are not supported.
>   * Atomic caches can be checked (false positive case is possible on this
> check), but can't be recovered.
>   * Partial entry removal can't be recovered.
>   * Entries streamed using data streamer (using not a "cache.put" based
> updater) and loaded by cache.load
>   are perceived as inconsistent since they may have different versions for
> same keys.
>   * Only explicit get operations are supported (getAndReplace, getAndPut,
> etc can be supported in future).
>
> 5) What's left?
>   * SQL/ThinClient/etc support.
>   * Metrics (found/repaired).
>   * Simple per-partition recovery feature able to work in the background in
> addition to per-entry recovery feature.
>
> 6) Is code checked?
>   * Pull Request #5656 [5] (feature) - has green TC.
>   * Pull Request #6575 [6] (RunAll with the feature enabled for every get()
> request) - has a limited amount of failures (because of data streamer,
> cache.load, etc).
>
> Thoughts?
>
> [0] https://issues.apache.org/jira/browse/IGNITE-10663
> [1]
>
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix
> [2]
>
> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
> [3]
>
> https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140
> [4]
>
> https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html
> [5] https://github.com/apache/ignite/pull/5656
> [6] https://github.com/apache/ignite/pull/6575
>

Re: Read Repair (ex. Consistency Check) - review request #2

Posted by Anton Vinogradov <av...@apache.org>.
Nikolay,

>> Should we consider it's removal in Ignite 3
Don't think so.

My initial ReadRepair implementation uses version to detect inconsistency.
Strategy can be changed later (most likely it will) or even provided
ability to use own strategy.
Data streamer's and cache.load's cases can be supported by simple values
check as the second phase of checks.
As I mentioned privately - all limitations are limitations for the initial
version and can be eliminated in the future.

Currently, the main goal is to merge the solution to continue the work.
* After the merge, it will be possible to split work on this task (metrics,
whole partitions fix, data streamers, ... etc).
* Seems, I found the consistency issue checking my feature (a backup can be
outdated for some time after tx finish in full_sync).
Merge will provide the ability to investigate bug deeper and solve if
confirmed.

On Thu, Jun 20, 2019 at 2:22 PM Nikolay Izhikov <ni...@apache.org> wrote:

> Anton.
>
> I worried about this limitation:
>
> > Entries streamed using data streamer (using not a "cache.put" based
> updater) and loaded by cache.load.
>
> As we discussed privately in this modes
>
> *ALL ENTRIES ON ALL OWNERS WILL HAVE DIFFERENT VERSIONS*
>
> Why we need this modes, in the first place?
> Should we consider it's removal in Ignite 3 or should we fix them?
>
> В Чт, 20/06/2019 в 14:15 +0300, Anton Vinogradov пишет:
> > Igniters,
> > I'm glad to introduce Read Repair feature [0] provides additional
> > consistency guarantee for Ignite.
> >
> > 1) Why we need it?
> > The detailed explanation can be found at IEP-31 [1].
> > In short, because of bugs, it's possible to gain an inconsistent state.
> > We need additional features to handle this case.
> >
> > Currently we able to check cluster using Idle_verify [2] feature, but it
> > will not fix the data, will not even tell which entries are broken.
> > Read Repair is a feature to understand which entries are broken and to
> fix
> > them.
> >
> > 1) How it works?
> > IgniteCache now able to provide special proxy [3] withReadRepair().
> > This proxy guarantee that data will be gained from all owners and
> compared.
> > In the case of consistency violation situation, data will be recovered
> and
> > a special event recorded.
> >
> > 3) Naming?
> > Feature name based on Cassandra's Read Repair feature [4], which is
> pretty
> > similar.
> >
> > 4) Limitations which can be fixed in the future?
> >   * MVCC and Near caches are not supported.
> >   * Atomic caches can be checked (false positive case is possible on this
> > check), but can't be recovered.
> >   * Partial entry removal can't be recovered.
> >   * Entries streamed using data streamer (using not a "cache.put" based
> > updater) and loaded by cache.load
> >   are perceived as inconsistent since they may have different versions
> for
> > same keys.
> >   * Only explicit get operations are supported (getAndReplace, getAndPut,
> > etc can be supported in future).
> >
> > 5) What's left?
> >   * SQL/ThinClient/etc support.
> >   * Metrics (found/repaired).
> >   * Simple per-partition recovery feature able to work in the background
> in
> > addition to per-entry recovery feature.
> >
> > 6) Is code checked?
> >   * Pull Request #5656 [5] (feature) - has green TC.
> >   * Pull Request #6575 [6] (RunAll with the feature enabled for every
> get()
> > request) - has a limited amount of failures (because of data streamer,
> > cache.load, etc).
> >
> > Thoughts?
> >
> > [0] https://issues.apache.org/jira/browse/IGNITE-10663
> > [1]
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix
> > [2]
> >
> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
> > [3]
> >
> https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140
> > [4]
> >
> https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html
> > [5] https://github.com/apache/ignite/pull/5656
> > [6] https://github.com/apache/ignite/pull/6575
>

Re: Read Repair (ex. Consistency Check) - review request #2

Posted by Nikolay Izhikov <ni...@apache.org>.
Anton.

I worried about this limitation:

> Entries streamed using data streamer (using not a "cache.put" based updater) and loaded by cache.load.

As we discussed privately in this modes 

*ALL ENTRIES ON ALL OWNERS WILL HAVE DIFFERENT VERSIONS*

Why we need this modes, in the first place?
Should we consider it's removal in Ignite 3 or should we fix them?

В Чт, 20/06/2019 в 14:15 +0300, Anton Vinogradov пишет:
> Igniters,
> I'm glad to introduce Read Repair feature [0] provides additional
> consistency guarantee for Ignite.
> 
> 1) Why we need it?
> The detailed explanation can be found at IEP-31 [1].
> In short, because of bugs, it's possible to gain an inconsistent state.
> We need additional features to handle this case.
> 
> Currently we able to check cluster using Idle_verify [2] feature, but it
> will not fix the data, will not even tell which entries are broken.
> Read Repair is a feature to understand which entries are broken and to fix
> them.
> 
> 1) How it works?
> IgniteCache now able to provide special proxy [3] withReadRepair().
> This proxy guarantee that data will be gained from all owners and compared.
> In the case of consistency violation situation, data will be recovered and
> a special event recorded.
> 
> 3) Naming?
> Feature name based on Cassandra's Read Repair feature [4], which is pretty
> similar.
> 
> 4) Limitations which can be fixed in the future?
>   * MVCC and Near caches are not supported.
>   * Atomic caches can be checked (false positive case is possible on this
> check), but can't be recovered.
>   * Partial entry removal can't be recovered.
>   * Entries streamed using data streamer (using not a "cache.put" based
> updater) and loaded by cache.load
>   are perceived as inconsistent since they may have different versions for
> same keys.
>   * Only explicit get operations are supported (getAndReplace, getAndPut,
> etc can be supported in future).
> 
> 5) What's left?
>   * SQL/ThinClient/etc support.
>   * Metrics (found/repaired).
>   * Simple per-partition recovery feature able to work in the background in
> addition to per-entry recovery feature.
> 
> 6) Is code checked?
>   * Pull Request #5656 [5] (feature) - has green TC.
>   * Pull Request #6575 [6] (RunAll with the feature enabled for every get()
> request) - has a limited amount of failures (because of data streamer,
> cache.load, etc).
> 
> Thoughts?
> 
> [0] https://issues.apache.org/jira/browse/IGNITE-10663
> [1]
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix
> [2]
> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
> [3]
> https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140
> [4]
> https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html
> [5] https://github.com/apache/ignite/pull/5656
> [6] https://github.com/apache/ignite/pull/6575