You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by eugene miretsky <eu...@gmail.com> on 2018/09/07 20:45:38 UTC

Partition map exchange in detail

Hello,

Out cluster occasionally fails with "partition map exchange failure"
errors, I have searched around and it seems that a lot of people have had a
similar issue in the past. My high-level understanding is that when one of
the nodes fails (out of memory, exception, GC etc.) nodes fail to exchange
partition maps. However, I have a few questions
1) When does partition map exchange happen? Periodically, when a node
joins, etc.
2) Is it done in the same thread as communication SPI, or is a separate
worker?
3) How does the exchange happen? Via a coordinator, peer to peer, etc?
4) What does the exchange block?
5) When is the exchange retried?
5) How to resolve the error? The only thing I have seen online is to
decrease failureDetectionTimeout

Our settings are
- Zookeeper SPI
- Persistence enabled

Cheers,
Eugene

Re: Partition map exchange in detail

Posted by Вячеслав Коптилин <sl...@gmail.com>.

Hello Eugene,

I hope you meant PME (partitions map exchange) instead of NPE :)

> What constitutes a transaction in this context?
If I am not mistaken, it is about Ignite transactions.
Please take a look at this page
https://apacheignite.readme.io/docs/transactions

> Does it mean that if the cluster constantly receives transaction
requests, NPE will never happen?
> Or will all transactions that were received after the NPE request wait
for the NPE to complete?
Transactions that were initiated after the PME request will wait for the
PME is completed.

Thanks,
S.

ср, 12 сент. 2018 г. в 22:51, eugene miretsky <eu...@gmail.com>:

> Make sense
> I think the actual issue that was affecting me is
> https://issues.apache.org/jira/browse/IGNITE-9562. (which IEP-25 should
> solve).
>
> Final 2 questions:
> 1) If all NPE waits for all pending transactions
>   a) What constitutes a transaction in this context? (any query, a SQL
> transaction, etc)
>   b) Does it mean that if the cluster constantly receives transaction
> requests, NPE will never happen? (Or will all transactions that were
> received after the NPE request wait for the NPE to complete?)
> 2) Any other advice on how to avoid NPE? (transaction timeouts, graceful
> shutdown/restart of nodes, etc)
>
> Cheers,
> Eugene
>
>
>
>
>
> On Wed, Sep 12, 2018 at 12:18 PM Pavel Kovalenko <jo...@gmail.com>
> wrote:
>
>> Eugene,
>>
>> In the case of Zookeeper Discovery is enabled and communication problem
>> between some nodes, a subset of problem nodes will be automatically killed
>> to reach cluster state where each node can communicate with each other
>> without problems. So, you're absolutely right, dead nodes will be removed
>> from a cluster and will not participate in PME.
>> IEP-25 is trying to solve a more general problem related only to PME.
>> Network problems are only part of the problem can happen during PME. A node
>> may break down before it even tried to send a message because of unexpected
>> exceptions (e.g. NullPointer, Runtime, Assertion e.g.). In general, IEP-25
>> tries to defend us from any kind of unexpected problems to make sure that
>> PME will not be blocked in that case and the cluster will continue to live.
>>
>>
>> ср, 12 сент. 2018 г. в 18:53, eugene miretsky <eugene.miretsky@gmail.com
>> >:
>>
>>> Hi Pavel,
>>>
>>> The issue we are discussing is PME failing because one node cannot
>>> communicate to another node, that's what IEP-25 is trying to solve. But in
>>> that case (where one node is either down, or there is a communication
>>> problem between two nodes) I would expect the split brain resolver to kick
>>> in, and shut down one of the nodes. I would also expect the dead node to be
>>> removed from the cluster, and no longer take part in PME.
>>>
>>>
>>>
>>> On Wed, Sep 12, 2018 at 11:25 AM Pavel Kovalenko <jo...@gmail.com>
>>> wrote:
>>>
>>>> Hi Eugene,
>>>>
>>>> Sorry, but I didn't catch the meaning of your question about Zookeeper
>>>> Discovery. Could you please re-phrase it?
>>>>
>>>> ср, 12 сент. 2018 г. в 17:54, Ilya Lantukh <il...@gridgain.com>:
>>>>
>>>>> Pavel K., can you please answer about Zookeeper discovery?
>>>>>
>>>>> On Wed, Sep 12, 2018 at 5:49 PM, eugene miretsky <
>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>
>>>>>> Thanks for the patience with my questions - just trying to understand
>>>>>> the system better.
>>>>>>
>>>>>> 3) I was referring to
>>>>>> https://apacheignite.readme.io/docs/zookeeper-discovery#section-failures-and-split-brain-handling.
>>>>>> How come it doesn't get the node to shut down?
>>>>>> 4) Are there any docs/JIRAs that explain how counters are used, and
>>>>>> why they are required in the state?
>>>>>>
>>>>>> Cheers,
>>>>>> Eugene
>>>>>>
>>>>>>
>>>>>> On Wed, Sep 12, 2018 at 10:04 AM Ilya Lantukh <il...@gridgain.com>
>>>>>> wrote:
>>>>>>
>>>>>>> 3) Such mechanics will be implemented in IEP-25 (linked above).
>>>>>>> 4) Partition map states include update counters, which are
>>>>>>> incremented on every cache update and play important role in new state
>>>>>>> calculation. So, technically, every cache operation can lead to partition
>>>>>>> map change, and for obvious reasons we can't route them through
>>>>>>> coordinator. Ignite is a more complex system than Akka or Kafka and such
>>>>>>> simple solutions won't work here (in general case). However, it is true
>>>>>>> that PME could be simplified or completely avoid for certain cases and the
>>>>>>> community is currently working on such optimizations (
>>>>>>> https://issues.apache.org/jira/browse/IGNITE-9558 for example).
>>>>>>>
>>>>>>> On Wed, Sep 12, 2018 at 9:08 AM, eugene miretsky <
>>>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>>>
>>>>>>>> 2b) I had a few situations where the cluster went into a state
>>>>>>>> where PME constantly failed, and could never recover. I think the root
>>>>>>>> cause was that a transaction got stuck and didn't timeout/rollback.  I will
>>>>>>>> try to reproduce it again and get back to you
>>>>>>>> 3) If a node is down, I would expect it to get detected and the
>>>>>>>> node to get removed from the cluster. In such case, PME should not even be
>>>>>>>> attempted with that node. Hence you would expect PME to fail very rarely
>>>>>>>> (any faulty node will be removed before it has a chance to fail PME)
>>>>>>>> 4) Don't all partition map changes go through the coordinator? I
>>>>>>>> believe a lot of distributed systems work in this way (all decisions are
>>>>>>>> made by the coordinator/leader) - In Akka the leader is responsible for
>>>>>>>> making all cluster membership changes, in Kafka the controller does the
>>>>>>>> leader election.
>>>>>>>>
>>>>>>>> On Tue, Sep 11, 2018 at 11:11 AM Ilya Lantukh <
>>>>>>>> ilantukh@gridgain.com> wrote:
>>>>>>>>
>>>>>>>>> 1) It is.
>>>>>>>>> 2a) Ignite has retry mechanics for all messages, including
>>>>>>>>> PME-related ones.
>>>>>>>>> 2b) In this situation PME will hang, but it isn't a "deadlock".
>>>>>>>>> 3) Sorry, I didn't understand your question. If a node is down,
>>>>>>>>> but DiscoverySpi doesn't detect it, it isn't PME-related problem.
>>>>>>>>> 4) How can you ensure that partition maps on coordinator are *latest
>>>>>>>>> *without "freezing" cluster state for some time?
>>>>>>>>>
>>>>>>>>> On Sat, Sep 8, 2018 at 3:21 AM, eugene miretsky <
>>>>>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>>
>>>>>>>>>> We are using persistence, so I am not sure if shutting down nodes
>>>>>>>>>> will be the desired outcome for us since we would need to modify the
>>>>>>>>>> baseline topolgy.
>>>>>>>>>>
>>>>>>>>>> A couple more follow up questions
>>>>>>>>>>
>>>>>>>>>> 1) Is PME triggered when client nodes join us well? We are using
>>>>>>>>>> Spark client, so new nodes are created/destroy every time.
>>>>>>>>>> 2) It sounds to me like there is a pontential for the cluster to
>>>>>>>>>> get into a deadlock if
>>>>>>>>>>    a) single PME message is lost (PME never finishes, there are
>>>>>>>>>> no retries, and all future operations are blocked on the pending PME)
>>>>>>>>>>    b) one of the nodes has a  long running/stuck pending operation
>>>>>>>>>> 3) Under what circumastance can PME fail, while DiscoverySpi
>>>>>>>>>> fails to detect the node being down? We are using ZookeeperSpi so I would
>>>>>>>>>> expect the split brain resolver to shut down the node.
>>>>>>>>>> 4) Why is PME needed? Doesn't the coordinator know the altest
>>>>>>>>>> toplogy/pertition map of the cluster through regualr gossip?
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Eugene
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 7, 2018 at 5:18 PM Ilya Lantukh <
>>>>>>>>>> ilantukh@gridgain.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Eugene,
>>>>>>>>>>>
>>>>>>>>>>> 1) PME happens when topology is modified (TopologyVersion is
>>>>>>>>>>> incremented). The most common events that trigger it are: node
>>>>>>>>>>> start/stop/fail, cluster activation/deactivation, dynamic cache start/stop.
>>>>>>>>>>> 2) It is done by a separate ExchangeWorker. Events that trigger
>>>>>>>>>>> PME are transferred using DiscoverySpi instead of CommunicationSpi.
>>>>>>>>>>> 3) All nodes wait for all pending cache operations to finish and
>>>>>>>>>>> then send their local partition maps to the coordinator (oldest node). Then
>>>>>>>>>>> coordinator calculates new global partition maps and sends them to every
>>>>>>>>>>> node.
>>>>>>>>>>> 4) All cache operations.
>>>>>>>>>>> 5) Exchange is never retried. Ignite community is currently
>>>>>>>>>>> working on PME failure handling that should kick all problematic nodes
>>>>>>>>>>> after timeout is reached (see
>>>>>>>>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-25%3A+Partition+Map+Exchange+hangs+resolving
>>>>>>>>>>> for details), but it isn't done yet.
>>>>>>>>>>> 6) You shouldn't consider PME failure as a error by itself, but
>>>>>>>>>>> rather as a result of some other error. The most common reason of PME
>>>>>>>>>>> hang-up is pending cache operation that couldn't finish. Check your logs -
>>>>>>>>>>> it should list pending transactions and atomic updates. Search for "Found
>>>>>>>>>>> long running" substring.
>>>>>>>>>>>
>>>>>>>>>>> Hope this helps.
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 7, 2018 at 11:45 PM, eugene miretsky <
>>>>>>>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello,
>>>>>>>>>>>>
>>>>>>>>>>>> Out cluster occasionally fails with "partition map exchange
>>>>>>>>>>>> failure" errors, I have searched around and it seems that a lot of people
>>>>>>>>>>>> have had a similar issue in the past. My high-level understanding is that
>>>>>>>>>>>> when one of the nodes fails (out of memory, exception, GC etc.) nodes fail
>>>>>>>>>>>> to exchange partition maps. However, I have a few questions
>>>>>>>>>>>> 1) When does partition map exchange happen? Periodically, when
>>>>>>>>>>>> a node joins, etc.
>>>>>>>>>>>> 2) Is it done in the same thread as communication SPI, or is a
>>>>>>>>>>>> separate worker?
>>>>>>>>>>>> 3) How does the exchange happen? Via a coordinator, peer to
>>>>>>>>>>>> peer, etc?
>>>>>>>>>>>> 4) What does the exchange block?
>>>>>>>>>>>> 5) When is the exchange retried?
>>>>>>>>>>>> 5) How to resolve the error? The only thing I have seen online
>>>>>>>>>>>> is to decrease failureDetectionTimeout
>>>>>>>>>>>>
>>>>>>>>>>>> Our settings are
>>>>>>>>>>>> - Zookeeper SPI
>>>>>>>>>>>> - Persistence enabled
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Eugene
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Best regards,
>>>>>>>>>>> Ilya
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Best regards,
>>>>>>>>> Ilya
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best regards,
>>>>>>> Ilya
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best regards,
>>>>> Ilya
>>>>>
>>>>

Re: Partition map exchange in detail

Posted by eugene miretsky <eu...@gmail.com>.

Make sense
I think the actual issue that was affecting me is
https://issues.apache.org/jira/browse/IGNITE-9562. (which IEP-25 should
solve).

Final 2 questions:
1) If all NPE waits for all pending transactions
  a) What constitutes a transaction in this context? (any query, a SQL
transaction, etc)
  b) Does it mean that if the cluster constantly receives transaction
requests, NPE will never happen? (Or will all transactions that were
received after the NPE request wait for the NPE to complete?)
2) Any other advice on how to avoid NPE? (transaction timeouts, graceful
shutdown/restart of nodes, etc)

Cheers,
Eugene





On Wed, Sep 12, 2018 at 12:18 PM Pavel Kovalenko <jo...@gmail.com> wrote:

> Eugene,
>
> In the case of Zookeeper Discovery is enabled and communication problem
> between some nodes, a subset of problem nodes will be automatically killed
> to reach cluster state where each node can communicate with each other
> without problems. So, you're absolutely right, dead nodes will be removed
> from a cluster and will not participate in PME.
> IEP-25 is trying to solve a more general problem related only to PME.
> Network problems are only part of the problem can happen during PME. A node
> may break down before it even tried to send a message because of unexpected
> exceptions (e.g. NullPointer, Runtime, Assertion e.g.). In general, IEP-25
> tries to defend us from any kind of unexpected problems to make sure that
> PME will not be blocked in that case and the cluster will continue to live.
>
>
> ср, 12 сент. 2018 г. в 18:53, eugene miretsky <eu...@gmail.com>:
>
>> Hi Pavel,
>>
>> The issue we are discussing is PME failing because one node cannot
>> communicate to another node, that's what IEP-25 is trying to solve. But in
>> that case (where one node is either down, or there is a communication
>> problem between two nodes) I would expect the split brain resolver to kick
>> in, and shut down one of the nodes. I would also expect the dead node to be
>> removed from the cluster, and no longer take part in PME.
>>
>>
>>
>> On Wed, Sep 12, 2018 at 11:25 AM Pavel Kovalenko <jo...@gmail.com>
>> wrote:
>>
>>> Hi Eugene,
>>>
>>> Sorry, but I didn't catch the meaning of your question about Zookeeper
>>> Discovery. Could you please re-phrase it?
>>>
>>> ср, 12 сент. 2018 г. в 17:54, Ilya Lantukh <il...@gridgain.com>:
>>>
>>>> Pavel K., can you please answer about Zookeeper discovery?
>>>>
>>>> On Wed, Sep 12, 2018 at 5:49 PM, eugene miretsky <
>>>> eugene.miretsky@gmail.com> wrote:
>>>>
>>>>> Thanks for the patience with my questions - just trying to understand
>>>>> the system better.
>>>>>
>>>>> 3) I was referring to
>>>>> https://apacheignite.readme.io/docs/zookeeper-discovery#section-failures-and-split-brain-handling.
>>>>> How come it doesn't get the node to shut down?
>>>>> 4) Are there any docs/JIRAs that explain how counters are used, and
>>>>> why they are required in the state?
>>>>>
>>>>> Cheers,
>>>>> Eugene
>>>>>
>>>>>
>>>>> On Wed, Sep 12, 2018 at 10:04 AM Ilya Lantukh <il...@gridgain.com>
>>>>> wrote:
>>>>>
>>>>>> 3) Such mechanics will be implemented in IEP-25 (linked above).
>>>>>> 4) Partition map states include update counters, which are
>>>>>> incremented on every cache update and play important role in new state
>>>>>> calculation. So, technically, every cache operation can lead to partition
>>>>>> map change, and for obvious reasons we can't route them through
>>>>>> coordinator. Ignite is a more complex system than Akka or Kafka and such
>>>>>> simple solutions won't work here (in general case). However, it is true
>>>>>> that PME could be simplified or completely avoid for certain cases and the
>>>>>> community is currently working on such optimizations (
>>>>>> https://issues.apache.org/jira/browse/IGNITE-9558 for example).
>>>>>>
>>>>>> On Wed, Sep 12, 2018 at 9:08 AM, eugene miretsky <
>>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>>
>>>>>>> 2b) I had a few situations where the cluster went into a state where
>>>>>>> PME constantly failed, and could never recover. I think the root cause was
>>>>>>> that a transaction got stuck and didn't timeout/rollback.  I will try to
>>>>>>> reproduce it again and get back to you
>>>>>>> 3) If a node is down, I would expect it to get detected and the node
>>>>>>> to get removed from the cluster. In such case, PME should not even be
>>>>>>> attempted with that node. Hence you would expect PME to fail very rarely
>>>>>>> (any faulty node will be removed before it has a chance to fail PME)
>>>>>>> 4) Don't all partition map changes go through the coordinator? I
>>>>>>> believe a lot of distributed systems work in this way (all decisions are
>>>>>>> made by the coordinator/leader) - In Akka the leader is responsible for
>>>>>>> making all cluster membership changes, in Kafka the controller does the
>>>>>>> leader election.
>>>>>>>
>>>>>>> On Tue, Sep 11, 2018 at 11:11 AM Ilya Lantukh <il...@gridgain.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> 1) It is.
>>>>>>>> 2a) Ignite has retry mechanics for all messages, including
>>>>>>>> PME-related ones.
>>>>>>>> 2b) In this situation PME will hang, but it isn't a "deadlock".
>>>>>>>> 3) Sorry, I didn't understand your question. If a node is down, but
>>>>>>>> DiscoverySpi doesn't detect it, it isn't PME-related problem.
>>>>>>>> 4) How can you ensure that partition maps on coordinator are *latest
>>>>>>>> *without "freezing" cluster state for some time?
>>>>>>>>
>>>>>>>> On Sat, Sep 8, 2018 at 3:21 AM, eugene miretsky <
>>>>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>> We are using persistence, so I am not sure if shutting down nodes
>>>>>>>>> will be the desired outcome for us since we would need to modify the
>>>>>>>>> baseline topolgy.
>>>>>>>>>
>>>>>>>>> A couple more follow up questions
>>>>>>>>>
>>>>>>>>> 1) Is PME triggered when client nodes join us well? We are using
>>>>>>>>> Spark client, so new nodes are created/destroy every time.
>>>>>>>>> 2) It sounds to me like there is a pontential for the cluster to
>>>>>>>>> get into a deadlock if
>>>>>>>>>    a) single PME message is lost (PME never finishes, there are no
>>>>>>>>> retries, and all future operations are blocked on the pending PME)
>>>>>>>>>    b) one of the nodes has a  long running/stuck pending operation
>>>>>>>>> 3) Under what circumastance can PME fail, while DiscoverySpi fails
>>>>>>>>> to detect the node being down? We are using ZookeeperSpi so I would expect
>>>>>>>>> the split brain resolver to shut down the node.
>>>>>>>>> 4) Why is PME needed? Doesn't the coordinator know the altest
>>>>>>>>> toplogy/pertition map of the cluster through regualr gossip?
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Eugene
>>>>>>>>>
>>>>>>>>> On Fri, Sep 7, 2018 at 5:18 PM Ilya Lantukh <il...@gridgain.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Eugene,
>>>>>>>>>>
>>>>>>>>>> 1) PME happens when topology is modified (TopologyVersion is
>>>>>>>>>> incremented). The most common events that trigger it are: node
>>>>>>>>>> start/stop/fail, cluster activation/deactivation, dynamic cache start/stop.
>>>>>>>>>> 2) It is done by a separate ExchangeWorker. Events that trigger
>>>>>>>>>> PME are transferred using DiscoverySpi instead of CommunicationSpi.
>>>>>>>>>> 3) All nodes wait for all pending cache operations to finish and
>>>>>>>>>> then send their local partition maps to the coordinator (oldest node). Then
>>>>>>>>>> coordinator calculates new global partition maps and sends them to every
>>>>>>>>>> node.
>>>>>>>>>> 4) All cache operations.
>>>>>>>>>> 5) Exchange is never retried. Ignite community is currently
>>>>>>>>>> working on PME failure handling that should kick all problematic nodes
>>>>>>>>>> after timeout is reached (see
>>>>>>>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-25%3A+Partition+Map+Exchange+hangs+resolving
>>>>>>>>>> for details), but it isn't done yet.
>>>>>>>>>> 6) You shouldn't consider PME failure as a error by itself, but
>>>>>>>>>> rather as a result of some other error. The most common reason of PME
>>>>>>>>>> hang-up is pending cache operation that couldn't finish. Check your logs -
>>>>>>>>>> it should list pending transactions and atomic updates. Search for "Found
>>>>>>>>>> long running" substring.
>>>>>>>>>>
>>>>>>>>>> Hope this helps.
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 7, 2018 at 11:45 PM, eugene miretsky <
>>>>>>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> Out cluster occasionally fails with "partition map exchange
>>>>>>>>>>> failure" errors, I have searched around and it seems that a lot of people
>>>>>>>>>>> have had a similar issue in the past. My high-level understanding is that
>>>>>>>>>>> when one of the nodes fails (out of memory, exception, GC etc.) nodes fail
>>>>>>>>>>> to exchange partition maps. However, I have a few questions
>>>>>>>>>>> 1) When does partition map exchange happen? Periodically, when a
>>>>>>>>>>> node joins, etc.
>>>>>>>>>>> 2) Is it done in the same thread as communication SPI, or is a
>>>>>>>>>>> separate worker?
>>>>>>>>>>> 3) How does the exchange happen? Via a coordinator, peer to
>>>>>>>>>>> peer, etc?
>>>>>>>>>>> 4) What does the exchange block?
>>>>>>>>>>> 5) When is the exchange retried?
>>>>>>>>>>> 5) How to resolve the error? The only thing I have seen online
>>>>>>>>>>> is to decrease failureDetectionTimeout
>>>>>>>>>>>
>>>>>>>>>>> Our settings are
>>>>>>>>>>> - Zookeeper SPI
>>>>>>>>>>> - Persistence enabled
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Eugene
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Best regards,
>>>>>>>>>> Ilya
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best regards,
>>>>>>>> Ilya
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best regards,
>>>>>> Ilya
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best regards,
>>>> Ilya
>>>>
>>>

Re: Partition map exchange in detail

Posted by Pavel Kovalenko <jo...@gmail.com>.

Eugene,

In the case of Zookeeper Discovery is enabled and communication problem
between some nodes, a subset of problem nodes will be automatically killed
to reach cluster state where each node can communicate with each other
without problems. So, you're absolutely right, dead nodes will be removed
from a cluster and will not participate in PME.
IEP-25 is trying to solve a more general problem related only to PME.
Network problems are only part of the problem can happen during PME. A node
may break down before it even tried to send a message because of unexpected
exceptions (e.g. NullPointer, Runtime, Assertion e.g.). In general, IEP-25
tries to defend us from any kind of unexpected problems to make sure that
PME will not be blocked in that case and the cluster will continue to live.


ср, 12 сент. 2018 г. в 18:53, eugene miretsky <eu...@gmail.com>:

> Hi Pavel,
>
> The issue we are discussing is PME failing because one node cannot
> communicate to another node, that's what IEP-25 is trying to solve. But in
> that case (where one node is either down, or there is a communication
> problem between two nodes) I would expect the split brain resolver to kick
> in, and shut down one of the nodes. I would also expect the dead node to be
> removed from the cluster, and no longer take part in PME.
>
>
>
> On Wed, Sep 12, 2018 at 11:25 AM Pavel Kovalenko <jo...@gmail.com>
> wrote:
>
>> Hi Eugene,
>>
>> Sorry, but I didn't catch the meaning of your question about Zookeeper
>> Discovery. Could you please re-phrase it?
>>
>> ср, 12 сент. 2018 г. в 17:54, Ilya Lantukh <il...@gridgain.com>:
>>
>>> Pavel K., can you please answer about Zookeeper discovery?
>>>
>>> On Wed, Sep 12, 2018 at 5:49 PM, eugene miretsky <
>>> eugene.miretsky@gmail.com> wrote:
>>>
>>>> Thanks for the patience with my questions - just trying to understand
>>>> the system better.
>>>>
>>>> 3) I was referring to
>>>> https://apacheignite.readme.io/docs/zookeeper-discovery#section-failures-and-split-brain-handling.
>>>> How come it doesn't get the node to shut down?
>>>> 4) Are there any docs/JIRAs that explain how counters are used, and why
>>>> they are required in the state?
>>>>
>>>> Cheers,
>>>> Eugene
>>>>
>>>>
>>>> On Wed, Sep 12, 2018 at 10:04 AM Ilya Lantukh <il...@gridgain.com>
>>>> wrote:
>>>>
>>>>> 3) Such mechanics will be implemented in IEP-25 (linked above).
>>>>> 4) Partition map states include update counters, which are incremented
>>>>> on every cache update and play important role in new state calculation. So,
>>>>> technically, every cache operation can lead to partition map change, and
>>>>> for obvious reasons we can't route them through coordinator. Ignite is a
>>>>> more complex system than Akka or Kafka and such simple solutions won't work
>>>>> here (in general case). However, it is true that PME could be simplified or
>>>>> completely avoid for certain cases and the community is currently working
>>>>> on such optimizations (
>>>>> https://issues.apache.org/jira/browse/IGNITE-9558 for example).
>>>>>
>>>>> On Wed, Sep 12, 2018 at 9:08 AM, eugene miretsky <
>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>
>>>>>> 2b) I had a few situations where the cluster went into a state where
>>>>>> PME constantly failed, and could never recover. I think the root cause was
>>>>>> that a transaction got stuck and didn't timeout/rollback.  I will try to
>>>>>> reproduce it again and get back to you
>>>>>> 3) If a node is down, I would expect it to get detected and the node
>>>>>> to get removed from the cluster. In such case, PME should not even be
>>>>>> attempted with that node. Hence you would expect PME to fail very rarely
>>>>>> (any faulty node will be removed before it has a chance to fail PME)
>>>>>> 4) Don't all partition map changes go through the coordinator? I
>>>>>> believe a lot of distributed systems work in this way (all decisions are
>>>>>> made by the coordinator/leader) - In Akka the leader is responsible for
>>>>>> making all cluster membership changes, in Kafka the controller does the
>>>>>> leader election.
>>>>>>
>>>>>> On Tue, Sep 11, 2018 at 11:11 AM Ilya Lantukh <il...@gridgain.com>
>>>>>> wrote:
>>>>>>
>>>>>>> 1) It is.
>>>>>>> 2a) Ignite has retry mechanics for all messages, including
>>>>>>> PME-related ones.
>>>>>>> 2b) In this situation PME will hang, but it isn't a "deadlock".
>>>>>>> 3) Sorry, I didn't understand your question. If a node is down, but
>>>>>>> DiscoverySpi doesn't detect it, it isn't PME-related problem.
>>>>>>> 4) How can you ensure that partition maps on coordinator are *latest
>>>>>>> *without "freezing" cluster state for some time?
>>>>>>>
>>>>>>> On Sat, Sep 8, 2018 at 3:21 AM, eugene miretsky <
>>>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>> We are using persistence, so I am not sure if shutting down nodes
>>>>>>>> will be the desired outcome for us since we would need to modify the
>>>>>>>> baseline topolgy.
>>>>>>>>
>>>>>>>> A couple more follow up questions
>>>>>>>>
>>>>>>>> 1) Is PME triggered when client nodes join us well? We are using
>>>>>>>> Spark client, so new nodes are created/destroy every time.
>>>>>>>> 2) It sounds to me like there is a pontential for the cluster to
>>>>>>>> get into a deadlock if
>>>>>>>>    a) single PME message is lost (PME never finishes, there are no
>>>>>>>> retries, and all future operations are blocked on the pending PME)
>>>>>>>>    b) one of the nodes has a  long running/stuck pending operation
>>>>>>>> 3) Under what circumastance can PME fail, while DiscoverySpi fails
>>>>>>>> to detect the node being down? We are using ZookeeperSpi so I would expect
>>>>>>>> the split brain resolver to shut down the node.
>>>>>>>> 4) Why is PME needed? Doesn't the coordinator know the altest
>>>>>>>> toplogy/pertition map of the cluster through regualr gossip?
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Eugene
>>>>>>>>
>>>>>>>> On Fri, Sep 7, 2018 at 5:18 PM Ilya Lantukh <il...@gridgain.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Eugene,
>>>>>>>>>
>>>>>>>>> 1) PME happens when topology is modified (TopologyVersion is
>>>>>>>>> incremented). The most common events that trigger it are: node
>>>>>>>>> start/stop/fail, cluster activation/deactivation, dynamic cache start/stop.
>>>>>>>>> 2) It is done by a separate ExchangeWorker. Events that trigger
>>>>>>>>> PME are transferred using DiscoverySpi instead of CommunicationSpi.
>>>>>>>>> 3) All nodes wait for all pending cache operations to finish and
>>>>>>>>> then send their local partition maps to the coordinator (oldest node). Then
>>>>>>>>> coordinator calculates new global partition maps and sends them to every
>>>>>>>>> node.
>>>>>>>>> 4) All cache operations.
>>>>>>>>> 5) Exchange is never retried. Ignite community is currently
>>>>>>>>> working on PME failure handling that should kick all problematic nodes
>>>>>>>>> after timeout is reached (see
>>>>>>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-25%3A+Partition+Map+Exchange+hangs+resolving
>>>>>>>>> for details), but it isn't done yet.
>>>>>>>>> 6) You shouldn't consider PME failure as a error by itself, but
>>>>>>>>> rather as a result of some other error. The most common reason of PME
>>>>>>>>> hang-up is pending cache operation that couldn't finish. Check your logs -
>>>>>>>>> it should list pending transactions and atomic updates. Search for "Found
>>>>>>>>> long running" substring.
>>>>>>>>>
>>>>>>>>> Hope this helps.
>>>>>>>>>
>>>>>>>>> On Fri, Sep 7, 2018 at 11:45 PM, eugene miretsky <
>>>>>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> Out cluster occasionally fails with "partition map exchange
>>>>>>>>>> failure" errors, I have searched around and it seems that a lot of people
>>>>>>>>>> have had a similar issue in the past. My high-level understanding is that
>>>>>>>>>> when one of the nodes fails (out of memory, exception, GC etc.) nodes fail
>>>>>>>>>> to exchange partition maps. However, I have a few questions
>>>>>>>>>> 1) When does partition map exchange happen? Periodically, when a
>>>>>>>>>> node joins, etc.
>>>>>>>>>> 2) Is it done in the same thread as communication SPI, or is a
>>>>>>>>>> separate worker?
>>>>>>>>>> 3) How does the exchange happen? Via a coordinator, peer to peer,
>>>>>>>>>> etc?
>>>>>>>>>> 4) What does the exchange block?
>>>>>>>>>> 5) When is the exchange retried?
>>>>>>>>>> 5) How to resolve the error? The only thing I have seen online is
>>>>>>>>>> to decrease failureDetectionTimeout
>>>>>>>>>>
>>>>>>>>>> Our settings are
>>>>>>>>>> - Zookeeper SPI
>>>>>>>>>> - Persistence enabled
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Eugene
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Best regards,
>>>>>>>>> Ilya
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best regards,
>>>>>>> Ilya
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best regards,
>>>>> Ilya
>>>>>
>>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Ilya
>>>
>>

Re: Partition map exchange in detail

Posted by eugene miretsky <eu...@gmail.com>.

Hi Pavel,

The issue we are discussing is PME failing because one node cannot
communicate to another node, that's what IEP-25 is trying to solve. But in
that case (where one node is either down, or there is a communication
problem between two nodes) I would expect the split brain resolver to kick
in, and shut down one of the nodes. I would also expect the dead node to be
removed from the cluster, and no longer take part in PME.



On Wed, Sep 12, 2018 at 11:25 AM Pavel Kovalenko <jo...@gmail.com> wrote:

> Hi Eugene,
>
> Sorry, but I didn't catch the meaning of your question about Zookeeper
> Discovery. Could you please re-phrase it?
>
> ср, 12 сент. 2018 г. в 17:54, Ilya Lantukh <il...@gridgain.com>:
>
>> Pavel K., can you please answer about Zookeeper discovery?
>>
>> On Wed, Sep 12, 2018 at 5:49 PM, eugene miretsky <
>> eugene.miretsky@gmail.com> wrote:
>>
>>> Thanks for the patience with my questions - just trying to understand
>>> the system better.
>>>
>>> 3) I was referring to
>>> https://apacheignite.readme.io/docs/zookeeper-discovery#section-failures-and-split-brain-handling.
>>> How come it doesn't get the node to shut down?
>>> 4) Are there any docs/JIRAs that explain how counters are used, and why
>>> they are required in the state?
>>>
>>> Cheers,
>>> Eugene
>>>
>>>
>>> On Wed, Sep 12, 2018 at 10:04 AM Ilya Lantukh <il...@gridgain.com>
>>> wrote:
>>>
>>>> 3) Such mechanics will be implemented in IEP-25 (linked above).
>>>> 4) Partition map states include update counters, which are incremented
>>>> on every cache update and play important role in new state calculation. So,
>>>> technically, every cache operation can lead to partition map change, and
>>>> for obvious reasons we can't route them through coordinator. Ignite is a
>>>> more complex system than Akka or Kafka and such simple solutions won't work
>>>> here (in general case). However, it is true that PME could be simplified or
>>>> completely avoid for certain cases and the community is currently working
>>>> on such optimizations (
>>>> https://issues.apache.org/jira/browse/IGNITE-9558 for example).
>>>>
>>>> On Wed, Sep 12, 2018 at 9:08 AM, eugene miretsky <
>>>> eugene.miretsky@gmail.com> wrote:
>>>>
>>>>> 2b) I had a few situations where the cluster went into a state where
>>>>> PME constantly failed, and could never recover. I think the root cause was
>>>>> that a transaction got stuck and didn't timeout/rollback.  I will try to
>>>>> reproduce it again and get back to you
>>>>> 3) If a node is down, I would expect it to get detected and the node
>>>>> to get removed from the cluster. In such case, PME should not even be
>>>>> attempted with that node. Hence you would expect PME to fail very rarely
>>>>> (any faulty node will be removed before it has a chance to fail PME)
>>>>> 4) Don't all partition map changes go through the coordinator? I
>>>>> believe a lot of distributed systems work in this way (all decisions are
>>>>> made by the coordinator/leader) - In Akka the leader is responsible for
>>>>> making all cluster membership changes, in Kafka the controller does the
>>>>> leader election.
>>>>>
>>>>> On Tue, Sep 11, 2018 at 11:11 AM Ilya Lantukh <il...@gridgain.com>
>>>>> wrote:
>>>>>
>>>>>> 1) It is.
>>>>>> 2a) Ignite has retry mechanics for all messages, including
>>>>>> PME-related ones.
>>>>>> 2b) In this situation PME will hang, but it isn't a "deadlock".
>>>>>> 3) Sorry, I didn't understand your question. If a node is down, but
>>>>>> DiscoverySpi doesn't detect it, it isn't PME-related problem.
>>>>>> 4) How can you ensure that partition maps on coordinator are *latest
>>>>>> *without "freezing" cluster state for some time?
>>>>>>
>>>>>> On Sat, Sep 8, 2018 at 3:21 AM, eugene miretsky <
>>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> We are using persistence, so I am not sure if shutting down nodes
>>>>>>> will be the desired outcome for us since we would need to modify the
>>>>>>> baseline topolgy.
>>>>>>>
>>>>>>> A couple more follow up questions
>>>>>>>
>>>>>>> 1) Is PME triggered when client nodes join us well? We are using
>>>>>>> Spark client, so new nodes are created/destroy every time.
>>>>>>> 2) It sounds to me like there is a pontential for the cluster to get
>>>>>>> into a deadlock if
>>>>>>>    a) single PME message is lost (PME never finishes, there are no
>>>>>>> retries, and all future operations are blocked on the pending PME)
>>>>>>>    b) one of the nodes has a  long running/stuck pending operation
>>>>>>> 3) Under what circumastance can PME fail, while DiscoverySpi fails
>>>>>>> to detect the node being down? We are using ZookeeperSpi so I would expect
>>>>>>> the split brain resolver to shut down the node.
>>>>>>> 4) Why is PME needed? Doesn't the coordinator know the altest
>>>>>>> toplogy/pertition map of the cluster through regualr gossip?
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Eugene
>>>>>>>
>>>>>>> On Fri, Sep 7, 2018 at 5:18 PM Ilya Lantukh <il...@gridgain.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Eugene,
>>>>>>>>
>>>>>>>> 1) PME happens when topology is modified (TopologyVersion is
>>>>>>>> incremented). The most common events that trigger it are: node
>>>>>>>> start/stop/fail, cluster activation/deactivation, dynamic cache start/stop.
>>>>>>>> 2) It is done by a separate ExchangeWorker. Events that trigger PME
>>>>>>>> are transferred using DiscoverySpi instead of CommunicationSpi.
>>>>>>>> 3) All nodes wait for all pending cache operations to finish and
>>>>>>>> then send their local partition maps to the coordinator (oldest node). Then
>>>>>>>> coordinator calculates new global partition maps and sends them to every
>>>>>>>> node.
>>>>>>>> 4) All cache operations.
>>>>>>>> 5) Exchange is never retried. Ignite community is currently working
>>>>>>>> on PME failure handling that should kick all problematic nodes after
>>>>>>>> timeout is reached (see
>>>>>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-25%3A+Partition+Map+Exchange+hangs+resolving
>>>>>>>> for details), but it isn't done yet.
>>>>>>>> 6) You shouldn't consider PME failure as a error by itself, but
>>>>>>>> rather as a result of some other error. The most common reason of PME
>>>>>>>> hang-up is pending cache operation that couldn't finish. Check your logs -
>>>>>>>> it should list pending transactions and atomic updates. Search for "Found
>>>>>>>> long running" substring.
>>>>>>>>
>>>>>>>> Hope this helps.
>>>>>>>>
>>>>>>>> On Fri, Sep 7, 2018 at 11:45 PM, eugene miretsky <
>>>>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> Out cluster occasionally fails with "partition map exchange
>>>>>>>>> failure" errors, I have searched around and it seems that a lot of people
>>>>>>>>> have had a similar issue in the past. My high-level understanding is that
>>>>>>>>> when one of the nodes fails (out of memory, exception, GC etc.) nodes fail
>>>>>>>>> to exchange partition maps. However, I have a few questions
>>>>>>>>> 1) When does partition map exchange happen? Periodically, when a
>>>>>>>>> node joins, etc.
>>>>>>>>> 2) Is it done in the same thread as communication SPI, or is a
>>>>>>>>> separate worker?
>>>>>>>>> 3) How does the exchange happen? Via a coordinator, peer to peer,
>>>>>>>>> etc?
>>>>>>>>> 4) What does the exchange block?
>>>>>>>>> 5) When is the exchange retried?
>>>>>>>>> 5) How to resolve the error? The only thing I have seen online is
>>>>>>>>> to decrease failureDetectionTimeout
>>>>>>>>>
>>>>>>>>> Our settings are
>>>>>>>>> - Zookeeper SPI
>>>>>>>>> - Persistence enabled
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Eugene
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best regards,
>>>>>>>> Ilya
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best regards,
>>>>>> Ilya
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best regards,
>>>> Ilya
>>>>
>>>
>>
>>
>> --
>> Best regards,
>> Ilya
>>
>

Re: Partition map exchange in detail

Posted by Pavel Kovalenko <jo...@gmail.com>.

Hi Eugene,

Sorry, but I didn't catch the meaning of your question about Zookeeper
Discovery. Could you please re-phrase it?

ср, 12 сент. 2018 г. в 17:54, Ilya Lantukh <il...@gridgain.com>:

> Pavel K., can you please answer about Zookeeper discovery?
>
> On Wed, Sep 12, 2018 at 5:49 PM, eugene miretsky <
> eugene.miretsky@gmail.com> wrote:
>
>> Thanks for the patience with my questions - just trying to understand the
>> system better.
>>
>> 3) I was referring to
>> https://apacheignite.readme.io/docs/zookeeper-discovery#section-failures-and-split-brain-handling.
>> How come it doesn't get the node to shut down?
>> 4) Are there any docs/JIRAs that explain how counters are used, and why
>> they are required in the state?
>>
>> Cheers,
>> Eugene
>>
>>
>> On Wed, Sep 12, 2018 at 10:04 AM Ilya Lantukh <il...@gridgain.com>
>> wrote:
>>
>>> 3) Such mechanics will be implemented in IEP-25 (linked above).
>>> 4) Partition map states include update counters, which are incremented
>>> on every cache update and play important role in new state calculation. So,
>>> technically, every cache operation can lead to partition map change, and
>>> for obvious reasons we can't route them through coordinator. Ignite is a
>>> more complex system than Akka or Kafka and such simple solutions won't work
>>> here (in general case). However, it is true that PME could be simplified or
>>> completely avoid for certain cases and the community is currently working
>>> on such optimizations (https://issues.apache.org/jira/browse/IGNITE-9558
>>> for example).
>>>
>>> On Wed, Sep 12, 2018 at 9:08 AM, eugene miretsky <
>>> eugene.miretsky@gmail.com> wrote:
>>>
>>>> 2b) I had a few situations where the cluster went into a state where
>>>> PME constantly failed, and could never recover. I think the root cause was
>>>> that a transaction got stuck and didn't timeout/rollback.  I will try to
>>>> reproduce it again and get back to you
>>>> 3) If a node is down, I would expect it to get detected and the node to
>>>> get removed from the cluster. In such case, PME should not even be
>>>> attempted with that node. Hence you would expect PME to fail very rarely
>>>> (any faulty node will be removed before it has a chance to fail PME)
>>>> 4) Don't all partition map changes go through the coordinator? I
>>>> believe a lot of distributed systems work in this way (all decisions are
>>>> made by the coordinator/leader) - In Akka the leader is responsible for
>>>> making all cluster membership changes, in Kafka the controller does the
>>>> leader election.
>>>>
>>>> On Tue, Sep 11, 2018 at 11:11 AM Ilya Lantukh <il...@gridgain.com>
>>>> wrote:
>>>>
>>>>> 1) It is.
>>>>> 2a) Ignite has retry mechanics for all messages, including PME-related
>>>>> ones.
>>>>> 2b) In this situation PME will hang, but it isn't a "deadlock".
>>>>> 3) Sorry, I didn't understand your question. If a node is down, but
>>>>> DiscoverySpi doesn't detect it, it isn't PME-related problem.
>>>>> 4) How can you ensure that partition maps on coordinator are *latest *without
>>>>> "freezing" cluster state for some time?
>>>>>
>>>>> On Sat, Sep 8, 2018 at 3:21 AM, eugene miretsky <
>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> We are using persistence, so I am not sure if shutting down nodes
>>>>>> will be the desired outcome for us since we would need to modify the
>>>>>> baseline topolgy.
>>>>>>
>>>>>> A couple more follow up questions
>>>>>>
>>>>>> 1) Is PME triggered when client nodes join us well? We are using
>>>>>> Spark client, so new nodes are created/destroy every time.
>>>>>> 2) It sounds to me like there is a pontential for the cluster to get
>>>>>> into a deadlock if
>>>>>>    a) single PME message is lost (PME never finishes, there are no
>>>>>> retries, and all future operations are blocked on the pending PME)
>>>>>>    b) one of the nodes has a  long running/stuck pending operation
>>>>>> 3) Under what circumastance can PME fail, while DiscoverySpi fails to
>>>>>> detect the node being down? We are using ZookeeperSpi so I would expect the
>>>>>> split brain resolver to shut down the node.
>>>>>> 4) Why is PME needed? Doesn't the coordinator know the altest
>>>>>> toplogy/pertition map of the cluster through regualr gossip?
>>>>>>
>>>>>> Cheers,
>>>>>> Eugene
>>>>>>
>>>>>> On Fri, Sep 7, 2018 at 5:18 PM Ilya Lantukh <il...@gridgain.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Eugene,
>>>>>>>
>>>>>>> 1) PME happens when topology is modified (TopologyVersion is
>>>>>>> incremented). The most common events that trigger it are: node
>>>>>>> start/stop/fail, cluster activation/deactivation, dynamic cache start/stop.
>>>>>>> 2) It is done by a separate ExchangeWorker. Events that trigger PME
>>>>>>> are transferred using DiscoverySpi instead of CommunicationSpi.
>>>>>>> 3) All nodes wait for all pending cache operations to finish and
>>>>>>> then send their local partition maps to the coordinator (oldest node). Then
>>>>>>> coordinator calculates new global partition maps and sends them to every
>>>>>>> node.
>>>>>>> 4) All cache operations.
>>>>>>> 5) Exchange is never retried. Ignite community is currently working
>>>>>>> on PME failure handling that should kick all problematic nodes after
>>>>>>> timeout is reached (see
>>>>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-25%3A+Partition+Map+Exchange+hangs+resolving
>>>>>>> for details), but it isn't done yet.
>>>>>>> 6) You shouldn't consider PME failure as a error by itself, but
>>>>>>> rather as a result of some other error. The most common reason of PME
>>>>>>> hang-up is pending cache operation that couldn't finish. Check your logs -
>>>>>>> it should list pending transactions and atomic updates. Search for "Found
>>>>>>> long running" substring.
>>>>>>>
>>>>>>> Hope this helps.
>>>>>>>
>>>>>>> On Fri, Sep 7, 2018 at 11:45 PM, eugene miretsky <
>>>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> Out cluster occasionally fails with "partition map exchange
>>>>>>>> failure" errors, I have searched around and it seems that a lot of people
>>>>>>>> have had a similar issue in the past. My high-level understanding is that
>>>>>>>> when one of the nodes fails (out of memory, exception, GC etc.) nodes fail
>>>>>>>> to exchange partition maps. However, I have a few questions
>>>>>>>> 1) When does partition map exchange happen? Periodically, when a
>>>>>>>> node joins, etc.
>>>>>>>> 2) Is it done in the same thread as communication SPI, or is a
>>>>>>>> separate worker?
>>>>>>>> 3) How does the exchange happen? Via a coordinator, peer to peer,
>>>>>>>> etc?
>>>>>>>> 4) What does the exchange block?
>>>>>>>> 5) When is the exchange retried?
>>>>>>>> 5) How to resolve the error? The only thing I have seen online is
>>>>>>>> to decrease failureDetectionTimeout
>>>>>>>>
>>>>>>>> Our settings are
>>>>>>>> - Zookeeper SPI
>>>>>>>> - Persistence enabled
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Eugene
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best regards,
>>>>>>> Ilya
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best regards,
>>>>> Ilya
>>>>>
>>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Ilya
>>>
>>
>
>
> --
> Best regards,
> Ilya
>

Re: Partition map exchange in detail

Posted by Ilya Lantukh <il...@gridgain.com>.

Pavel K., can you please answer about Zookeeper discovery?

On Wed, Sep 12, 2018 at 5:49 PM, eugene miretsky <eu...@gmail.com>
wrote:

> Thanks for the patience with my questions - just trying to understand the
> system better.
>
> 3) I was referring to https://apacheignite.readme.io/docs/
> zookeeper-discovery#section-failures-and-split-brain-handling. How come
> it doesn't get the node to shut down?
> 4) Are there any docs/JIRAs that explain how counters are used, and why
> they are required in the state?
>
> Cheers,
> Eugene
>
>
> On Wed, Sep 12, 2018 at 10:04 AM Ilya Lantukh <il...@gridgain.com>
> wrote:
>
>> 3) Such mechanics will be implemented in IEP-25 (linked above).
>> 4) Partition map states include update counters, which are incremented on
>> every cache update and play important role in new state calculation. So,
>> technically, every cache operation can lead to partition map change, and
>> for obvious reasons we can't route them through coordinator. Ignite is a
>> more complex system than Akka or Kafka and such simple solutions won't work
>> here (in general case). However, it is true that PME could be simplified or
>> completely avoid for certain cases and the community is currently working
>> on such optimizations (https://issues.apache.org/jira/browse/IGNITE-9558
>> for example).
>>
>> On Wed, Sep 12, 2018 at 9:08 AM, eugene miretsky <
>> eugene.miretsky@gmail.com> wrote:
>>
>>> 2b) I had a few situations where the cluster went into a state where PME
>>> constantly failed, and could never recover. I think the root cause was that
>>> a transaction got stuck and didn't timeout/rollback.  I will try to
>>> reproduce it again and get back to you
>>> 3) If a node is down, I would expect it to get detected and the node to
>>> get removed from the cluster. In such case, PME should not even be
>>> attempted with that node. Hence you would expect PME to fail very rarely
>>> (any faulty node will be removed before it has a chance to fail PME)
>>> 4) Don't all partition map changes go through the coordinator? I believe
>>> a lot of distributed systems work in this way (all decisions are made by
>>> the coordinator/leader) - In Akka the leader is responsible for making all
>>> cluster membership changes, in Kafka the controller does the leader
>>> election.
>>>
>>> On Tue, Sep 11, 2018 at 11:11 AM Ilya Lantukh <il...@gridgain.com>
>>> wrote:
>>>
>>>> 1) It is.
>>>> 2a) Ignite has retry mechanics for all messages, including PME-related
>>>> ones.
>>>> 2b) In this situation PME will hang, but it isn't a "deadlock".
>>>> 3) Sorry, I didn't understand your question. If a node is down, but
>>>> DiscoverySpi doesn't detect it, it isn't PME-related problem.
>>>> 4) How can you ensure that partition maps on coordinator are *latest *without
>>>> "freezing" cluster state for some time?
>>>>
>>>> On Sat, Sep 8, 2018 at 3:21 AM, eugene miretsky <
>>>> eugene.miretsky@gmail.com> wrote:
>>>>
>>>>> Thanks!
>>>>>
>>>>> We are using persistence, so I am not sure if shutting down nodes will
>>>>> be the desired outcome for us since we would need to modify the baseline
>>>>> topolgy.
>>>>>
>>>>> A couple more follow up questions
>>>>>
>>>>> 1) Is PME triggered when client nodes join us well? We are using Spark
>>>>> client, so new nodes are created/destroy every time.
>>>>> 2) It sounds to me like there is a pontential for the cluster to get
>>>>> into a deadlock if
>>>>>    a) single PME message is lost (PME never finishes, there are no
>>>>> retries, and all future operations are blocked on the pending PME)
>>>>>    b) one of the nodes has a  long running/stuck pending operation
>>>>> 3) Under what circumastance can PME fail, while DiscoverySpi fails to
>>>>> detect the node being down? We are using ZookeeperSpi so I would expect the
>>>>> split brain resolver to shut down the node.
>>>>> 4) Why is PME needed? Doesn't the coordinator know the altest
>>>>> toplogy/pertition map of the cluster through regualr gossip?
>>>>>
>>>>> Cheers,
>>>>> Eugene
>>>>>
>>>>> On Fri, Sep 7, 2018 at 5:18 PM Ilya Lantukh <il...@gridgain.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Eugene,
>>>>>>
>>>>>> 1) PME happens when topology is modified (TopologyVersion is
>>>>>> incremented). The most common events that trigger it are: node
>>>>>> start/stop/fail, cluster activation/deactivation, dynamic cache start/stop.
>>>>>> 2) It is done by a separate ExchangeWorker. Events that trigger PME
>>>>>> are transferred using DiscoverySpi instead of CommunicationSpi.
>>>>>> 3) All nodes wait for all pending cache operations to finish and then
>>>>>> send their local partition maps to the coordinator (oldest node). Then
>>>>>> coordinator calculates new global partition maps and sends them to every
>>>>>> node.
>>>>>> 4) All cache operations.
>>>>>> 5) Exchange is never retried. Ignite community is currently working
>>>>>> on PME failure handling that should kick all problematic nodes after
>>>>>> timeout is reached (see https://cwiki.apache.org/
>>>>>> confluence/display/IGNITE/IEP-25%3A+Partition+Map+Exchange+
>>>>>> hangs+resolving for details), but it isn't done yet.
>>>>>> 6) You shouldn't consider PME failure as a error by itself, but
>>>>>> rather as a result of some other error. The most common reason of PME
>>>>>> hang-up is pending cache operation that couldn't finish. Check your logs -
>>>>>> it should list pending transactions and atomic updates. Search for "Found
>>>>>> long running" substring.
>>>>>>
>>>>>> Hope this helps.
>>>>>>
>>>>>> On Fri, Sep 7, 2018 at 11:45 PM, eugene miretsky <
>>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> Out cluster occasionally fails with "partition map exchange failure"
>>>>>>> errors, I have searched around and it seems that a lot of people have had a
>>>>>>> similar issue in the past. My high-level understanding is that when one of
>>>>>>> the nodes fails (out of memory, exception, GC etc.) nodes fail to exchange
>>>>>>> partition maps. However, I have a few questions
>>>>>>> 1) When does partition map exchange happen? Periodically, when a
>>>>>>> node joins, etc.
>>>>>>> 2) Is it done in the same thread as communication SPI, or is a
>>>>>>> separate worker?
>>>>>>> 3) How does the exchange happen? Via a coordinator, peer to peer,
>>>>>>> etc?
>>>>>>> 4) What does the exchange block?
>>>>>>> 5) When is the exchange retried?
>>>>>>> 5) How to resolve the error? The only thing I have seen online is to
>>>>>>> decrease failureDetectionTimeout
>>>>>>>
>>>>>>> Our settings are
>>>>>>> - Zookeeper SPI
>>>>>>> - Persistence enabled
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Eugene
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best regards,
>>>>>> Ilya
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best regards,
>>>> Ilya
>>>>
>>>
>>
>>
>> --
>> Best regards,
>> Ilya
>>
>


-- 
Best regards,
Ilya

Re: Partition map exchange in detail

Posted by eugene miretsky <eu...@gmail.com>.

Thanks for the patience with my questions - just trying to understand the
system better.

3) I was referring to
https://apacheignite.readme.io/docs/zookeeper-discovery#section-failures-and-split-brain-handling.
How come it doesn't get the node to shut down?
4) Are there any docs/JIRAs that explain how counters are used, and why
they are required in the state?

Cheers,
Eugene


On Wed, Sep 12, 2018 at 10:04 AM Ilya Lantukh <il...@gridgain.com> wrote:

> 3) Such mechanics will be implemented in IEP-25 (linked above).
> 4) Partition map states include update counters, which are incremented on
> every cache update and play important role in new state calculation. So,
> technically, every cache operation can lead to partition map change, and
> for obvious reasons we can't route them through coordinator. Ignite is a
> more complex system than Akka or Kafka and such simple solutions won't work
> here (in general case). However, it is true that PME could be simplified or
> completely avoid for certain cases and the community is currently working
> on such optimizations (https://issues.apache.org/jira/browse/IGNITE-9558
> for example).
>
> On Wed, Sep 12, 2018 at 9:08 AM, eugene miretsky <
> eugene.miretsky@gmail.com> wrote:
>
>> 2b) I had a few situations where the cluster went into a state where PME
>> constantly failed, and could never recover. I think the root cause was that
>> a transaction got stuck and didn't timeout/rollback.  I will try to
>> reproduce it again and get back to you
>> 3) If a node is down, I would expect it to get detected and the node to
>> get removed from the cluster. In such case, PME should not even be
>> attempted with that node. Hence you would expect PME to fail very rarely
>> (any faulty node will be removed before it has a chance to fail PME)
>> 4) Don't all partition map changes go through the coordinator? I believe
>> a lot of distributed systems work in this way (all decisions are made by
>> the coordinator/leader) - In Akka the leader is responsible for making all
>> cluster membership changes, in Kafka the controller does the leader
>> election.
>>
>> On Tue, Sep 11, 2018 at 11:11 AM Ilya Lantukh <il...@gridgain.com>
>> wrote:
>>
>>> 1) It is.
>>> 2a) Ignite has retry mechanics for all messages, including PME-related
>>> ones.
>>> 2b) In this situation PME will hang, but it isn't a "deadlock".
>>> 3) Sorry, I didn't understand your question. If a node is down, but
>>> DiscoverySpi doesn't detect it, it isn't PME-related problem.
>>> 4) How can you ensure that partition maps on coordinator are *latest *without
>>> "freezing" cluster state for some time?
>>>
>>> On Sat, Sep 8, 2018 at 3:21 AM, eugene miretsky <
>>> eugene.miretsky@gmail.com> wrote:
>>>
>>>> Thanks!
>>>>
>>>> We are using persistence, so I am not sure if shutting down nodes will
>>>> be the desired outcome for us since we would need to modify the baseline
>>>> topolgy.
>>>>
>>>> A couple more follow up questions
>>>>
>>>> 1) Is PME triggered when client nodes join us well? We are using Spark
>>>> client, so new nodes are created/destroy every time.
>>>> 2) It sounds to me like there is a pontential for the cluster to get
>>>> into a deadlock if
>>>>    a) single PME message is lost (PME never finishes, there are no
>>>> retries, and all future operations are blocked on the pending PME)
>>>>    b) one of the nodes has a  long running/stuck pending operation
>>>> 3) Under what circumastance can PME fail, while DiscoverySpi fails to
>>>> detect the node being down? We are using ZookeeperSpi so I would expect the
>>>> split brain resolver to shut down the node.
>>>> 4) Why is PME needed? Doesn't the coordinator know the altest
>>>> toplogy/pertition map of the cluster through regualr gossip?
>>>>
>>>> Cheers,
>>>> Eugene
>>>>
>>>> On Fri, Sep 7, 2018 at 5:18 PM Ilya Lantukh <il...@gridgain.com>
>>>> wrote:
>>>>
>>>>> Hi Eugene,
>>>>>
>>>>> 1) PME happens when topology is modified (TopologyVersion is
>>>>> incremented). The most common events that trigger it are: node
>>>>> start/stop/fail, cluster activation/deactivation, dynamic cache start/stop.
>>>>> 2) It is done by a separate ExchangeWorker. Events that trigger PME
>>>>> are transferred using DiscoverySpi instead of CommunicationSpi.
>>>>> 3) All nodes wait for all pending cache operations to finish and then
>>>>> send their local partition maps to the coordinator (oldest node). Then
>>>>> coordinator calculates new global partition maps and sends them to every
>>>>> node.
>>>>> 4) All cache operations.
>>>>> 5) Exchange is never retried. Ignite community is currently working on
>>>>> PME failure handling that should kick all problematic nodes after timeout
>>>>> is reached (see
>>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-25%3A+Partition+Map+Exchange+hangs+resolving
>>>>> for details), but it isn't done yet.
>>>>> 6) You shouldn't consider PME failure as a error by itself, but rather
>>>>> as a result of some other error. The most common reason of PME hang-up is
>>>>> pending cache operation that couldn't finish. Check your logs - it should
>>>>> list pending transactions and atomic updates. Search for "Found long
>>>>> running" substring.
>>>>>
>>>>> Hope this helps.
>>>>>
>>>>> On Fri, Sep 7, 2018 at 11:45 PM, eugene miretsky <
>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> Out cluster occasionally fails with "partition map exchange failure"
>>>>>> errors, I have searched around and it seems that a lot of people have had a
>>>>>> similar issue in the past. My high-level understanding is that when one of
>>>>>> the nodes fails (out of memory, exception, GC etc.) nodes fail to exchange
>>>>>> partition maps. However, I have a few questions
>>>>>> 1) When does partition map exchange happen? Periodically, when a node
>>>>>> joins, etc.
>>>>>> 2) Is it done in the same thread as communication SPI, or is a
>>>>>> separate worker?
>>>>>> 3) How does the exchange happen? Via a coordinator, peer to peer,
>>>>>> etc?
>>>>>> 4) What does the exchange block?
>>>>>> 5) When is the exchange retried?
>>>>>> 5) How to resolve the error? The only thing I have seen online is to
>>>>>> decrease failureDetectionTimeout
>>>>>>
>>>>>> Our settings are
>>>>>> - Zookeeper SPI
>>>>>> - Persistence enabled
>>>>>>
>>>>>> Cheers,
>>>>>> Eugene
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best regards,
>>>>> Ilya
>>>>>
>>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Ilya
>>>
>>
>
>
> --
> Best regards,
> Ilya
>

Re: Partition map exchange in detail

Posted by Ilya Lantukh <il...@gridgain.com>.

3) Such mechanics will be implemented in IEP-25 (linked above).
4) Partition map states include update counters, which are incremented on
every cache update and play important role in new state calculation. So,
technically, every cache operation can lead to partition map change, and
for obvious reasons we can't route them through coordinator. Ignite is a
more complex system than Akka or Kafka and such simple solutions won't work
here (in general case). However, it is true that PME could be simplified or
completely avoid for certain cases and the community is currently working
on such optimizations (https://issues.apache.org/jira/browse/IGNITE-9558
for example).

On Wed, Sep 12, 2018 at 9:08 AM, eugene miretsky <eu...@gmail.com>
wrote:

> 2b) I had a few situations where the cluster went into a state where PME
> constantly failed, and could never recover. I think the root cause was that
> a transaction got stuck and didn't timeout/rollback.  I will try to
> reproduce it again and get back to you
> 3) If a node is down, I would expect it to get detected and the node to
> get removed from the cluster. In such case, PME should not even be
> attempted with that node. Hence you would expect PME to fail very rarely
> (any faulty node will be removed before it has a chance to fail PME)
> 4) Don't all partition map changes go through the coordinator? I believe a
> lot of distributed systems work in this way (all decisions are made by the
> coordinator/leader) - In Akka the leader is responsible for making all
> cluster membership changes, in Kafka the controller does the leader
> election.
>
> On Tue, Sep 11, 2018 at 11:11 AM Ilya Lantukh <il...@gridgain.com>
> wrote:
>
>> 1) It is.
>> 2a) Ignite has retry mechanics for all messages, including PME-related
>> ones.
>> 2b) In this situation PME will hang, but it isn't a "deadlock".
>> 3) Sorry, I didn't understand your question. If a node is down, but
>> DiscoverySpi doesn't detect it, it isn't PME-related problem.
>> 4) How can you ensure that partition maps on coordinator are *latest *without
>> "freezing" cluster state for some time?
>>
>> On Sat, Sep 8, 2018 at 3:21 AM, eugene miretsky <
>> eugene.miretsky@gmail.com> wrote:
>>
>>> Thanks!
>>>
>>> We are using persistence, so I am not sure if shutting down nodes will
>>> be the desired outcome for us since we would need to modify the baseline
>>> topolgy.
>>>
>>> A couple more follow up questions
>>>
>>> 1) Is PME triggered when client nodes join us well? We are using Spark
>>> client, so new nodes are created/destroy every time.
>>> 2) It sounds to me like there is a pontential for the cluster to get
>>> into a deadlock if
>>>    a) single PME message is lost (PME never finishes, there are no
>>> retries, and all future operations are blocked on the pending PME)
>>>    b) one of the nodes has a  long running/stuck pending operation
>>> 3) Under what circumastance can PME fail, while DiscoverySpi fails to
>>> detect the node being down? We are using ZookeeperSpi so I would expect the
>>> split brain resolver to shut down the node.
>>> 4) Why is PME needed? Doesn't the coordinator know the altest
>>> toplogy/pertition map of the cluster through regualr gossip?
>>>
>>> Cheers,
>>> Eugene
>>>
>>> On Fri, Sep 7, 2018 at 5:18 PM Ilya Lantukh <il...@gridgain.com>
>>> wrote:
>>>
>>>> Hi Eugene,
>>>>
>>>> 1) PME happens when topology is modified (TopologyVersion is
>>>> incremented). The most common events that trigger it are: node
>>>> start/stop/fail, cluster activation/deactivation, dynamic cache start/stop.
>>>> 2) It is done by a separate ExchangeWorker. Events that trigger PME are
>>>> transferred using DiscoverySpi instead of CommunicationSpi.
>>>> 3) All nodes wait for all pending cache operations to finish and then
>>>> send their local partition maps to the coordinator (oldest node). Then
>>>> coordinator calculates new global partition maps and sends them to every
>>>> node.
>>>> 4) All cache operations.
>>>> 5) Exchange is never retried. Ignite community is currently working on
>>>> PME failure handling that should kick all problematic nodes after timeout
>>>> is reached (see https://cwiki.apache.org/confluence/display/IGNITE/IEP-
>>>> 25%3A+Partition+Map+Exchange+hangs+resolving for details), but it
>>>> isn't done yet.
>>>> 6) You shouldn't consider PME failure as a error by itself, but rather
>>>> as a result of some other error. The most common reason of PME hang-up is
>>>> pending cache operation that couldn't finish. Check your logs - it should
>>>> list pending transactions and atomic updates. Search for "Found long
>>>> running" substring.
>>>>
>>>> Hope this helps.
>>>>
>>>> On Fri, Sep 7, 2018 at 11:45 PM, eugene miretsky <
>>>> eugene.miretsky@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> Out cluster occasionally fails with "partition map exchange failure"
>>>>> errors, I have searched around and it seems that a lot of people have had a
>>>>> similar issue in the past. My high-level understanding is that when one of
>>>>> the nodes fails (out of memory, exception, GC etc.) nodes fail to exchange
>>>>> partition maps. However, I have a few questions
>>>>> 1) When does partition map exchange happen? Periodically, when a node
>>>>> joins, etc.
>>>>> 2) Is it done in the same thread as communication SPI, or is a
>>>>> separate worker?
>>>>> 3) How does the exchange happen? Via a coordinator, peer to peer, etc?
>>>>> 4) What does the exchange block?
>>>>> 5) When is the exchange retried?
>>>>> 5) How to resolve the error? The only thing I have seen online is to
>>>>> decrease failureDetectionTimeout
>>>>>
>>>>> Our settings are
>>>>> - Zookeeper SPI
>>>>> - Persistence enabled
>>>>>
>>>>> Cheers,
>>>>> Eugene
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best regards,
>>>> Ilya
>>>>
>>>
>>
>>
>> --
>> Best regards,
>> Ilya
>>
>


-- 
Best regards,
Ilya

Re: Partition map exchange in detail

Posted by eugene miretsky <eu...@gmail.com>.

2b) I had a few situations where the cluster went into a state where PME
constantly failed, and could never recover. I think the root cause was that
a transaction got stuck and didn't timeout/rollback.  I will try to
reproduce it again and get back to you
3) If a node is down, I would expect it to get detected and the node to get
removed from the cluster. In such case, PME should not even be attempted
with that node. Hence you would expect PME to fail very rarely (any faulty
node will be removed before it has a chance to fail PME)
4) Don't all partition map changes go through the coordinator? I believe a
lot of distributed systems work in this way (all decisions are made by the
coordinator/leader) - In Akka the leader is responsible for making all
cluster membership changes, in Kafka the controller does the leader
election.

On Tue, Sep 11, 2018 at 11:11 AM Ilya Lantukh <il...@gridgain.com> wrote:

> 1) It is.
> 2a) Ignite has retry mechanics for all messages, including PME-related
> ones.
> 2b) In this situation PME will hang, but it isn't a "deadlock".
> 3) Sorry, I didn't understand your question. If a node is down, but
> DiscoverySpi doesn't detect it, it isn't PME-related problem.
> 4) How can you ensure that partition maps on coordinator are *latest *without
> "freezing" cluster state for some time?
>
> On Sat, Sep 8, 2018 at 3:21 AM, eugene miretsky <eugene.miretsky@gmail.com
> > wrote:
>
>> Thanks!
>>
>> We are using persistence, so I am not sure if shutting down nodes will be
>> the desired outcome for us since we would need to modify the baseline
>> topolgy.
>>
>> A couple more follow up questions
>>
>> 1) Is PME triggered when client nodes join us well? We are using Spark
>> client, so new nodes are created/destroy every time.
>> 2) It sounds to me like there is a pontential for the cluster to get into
>> a deadlock if
>>    a) single PME message is lost (PME never finishes, there are no
>> retries, and all future operations are blocked on the pending PME)
>>    b) one of the nodes has a  long running/stuck pending operation
>> 3) Under what circumastance can PME fail, while DiscoverySpi fails to
>> detect the node being down? We are using ZookeeperSpi so I would expect the
>> split brain resolver to shut down the node.
>> 4) Why is PME needed? Doesn't the coordinator know the altest
>> toplogy/pertition map of the cluster through regualr gossip?
>>
>> Cheers,
>> Eugene
>>
>> On Fri, Sep 7, 2018 at 5:18 PM Ilya Lantukh <il...@gridgain.com>
>> wrote:
>>
>>> Hi Eugene,
>>>
>>> 1) PME happens when topology is modified (TopologyVersion is
>>> incremented). The most common events that trigger it are: node
>>> start/stop/fail, cluster activation/deactivation, dynamic cache start/stop.
>>> 2) It is done by a separate ExchangeWorker. Events that trigger PME are
>>> transferred using DiscoverySpi instead of CommunicationSpi.
>>> 3) All nodes wait for all pending cache operations to finish and then
>>> send their local partition maps to the coordinator (oldest node). Then
>>> coordinator calculates new global partition maps and sends them to every
>>> node.
>>> 4) All cache operations.
>>> 5) Exchange is never retried. Ignite community is currently working on
>>> PME failure handling that should kick all problematic nodes after timeout
>>> is reached (see
>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-25%3A+Partition+Map+Exchange+hangs+resolving
>>> for details), but it isn't done yet.
>>> 6) You shouldn't consider PME failure as a error by itself, but rather
>>> as a result of some other error. The most common reason of PME hang-up is
>>> pending cache operation that couldn't finish. Check your logs - it should
>>> list pending transactions and atomic updates. Search for "Found long
>>> running" substring.
>>>
>>> Hope this helps.
>>>
>>> On Fri, Sep 7, 2018 at 11:45 PM, eugene miretsky <
>>> eugene.miretsky@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> Out cluster occasionally fails with "partition map exchange failure"
>>>> errors, I have searched around and it seems that a lot of people have had a
>>>> similar issue in the past. My high-level understanding is that when one of
>>>> the nodes fails (out of memory, exception, GC etc.) nodes fail to exchange
>>>> partition maps. However, I have a few questions
>>>> 1) When does partition map exchange happen? Periodically, when a node
>>>> joins, etc.
>>>> 2) Is it done in the same thread as communication SPI, or is a separate
>>>> worker?
>>>> 3) How does the exchange happen? Via a coordinator, peer to peer, etc?
>>>> 4) What does the exchange block?
>>>> 5) When is the exchange retried?
>>>> 5) How to resolve the error? The only thing I have seen online is to
>>>> decrease failureDetectionTimeout
>>>>
>>>> Our settings are
>>>> - Zookeeper SPI
>>>> - Persistence enabled
>>>>
>>>> Cheers,
>>>> Eugene
>>>>
>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Ilya
>>>
>>
>
>
> --
> Best regards,
> Ilya
>

Re: Partition map exchange in detail

Posted by Ilya Lantukh <il...@gridgain.com>.

1) It is.
2a) Ignite has retry mechanics for all messages, including PME-related ones.
2b) In this situation PME will hang, but it isn't a "deadlock".
3) Sorry, I didn't understand your question. If a node is down, but
DiscoverySpi doesn't detect it, it isn't PME-related problem.
4) How can you ensure that partition maps on coordinator are *latest *without
"freezing" cluster state for some time?

On Sat, Sep 8, 2018 at 3:21 AM, eugene miretsky <eu...@gmail.com>
wrote:

> Thanks!
>
> We are using persistence, so I am not sure if shutting down nodes will be
> the desired outcome for us since we would need to modify the baseline
> topolgy.
>
> A couple more follow up questions
>
> 1) Is PME triggered when client nodes join us well? We are using Spark
> client, so new nodes are created/destroy every time.
> 2) It sounds to me like there is a pontential for the cluster to get into
> a deadlock if
>    a) single PME message is lost (PME never finishes, there are no
> retries, and all future operations are blocked on the pending PME)
>    b) one of the nodes has a  long running/stuck pending operation
> 3) Under what circumastance can PME fail, while DiscoverySpi fails to
> detect the node being down? We are using ZookeeperSpi so I would expect the
> split brain resolver to shut down the node.
> 4) Why is PME needed? Doesn't the coordinator know the altest
> toplogy/pertition map of the cluster through regualr gossip?
>
> Cheers,
> Eugene
>
> On Fri, Sep 7, 2018 at 5:18 PM Ilya Lantukh <il...@gridgain.com> wrote:
>
>> Hi Eugene,
>>
>> 1) PME happens when topology is modified (TopologyVersion is
>> incremented). The most common events that trigger it are: node
>> start/stop/fail, cluster activation/deactivation, dynamic cache start/stop.
>> 2) It is done by a separate ExchangeWorker. Events that trigger PME are
>> transferred using DiscoverySpi instead of CommunicationSpi.
>> 3) All nodes wait for all pending cache operations to finish and then
>> send their local partition maps to the coordinator (oldest node). Then
>> coordinator calculates new global partition maps and sends them to every
>> node.
>> 4) All cache operations.
>> 5) Exchange is never retried. Ignite community is currently working on
>> PME failure handling that should kick all problematic nodes after timeout
>> is reached (see https://cwiki.apache.org/confluence/display/IGNITE/IEP-
>> 25%3A+Partition+Map+Exchange+hangs+resolving for details), but it isn't
>> done yet.
>> 6) You shouldn't consider PME failure as a error by itself, but rather as
>> a result of some other error. The most common reason of PME hang-up is
>> pending cache operation that couldn't finish. Check your logs - it should
>> list pending transactions and atomic updates. Search for "Found long
>> running" substring.
>>
>> Hope this helps.
>>
>> On Fri, Sep 7, 2018 at 11:45 PM, eugene miretsky <
>> eugene.miretsky@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> Out cluster occasionally fails with "partition map exchange failure"
>>> errors, I have searched around and it seems that a lot of people have had a
>>> similar issue in the past. My high-level understanding is that when one of
>>> the nodes fails (out of memory, exception, GC etc.) nodes fail to exchange
>>> partition maps. However, I have a few questions
>>> 1) When does partition map exchange happen? Periodically, when a node
>>> joins, etc.
>>> 2) Is it done in the same thread as communication SPI, or is a separate
>>> worker?
>>> 3) How does the exchange happen? Via a coordinator, peer to peer, etc?
>>> 4) What does the exchange block?
>>> 5) When is the exchange retried?
>>> 5) How to resolve the error? The only thing I have seen online is to
>>> decrease failureDetectionTimeout
>>>
>>> Our settings are
>>> - Zookeeper SPI
>>> - Persistence enabled
>>>
>>> Cheers,
>>> Eugene
>>>
>>
>>
>>
>> --
>> Best regards,
>> Ilya
>>
>


-- 
Best regards,
Ilya

Re: Partition map exchange in detail

Posted by eugene miretsky <eu...@gmail.com>.

Thanks!

We are using persistence, so I am not sure if shutting down nodes will be
the desired outcome for us since we would need to modify the baseline
topolgy.

A couple more follow up questions

1) Is PME triggered when client nodes join us well? We are using Spark
client, so new nodes are created/destroy every time.
2) It sounds to me like there is a pontential for the cluster to get into a
deadlock if
   a) single PME message is lost (PME never finishes, there are no retries,
and all future operations are blocked on the pending PME)
   b) one of the nodes has a  long running/stuck pending operation
3) Under what circumastance can PME fail, while DiscoverySpi fails to
detect the node being down? We are using ZookeeperSpi so I would expect the
split brain resolver to shut down the node.
4) Why is PME needed? Doesn't the coordinator know the altest
toplogy/pertition map of the cluster through regualr gossip?

Cheers,
Eugene

On Fri, Sep 7, 2018 at 5:18 PM Ilya Lantukh <il...@gridgain.com> wrote:

> Hi Eugene,
>
> 1) PME happens when topology is modified (TopologyVersion is incremented).
> The most common events that trigger it are: node start/stop/fail, cluster
> activation/deactivation, dynamic cache start/stop.
> 2) It is done by a separate ExchangeWorker. Events that trigger PME are
> transferred using DiscoverySpi instead of CommunicationSpi.
> 3) All nodes wait for all pending cache operations to finish and then send
> their local partition maps to the coordinator (oldest node). Then
> coordinator calculates new global partition maps and sends them to every
> node.
> 4) All cache operations.
> 5) Exchange is never retried. Ignite community is currently working on PME
> failure handling that should kick all problematic nodes after timeout is
> reached (see
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-25%3A+Partition+Map+Exchange+hangs+resolving
> for details), but it isn't done yet.
> 6) You shouldn't consider PME failure as a error by itself, but rather as
> a result of some other error. The most common reason of PME hang-up is
> pending cache operation that couldn't finish. Check your logs - it should
> list pending transactions and atomic updates. Search for "Found long
> running" substring.
>
> Hope this helps.
>
> On Fri, Sep 7, 2018 at 11:45 PM, eugene miretsky <
> eugene.miretsky@gmail.com> wrote:
>
>> Hello,
>>
>> Out cluster occasionally fails with "partition map exchange failure"
>> errors, I have searched around and it seems that a lot of people have had a
>> similar issue in the past. My high-level understanding is that when one of
>> the nodes fails (out of memory, exception, GC etc.) nodes fail to exchange
>> partition maps. However, I have a few questions
>> 1) When does partition map exchange happen? Periodically, when a node
>> joins, etc.
>> 2) Is it done in the same thread as communication SPI, or is a separate
>> worker?
>> 3) How does the exchange happen? Via a coordinator, peer to peer, etc?
>> 4) What does the exchange block?
>> 5) When is the exchange retried?
>> 5) How to resolve the error? The only thing I have seen online is to
>> decrease failureDetectionTimeout
>>
>> Our settings are
>> - Zookeeper SPI
>> - Persistence enabled
>>
>> Cheers,
>> Eugene
>>
>
>
>
> --
> Best regards,
> Ilya
>

Re: Partition map exchange in detail

Posted by Ilya Lantukh <il...@gridgain.com>.

Hi Eugene,

1) PME happens when topology is modified (TopologyVersion is incremented).
The most common events that trigger it are: node start/stop/fail, cluster
activation/deactivation, dynamic cache start/stop.
2) It is done by a separate ExchangeWorker. Events that trigger PME are
transferred using DiscoverySpi instead of CommunicationSpi.
3) All nodes wait for all pending cache operations to finish and then send
their local partition maps to the coordinator (oldest node). Then
coordinator calculates new global partition maps and sends them to every
node.
4) All cache operations.
5) Exchange is never retried. Ignite community is currently working on PME
failure handling that should kick all problematic nodes after timeout is
reached (see
https://cwiki.apache.org/confluence/display/IGNITE/IEP-25%3A+Partition+Map+Exchange+hangs+resolving
for details), but it isn't done yet.
6) You shouldn't consider PME failure as a error by itself, but rather as a
result of some other error. The most common reason of PME hang-up is
pending cache operation that couldn't finish. Check your logs - it should
list pending transactions and atomic updates. Search for "Found long
running" substring.

Hope this helps.

On Fri, Sep 7, 2018 at 11:45 PM, eugene miretsky <eu...@gmail.com>
wrote:

> Hello,
>
> Out cluster occasionally fails with "partition map exchange failure"
> errors, I have searched around and it seems that a lot of people have had a
> similar issue in the past. My high-level understanding is that when one of
> the nodes fails (out of memory, exception, GC etc.) nodes fail to exchange
> partition maps. However, I have a few questions
> 1) When does partition map exchange happen? Periodically, when a node
> joins, etc.
> 2) Is it done in the same thread as communication SPI, or is a separate
> worker?
> 3) How does the exchange happen? Via a coordinator, peer to peer, etc?
> 4) What does the exchange block?
> 5) When is the exchange retried?
> 5) How to resolve the error? The only thing I have seen online is to
> decrease failureDetectionTimeout
>
> Our settings are
> - Zookeeper SPI
> - Persistence enabled
>
> Cheers,
> Eugene
>

-- 
Best regards,
Ilya