You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by Cong Guo <na...@gmail.com> on 2020/11/12 23:22:26 UTC

Nodes failed to join the cluster after restarting

Hi,

I have a 3-node cluster with persistence enabled. All the three nodes are
in the baseline topology. The ignite version is 2.8.1.

When I restart the first node, it encounters an error and fails to join the
cluster. The error message is "Caused by: org.apache.
ignite.spi.IgniteSpiException: Attempting to join node with larger
distributed metastorage version id. The node is most likely in invalid
state and can't be joined." I try several times but get the same error.

Then I restart the second node, it encounters the same error. After I
restart the third node, the other two nodes can start successfully and join
the cluster. When I restart the nodes, I do not change the baseline
topology. I cannot reproduce this error now.

I find someone else has the same problem.
http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html

The answer is corruption in the metastorage. I do not see any issue of the
metastorage files. However, it is a small probability event to have files
on two different machines corrupted at the same time. Is it possible that
this is another bug like https://issues.apache.org/jira/browse/IGNITE-12850?

Do you have any document about how the version id is updated and read?
Could you please show me in the source code where the version id is read
when a node starts and where the version id is updated when a node stops?
Thank you!

Re: Nodes failed to join the cluster after restarting

Posted by Ivan Bessonov <be...@gmail.com>.

Hi,

sadly, logs from the latest message show nothing. There are no visible
issues with the code either, I already checked it. Sorry to say, but what
we need is additional logs in Ignite code and stable reproducer, we don't
have both.

You shouldn't worry about it I think. It's most likely a bug that only
occurs once.

чт, 19 нояб. 2020 г. в 02:50, Cong Guo <na...@gmail.com>:

> Hi,
>
> I attach the log from the only working node while two others are
> restarted. There is no error message other than the "failed to join"
> message. I do not see any clue in the log. I cannot reproduce this issue
> either. That's why I am asking about the code. Maybe you know certain
> suspicious places. Thank you.
>
> On Wed, Nov 18, 2020 at 2:45 AM Ivan Bessonov <be...@gmail.com>
> wrote:
>
>> Sorry, I see that you use TcpDiscoverySpi.
>>
>> ср, 18 нояб. 2020 г. в 10:44, Ivan Bessonov <be...@gmail.com>:
>>
>>> Hello,
>>>
>>> these parameters are configured automatically, I know that you don't
>>> configure them. And with the fact that all "automatic" configuration is
>>> completed, chances of seeing the same bug are low.
>>>
>>> Understanding the reason is tricky, we would need to debug the starting
>>> node or at least add more logs. Is this possible? I see that you're asking
>>> me about the code.
>>>
>>> Knowing the content of "ver" and "histCache.toArray()" in
>>> "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#collectJoiningNodeData"
>>> would certainly help.
>>> More specifically - *ver.id <http://ver.id>()* and *Arrays.stream(histCache.toArray()).map(item
>>> -> Arrays.toString(item.keys())).collect(Collectors.joining(","))*
>>>
>>> Honestly, I have no idea how your situation is even possible, otherwise
>>> we would find the solution rather quickly. Needless to say, I can't
>>> reproduce it. Error message that you see was created for the case when you
>>> join your node to the wrong cluster.
>>>
>>> Do you have any custom code during the node start? And one more question
>>> - what discovery SPI are you using? TCP or Zookeeper?
>>>
>>>
>>> ср, 18 нояб. 2020 г. в 02:29, Cong Guo <na...@gmail.com>:
>>>
>>>> Hi,
>>>>
>>>> The parameters values on two other nodes are the same. Actually I do
>>>> not configure these values. When you enable the native persistence, you
>>>> will see these logs by default. Nothing is special. When this error occurs
>>>> on the restarting node, nothing happens on two other nodes. When I restart
>>>> the second node, it also fails due to the same error.
>>>>
>>>> I will still need to restart the nodes in the future,  one by one
>>>> without stopping the service. This issue may happen again. The workaround
>>>> has to deactivate the cluster and stop the service, which does not work in
>>>> a production environment.
>>>>
>>>> I think we need to fix this bug or at least understand the reason to
>>>> avoid it. Could you please tell me where this version value could be
>>>> modified when a node just starts? Do you have any guess about this bug now?
>>>> I can help analyze the code. Thank you.
>>>>
>>>> On Tue, Nov 17, 2020 at 4:09 AM Ivan Bessonov <be...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thank you for the reply!
>>>>>
>>>>> Right now the only existing distributed properties I see are these:
>>>>> - Baseline parameter 'baselineAutoAdjustEnabled' was changed from
>>>>> 'null' to 'false'
>>>>> - Baseline parameter 'baselineAutoAdjustTimeout' was changed from
>>>>> 'null' to '300000'
>>>>> - SQL parameter 'sql.disabledFunctions' was changed from 'null' to
>>>>> '[FILE_WRITE, CANCEL_SESSION, MEMORY_USED, CSVREAD, LINK_SCHEMA,
>>>>> MEMORY_FREE, FILE_READ, CSVWRITE, SESSION_ID, LOCK_MODE]'
>>>>>
>>>>> I wonder what values they have on nodes that rejected the new node. I
>>>>> suggest sending logs of those nodes as well.
>>>>> Right now I believe that this bug won't happen again on your
>>>>> installation, but it only makes it more elusive...
>>>>>
>>>>> The most probable reason is that node (somehow) initialized some
>>>>> properties with defaults before joining the cluster, while cluster didn't
>>>>> have those values at all.
>>>>> The rule is that activated cluster can't accept changed properties
>>>>> from joining node. So, the workaround would be deactivating the cluster,
>>>>> joining the node and activating it again. But as I said, I don't think that
>>>>> you'll see this bug ever again.
>>>>>
>>>>> вт, 17 нояб. 2020 г. в 07:34, Cong Guo <na...@gmail.com>:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Please find the attached log for a complete but failed reboot. You
>>>>>> can see the exceptions.
>>>>>>
>>>>>> On Mon, Nov 16, 2020 at 4:00 AM Ivan Bessonov <be...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> there must be a bug somewhere during node start, it updates its
>>>>>>> distributed metastorage content and tries to join an already activated
>>>>>>> cluster, thus creating a conflict. It's hard to tell the exact data that
>>>>>>> caused conflict, especially without any logs.
>>>>>>>
>>>>>>> Topic that you mentioned (
>>>>>>> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html)
>>>>>>> seems to be about the same problem, but the issue
>>>>>>> https://issues.apache.org/jira/browse/IGNITE-12850 is not related
>>>>>>> to it.
>>>>>>>
>>>>>>> If you have logs from those unsuccessful restart attempts, it would
>>>>>>> be very helpful.
>>>>>>>
>>>>>>> Sadly, distributed metastorage is an internal component to store
>>>>>>> settings and has no public documentation. Developers documentation is
>>>>>>> probably outdated and incomplete. But just in case, "version id" that
>>>>>>> message is referring to is located in field
>>>>>>> "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#ver",
>>>>>>> it's incremented on every distributed metastorage setting update. You can
>>>>>>> find your error message in the same class.
>>>>>>>
>>>>>>> Please follow up with more questions and logs it possible, I hope
>>>>>>> we'll figure it out.
>>>>>>>
>>>>>>> Thank you!
>>>>>>>
>>>>>>> пт, 13 нояб. 2020 г. в 02:23, Cong Guo <na...@gmail.com>:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I have a 3-node cluster with persistence enabled. All the three
>>>>>>>> nodes are in the baseline topology. The ignite version is 2.8.1.
>>>>>>>>
>>>>>>>> When I restart the first node, it encounters an error and fails to
>>>>>>>> join the cluster. The error message is "Caused by: org.apache.
>>>>>>>> ignite.spi.IgniteSpiException: Attempting to join node with larger
>>>>>>>> distributed metastorage version id. The node is most likely in invalid
>>>>>>>> state and can't be joined." I try several times but get the same
>>>>>>>> error.
>>>>>>>>
>>>>>>>> Then I restart the second node, it encounters the same error. After
>>>>>>>> I restart the third node, the other two nodes can start successfully and
>>>>>>>> join the cluster. When I restart the nodes, I do not change the baseline
>>>>>>>> topology. I cannot reproduce this error now.
>>>>>>>>
>>>>>>>> I find someone else has the same problem.
>>>>>>>> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html
>>>>>>>>
>>>>>>>> The answer is corruption in the metastorage. I do not see any issue
>>>>>>>> of the metastorage files. However, it is a small probability event to have
>>>>>>>> files on two different machines corrupted at the same time. Is it possible
>>>>>>>> that this is another bug like
>>>>>>>> https://issues.apache.org/jira/browse/IGNITE-12850?
>>>>>>>>
>>>>>>>> Do you have any document about how the version id is updated and
>>>>>>>> read? Could you please show me in the source code where the version id is
>>>>>>>> read when a node starts and where the version id is updated when a node
>>>>>>>> stops? Thank you!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Sincerely yours,
>>>>>>> Ivan Bessonov
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Sincerely yours,
>>>>> Ivan Bessonov
>>>>>
>>>>
>>>
>>> --
>>> Sincerely yours,
>>> Ivan Bessonov
>>>
>>
>>
>> --
>> Sincerely yours,
>> Ivan Bessonov
>>
>

-- 
Sincerely yours,
Ivan Bessonov

Re: Nodes failed to join the cluster after restarting

Posted by Cong Guo <na...@gmail.com>.

Hi,

I attach the log from the only working node while two others are restarted.
There is no error message other than the "failed to join" message. I do not
see any clue in the log. I cannot reproduce this issue either. That's why I
am asking about the code. Maybe you know certain suspicious places. Thank
you.

On Wed, Nov 18, 2020 at 2:45 AM Ivan Bessonov <be...@gmail.com> wrote:

> Sorry, I see that you use TcpDiscoverySpi.
>
> ср, 18 нояб. 2020 г. в 10:44, Ivan Bessonov <be...@gmail.com>:
>
>> Hello,
>>
>> these parameters are configured automatically, I know that you don't
>> configure them. And with the fact that all "automatic" configuration is
>> completed, chances of seeing the same bug are low.
>>
>> Understanding the reason is tricky, we would need to debug the starting
>> node or at least add more logs. Is this possible? I see that you're asking
>> me about the code.
>>
>> Knowing the content of "ver" and "histCache.toArray()" in
>> "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#collectJoiningNodeData"
>> would certainly help.
>> More specifically - *ver.id <http://ver.id>()* and *Arrays.stream(histCache.toArray()).map(item
>> -> Arrays.toString(item.keys())).collect(Collectors.joining(","))*
>>
>> Honestly, I have no idea how your situation is even possible, otherwise
>> we would find the solution rather quickly. Needless to say, I can't
>> reproduce it. Error message that you see was created for the case when you
>> join your node to the wrong cluster.
>>
>> Do you have any custom code during the node start? And one more question
>> - what discovery SPI are you using? TCP or Zookeeper?
>>
>>
>> ср, 18 нояб. 2020 г. в 02:29, Cong Guo <na...@gmail.com>:
>>
>>> Hi,
>>>
>>> The parameters values on two other nodes are the same. Actually I do not
>>> configure these values. When you enable the native persistence, you will
>>> see these logs by default. Nothing is special. When this error occurs on
>>> the restarting node, nothing happens on two other nodes. When I restart the
>>> second node, it also fails due to the same error.
>>>
>>> I will still need to restart the nodes in the future,  one by one
>>> without stopping the service. This issue may happen again. The workaround
>>> has to deactivate the cluster and stop the service, which does not work in
>>> a production environment.
>>>
>>> I think we need to fix this bug or at least understand the reason to
>>> avoid it. Could you please tell me where this version value could be
>>> modified when a node just starts? Do you have any guess about this bug now?
>>> I can help analyze the code. Thank you.
>>>
>>> On Tue, Nov 17, 2020 at 4:09 AM Ivan Bessonov <be...@gmail.com>
>>> wrote:
>>>
>>>> Thank you for the reply!
>>>>
>>>> Right now the only existing distributed properties I see are these:
>>>> - Baseline parameter 'baselineAutoAdjustEnabled' was changed from
>>>> 'null' to 'false'
>>>> - Baseline parameter 'baselineAutoAdjustTimeout' was changed from
>>>> 'null' to '300000'
>>>> - SQL parameter 'sql.disabledFunctions' was changed from 'null' to
>>>> '[FILE_WRITE, CANCEL_SESSION, MEMORY_USED, CSVREAD, LINK_SCHEMA,
>>>> MEMORY_FREE, FILE_READ, CSVWRITE, SESSION_ID, LOCK_MODE]'
>>>>
>>>> I wonder what values they have on nodes that rejected the new node. I
>>>> suggest sending logs of those nodes as well.
>>>> Right now I believe that this bug won't happen again on your
>>>> installation, but it only makes it more elusive...
>>>>
>>>> The most probable reason is that node (somehow) initialized some
>>>> properties with defaults before joining the cluster, while cluster didn't
>>>> have those values at all.
>>>> The rule is that activated cluster can't accept changed properties from
>>>> joining node. So, the workaround would be deactivating the cluster, joining
>>>> the node and activating it again. But as I said, I don't think that you'll
>>>> see this bug ever again.
>>>>
>>>> вт, 17 нояб. 2020 г. в 07:34, Cong Guo <na...@gmail.com>:
>>>>
>>>>> Hi,
>>>>>
>>>>> Please find the attached log for a complete but failed reboot. You can
>>>>> see the exceptions.
>>>>>
>>>>> On Mon, Nov 16, 2020 at 4:00 AM Ivan Bessonov <be...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> there must be a bug somewhere during node start, it updates its
>>>>>> distributed metastorage content and tries to join an already activated
>>>>>> cluster, thus creating a conflict. It's hard to tell the exact data that
>>>>>> caused conflict, especially without any logs.
>>>>>>
>>>>>> Topic that you mentioned (
>>>>>> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html)
>>>>>> seems to be about the same problem, but the issue
>>>>>> https://issues.apache.org/jira/browse/IGNITE-12850 is not related to
>>>>>> it.
>>>>>>
>>>>>> If you have logs from those unsuccessful restart attempts, it would
>>>>>> be very helpful.
>>>>>>
>>>>>> Sadly, distributed metastorage is an internal component to store
>>>>>> settings and has no public documentation. Developers documentation is
>>>>>> probably outdated and incomplete. But just in case, "version id" that
>>>>>> message is referring to is located in field
>>>>>> "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#ver",
>>>>>> it's incremented on every distributed metastorage setting update. You can
>>>>>> find your error message in the same class.
>>>>>>
>>>>>> Please follow up with more questions and logs it possible, I hope
>>>>>> we'll figure it out.
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>> пт, 13 нояб. 2020 г. в 02:23, Cong Guo <na...@gmail.com>:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have a 3-node cluster with persistence enabled. All the three
>>>>>>> nodes are in the baseline topology. The ignite version is 2.8.1.
>>>>>>>
>>>>>>> When I restart the first node, it encounters an error and fails to
>>>>>>> join the cluster. The error message is "Caused by: org.apache.
>>>>>>> ignite.spi.IgniteSpiException: Attempting to join node with larger
>>>>>>> distributed metastorage version id. The node is most likely in invalid
>>>>>>> state and can't be joined." I try several times but get the same
>>>>>>> error.
>>>>>>>
>>>>>>> Then I restart the second node, it encounters the same error. After
>>>>>>> I restart the third node, the other two nodes can start successfully and
>>>>>>> join the cluster. When I restart the nodes, I do not change the baseline
>>>>>>> topology. I cannot reproduce this error now.
>>>>>>>
>>>>>>> I find someone else has the same problem.
>>>>>>> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html
>>>>>>>
>>>>>>> The answer is corruption in the metastorage. I do not see any issue
>>>>>>> of the metastorage files. However, it is a small probability event to have
>>>>>>> files on two different machines corrupted at the same time. Is it possible
>>>>>>> that this is another bug like
>>>>>>> https://issues.apache.org/jira/browse/IGNITE-12850?
>>>>>>>
>>>>>>> Do you have any document about how the version id is updated and
>>>>>>> read? Could you please show me in the source code where the version id is
>>>>>>> read when a node starts and where the version id is updated when a node
>>>>>>> stops? Thank you!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Sincerely yours,
>>>>>> Ivan Bessonov
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Sincerely yours,
>>>> Ivan Bessonov
>>>>
>>>
>>
>> --
>> Sincerely yours,
>> Ivan Bessonov
>>
>
>
> --
> Sincerely yours,
> Ivan Bessonov
>

Re: Nodes failed to join the cluster after restarting

Posted by Ivan Bessonov <be...@gmail.com>.

Sorry, I see that you use TcpDiscoverySpi.

ср, 18 нояб. 2020 г. в 10:44, Ivan Bessonov <be...@gmail.com>:

> Hello,
>
> these parameters are configured automatically, I know that you don't
> configure them. And with the fact that all "automatic" configuration is
> completed, chances of seeing the same bug are low.
>
> Understanding the reason is tricky, we would need to debug the starting
> node or at least add more logs. Is this possible? I see that you're asking
> me about the code.
>
> Knowing the content of "ver" and "histCache.toArray()" in
> "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#collectJoiningNodeData"
> would certainly help.
> More specifically - *ver.id <http://ver.id>()* and *Arrays.stream(histCache.toArray()).map(item
> -> Arrays.toString(item.keys())).collect(Collectors.joining(","))*
>
> Honestly, I have no idea how your situation is even possible, otherwise we
> would find the solution rather quickly. Needless to say, I can't reproduce
> it. Error message that you see was created for the case when you join your
> node to the wrong cluster.
>
> Do you have any custom code during the node start? And one more question -
> what discovery SPI are you using? TCP or Zookeeper?
>
>
> ср, 18 нояб. 2020 г. в 02:29, Cong Guo <na...@gmail.com>:
>
>> Hi,
>>
>> The parameters values on two other nodes are the same. Actually I do not
>> configure these values. When you enable the native persistence, you will
>> see these logs by default. Nothing is special. When this error occurs on
>> the restarting node, nothing happens on two other nodes. When I restart the
>> second node, it also fails due to the same error.
>>
>> I will still need to restart the nodes in the future,  one by one without
>> stopping the service. This issue may happen again. The workaround has to
>> deactivate the cluster and stop the service, which does not work in a
>> production environment.
>>
>> I think we need to fix this bug or at least understand the reason to
>> avoid it. Could you please tell me where this version value could be
>> modified when a node just starts? Do you have any guess about this bug now?
>> I can help analyze the code. Thank you.
>>
>> On Tue, Nov 17, 2020 at 4:09 AM Ivan Bessonov <be...@gmail.com>
>> wrote:
>>
>>> Thank you for the reply!
>>>
>>> Right now the only existing distributed properties I see are these:
>>> - Baseline parameter 'baselineAutoAdjustEnabled' was changed from 'null'
>>> to 'false'
>>> - Baseline parameter 'baselineAutoAdjustTimeout' was changed from 'null'
>>> to '300000'
>>> - SQL parameter 'sql.disabledFunctions' was changed from 'null' to
>>> '[FILE_WRITE, CANCEL_SESSION, MEMORY_USED, CSVREAD, LINK_SCHEMA,
>>> MEMORY_FREE, FILE_READ, CSVWRITE, SESSION_ID, LOCK_MODE]'
>>>
>>> I wonder what values they have on nodes that rejected the new node. I
>>> suggest sending logs of those nodes as well.
>>> Right now I believe that this bug won't happen again on your
>>> installation, but it only makes it more elusive...
>>>
>>> The most probable reason is that node (somehow) initialized some
>>> properties with defaults before joining the cluster, while cluster didn't
>>> have those values at all.
>>> The rule is that activated cluster can't accept changed properties from
>>> joining node. So, the workaround would be deactivating the cluster, joining
>>> the node and activating it again. But as I said, I don't think that you'll
>>> see this bug ever again.
>>>
>>> вт, 17 нояб. 2020 г. в 07:34, Cong Guo <na...@gmail.com>:
>>>
>>>> Hi,
>>>>
>>>> Please find the attached log for a complete but failed reboot. You can
>>>> see the exceptions.
>>>>
>>>> On Mon, Nov 16, 2020 at 4:00 AM Ivan Bessonov <be...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> there must be a bug somewhere during node start, it updates its
>>>>> distributed metastorage content and tries to join an already activated
>>>>> cluster, thus creating a conflict. It's hard to tell the exact data that
>>>>> caused conflict, especially without any logs.
>>>>>
>>>>> Topic that you mentioned (
>>>>> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html)
>>>>> seems to be about the same problem, but the issue
>>>>> https://issues.apache.org/jira/browse/IGNITE-12850 is not related to
>>>>> it.
>>>>>
>>>>> If you have logs from those unsuccessful restart attempts, it would be
>>>>> very helpful.
>>>>>
>>>>> Sadly, distributed metastorage is an internal component to store
>>>>> settings and has no public documentation. Developers documentation is
>>>>> probably outdated and incomplete. But just in case, "version id" that
>>>>> message is referring to is located in field
>>>>> "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#ver",
>>>>> it's incremented on every distributed metastorage setting update. You can
>>>>> find your error message in the same class.
>>>>>
>>>>> Please follow up with more questions and logs it possible, I hope
>>>>> we'll figure it out.
>>>>>
>>>>> Thank you!
>>>>>
>>>>> пт, 13 нояб. 2020 г. в 02:23, Cong Guo <na...@gmail.com>:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have a 3-node cluster with persistence enabled. All the three nodes
>>>>>> are in the baseline topology. The ignite version is 2.8.1.
>>>>>>
>>>>>> When I restart the first node, it encounters an error and fails to
>>>>>> join the cluster. The error message is "Caused by: org.apache.
>>>>>> ignite.spi.IgniteSpiException: Attempting to join node with larger
>>>>>> distributed metastorage version id. The node is most likely in invalid
>>>>>> state and can't be joined." I try several times but get the same
>>>>>> error.
>>>>>>
>>>>>> Then I restart the second node, it encounters the same error. After I
>>>>>> restart the third node, the other two nodes can start successfully and join
>>>>>> the cluster. When I restart the nodes, I do not change the baseline
>>>>>> topology. I cannot reproduce this error now.
>>>>>>
>>>>>> I find someone else has the same problem.
>>>>>> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html
>>>>>>
>>>>>> The answer is corruption in the metastorage. I do not see any issue
>>>>>> of the metastorage files. However, it is a small probability event to have
>>>>>> files on two different machines corrupted at the same time. Is it possible
>>>>>> that this is another bug like
>>>>>> https://issues.apache.org/jira/browse/IGNITE-12850?
>>>>>>
>>>>>> Do you have any document about how the version id is updated and
>>>>>> read? Could you please show me in the source code where the version id is
>>>>>> read when a node starts and where the version id is updated when a node
>>>>>> stops? Thank you!
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Sincerely yours,
>>>>> Ivan Bessonov
>>>>>
>>>>
>>>
>>> --
>>> Sincerely yours,
>>> Ivan Bessonov
>>>
>>
>
> --
> Sincerely yours,
> Ivan Bessonov
>


-- 
Sincerely yours,
Ivan Bessonov

Re: Nodes failed to join the cluster after restarting

Posted by Ivan Bessonov <be...@gmail.com>.

Hello,

these parameters are configured automatically, I know that you don't
configure them. And with the fact that all "automatic" configuration is
completed, chances of seeing the same bug are low.

Understanding the reason is tricky, we would need to debug the starting
node or at least add more logs. Is this possible? I see that you're asking
me about the code.

Knowing the content of "ver" and "histCache.toArray()" in
"org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#collectJoiningNodeData"
would certainly help.
More specifically - *ver.id <http://ver.id>()* and
*Arrays.stream(histCache.toArray()).map(item
-> Arrays.toString(item.keys())).collect(Collectors.joining(","))*

Honestly, I have no idea how your situation is even possible, otherwise we
would find the solution rather quickly. Needless to say, I can't reproduce
it. Error message that you see was created for the case when you join your
node to the wrong cluster.

Do you have any custom code during the node start? And one more question -
what discovery SPI are you using? TCP or Zookeeper?


ср, 18 нояб. 2020 г. в 02:29, Cong Guo <na...@gmail.com>:

> Hi,
>
> The parameters values on two other nodes are the same. Actually I do not
> configure these values. When you enable the native persistence, you will
> see these logs by default. Nothing is special. When this error occurs on
> the restarting node, nothing happens on two other nodes. When I restart the
> second node, it also fails due to the same error.
>
> I will still need to restart the nodes in the future,  one by one without
> stopping the service. This issue may happen again. The workaround has to
> deactivate the cluster and stop the service, which does not work in a
> production environment.
>
> I think we need to fix this bug or at least understand the reason to avoid
> it. Could you please tell me where this version value could be modified
> when a node just starts? Do you have any guess about this bug now? I can
> help analyze the code. Thank you.
>
> On Tue, Nov 17, 2020 at 4:09 AM Ivan Bessonov <be...@gmail.com>
> wrote:
>
>> Thank you for the reply!
>>
>> Right now the only existing distributed properties I see are these:
>> - Baseline parameter 'baselineAutoAdjustEnabled' was changed from 'null'
>> to 'false'
>> - Baseline parameter 'baselineAutoAdjustTimeout' was changed from 'null'
>> to '300000'
>> - SQL parameter 'sql.disabledFunctions' was changed from 'null' to
>> '[FILE_WRITE, CANCEL_SESSION, MEMORY_USED, CSVREAD, LINK_SCHEMA,
>> MEMORY_FREE, FILE_READ, CSVWRITE, SESSION_ID, LOCK_MODE]'
>>
>> I wonder what values they have on nodes that rejected the new node. I
>> suggest sending logs of those nodes as well.
>> Right now I believe that this bug won't happen again on your
>> installation, but it only makes it more elusive...
>>
>> The most probable reason is that node (somehow) initialized some
>> properties with defaults before joining the cluster, while cluster didn't
>> have those values at all.
>> The rule is that activated cluster can't accept changed properties from
>> joining node. So, the workaround would be deactivating the cluster, joining
>> the node and activating it again. But as I said, I don't think that you'll
>> see this bug ever again.
>>
>> вт, 17 нояб. 2020 г. в 07:34, Cong Guo <na...@gmail.com>:
>>
>>> Hi,
>>>
>>> Please find the attached log for a complete but failed reboot. You can
>>> see the exceptions.
>>>
>>> On Mon, Nov 16, 2020 at 4:00 AM Ivan Bessonov <be...@gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> there must be a bug somewhere during node start, it updates its
>>>> distributed metastorage content and tries to join an already activated
>>>> cluster, thus creating a conflict. It's hard to tell the exact data that
>>>> caused conflict, especially without any logs.
>>>>
>>>> Topic that you mentioned (
>>>> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html)
>>>> seems to be about the same problem, but the issue
>>>> https://issues.apache.org/jira/browse/IGNITE-12850 is not related to
>>>> it.
>>>>
>>>> If you have logs from those unsuccessful restart attempts, it would be
>>>> very helpful.
>>>>
>>>> Sadly, distributed metastorage is an internal component to store
>>>> settings and has no public documentation. Developers documentation is
>>>> probably outdated and incomplete. But just in case, "version id" that
>>>> message is referring to is located in field
>>>> "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#ver",
>>>> it's incremented on every distributed metastorage setting update. You can
>>>> find your error message in the same class.
>>>>
>>>> Please follow up with more questions and logs it possible, I hope we'll
>>>> figure it out.
>>>>
>>>> Thank you!
>>>>
>>>> пт, 13 нояб. 2020 г. в 02:23, Cong Guo <na...@gmail.com>:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have a 3-node cluster with persistence enabled. All the three nodes
>>>>> are in the baseline topology. The ignite version is 2.8.1.
>>>>>
>>>>> When I restart the first node, it encounters an error and fails to
>>>>> join the cluster. The error message is "Caused by: org.apache.
>>>>> ignite.spi.IgniteSpiException: Attempting to join node with larger
>>>>> distributed metastorage version id. The node is most likely in invalid
>>>>> state and can't be joined." I try several times but get the same
>>>>> error.
>>>>>
>>>>> Then I restart the second node, it encounters the same error. After I
>>>>> restart the third node, the other two nodes can start successfully and join
>>>>> the cluster. When I restart the nodes, I do not change the baseline
>>>>> topology. I cannot reproduce this error now.
>>>>>
>>>>> I find someone else has the same problem.
>>>>> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html
>>>>>
>>>>> The answer is corruption in the metastorage. I do not see any issue of
>>>>> the metastorage files. However, it is a small probability event to have
>>>>> files on two different machines corrupted at the same time. Is it possible
>>>>> that this is another bug like
>>>>> https://issues.apache.org/jira/browse/IGNITE-12850?
>>>>>
>>>>> Do you have any document about how the version id is updated and read?
>>>>> Could you please show me in the source code where the version id is read
>>>>> when a node starts and where the version id is updated when a node stops?
>>>>> Thank you!
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Sincerely yours,
>>>> Ivan Bessonov
>>>>
>>>
>>
>> --
>> Sincerely yours,
>> Ivan Bessonov
>>
>

-- 
Sincerely yours,
Ivan Bessonov

Re: Nodes failed to join the cluster after restarting

Posted by Cong Guo <na...@gmail.com>.

Hi,

The parameters values on two other nodes are the same. Actually I do not
configure these values. When you enable the native persistence, you will
see these logs by default. Nothing is special. When this error occurs on
the restarting node, nothing happens on two other nodes. When I restart the
second node, it also fails due to the same error.

I will still need to restart the nodes in the future,  one by one without
stopping the service. This issue may happen again. The workaround has to
deactivate the cluster and stop the service, which does not work in a
production environment.

I think we need to fix this bug or at least understand the reason to avoid
it. Could you please tell me where this version value could be modified
when a node just starts? Do you have any guess about this bug now? I can
help analyze the code. Thank you.

On Tue, Nov 17, 2020 at 4:09 AM Ivan Bessonov <be...@gmail.com> wrote:

> Thank you for the reply!
>
> Right now the only existing distributed properties I see are these:
> - Baseline parameter 'baselineAutoAdjustEnabled' was changed from 'null'
> to 'false'
> - Baseline parameter 'baselineAutoAdjustTimeout' was changed from 'null'
> to '300000'
> - SQL parameter 'sql.disabledFunctions' was changed from 'null' to
> '[FILE_WRITE, CANCEL_SESSION, MEMORY_USED, CSVREAD, LINK_SCHEMA,
> MEMORY_FREE, FILE_READ, CSVWRITE, SESSION_ID, LOCK_MODE]'
>
> I wonder what values they have on nodes that rejected the new node. I
> suggest sending logs of those nodes as well.
> Right now I believe that this bug won't happen again on your installation,
> but it only makes it more elusive...
>
> The most probable reason is that node (somehow) initialized some
> properties with defaults before joining the cluster, while cluster didn't
> have those values at all.
> The rule is that activated cluster can't accept changed properties from
> joining node. So, the workaround would be deactivating the cluster, joining
> the node and activating it again. But as I said, I don't think that you'll
> see this bug ever again.
>
> вт, 17 нояб. 2020 г. в 07:34, Cong Guo <na...@gmail.com>:
>
>> Hi,
>>
>> Please find the attached log for a complete but failed reboot. You can
>> see the exceptions.
>>
>> On Mon, Nov 16, 2020 at 4:00 AM Ivan Bessonov <be...@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> there must be a bug somewhere during node start, it updates its
>>> distributed metastorage content and tries to join an already activated
>>> cluster, thus creating a conflict. It's hard to tell the exact data that
>>> caused conflict, especially without any logs.
>>>
>>> Topic that you mentioned (
>>> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html)
>>> seems to be about the same problem, but the issue
>>> https://issues.apache.org/jira/browse/IGNITE-12850 is not related to it.
>>>
>>> If you have logs from those unsuccessful restart attempts, it would be
>>> very helpful.
>>>
>>> Sadly, distributed metastorage is an internal component to store
>>> settings and has no public documentation. Developers documentation is
>>> probably outdated and incomplete. But just in case, "version id" that
>>> message is referring to is located in field
>>> "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#ver",
>>> it's incremented on every distributed metastorage setting update. You can
>>> find your error message in the same class.
>>>
>>> Please follow up with more questions and logs it possible, I hope we'll
>>> figure it out.
>>>
>>> Thank you!
>>>
>>> пт, 13 нояб. 2020 г. в 02:23, Cong Guo <na...@gmail.com>:
>>>
>>>> Hi,
>>>>
>>>> I have a 3-node cluster with persistence enabled. All the three nodes
>>>> are in the baseline topology. The ignite version is 2.8.1.
>>>>
>>>> When I restart the first node, it encounters an error and fails to join
>>>> the cluster. The error message is "Caused by: org.apache.
>>>> ignite.spi.IgniteSpiException: Attempting to join node with larger
>>>> distributed metastorage version id. The node is most likely in invalid
>>>> state and can't be joined." I try several times but get the same error.
>>>>
>>>> Then I restart the second node, it encounters the same error. After I
>>>> restart the third node, the other two nodes can start successfully and join
>>>> the cluster. When I restart the nodes, I do not change the baseline
>>>> topology. I cannot reproduce this error now.
>>>>
>>>> I find someone else has the same problem.
>>>> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html
>>>>
>>>> The answer is corruption in the metastorage. I do not see any issue of
>>>> the metastorage files. However, it is a small probability event to have
>>>> files on two different machines corrupted at the same time. Is it possible
>>>> that this is another bug like
>>>> https://issues.apache.org/jira/browse/IGNITE-12850?
>>>>
>>>> Do you have any document about how the version id is updated and read?
>>>> Could you please show me in the source code where the version id is read
>>>> when a node starts and where the version id is updated when a node stops?
>>>> Thank you!
>>>>
>>>>
>>>>
>>>
>>> --
>>> Sincerely yours,
>>> Ivan Bessonov
>>>
>>
>
> --
> Sincerely yours,
> Ivan Bessonov
>

Re: Nodes failed to join the cluster after restarting

Posted by Ivan Bessonov <be...@gmail.com>.

Thank you for the reply!

Right now the only existing distributed properties I see are these:
- Baseline parameter 'baselineAutoAdjustEnabled' was changed from 'null' to
'false'
- Baseline parameter 'baselineAutoAdjustTimeout' was changed from 'null' to
'300000'
- SQL parameter 'sql.disabledFunctions' was changed from 'null' to
'[FILE_WRITE, CANCEL_SESSION, MEMORY_USED, CSVREAD, LINK_SCHEMA,
MEMORY_FREE, FILE_READ, CSVWRITE, SESSION_ID, LOCK_MODE]'

I wonder what values they have on nodes that rejected the new node. I
suggest sending logs of those nodes as well.
Right now I believe that this bug won't happen again on your installation,
but it only makes it more elusive...

The most probable reason is that node (somehow) initialized some properties
with defaults before joining the cluster, while cluster didn't have those
values at all.
The rule is that activated cluster can't accept changed properties from
joining node. So, the workaround would be deactivating the cluster, joining
the node and activating it again. But as I said, I don't think that you'll
see this bug ever again.

вт, 17 нояб. 2020 г. в 07:34, Cong Guo <na...@gmail.com>:

> Hi,
>
> Please find the attached log for a complete but failed reboot. You can see
> the exceptions.
>
> On Mon, Nov 16, 2020 at 4:00 AM Ivan Bessonov <be...@gmail.com>
> wrote:
>
>> Hello,
>>
>> there must be a bug somewhere during node start, it updates its
>> distributed metastorage content and tries to join an already activated
>> cluster, thus creating a conflict. It's hard to tell the exact data that
>> caused conflict, especially without any logs.
>>
>> Topic that you mentioned (
>> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html)
>> seems to be about the same problem, but the issue
>> https://issues.apache.org/jira/browse/IGNITE-12850 is not related to it.
>>
>> If you have logs from those unsuccessful restart attempts, it would be
>> very helpful.
>>
>> Sadly, distributed metastorage is an internal component to store settings
>> and has no public documentation. Developers documentation is probably
>> outdated and incomplete. But just in case, "version id" that message is
>> referring to is located in field
>> "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#ver",
>> it's incremented on every distributed metastorage setting update. You can
>> find your error message in the same class.
>>
>> Please follow up with more questions and logs it possible, I hope we'll
>> figure it out.
>>
>> Thank you!
>>
>> пт, 13 нояб. 2020 г. в 02:23, Cong Guo <na...@gmail.com>:
>>
>>> Hi,
>>>
>>> I have a 3-node cluster with persistence enabled. All the three nodes
>>> are in the baseline topology. The ignite version is 2.8.1.
>>>
>>> When I restart the first node, it encounters an error and fails to join
>>> the cluster. The error message is "Caused by: org.apache.
>>> ignite.spi.IgniteSpiException: Attempting to join node with larger
>>> distributed metastorage version id. The node is most likely in invalid
>>> state and can't be joined." I try several times but get the same error.
>>>
>>> Then I restart the second node, it encounters the same error. After I
>>> restart the third node, the other two nodes can start successfully and join
>>> the cluster. When I restart the nodes, I do not change the baseline
>>> topology. I cannot reproduce this error now.
>>>
>>> I find someone else has the same problem.
>>> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html
>>>
>>> The answer is corruption in the metastorage. I do not see any issue of
>>> the metastorage files. However, it is a small probability event to have
>>> files on two different machines corrupted at the same time. Is it possible
>>> that this is another bug like
>>> https://issues.apache.org/jira/browse/IGNITE-12850?
>>>
>>> Do you have any document about how the version id is updated and read?
>>> Could you please show me in the source code where the version id is read
>>> when a node starts and where the version id is updated when a node stops?
>>> Thank you!
>>>
>>>
>>>
>>
>> --
>> Sincerely yours,
>> Ivan Bessonov
>>
>

-- 
Sincerely yours,
Ivan Bessonov

Re: Nodes failed to join the cluster after restarting

Posted by Cong Guo <na...@gmail.com>.

Hi,

Please find the attached log for a complete but failed reboot. You can see
the exceptions.

On Mon, Nov 16, 2020 at 4:00 AM Ivan Bessonov <be...@gmail.com> wrote:

> Hello,
>
> there must be a bug somewhere during node start, it updates its
> distributed metastorage content and tries to join an already activated
> cluster, thus creating a conflict. It's hard to tell the exact data that
> caused conflict, especially without any logs.
>
> Topic that you mentioned (
> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html)
> seems to be about the same problem, but the issue
> https://issues.apache.org/jira/browse/IGNITE-12850 is not related to it.
>
> If you have logs from those unsuccessful restart attempts, it would be
> very helpful.
>
> Sadly, distributed metastorage is an internal component to store settings
> and has no public documentation. Developers documentation is probably
> outdated and incomplete. But just in case, "version id" that message is
> referring to is located in field
> "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#ver",
> it's incremented on every distributed metastorage setting update. You can
> find your error message in the same class.
>
> Please follow up with more questions and logs it possible, I hope we'll
> figure it out.
>
> Thank you!
>
> пт, 13 нояб. 2020 г. в 02:23, Cong Guo <na...@gmail.com>:
>
>> Hi,
>>
>> I have a 3-node cluster with persistence enabled. All the three nodes are
>> in the baseline topology. The ignite version is 2.8.1.
>>
>> When I restart the first node, it encounters an error and fails to join
>> the cluster. The error message is "Caused by: org.apache.
>> ignite.spi.IgniteSpiException: Attempting to join node with larger
>> distributed metastorage version id. The node is most likely in invalid
>> state and can't be joined." I try several times but get the same error.
>>
>> Then I restart the second node, it encounters the same error. After I
>> restart the third node, the other two nodes can start successfully and join
>> the cluster. When I restart the nodes, I do not change the baseline
>> topology. I cannot reproduce this error now.
>>
>> I find someone else has the same problem.
>> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html
>>
>> The answer is corruption in the metastorage. I do not see any issue of
>> the metastorage files. However, it is a small probability event to have
>> files on two different machines corrupted at the same time. Is it possible
>> that this is another bug like
>> https://issues.apache.org/jira/browse/IGNITE-12850?
>>
>> Do you have any document about how the version id is updated and read?
>> Could you please show me in the source code where the version id is read
>> when a node starts and where the version id is updated when a node stops?
>> Thank you!
>>
>>
>>
>
> --
> Sincerely yours,
> Ivan Bessonov
>

Re: Nodes failed to join the cluster after restarting

Posted by Ivan Bessonov <be...@gmail.com>.

Hello,

there must be a bug somewhere during node start, it updates its
distributed metastorage content and tries to join an already activated
cluster, thus creating a conflict. It's hard to tell the exact data that
caused conflict, especially without any logs.

Topic that you mentioned (
http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html)
seems to be about the same problem, but the issue
https://issues.apache.org/jira/browse/IGNITE-12850 is not related to it.

If you have logs from those unsuccessful restart attempts, it would be very
helpful.

Sadly, distributed metastorage is an internal component to store settings
and has no public documentation. Developers documentation is probably
outdated and incomplete. But just in case, "version id" that message is
referring to is located in field
"org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#ver",
it's incremented on every distributed metastorage setting update. You can
find your error message in the same class.

Please follow up with more questions and logs it possible, I hope we'll
figure it out.

Thank you!

пт, 13 нояб. 2020 г. в 02:23, Cong Guo <na...@gmail.com>:

> Hi,
>
> I have a 3-node cluster with persistence enabled. All the three nodes are
> in the baseline topology. The ignite version is 2.8.1.
>
> When I restart the first node, it encounters an error and fails to join
> the cluster. The error message is "Caused by: org.apache.
> ignite.spi.IgniteSpiException: Attempting to join node with larger
> distributed metastorage version id. The node is most likely in invalid
> state and can't be joined." I try several times but get the same error.
>
> Then I restart the second node, it encounters the same error. After I
> restart the third node, the other two nodes can start successfully and join
> the cluster. When I restart the nodes, I do not change the baseline
> topology. I cannot reproduce this error now.
>
> I find someone else has the same problem.
> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html
>
> The answer is corruption in the metastorage. I do not see any issue of the
> metastorage files. However, it is a small probability event to have files
> on two different machines corrupted at the same time. Is it possible that
> this is another bug like
> https://issues.apache.org/jira/browse/IGNITE-12850?
>
> Do you have any document about how the version id is updated and read?
> Could you please show me in the source code where the version id is read
> when a node starts and where the version id is updated when a node stops?
> Thank you!
>
>
>

-- 
Sincerely yours,
Ivan Bessonov