You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ignite.apache.org by John Smith <ja...@gmail.com> on 2020/05/07 16:11:04 UTC

Cache was inconsistent state

Hi, running 2.7.0 on 3 deployed on VMs running Ubuntu.

I checked the state of the cluster by going to: /ignite?cmd=currentState
And the response was:
{"successStatus":0,"error":null,"sessionToken":null,"response":true}
I also checked: /ignite?cmd=size&cacheName=....

2 nodes where reporting 3 million records
1 node was reporting 2 million records.

When I connected to visor and ran the node command... The details where
wrong as it only showed 2 server nodes and only 1 client, but 3 server
nodes actually exist and more clients are connected.

So I rebooted the node that was claiming 2 million records instead of 3 and
when I re-ran the node command displayed all the proper nodes.
Also after the reboot all the nodes started reporting 2 million records
instead of 3 million so there some sort of rebalancing or correction (the
cache has a 90 day TTL)?



Before reboot
+=============================================================================================================================+
| # |       Node ID8(@), IP       |            Consistent ID             |
Node Type | Up Time  | CPUs | CPU Load | Free Heap |
+=============================================================================================================================+
| 0 | xxxxxx(@n0), xxxxxx.69 | xxxxxx | Server    | 20:25:30 | 4    | 1.27
%   | 84.00 %   |
| 1 | xxxxxx(@n1), xxxxxx.1 | xxxxxx | Client    | 13:12:01 | 3    | 0.67 %
  | 74.00 %   |
| 2 | xxxxxx(@n2), xxxxxx.63 | xxxxxx | Server    | 16:55:05 | 4    | 6.57
%   | 84.00 %   |
+-----------------------------------------------------------------------------------------------------------------------------+

After reboot
+=============================================================================================================================+
| # |       Node ID8(@), IP       |            Consistent ID             |
Node Type | Up Time  | CPUs | CPU Load | Free Heap |
+=============================================================================================================================+
| 0 | xxxxxx(@n0), xxxxxx.69 | xxxxxx | Server    | 21:13:45 | 4    | 0.77
%   | 56.00 %   |
| 1 | xxxxxx(@n1), xxxxxx.1 | xxxxxx | Client    | 14:00:17 | 3    | 0.77 %
  | 56.00 %   |
| 2 | xxxxxx(@n2), xxxxxx.63 | xxxxxx | Server    | 17:43:20 | 4    | 1.00
%   | 60.00 %   |
| 3 | xxxxxx(@n3), xxxxxx.65 | xxxxxx | Client    | 01:42:45 | 4    | 4.10
%   | 56.00 %   |
| 4 | xxxxxx(@n4), xxxxxx.65 | xxxxxx | Client    | 01:42:45 | 4    | 3.93
%   | 56.00 %   |
| 5 | xxxxxx(@n5), xxxxxx.1 | xxxxxx | Client    | 16:59:53 | 2    | 0.67 %
  | 91.00 %   |
| 6 | xxxxxx(@n6), xxxxxx.79 | xxxxxx | Server    | 00:41:31 | 4    | 1.00
%   | 97.00 %   |
+-----------------------------------------------------------------------------------------------------------------------------+

Re: Cache was inconsistent state

Posted by Evgenii Zhuravlev <e....@gmail.com>.
John,

Yes, client nodes should have this parameter too.

Evgenii

пн, 11 мая 2020 г. в 07:54, John Smith <ja...@gmail.com>:

> I mean both the prefer IPV4 and the Zookeeper discovery should be on the
> "central" cluster as well as all nodes specifically marked as client = true?
>
> On Mon, 11 May 2020 at 09:59, John Smith <ja...@gmail.com> wrote:
>
>> Should be on client nodes as well that are specifically setClient = true?
>>
>> On Fri, 8 May 2020 at 22:26, Evgenii Zhuravlev <e....@gmail.com>
>> wrote:
>>
>>> John,
>>>
>>> It looks like a split-brain. They were in one cluster at first. I'm not
>>> sure what was the reason for this, it could be a network problem or
>>> something else.
>>>
>>> I saw in logs that you use both ipv4 and ipv6, I would recommend using
>>> only one of them to avoid problems - just add -Djava.net.preferIPv4Stack=true
>>> to all nodes in the cluster.
>>>
>>> Also, to avoid split-brain situations, you can use Zookeeper Discovery:
>>> https://apacheignite.readme.io/docs/zookeeper-discovery#failures-and-split-brain-handling or
>>> implement Segmentation resolver. More information about the second can be
>>> found on the forum, for example, here:
>>> http://apache-ignite-users.70518.x6.nabble.com/split-brain-problem-and-GridSegmentationProcessor-td14590.html
>>>
>>> Evgenii
>>>
>>> пт, 8 мая 2020 г. в 14:30, John Smith <ja...@gmail.com>:
>>>
>>>> How though? It's the same cluster! We haven't changed anything
>>>> this happened on it's own...
>>>>
>>>> All I did was reboot the node and the cluster fixed itself.
>>>>
>>>> On Fri, 8 May 2020 at 15:32, Evgenii Zhuravlev <
>>>> e.zhuravlev.wk@gmail.com> wrote:
>>>>
>>>>> Hi John,
>>>>>
>>>>> *Yes, it looks like they are in a different clusters:*
>>>>> *Metrics from the node with a problem:*
>>>>> [15:17:28,668][INFO][grid-timeout-worker-#23%xxxxxx%][IgniteKernal%xxxxxx]
>>>>>
>>>>> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>>>>>     ^-- Node [id=5bbf262e, name=xxxxxx, uptime=93 days, 19:36:10.921]
>>>>>     ^-- H/N/C [hosts=3, nodes=4, CPUs=10]
>>>>>
>>>>> *Metrics from another node:*
>>>>> [15:17:05,635][INFO][grid-timeout-worker-#23%xxxxxx%][IgniteKernal%xxxxxx]
>>>>>
>>>>> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>>>>>     ^-- Node [id=dddefdcd, name=xxxxxx, uptime=19 days, 16:49:48.381]
>>>>>     ^-- H/N/C [hosts=6, nodes=7, CPUs=21]
>>>>>
>>>>> *The same topology versions for 2 nodes has different nodes:*
>>>>> [03:56:17,643][INFO][disco-event-worker-#42%xxxxxx%][GridDiscoveryManager]
>>>>> Topology snapshot [ver=1036, locNode=5bbf262e, servers=1, clients=3,
>>>>> state=ACTIVE, CPUs=10, offheap=10.0GB, heap=13.0GB]
>>>>> [03:56:17,643][INFO][disco-event-worker-#42%xxxxxx%][GridDiscoveryManager]
>>>>>   ^-- Baseline [id=0, size=3, online=1, offline=2]
>>>>>
>>>>> *And*
>>>>>
>>>>> [03:56:43,388][INFO][disco-event-worker-#42%xxxxxx%][GridDiscoveryManager]
>>>>> Topology snapshot [ver=1036, locNode=4394fdd4, servers=2, clients=2,
>>>>> state=ACTIVE, CPUs=15, offheap=20.0GB, heap=19.0GB]
>>>>> [03:56:43,389][INFO][disco-event-worker-#42%xxxxxx%][GridDiscoveryManager]
>>>>>   ^-- Baseline [id=0, size=3, online=2, offline=1]
>>>>>
>>>>> So, it's just 2 different clusters.
>>>>>
>>>>> Best Regards,
>>>>> Evgenii
>>>>>
>>>>> пт, 8 мая 2020 г. в 08:50, John Smith <ja...@gmail.com>:
>>>>>
>>>>>> Hi Evgenii, here the logs.
>>>>>>
>>>>>> https://www.dropbox.com/s/ke71qsoqg588kc8/ignite-logs.zip?dl=0
>>>>>>
>>>>>> On Fri, 8 May 2020 at 09:21, John Smith <ja...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Ok let me try get them...
>>>>>>>
>>>>>>> On Thu., May 7, 2020, 1:14 p.m. Evgenii Zhuravlev, <
>>>>>>> e.zhuravlev.wk@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> It looks like the third server node was not a part of this cluster
>>>>>>>> before restart. Can you share full logs from all server nodes?
>>>>>>>>
>>>>>>>> Evgenii
>>>>>>>>
>>>>>>>> чт, 7 мая 2020 г. в 09:11, John Smith <ja...@gmail.com>:
>>>>>>>>
>>>>>>>>> Hi, running 2.7.0 on 3 deployed on VMs running Ubuntu.
>>>>>>>>>
>>>>>>>>> I checked the state of the cluster by going
>>>>>>>>> to: /ignite?cmd=currentState
>>>>>>>>> And the response was: {"successStatus":0,"error":null,"sessionToken":null,"response":true}
>>>>>>>>> I also checked: /ignite?cmd=size&cacheName=....
>>>>>>>>>
>>>>>>>>> 2 nodes where reporting 3 million records
>>>>>>>>> 1 node was reporting 2 million records.
>>>>>>>>>
>>>>>>>>> When I connected to visor and ran the node command... The details
>>>>>>>>> where wrong as it only showed 2 server nodes and only 1 client, but 3
>>>>>>>>> server nodes actually exist and more clients are connected.
>>>>>>>>>
>>>>>>>>> So I rebooted the node that was claiming 2 million records instead
>>>>>>>>> of 3 and when I re-ran the node command displayed all the proper nodes.
>>>>>>>>> Also after the reboot all the nodes started reporting 2 million
>>>>>>>>> records instead of 3 million so there some sort of rebalancing or
>>>>>>>>> correction (the cache has a 90 day TTL)?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Before reboot
>>>>>>>>>
>>>>>>>>> +=============================================================================================================================+
>>>>>>>>> | # |       Node ID8(@), IP       |            Consistent ID
>>>>>>>>>       | Node Type | Up Time  | CPUs | CPU Load | Free Heap |
>>>>>>>>>
>>>>>>>>> +=============================================================================================================================+
>>>>>>>>> | 0 | xxxxxx(@n0), xxxxxx.69 | xxxxxx | Server    | 20:25:30 | 4
>>>>>>>>>  | 1.27 %   | 84.00 %   |
>>>>>>>>> | 1 | xxxxxx(@n1), xxxxxx.1 | xxxxxx | Client    | 13:12:01 | 3
>>>>>>>>>  | 0.67 %   | 74.00 %   |
>>>>>>>>> | 2 | xxxxxx(@n2), xxxxxx.63 | xxxxxx | Server    | 16:55:05 | 4
>>>>>>>>>  | 6.57 %   | 84.00 %   |
>>>>>>>>>
>>>>>>>>> +-----------------------------------------------------------------------------------------------------------------------------+
>>>>>>>>>
>>>>>>>>> After reboot
>>>>>>>>>
>>>>>>>>> +=============================================================================================================================+
>>>>>>>>> | # |       Node ID8(@), IP       |            Consistent ID
>>>>>>>>>       | Node Type | Up Time  | CPUs | CPU Load | Free Heap |
>>>>>>>>>
>>>>>>>>> +=============================================================================================================================+
>>>>>>>>> | 0 | xxxxxx(@n0), xxxxxx.69 | xxxxxx | Server    | 21:13:45 | 4
>>>>>>>>>  | 0.77 %   | 56.00 %   |
>>>>>>>>> | 1 | xxxxxx(@n1), xxxxxx.1 | xxxxxx | Client    | 14:00:17 | 3
>>>>>>>>>  | 0.77 %   | 56.00 %   |
>>>>>>>>> | 2 | xxxxxx(@n2), xxxxxx.63 | xxxxxx | Server    | 17:43:20 | 4
>>>>>>>>>  | 1.00 %   | 60.00 %   |
>>>>>>>>> | 3 | xxxxxx(@n3), xxxxxx.65 | xxxxxx | Client    | 01:42:45 | 4
>>>>>>>>>  | 4.10 %   | 56.00 %   |
>>>>>>>>> | 4 | xxxxxx(@n4), xxxxxx.65 | xxxxxx | Client    | 01:42:45 | 4
>>>>>>>>>  | 3.93 %   | 56.00 %   |
>>>>>>>>> | 5 | xxxxxx(@n5), xxxxxx.1 | xxxxxx | Client    | 16:59:53 | 2
>>>>>>>>>  | 0.67 %   | 91.00 %   |
>>>>>>>>> | 6 | xxxxxx(@n6), xxxxxx.79 | xxxxxx | Server    | 00:41:31 | 4
>>>>>>>>>  | 1.00 %   | 97.00 %   |
>>>>>>>>>
>>>>>>>>> +-----------------------------------------------------------------------------------------------------------------------------+
>>>>>>>>>
>>>>>>>>

Re: Cache was inconsistent state

Posted by John Smith <ja...@gmail.com>.
I mean both the prefer IPV4 and the Zookeeper discovery should be on the
"central" cluster as well as all nodes specifically marked as client = true?

On Mon, 11 May 2020 at 09:59, John Smith <ja...@gmail.com> wrote:

> Should be on client nodes as well that are specifically setClient = true?
>
> On Fri, 8 May 2020 at 22:26, Evgenii Zhuravlev <e....@gmail.com>
> wrote:
>
>> John,
>>
>> It looks like a split-brain. They were in one cluster at first. I'm not
>> sure what was the reason for this, it could be a network problem or
>> something else.
>>
>> I saw in logs that you use both ipv4 and ipv6, I would recommend using
>> only one of them to avoid problems - just add -Djava.net.preferIPv4Stack=true
>> to all nodes in the cluster.
>>
>> Also, to avoid split-brain situations, you can use Zookeeper Discovery:
>> https://apacheignite.readme.io/docs/zookeeper-discovery#failures-and-split-brain-handling or
>> implement Segmentation resolver. More information about the second can be
>> found on the forum, for example, here:
>> http://apache-ignite-users.70518.x6.nabble.com/split-brain-problem-and-GridSegmentationProcessor-td14590.html
>>
>> Evgenii
>>
>> пт, 8 мая 2020 г. в 14:30, John Smith <ja...@gmail.com>:
>>
>>> How though? It's the same cluster! We haven't changed anything
>>> this happened on it's own...
>>>
>>> All I did was reboot the node and the cluster fixed itself.
>>>
>>> On Fri, 8 May 2020 at 15:32, Evgenii Zhuravlev <e....@gmail.com>
>>> wrote:
>>>
>>>> Hi John,
>>>>
>>>> *Yes, it looks like they are in a different clusters:*
>>>> *Metrics from the node with a problem:*
>>>> [15:17:28,668][INFO][grid-timeout-worker-#23%xxxxxx%][IgniteKernal%xxxxxx]
>>>>
>>>> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>>>>     ^-- Node [id=5bbf262e, name=xxxxxx, uptime=93 days, 19:36:10.921]
>>>>     ^-- H/N/C [hosts=3, nodes=4, CPUs=10]
>>>>
>>>> *Metrics from another node:*
>>>> [15:17:05,635][INFO][grid-timeout-worker-#23%xxxxxx%][IgniteKernal%xxxxxx]
>>>>
>>>> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>>>>     ^-- Node [id=dddefdcd, name=xxxxxx, uptime=19 days, 16:49:48.381]
>>>>     ^-- H/N/C [hosts=6, nodes=7, CPUs=21]
>>>>
>>>> *The same topology versions for 2 nodes has different nodes:*
>>>> [03:56:17,643][INFO][disco-event-worker-#42%xxxxxx%][GridDiscoveryManager]
>>>> Topology snapshot [ver=1036, locNode=5bbf262e, servers=1, clients=3,
>>>> state=ACTIVE, CPUs=10, offheap=10.0GB, heap=13.0GB]
>>>> [03:56:17,643][INFO][disco-event-worker-#42%xxxxxx%][GridDiscoveryManager]
>>>>   ^-- Baseline [id=0, size=3, online=1, offline=2]
>>>>
>>>> *And*
>>>>
>>>> [03:56:43,388][INFO][disco-event-worker-#42%xxxxxx%][GridDiscoveryManager]
>>>> Topology snapshot [ver=1036, locNode=4394fdd4, servers=2, clients=2,
>>>> state=ACTIVE, CPUs=15, offheap=20.0GB, heap=19.0GB]
>>>> [03:56:43,389][INFO][disco-event-worker-#42%xxxxxx%][GridDiscoveryManager]
>>>>   ^-- Baseline [id=0, size=3, online=2, offline=1]
>>>>
>>>> So, it's just 2 different clusters.
>>>>
>>>> Best Regards,
>>>> Evgenii
>>>>
>>>> пт, 8 мая 2020 г. в 08:50, John Smith <ja...@gmail.com>:
>>>>
>>>>> Hi Evgenii, here the logs.
>>>>>
>>>>> https://www.dropbox.com/s/ke71qsoqg588kc8/ignite-logs.zip?dl=0
>>>>>
>>>>> On Fri, 8 May 2020 at 09:21, John Smith <ja...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Ok let me try get them...
>>>>>>
>>>>>> On Thu., May 7, 2020, 1:14 p.m. Evgenii Zhuravlev, <
>>>>>> e.zhuravlev.wk@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> It looks like the third server node was not a part of this cluster
>>>>>>> before restart. Can you share full logs from all server nodes?
>>>>>>>
>>>>>>> Evgenii
>>>>>>>
>>>>>>> чт, 7 мая 2020 г. в 09:11, John Smith <ja...@gmail.com>:
>>>>>>>
>>>>>>>> Hi, running 2.7.0 on 3 deployed on VMs running Ubuntu.
>>>>>>>>
>>>>>>>> I checked the state of the cluster by going
>>>>>>>> to: /ignite?cmd=currentState
>>>>>>>> And the response was: {"successStatus":0,"error":null,"sessionToken":null,"response":true}
>>>>>>>> I also checked: /ignite?cmd=size&cacheName=....
>>>>>>>>
>>>>>>>> 2 nodes where reporting 3 million records
>>>>>>>> 1 node was reporting 2 million records.
>>>>>>>>
>>>>>>>> When I connected to visor and ran the node command... The details
>>>>>>>> where wrong as it only showed 2 server nodes and only 1 client, but 3
>>>>>>>> server nodes actually exist and more clients are connected.
>>>>>>>>
>>>>>>>> So I rebooted the node that was claiming 2 million records instead
>>>>>>>> of 3 and when I re-ran the node command displayed all the proper nodes.
>>>>>>>> Also after the reboot all the nodes started reporting 2 million
>>>>>>>> records instead of 3 million so there some sort of rebalancing or
>>>>>>>> correction (the cache has a 90 day TTL)?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Before reboot
>>>>>>>>
>>>>>>>> +=============================================================================================================================+
>>>>>>>> | # |       Node ID8(@), IP       |            Consistent ID
>>>>>>>>       | Node Type | Up Time  | CPUs | CPU Load | Free Heap |
>>>>>>>>
>>>>>>>> +=============================================================================================================================+
>>>>>>>> | 0 | xxxxxx(@n0), xxxxxx.69 | xxxxxx | Server    | 20:25:30 | 4
>>>>>>>>  | 1.27 %   | 84.00 %   |
>>>>>>>> | 1 | xxxxxx(@n1), xxxxxx.1 | xxxxxx | Client    | 13:12:01 | 3
>>>>>>>>  | 0.67 %   | 74.00 %   |
>>>>>>>> | 2 | xxxxxx(@n2), xxxxxx.63 | xxxxxx | Server    | 16:55:05 | 4
>>>>>>>>  | 6.57 %   | 84.00 %   |
>>>>>>>>
>>>>>>>> +-----------------------------------------------------------------------------------------------------------------------------+
>>>>>>>>
>>>>>>>> After reboot
>>>>>>>>
>>>>>>>> +=============================================================================================================================+
>>>>>>>> | # |       Node ID8(@), IP       |            Consistent ID
>>>>>>>>       | Node Type | Up Time  | CPUs | CPU Load | Free Heap |
>>>>>>>>
>>>>>>>> +=============================================================================================================================+
>>>>>>>> | 0 | xxxxxx(@n0), xxxxxx.69 | xxxxxx | Server    | 21:13:45 | 4
>>>>>>>>  | 0.77 %   | 56.00 %   |
>>>>>>>> | 1 | xxxxxx(@n1), xxxxxx.1 | xxxxxx | Client    | 14:00:17 | 3
>>>>>>>>  | 0.77 %   | 56.00 %   |
>>>>>>>> | 2 | xxxxxx(@n2), xxxxxx.63 | xxxxxx | Server    | 17:43:20 | 4
>>>>>>>>  | 1.00 %   | 60.00 %   |
>>>>>>>> | 3 | xxxxxx(@n3), xxxxxx.65 | xxxxxx | Client    | 01:42:45 | 4
>>>>>>>>  | 4.10 %   | 56.00 %   |
>>>>>>>> | 4 | xxxxxx(@n4), xxxxxx.65 | xxxxxx | Client    | 01:42:45 | 4
>>>>>>>>  | 3.93 %   | 56.00 %   |
>>>>>>>> | 5 | xxxxxx(@n5), xxxxxx.1 | xxxxxx | Client    | 16:59:53 | 2
>>>>>>>>  | 0.67 %   | 91.00 %   |
>>>>>>>> | 6 | xxxxxx(@n6), xxxxxx.79 | xxxxxx | Server    | 00:41:31 | 4
>>>>>>>>  | 1.00 %   | 97.00 %   |
>>>>>>>>
>>>>>>>> +-----------------------------------------------------------------------------------------------------------------------------+
>>>>>>>>
>>>>>>>

Re: Cache was inconsistent state

Posted by John Smith <ja...@gmail.com>.
Should be on client nodes as well that are specifically setClient = true?

On Fri, 8 May 2020 at 22:26, Evgenii Zhuravlev <e....@gmail.com>
wrote:

> John,
>
> It looks like a split-brain. They were in one cluster at first. I'm not
> sure what was the reason for this, it could be a network problem or
> something else.
>
> I saw in logs that you use both ipv4 and ipv6, I would recommend using
> only one of them to avoid problems - just add -Djava.net.preferIPv4Stack=true
> to all nodes in the cluster.
>
> Also, to avoid split-brain situations, you can use Zookeeper Discovery:
> https://apacheignite.readme.io/docs/zookeeper-discovery#failures-and-split-brain-handling or
> implement Segmentation resolver. More information about the second can be
> found on the forum, for example, here:
> http://apache-ignite-users.70518.x6.nabble.com/split-brain-problem-and-GridSegmentationProcessor-td14590.html
>
> Evgenii
>
> пт, 8 мая 2020 г. в 14:30, John Smith <ja...@gmail.com>:
>
>> How though? It's the same cluster! We haven't changed anything
>> this happened on it's own...
>>
>> All I did was reboot the node and the cluster fixed itself.
>>
>> On Fri, 8 May 2020 at 15:32, Evgenii Zhuravlev <e....@gmail.com>
>> wrote:
>>
>>> Hi John,
>>>
>>> *Yes, it looks like they are in a different clusters:*
>>> *Metrics from the node with a problem:*
>>> [15:17:28,668][INFO][grid-timeout-worker-#23%xxxxxx%][IgniteKernal%xxxxxx]
>>>
>>> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>>>     ^-- Node [id=5bbf262e, name=xxxxxx, uptime=93 days, 19:36:10.921]
>>>     ^-- H/N/C [hosts=3, nodes=4, CPUs=10]
>>>
>>> *Metrics from another node:*
>>> [15:17:05,635][INFO][grid-timeout-worker-#23%xxxxxx%][IgniteKernal%xxxxxx]
>>>
>>> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>>>     ^-- Node [id=dddefdcd, name=xxxxxx, uptime=19 days, 16:49:48.381]
>>>     ^-- H/N/C [hosts=6, nodes=7, CPUs=21]
>>>
>>> *The same topology versions for 2 nodes has different nodes:*
>>> [03:56:17,643][INFO][disco-event-worker-#42%xxxxxx%][GridDiscoveryManager]
>>> Topology snapshot [ver=1036, locNode=5bbf262e, servers=1, clients=3,
>>> state=ACTIVE, CPUs=10, offheap=10.0GB, heap=13.0GB]
>>> [03:56:17,643][INFO][disco-event-worker-#42%xxxxxx%][GridDiscoveryManager]
>>>   ^-- Baseline [id=0, size=3, online=1, offline=2]
>>>
>>> *And*
>>>
>>> [03:56:43,388][INFO][disco-event-worker-#42%xxxxxx%][GridDiscoveryManager]
>>> Topology snapshot [ver=1036, locNode=4394fdd4, servers=2, clients=2,
>>> state=ACTIVE, CPUs=15, offheap=20.0GB, heap=19.0GB]
>>> [03:56:43,389][INFO][disco-event-worker-#42%xxxxxx%][GridDiscoveryManager]
>>>   ^-- Baseline [id=0, size=3, online=2, offline=1]
>>>
>>> So, it's just 2 different clusters.
>>>
>>> Best Regards,
>>> Evgenii
>>>
>>> пт, 8 мая 2020 г. в 08:50, John Smith <ja...@gmail.com>:
>>>
>>>> Hi Evgenii, here the logs.
>>>>
>>>> https://www.dropbox.com/s/ke71qsoqg588kc8/ignite-logs.zip?dl=0
>>>>
>>>> On Fri, 8 May 2020 at 09:21, John Smith <ja...@gmail.com> wrote:
>>>>
>>>>> Ok let me try get them...
>>>>>
>>>>> On Thu., May 7, 2020, 1:14 p.m. Evgenii Zhuravlev, <
>>>>> e.zhuravlev.wk@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> It looks like the third server node was not a part of this cluster
>>>>>> before restart. Can you share full logs from all server nodes?
>>>>>>
>>>>>> Evgenii
>>>>>>
>>>>>> чт, 7 мая 2020 г. в 09:11, John Smith <ja...@gmail.com>:
>>>>>>
>>>>>>> Hi, running 2.7.0 on 3 deployed on VMs running Ubuntu.
>>>>>>>
>>>>>>> I checked the state of the cluster by going
>>>>>>> to: /ignite?cmd=currentState
>>>>>>> And the response was: {"successStatus":0,"error":null,"sessionToken":null,"response":true}
>>>>>>> I also checked: /ignite?cmd=size&cacheName=....
>>>>>>>
>>>>>>> 2 nodes where reporting 3 million records
>>>>>>> 1 node was reporting 2 million records.
>>>>>>>
>>>>>>> When I connected to visor and ran the node command... The details
>>>>>>> where wrong as it only showed 2 server nodes and only 1 client, but 3
>>>>>>> server nodes actually exist and more clients are connected.
>>>>>>>
>>>>>>> So I rebooted the node that was claiming 2 million records instead
>>>>>>> of 3 and when I re-ran the node command displayed all the proper nodes.
>>>>>>> Also after the reboot all the nodes started reporting 2 million
>>>>>>> records instead of 3 million so there some sort of rebalancing or
>>>>>>> correction (the cache has a 90 day TTL)?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Before reboot
>>>>>>>
>>>>>>> +=============================================================================================================================+
>>>>>>> | # |       Node ID8(@), IP       |            Consistent ID
>>>>>>>     | Node Type | Up Time  | CPUs | CPU Load | Free Heap |
>>>>>>>
>>>>>>> +=============================================================================================================================+
>>>>>>> | 0 | xxxxxx(@n0), xxxxxx.69 | xxxxxx | Server    | 20:25:30 | 4
>>>>>>>  | 1.27 %   | 84.00 %   |
>>>>>>> | 1 | xxxxxx(@n1), xxxxxx.1 | xxxxxx | Client    | 13:12:01 | 3    |
>>>>>>> 0.67 %   | 74.00 %   |
>>>>>>> | 2 | xxxxxx(@n2), xxxxxx.63 | xxxxxx | Server    | 16:55:05 | 4
>>>>>>>  | 6.57 %   | 84.00 %   |
>>>>>>>
>>>>>>> +-----------------------------------------------------------------------------------------------------------------------------+
>>>>>>>
>>>>>>> After reboot
>>>>>>>
>>>>>>> +=============================================================================================================================+
>>>>>>> | # |       Node ID8(@), IP       |            Consistent ID
>>>>>>>     | Node Type | Up Time  | CPUs | CPU Load | Free Heap |
>>>>>>>
>>>>>>> +=============================================================================================================================+
>>>>>>> | 0 | xxxxxx(@n0), xxxxxx.69 | xxxxxx | Server    | 21:13:45 | 4
>>>>>>>  | 0.77 %   | 56.00 %   |
>>>>>>> | 1 | xxxxxx(@n1), xxxxxx.1 | xxxxxx | Client    | 14:00:17 | 3    |
>>>>>>> 0.77 %   | 56.00 %   |
>>>>>>> | 2 | xxxxxx(@n2), xxxxxx.63 | xxxxxx | Server    | 17:43:20 | 4
>>>>>>>  | 1.00 %   | 60.00 %   |
>>>>>>> | 3 | xxxxxx(@n3), xxxxxx.65 | xxxxxx | Client    | 01:42:45 | 4
>>>>>>>  | 4.10 %   | 56.00 %   |
>>>>>>> | 4 | xxxxxx(@n4), xxxxxx.65 | xxxxxx | Client    | 01:42:45 | 4
>>>>>>>  | 3.93 %   | 56.00 %   |
>>>>>>> | 5 | xxxxxx(@n5), xxxxxx.1 | xxxxxx | Client    | 16:59:53 | 2    |
>>>>>>> 0.67 %   | 91.00 %   |
>>>>>>> | 6 | xxxxxx(@n6), xxxxxx.79 | xxxxxx | Server    | 00:41:31 | 4
>>>>>>>  | 1.00 %   | 97.00 %   |
>>>>>>>
>>>>>>> +-----------------------------------------------------------------------------------------------------------------------------+
>>>>>>>
>>>>>>

Re: Cache was inconsistent state

Posted by Evgenii Zhuravlev <e....@gmail.com>.
John,

It looks like a split-brain. They were in one cluster at first. I'm not
sure what was the reason for this, it could be a network problem or
something else.

I saw in logs that you use both ipv4 and ipv6, I would recommend using
only one of them to avoid problems - just add -Djava.net.preferIPv4Stack=true
to all nodes in the cluster.

Also, to avoid split-brain situations, you can use Zookeeper Discovery:
https://apacheignite.readme.io/docs/zookeeper-discovery#failures-and-split-brain-handling
or
implement Segmentation resolver. More information about the second can be
found on the forum, for example, here:
http://apache-ignite-users.70518.x6.nabble.com/split-brain-problem-and-GridSegmentationProcessor-td14590.html

Evgenii

пт, 8 мая 2020 г. в 14:30, John Smith <ja...@gmail.com>:

> How though? It's the same cluster! We haven't changed anything
> this happened on it's own...
>
> All I did was reboot the node and the cluster fixed itself.
>
> On Fri, 8 May 2020 at 15:32, Evgenii Zhuravlev <e....@gmail.com>
> wrote:
>
>> Hi John,
>>
>> *Yes, it looks like they are in a different clusters:*
>> *Metrics from the node with a problem:*
>> [15:17:28,668][INFO][grid-timeout-worker-#23%xxxxxx%][IgniteKernal%xxxxxx]
>>
>> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>>     ^-- Node [id=5bbf262e, name=xxxxxx, uptime=93 days, 19:36:10.921]
>>     ^-- H/N/C [hosts=3, nodes=4, CPUs=10]
>>
>> *Metrics from another node:*
>> [15:17:05,635][INFO][grid-timeout-worker-#23%xxxxxx%][IgniteKernal%xxxxxx]
>>
>> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>>     ^-- Node [id=dddefdcd, name=xxxxxx, uptime=19 days, 16:49:48.381]
>>     ^-- H/N/C [hosts=6, nodes=7, CPUs=21]
>>
>> *The same topology versions for 2 nodes has different nodes:*
>> [03:56:17,643][INFO][disco-event-worker-#42%xxxxxx%][GridDiscoveryManager]
>> Topology snapshot [ver=1036, locNode=5bbf262e, servers=1, clients=3,
>> state=ACTIVE, CPUs=10, offheap=10.0GB, heap=13.0GB]
>> [03:56:17,643][INFO][disco-event-worker-#42%xxxxxx%][GridDiscoveryManager]
>>   ^-- Baseline [id=0, size=3, online=1, offline=2]
>>
>> *And*
>>
>> [03:56:43,388][INFO][disco-event-worker-#42%xxxxxx%][GridDiscoveryManager]
>> Topology snapshot [ver=1036, locNode=4394fdd4, servers=2, clients=2,
>> state=ACTIVE, CPUs=15, offheap=20.0GB, heap=19.0GB]
>> [03:56:43,389][INFO][disco-event-worker-#42%xxxxxx%][GridDiscoveryManager]
>>   ^-- Baseline [id=0, size=3, online=2, offline=1]
>>
>> So, it's just 2 different clusters.
>>
>> Best Regards,
>> Evgenii
>>
>> пт, 8 мая 2020 г. в 08:50, John Smith <ja...@gmail.com>:
>>
>>> Hi Evgenii, here the logs.
>>>
>>> https://www.dropbox.com/s/ke71qsoqg588kc8/ignite-logs.zip?dl=0
>>>
>>> On Fri, 8 May 2020 at 09:21, John Smith <ja...@gmail.com> wrote:
>>>
>>>> Ok let me try get them...
>>>>
>>>> On Thu., May 7, 2020, 1:14 p.m. Evgenii Zhuravlev, <
>>>> e.zhuravlev.wk@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> It looks like the third server node was not a part of this cluster
>>>>> before restart. Can you share full logs from all server nodes?
>>>>>
>>>>> Evgenii
>>>>>
>>>>> чт, 7 мая 2020 г. в 09:11, John Smith <ja...@gmail.com>:
>>>>>
>>>>>> Hi, running 2.7.0 on 3 deployed on VMs running Ubuntu.
>>>>>>
>>>>>> I checked the state of the cluster by going
>>>>>> to: /ignite?cmd=currentState
>>>>>> And the response was: {"successStatus":0,"error":null,"sessionToken":null,"response":true}
>>>>>> I also checked: /ignite?cmd=size&cacheName=....
>>>>>>
>>>>>> 2 nodes where reporting 3 million records
>>>>>> 1 node was reporting 2 million records.
>>>>>>
>>>>>> When I connected to visor and ran the node command... The details
>>>>>> where wrong as it only showed 2 server nodes and only 1 client, but 3
>>>>>> server nodes actually exist and more clients are connected.
>>>>>>
>>>>>> So I rebooted the node that was claiming 2 million records instead of
>>>>>> 3 and when I re-ran the node command displayed all the proper nodes.
>>>>>> Also after the reboot all the nodes started reporting 2 million
>>>>>> records instead of 3 million so there some sort of rebalancing or
>>>>>> correction (the cache has a 90 day TTL)?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Before reboot
>>>>>>
>>>>>> +=============================================================================================================================+
>>>>>> | # |       Node ID8(@), IP       |            Consistent ID
>>>>>>     | Node Type | Up Time  | CPUs | CPU Load | Free Heap |
>>>>>>
>>>>>> +=============================================================================================================================+
>>>>>> | 0 | xxxxxx(@n0), xxxxxx.69 | xxxxxx | Server    | 20:25:30 | 4    |
>>>>>> 1.27 %   | 84.00 %   |
>>>>>> | 1 | xxxxxx(@n1), xxxxxx.1 | xxxxxx | Client    | 13:12:01 | 3    |
>>>>>> 0.67 %   | 74.00 %   |
>>>>>> | 2 | xxxxxx(@n2), xxxxxx.63 | xxxxxx | Server    | 16:55:05 | 4    |
>>>>>> 6.57 %   | 84.00 %   |
>>>>>>
>>>>>> +-----------------------------------------------------------------------------------------------------------------------------+
>>>>>>
>>>>>> After reboot
>>>>>>
>>>>>> +=============================================================================================================================+
>>>>>> | # |       Node ID8(@), IP       |            Consistent ID
>>>>>>     | Node Type | Up Time  | CPUs | CPU Load | Free Heap |
>>>>>>
>>>>>> +=============================================================================================================================+
>>>>>> | 0 | xxxxxx(@n0), xxxxxx.69 | xxxxxx | Server    | 21:13:45 | 4    |
>>>>>> 0.77 %   | 56.00 %   |
>>>>>> | 1 | xxxxxx(@n1), xxxxxx.1 | xxxxxx | Client    | 14:00:17 | 3    |
>>>>>> 0.77 %   | 56.00 %   |
>>>>>> | 2 | xxxxxx(@n2), xxxxxx.63 | xxxxxx | Server    | 17:43:20 | 4    |
>>>>>> 1.00 %   | 60.00 %   |
>>>>>> | 3 | xxxxxx(@n3), xxxxxx.65 | xxxxxx | Client    | 01:42:45 | 4    |
>>>>>> 4.10 %   | 56.00 %   |
>>>>>> | 4 | xxxxxx(@n4), xxxxxx.65 | xxxxxx | Client    | 01:42:45 | 4    |
>>>>>> 3.93 %   | 56.00 %   |
>>>>>> | 5 | xxxxxx(@n5), xxxxxx.1 | xxxxxx | Client    | 16:59:53 | 2    |
>>>>>> 0.67 %   | 91.00 %   |
>>>>>> | 6 | xxxxxx(@n6), xxxxxx.79 | xxxxxx | Server    | 00:41:31 | 4    |
>>>>>> 1.00 %   | 97.00 %   |
>>>>>>
>>>>>> +-----------------------------------------------------------------------------------------------------------------------------+
>>>>>>
>>>>>

Re: Cache was inconsistent state

Posted by John Smith <ja...@gmail.com>.
How though? It's the same cluster! We haven't changed anything
this happened on it's own...

All I did was reboot the node and the cluster fixed itself.

On Fri, 8 May 2020 at 15:32, Evgenii Zhuravlev <e....@gmail.com>
wrote:

> Hi John,
>
> *Yes, it looks like they are in a different clusters:*
> *Metrics from the node with a problem:*
> [15:17:28,668][INFO][grid-timeout-worker-#23%xxxxxx%][IgniteKernal%xxxxxx]
> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>     ^-- Node [id=5bbf262e, name=xxxxxx, uptime=93 days, 19:36:10.921]
>     ^-- H/N/C [hosts=3, nodes=4, CPUs=10]
>
> *Metrics from another node:*
> [15:17:05,635][INFO][grid-timeout-worker-#23%xxxxxx%][IgniteKernal%xxxxxx]
> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>     ^-- Node [id=dddefdcd, name=xxxxxx, uptime=19 days, 16:49:48.381]
>     ^-- H/N/C [hosts=6, nodes=7, CPUs=21]
>
> *The same topology versions for 2 nodes has different nodes:*
> [03:56:17,643][INFO][disco-event-worker-#42%xxxxxx%][GridDiscoveryManager]
> Topology snapshot [ver=1036, locNode=5bbf262e, servers=1, clients=3,
> state=ACTIVE, CPUs=10, offheap=10.0GB, heap=13.0GB]
> [03:56:17,643][INFO][disco-event-worker-#42%xxxxxx%][GridDiscoveryManager]
>   ^-- Baseline [id=0, size=3, online=1, offline=2]
>
> *And*
>
> [03:56:43,388][INFO][disco-event-worker-#42%xxxxxx%][GridDiscoveryManager]
> Topology snapshot [ver=1036, locNode=4394fdd4, servers=2, clients=2,
> state=ACTIVE, CPUs=15, offheap=20.0GB, heap=19.0GB]
> [03:56:43,389][INFO][disco-event-worker-#42%xxxxxx%][GridDiscoveryManager]
>   ^-- Baseline [id=0, size=3, online=2, offline=1]
>
> So, it's just 2 different clusters.
>
> Best Regards,
> Evgenii
>
> пт, 8 мая 2020 г. в 08:50, John Smith <ja...@gmail.com>:
>
>> Hi Evgenii, here the logs.
>>
>> https://www.dropbox.com/s/ke71qsoqg588kc8/ignite-logs.zip?dl=0
>>
>> On Fri, 8 May 2020 at 09:21, John Smith <ja...@gmail.com> wrote:
>>
>>> Ok let me try get them...
>>>
>>> On Thu., May 7, 2020, 1:14 p.m. Evgenii Zhuravlev, <
>>> e.zhuravlev.wk@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> It looks like the third server node was not a part of this cluster
>>>> before restart. Can you share full logs from all server nodes?
>>>>
>>>> Evgenii
>>>>
>>>> чт, 7 мая 2020 г. в 09:11, John Smith <ja...@gmail.com>:
>>>>
>>>>> Hi, running 2.7.0 on 3 deployed on VMs running Ubuntu.
>>>>>
>>>>> I checked the state of the cluster by going
>>>>> to: /ignite?cmd=currentState
>>>>> And the response was: {"successStatus":0,"error":null,"sessionToken":null,"response":true}
>>>>> I also checked: /ignite?cmd=size&cacheName=....
>>>>>
>>>>> 2 nodes where reporting 3 million records
>>>>> 1 node was reporting 2 million records.
>>>>>
>>>>> When I connected to visor and ran the node command... The details
>>>>> where wrong as it only showed 2 server nodes and only 1 client, but 3
>>>>> server nodes actually exist and more clients are connected.
>>>>>
>>>>> So I rebooted the node that was claiming 2 million records instead of
>>>>> 3 and when I re-ran the node command displayed all the proper nodes.
>>>>> Also after the reboot all the nodes started reporting 2 million
>>>>> records instead of 3 million so there some sort of rebalancing or
>>>>> correction (the cache has a 90 day TTL)?
>>>>>
>>>>>
>>>>>
>>>>> Before reboot
>>>>>
>>>>> +=============================================================================================================================+
>>>>> | # |       Node ID8(@), IP       |            Consistent ID
>>>>>   | Node Type | Up Time  | CPUs | CPU Load | Free Heap |
>>>>>
>>>>> +=============================================================================================================================+
>>>>> | 0 | xxxxxx(@n0), xxxxxx.69 | xxxxxx | Server    | 20:25:30 | 4    |
>>>>> 1.27 %   | 84.00 %   |
>>>>> | 1 | xxxxxx(@n1), xxxxxx.1 | xxxxxx | Client    | 13:12:01 | 3    |
>>>>> 0.67 %   | 74.00 %   |
>>>>> | 2 | xxxxxx(@n2), xxxxxx.63 | xxxxxx | Server    | 16:55:05 | 4    |
>>>>> 6.57 %   | 84.00 %   |
>>>>>
>>>>> +-----------------------------------------------------------------------------------------------------------------------------+
>>>>>
>>>>> After reboot
>>>>>
>>>>> +=============================================================================================================================+
>>>>> | # |       Node ID8(@), IP       |            Consistent ID
>>>>>   | Node Type | Up Time  | CPUs | CPU Load | Free Heap |
>>>>>
>>>>> +=============================================================================================================================+
>>>>> | 0 | xxxxxx(@n0), xxxxxx.69 | xxxxxx | Server    | 21:13:45 | 4    |
>>>>> 0.77 %   | 56.00 %   |
>>>>> | 1 | xxxxxx(@n1), xxxxxx.1 | xxxxxx | Client    | 14:00:17 | 3    |
>>>>> 0.77 %   | 56.00 %   |
>>>>> | 2 | xxxxxx(@n2), xxxxxx.63 | xxxxxx | Server    | 17:43:20 | 4    |
>>>>> 1.00 %   | 60.00 %   |
>>>>> | 3 | xxxxxx(@n3), xxxxxx.65 | xxxxxx | Client    | 01:42:45 | 4    |
>>>>> 4.10 %   | 56.00 %   |
>>>>> | 4 | xxxxxx(@n4), xxxxxx.65 | xxxxxx | Client    | 01:42:45 | 4    |
>>>>> 3.93 %   | 56.00 %   |
>>>>> | 5 | xxxxxx(@n5), xxxxxx.1 | xxxxxx | Client    | 16:59:53 | 2    |
>>>>> 0.67 %   | 91.00 %   |
>>>>> | 6 | xxxxxx(@n6), xxxxxx.79 | xxxxxx | Server    | 00:41:31 | 4    |
>>>>> 1.00 %   | 97.00 %   |
>>>>>
>>>>> +-----------------------------------------------------------------------------------------------------------------------------+
>>>>>
>>>>

Re: Cache was inconsistent state

Posted by Evgenii Zhuravlev <e....@gmail.com>.
Hi John,

*Yes, it looks like they are in a different clusters:*
*Metrics from the node with a problem:*
[15:17:28,668][INFO][grid-timeout-worker-#23%xxxxxx%][IgniteKernal%xxxxxx]
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=5bbf262e, name=xxxxxx, uptime=93 days, 19:36:10.921]
    ^-- H/N/C [hosts=3, nodes=4, CPUs=10]

*Metrics from another node:*
[15:17:05,635][INFO][grid-timeout-worker-#23%xxxxxx%][IgniteKernal%xxxxxx]
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=dddefdcd, name=xxxxxx, uptime=19 days, 16:49:48.381]
    ^-- H/N/C [hosts=6, nodes=7, CPUs=21]

*The same topology versions for 2 nodes has different nodes:*
[03:56:17,643][INFO][disco-event-worker-#42%xxxxxx%][GridDiscoveryManager]
Topology snapshot [ver=1036, locNode=5bbf262e, servers=1, clients=3,
state=ACTIVE, CPUs=10, offheap=10.0GB, heap=13.0GB]
[03:56:17,643][INFO][disco-event-worker-#42%xxxxxx%][GridDiscoveryManager]
  ^-- Baseline [id=0, size=3, online=1, offline=2]

*And*

[03:56:43,388][INFO][disco-event-worker-#42%xxxxxx%][GridDiscoveryManager]
Topology snapshot [ver=1036, locNode=4394fdd4, servers=2, clients=2,
state=ACTIVE, CPUs=15, offheap=20.0GB, heap=19.0GB]
[03:56:43,389][INFO][disco-event-worker-#42%xxxxxx%][GridDiscoveryManager]
  ^-- Baseline [id=0, size=3, online=2, offline=1]

So, it's just 2 different clusters.

Best Regards,
Evgenii

пт, 8 мая 2020 г. в 08:50, John Smith <ja...@gmail.com>:

> Hi Evgenii, here the logs.
>
> https://www.dropbox.com/s/ke71qsoqg588kc8/ignite-logs.zip?dl=0
>
> On Fri, 8 May 2020 at 09:21, John Smith <ja...@gmail.com> wrote:
>
>> Ok let me try get them...
>>
>> On Thu., May 7, 2020, 1:14 p.m. Evgenii Zhuravlev, <
>> e.zhuravlev.wk@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> It looks like the third server node was not a part of this cluster
>>> before restart. Can you share full logs from all server nodes?
>>>
>>> Evgenii
>>>
>>> чт, 7 мая 2020 г. в 09:11, John Smith <ja...@gmail.com>:
>>>
>>>> Hi, running 2.7.0 on 3 deployed on VMs running Ubuntu.
>>>>
>>>> I checked the state of the cluster by going to: /ignite?cmd=currentState
>>>> And the response was: {"successStatus":0,"error":null,"sessionToken":null,"response":true}
>>>> I also checked: /ignite?cmd=size&cacheName=....
>>>>
>>>> 2 nodes where reporting 3 million records
>>>> 1 node was reporting 2 million records.
>>>>
>>>> When I connected to visor and ran the node command... The details where
>>>> wrong as it only showed 2 server nodes and only 1 client, but 3 server
>>>> nodes actually exist and more clients are connected.
>>>>
>>>> So I rebooted the node that was claiming 2 million records instead of 3
>>>> and when I re-ran the node command displayed all the proper nodes.
>>>> Also after the reboot all the nodes started reporting 2 million records
>>>> instead of 3 million so there some sort of rebalancing or correction (the
>>>> cache has a 90 day TTL)?
>>>>
>>>>
>>>>
>>>> Before reboot
>>>>
>>>> +=============================================================================================================================+
>>>> | # |       Node ID8(@), IP       |            Consistent ID
>>>>   | Node Type | Up Time  | CPUs | CPU Load | Free Heap |
>>>>
>>>> +=============================================================================================================================+
>>>> | 0 | xxxxxx(@n0), xxxxxx.69 | xxxxxx | Server    | 20:25:30 | 4    |
>>>> 1.27 %   | 84.00 %   |
>>>> | 1 | xxxxxx(@n1), xxxxxx.1 | xxxxxx | Client    | 13:12:01 | 3    |
>>>> 0.67 %   | 74.00 %   |
>>>> | 2 | xxxxxx(@n2), xxxxxx.63 | xxxxxx | Server    | 16:55:05 | 4    |
>>>> 6.57 %   | 84.00 %   |
>>>>
>>>> +-----------------------------------------------------------------------------------------------------------------------------+
>>>>
>>>> After reboot
>>>>
>>>> +=============================================================================================================================+
>>>> | # |       Node ID8(@), IP       |            Consistent ID
>>>>   | Node Type | Up Time  | CPUs | CPU Load | Free Heap |
>>>>
>>>> +=============================================================================================================================+
>>>> | 0 | xxxxxx(@n0), xxxxxx.69 | xxxxxx | Server    | 21:13:45 | 4    |
>>>> 0.77 %   | 56.00 %   |
>>>> | 1 | xxxxxx(@n1), xxxxxx.1 | xxxxxx | Client    | 14:00:17 | 3    |
>>>> 0.77 %   | 56.00 %   |
>>>> | 2 | xxxxxx(@n2), xxxxxx.63 | xxxxxx | Server    | 17:43:20 | 4    |
>>>> 1.00 %   | 60.00 %   |
>>>> | 3 | xxxxxx(@n3), xxxxxx.65 | xxxxxx | Client    | 01:42:45 | 4    |
>>>> 4.10 %   | 56.00 %   |
>>>> | 4 | xxxxxx(@n4), xxxxxx.65 | xxxxxx | Client    | 01:42:45 | 4    |
>>>> 3.93 %   | 56.00 %   |
>>>> | 5 | xxxxxx(@n5), xxxxxx.1 | xxxxxx | Client    | 16:59:53 | 2    |
>>>> 0.67 %   | 91.00 %   |
>>>> | 6 | xxxxxx(@n6), xxxxxx.79 | xxxxxx | Server    | 00:41:31 | 4    |
>>>> 1.00 %   | 97.00 %   |
>>>>
>>>> +-----------------------------------------------------------------------------------------------------------------------------+
>>>>
>>>

Re: Cache was inconsistent state

Posted by John Smith <ja...@gmail.com>.
Hi Evgenii, here the logs.

https://www.dropbox.com/s/ke71qsoqg588kc8/ignite-logs.zip?dl=0

On Fri, 8 May 2020 at 09:21, John Smith <ja...@gmail.com> wrote:

> Ok let me try get them...
>
> On Thu., May 7, 2020, 1:14 p.m. Evgenii Zhuravlev, <
> e.zhuravlev.wk@gmail.com> wrote:
>
>> Hi,
>>
>> It looks like the third server node was not a part of this cluster before
>> restart. Can you share full logs from all server nodes?
>>
>> Evgenii
>>
>> чт, 7 мая 2020 г. в 09:11, John Smith <ja...@gmail.com>:
>>
>>> Hi, running 2.7.0 on 3 deployed on VMs running Ubuntu.
>>>
>>> I checked the state of the cluster by going to: /ignite?cmd=currentState
>>> And the response was: {"successStatus":0,"error":null,"sessionToken":null,"response":true}
>>> I also checked: /ignite?cmd=size&cacheName=....
>>>
>>> 2 nodes where reporting 3 million records
>>> 1 node was reporting 2 million records.
>>>
>>> When I connected to visor and ran the node command... The details where
>>> wrong as it only showed 2 server nodes and only 1 client, but 3 server
>>> nodes actually exist and more clients are connected.
>>>
>>> So I rebooted the node that was claiming 2 million records instead of 3
>>> and when I re-ran the node command displayed all the proper nodes.
>>> Also after the reboot all the nodes started reporting 2 million records
>>> instead of 3 million so there some sort of rebalancing or correction (the
>>> cache has a 90 day TTL)?
>>>
>>>
>>>
>>> Before reboot
>>>
>>> +=============================================================================================================================+
>>> | # |       Node ID8(@), IP       |            Consistent ID
>>> | Node Type | Up Time  | CPUs | CPU Load | Free Heap |
>>>
>>> +=============================================================================================================================+
>>> | 0 | xxxxxx(@n0), xxxxxx.69 | xxxxxx | Server    | 20:25:30 | 4    |
>>> 1.27 %   | 84.00 %   |
>>> | 1 | xxxxxx(@n1), xxxxxx.1 | xxxxxx | Client    | 13:12:01 | 3    |
>>> 0.67 %   | 74.00 %   |
>>> | 2 | xxxxxx(@n2), xxxxxx.63 | xxxxxx | Server    | 16:55:05 | 4    |
>>> 6.57 %   | 84.00 %   |
>>>
>>> +-----------------------------------------------------------------------------------------------------------------------------+
>>>
>>> After reboot
>>>
>>> +=============================================================================================================================+
>>> | # |       Node ID8(@), IP       |            Consistent ID
>>> | Node Type | Up Time  | CPUs | CPU Load | Free Heap |
>>>
>>> +=============================================================================================================================+
>>> | 0 | xxxxxx(@n0), xxxxxx.69 | xxxxxx | Server    | 21:13:45 | 4    |
>>> 0.77 %   | 56.00 %   |
>>> | 1 | xxxxxx(@n1), xxxxxx.1 | xxxxxx | Client    | 14:00:17 | 3    |
>>> 0.77 %   | 56.00 %   |
>>> | 2 | xxxxxx(@n2), xxxxxx.63 | xxxxxx | Server    | 17:43:20 | 4    |
>>> 1.00 %   | 60.00 %   |
>>> | 3 | xxxxxx(@n3), xxxxxx.65 | xxxxxx | Client    | 01:42:45 | 4    |
>>> 4.10 %   | 56.00 %   |
>>> | 4 | xxxxxx(@n4), xxxxxx.65 | xxxxxx | Client    | 01:42:45 | 4    |
>>> 3.93 %   | 56.00 %   |
>>> | 5 | xxxxxx(@n5), xxxxxx.1 | xxxxxx | Client    | 16:59:53 | 2    |
>>> 0.67 %   | 91.00 %   |
>>> | 6 | xxxxxx(@n6), xxxxxx.79 | xxxxxx | Server    | 00:41:31 | 4    |
>>> 1.00 %   | 97.00 %   |
>>>
>>> +-----------------------------------------------------------------------------------------------------------------------------+
>>>
>>

Re: Cache was inconsistent state

Posted by John Smith <ja...@gmail.com>.
Ok let me try get them...

On Thu., May 7, 2020, 1:14 p.m. Evgenii Zhuravlev, <e....@gmail.com>
wrote:

> Hi,
>
> It looks like the third server node was not a part of this cluster before
> restart. Can you share full logs from all server nodes?
>
> Evgenii
>
> чт, 7 мая 2020 г. в 09:11, John Smith <ja...@gmail.com>:
>
>> Hi, running 2.7.0 on 3 deployed on VMs running Ubuntu.
>>
>> I checked the state of the cluster by going to: /ignite?cmd=currentState
>> And the response was: {"successStatus":0,"error":null,"sessionToken":null,"response":true}
>> I also checked: /ignite?cmd=size&cacheName=....
>>
>> 2 nodes where reporting 3 million records
>> 1 node was reporting 2 million records.
>>
>> When I connected to visor and ran the node command... The details where
>> wrong as it only showed 2 server nodes and only 1 client, but 3 server
>> nodes actually exist and more clients are connected.
>>
>> So I rebooted the node that was claiming 2 million records instead of 3
>> and when I re-ran the node command displayed all the proper nodes.
>> Also after the reboot all the nodes started reporting 2 million records
>> instead of 3 million so there some sort of rebalancing or correction (the
>> cache has a 90 day TTL)?
>>
>>
>>
>> Before reboot
>>
>> +=============================================================================================================================+
>> | # |       Node ID8(@), IP       |            Consistent ID
>> | Node Type | Up Time  | CPUs | CPU Load | Free Heap |
>>
>> +=============================================================================================================================+
>> | 0 | xxxxxx(@n0), xxxxxx.69 | xxxxxx | Server    | 20:25:30 | 4    |
>> 1.27 %   | 84.00 %   |
>> | 1 | xxxxxx(@n1), xxxxxx.1 | xxxxxx | Client    | 13:12:01 | 3    | 0.67
>> %   | 74.00 %   |
>> | 2 | xxxxxx(@n2), xxxxxx.63 | xxxxxx | Server    | 16:55:05 | 4    |
>> 6.57 %   | 84.00 %   |
>>
>> +-----------------------------------------------------------------------------------------------------------------------------+
>>
>> After reboot
>>
>> +=============================================================================================================================+
>> | # |       Node ID8(@), IP       |            Consistent ID
>> | Node Type | Up Time  | CPUs | CPU Load | Free Heap |
>>
>> +=============================================================================================================================+
>> | 0 | xxxxxx(@n0), xxxxxx.69 | xxxxxx | Server    | 21:13:45 | 4    |
>> 0.77 %   | 56.00 %   |
>> | 1 | xxxxxx(@n1), xxxxxx.1 | xxxxxx | Client    | 14:00:17 | 3    | 0.77
>> %   | 56.00 %   |
>> | 2 | xxxxxx(@n2), xxxxxx.63 | xxxxxx | Server    | 17:43:20 | 4    |
>> 1.00 %   | 60.00 %   |
>> | 3 | xxxxxx(@n3), xxxxxx.65 | xxxxxx | Client    | 01:42:45 | 4    |
>> 4.10 %   | 56.00 %   |
>> | 4 | xxxxxx(@n4), xxxxxx.65 | xxxxxx | Client    | 01:42:45 | 4    |
>> 3.93 %   | 56.00 %   |
>> | 5 | xxxxxx(@n5), xxxxxx.1 | xxxxxx | Client    | 16:59:53 | 2    | 0.67
>> %   | 91.00 %   |
>> | 6 | xxxxxx(@n6), xxxxxx.79 | xxxxxx | Server    | 00:41:31 | 4    |
>> 1.00 %   | 97.00 %   |
>>
>> +-----------------------------------------------------------------------------------------------------------------------------+
>>
>

Re: Cache was inconsistent state

Posted by Evgenii Zhuravlev <e....@gmail.com>.
Hi,

It looks like the third server node was not a part of this cluster before
restart. Can you share full logs from all server nodes?

Evgenii

чт, 7 мая 2020 г. в 09:11, John Smith <ja...@gmail.com>:

> Hi, running 2.7.0 on 3 deployed on VMs running Ubuntu.
>
> I checked the state of the cluster by going to: /ignite?cmd=currentState
> And the response was: {"successStatus":0,"error":null,"sessionToken":null,"response":true}
> I also checked: /ignite?cmd=size&cacheName=....
>
> 2 nodes where reporting 3 million records
> 1 node was reporting 2 million records.
>
> When I connected to visor and ran the node command... The details where
> wrong as it only showed 2 server nodes and only 1 client, but 3 server
> nodes actually exist and more clients are connected.
>
> So I rebooted the node that was claiming 2 million records instead of 3
> and when I re-ran the node command displayed all the proper nodes.
> Also after the reboot all the nodes started reporting 2 million records
> instead of 3 million so there some sort of rebalancing or correction (the
> cache has a 90 day TTL)?
>
>
>
> Before reboot
>
> +=============================================================================================================================+
> | # |       Node ID8(@), IP       |            Consistent ID             |
> Node Type | Up Time  | CPUs | CPU Load | Free Heap |
>
> +=============================================================================================================================+
> | 0 | xxxxxx(@n0), xxxxxx.69 | xxxxxx | Server    | 20:25:30 | 4    | 1.27
> %   | 84.00 %   |
> | 1 | xxxxxx(@n1), xxxxxx.1 | xxxxxx | Client    | 13:12:01 | 3    | 0.67
> %   | 74.00 %   |
> | 2 | xxxxxx(@n2), xxxxxx.63 | xxxxxx | Server    | 16:55:05 | 4    | 6.57
> %   | 84.00 %   |
>
> +-----------------------------------------------------------------------------------------------------------------------------+
>
> After reboot
>
> +=============================================================================================================================+
> | # |       Node ID8(@), IP       |            Consistent ID             |
> Node Type | Up Time  | CPUs | CPU Load | Free Heap |
>
> +=============================================================================================================================+
> | 0 | xxxxxx(@n0), xxxxxx.69 | xxxxxx | Server    | 21:13:45 | 4    | 0.77
> %   | 56.00 %   |
> | 1 | xxxxxx(@n1), xxxxxx.1 | xxxxxx | Client    | 14:00:17 | 3    | 0.77
> %   | 56.00 %   |
> | 2 | xxxxxx(@n2), xxxxxx.63 | xxxxxx | Server    | 17:43:20 | 4    | 1.00
> %   | 60.00 %   |
> | 3 | xxxxxx(@n3), xxxxxx.65 | xxxxxx | Client    | 01:42:45 | 4    | 4.10
> %   | 56.00 %   |
> | 4 | xxxxxx(@n4), xxxxxx.65 | xxxxxx | Client    | 01:42:45 | 4    | 3.93
> %   | 56.00 %   |
> | 5 | xxxxxx(@n5), xxxxxx.1 | xxxxxx | Client    | 16:59:53 | 2    | 0.67
> %   | 91.00 %   |
> | 6 | xxxxxx(@n6), xxxxxx.79 | xxxxxx | Server    | 00:41:31 | 4    | 1.00
> %   | 97.00 %   |
>
> +-----------------------------------------------------------------------------------------------------------------------------+
>