You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by Айсина Роза Мунеровна <ro...@sbermarket.ru> on 2022/11/22 10:17:08 UTC

Crash recover for Ignite persistence cluster (lost partitions case)

Hola!
I have a problem recovering from cluster crash in case when persistence is enabled.

Our setup is
- 5 VM nodes with 40G Ram and 200GB disk,
- persistence is enabled (on separate disk on each VM),
- all cluster actions are made through Ansible playbooks,
- all caches are either partitioned with backups = 1 or replicated,
- cluster starts as the service with running ignite.sh,
- baseline auto adjust is disabled.

Also following the docs about partition loss policy I have added -DIGNITE_WAIT_FOR_BACKUPS_ON_SHUTDOWN=true to JVM_OPTS to wait until partition rebalancing.

What problem we have: after shutting down several nodes (2 go 5) one after another exception about lost partitions is raised.

Caused by: org.apache.ignite.internal.processors.cache.CacheInvalidStateException: Failed to execute query because cache partition has been lostPart [cacheName=PUBLIC_StoreProductFeatures, part=512]

But in logs of dead nodes I see that all shutdown hooks are called as expected on both nodes:

[2022-11-22T09:24:19,614][INFO ][shutdown-hook][G] Invoking shutdown hook...
[2022-11-22T09:24:19,615][INFO ][shutdown-hook][G] Ensuring that caches have sufficient backups and local rebalance completion...


And baseline topology looks like this (with 2 offline nodes as expected):

Cluster state: active
Current topology version: 23
Baseline auto adjustment disabled: softTimeout=30000

Current topology version: 23 (Coordinator: ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b, Address=172.17.0.1, Order=3)

Baseline nodes:
    ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b, Address=172.17.0.1, State=ONLINE, Order=3
    ConsistentId=4f67fccb-211b-4514-916b-a6286d1bb71b, Address=172.17.0.1, State=ONLINE, Order=21
    ConsistentId=d980fa1c-e955-428a-bac9-d67dbfebb75e, Address=172.17.0.1, State=ONLINE, Order=5
    ConsistentId=f151bd52-c173-45d7-952d-45cbe1d5fe97, State=OFFLINE
    ConsistentId=f6862354-b175-4a0c-a94c-20253a944996, State=OFFLINE
--------------------------------------------------------------------------------
Number of baseline nodes: 5

Other nodes not found.


So my questions are:

1) What can I do to recover from partitions lost problem after shutting down several nodes? I thought that in case of graceful shutdown this problem must be solved.

Now I can recover by returning one of offline nodes to cluster (starting the service) and running reset_lost_partitions command for broken cache. After this cache becomes available.

2) What can I do to prevent this problem in scenario with automatic cluster deployment? Should I add reset_lost_partitions command after activation or redeploy?

Please help.
Thanks in advance!

Best regards,
Rose.


--

Роза Айсина

Старший разработчик ПО

СберМаркет | Доставка из любимых магазинов



Email: roza.aysina@sbermarket.ru<ma...@sbermarket.ru>

Mob:

Web: sbermarket.ru<https://sbermarket.ru/>

App: iOS<https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457> и Android<https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>



УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ: это электронное сообщение и любые документы, приложенные к нему, содержат конфиденциальную информацию. Настоящим уведомляем Вас о том, что, если это сообщение не предназначено Вам, использование, копирование, распространение информации, содержащейся в настоящем сообщении, а также осуществление любых действий на основе этой информации, строго запрещено. Если Вы получили это сообщение по ошибке, пожалуйста, сообщите об этом отправителю по электронной почте и удалите это сообщение.
CONFIDENTIALITY NOTICE: This email and any files attached to it are confidential. If you are not the intended recipient you are notified that using, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error please notify the sender and delete this email.

Re: Crash recover for Ignite persistence cluster (lost partitions case)

Posted by Айсина Роза Мунеровна <ro...@sbermarket.ru>.

Hello Ilya!

Thank you very much for your explanations! It became much more clear what is going on when nodes are going to be shut down.

Out main goal is to restore cluster in case of unexpected errors and be able to restore all data that was backupped on persistence.

So I enabled auto adjust and tested these scenarios (completeness was checked by counting number of rows in each cache):

1. Number of dead nodes <= number of backups (in my case 1)

- after node was dead it was automatically removed from baseline topology
- no errors, all data is complete

2. Back to life dead node from sc.1

- started dead node from sc.1 and soon it was added to topology
- no errors, all data is complete, no need to do reset_lost_partitions

3. Shutting down all nodes and bringing them back to life

- shut down all nodes, service ended with SIGKILL (as this is invalid state)
- then raised up all nodes again - cluster became automatically active
- no errors, all data is complete, but first queries were too long: Query execution is too long

4. Number of dead nodes > number of backups (in my case 2)

- shut down 2 nodes and they became OFFLINE in topology
- they were not removed from topology!
- Failed to execute query because cache partition has been lostPart [cacheName=PUBLIC_StoreProductFeatures, part=2]
- data is corrupted

5. Back to life dead nodes from sc.4

- started dead nodes from sc.4 and soon they became ONLINE in topology
- the lost-partition-error remained
- after reset_lost_partitions command error disappeared and data became complete

In case of scenarios 4 and 5 am I right that reset_lost_partitions must be called only after dead nodes were brought to life? Otherwise data will be lost (part of data that was on dead nodes) but cluster will be without lost-partition-error?

And another question - in case of unexpected errors like OOM does bringing nodes back online will help?

Thanks in advance!

Best regards,
Rose.

On 23 Nov 2022, at 4:50 PM, Ilya Shishkov <sh...@gmail.com> wrote:

Hi Роза,

In addition to my previous answer:
Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]: apache-ignite@config.xml.service: State 'stop-final-sigterm' timed out. Killing.
Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]: apache-ignite@config.xml.service: Killing process 11135 (java) with signal SIGKILL.
Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]: apache-ignite@config.xml.service: Failed with result 'timeout'.
Your nodes were killed (SIGKILL), so there was no graceful shutdown. And as I said earlier, you should trigger a rebalance (i.e. remove stopping nodes from baseline) and wait for rebalancing. After rebalancing nodes removed from baseline will be gracefully shut down. Also about this feature you can read in [1].

1. https://ignite.apache.org/docs/latest/starting-nodes#shutting-down-nodes

вт, 22 нояб. 2022 г. в 22:11, Ilya Shishkov <sh...@gmail.com>>:
About baseline topology you can read in documentation [1]. Manual baseline baseline management can be done available by means of control script [2].

Links:
1. https://ignite.apache.org/docs/latest/clustering/baseline-topology
2. https://ignite.apache.org/docs/latest/tools/control-script#activation-deactivation-and-topology-management

вт, 22 нояб. 2022 г. в 21:58, Ilya Shishkov <sh...@gmail.com>>:
There is a typo here:
> Lost partitions are expected behaviour in case of partition because you have only 1 backup and lost two nodes.

I mean, that lost partitions are expected behaviour in case of partitioned caches when the number of offline nodes is more than the number of backups. In your case there are 1 backup and 2 offline nodes.

вт, 22 нояб. 2022 г. в 21:56, Ilya Shishkov <sh...@gmail.com>>:
Hi,
> 1) What can I do to recover from partitions lost problem after shutting down several nodes?
> I thought that in case of graceful shutdown this problem must be solved.
> Now I can recover by returning one of offline nodes to cluster (starting the service) and running reset_lost_partitions command for broken cache. After this cache becomes available.

Are caches with lost partitions replicated or partitioned? Lost partitions are expected behaviour in case of partition because you have only 1 backup and lost two nodes. If you want from cluster data to remain fully available in case of 2 nodes, you should set 2 backups for partitioned caches.

As for graceful shutdown: why do you expect that data would not be lost? If you have 1 backup and 1 offline node, then there are some partitions without backups, because the latter remains inaccessible while their owner is offline. So, if you shutdown another one node with such partitions, they will be lost.

So, for persistent clusters if you are in a situation, when you should work a long time without backups (i.e. with offline nodes, BUT without partition loss), you should trigger a rebalance. It can be done manually or automatically by changing the baseline.
After rebalancing, the amount of data copies will be restored.

Now you should bring back at least one of the nodes, in order to make partitions available. But if you need a full set of primary and partitions you need all baseline nodes in the cluster.

2) What can I do to prevent this problem in scenario with automatic cluster deployment? Should I add reset_lost_partitions command after activation or redeploy?

I don't fully understand what you mean, but there are no problems with automatic deployments. In most cases, the situation with partition losses tells that cluster is in invalid state.

вт, 22 нояб. 2022 г. в 19:49, Айсина Роза Мунеровна <ro...@sbermarket.ru>>:
Hi Sumit!

Thanks for your reply!

Yeah, I have used this utility reset_lost_partitions many times.

The problem is that this function requires all baseline nodes to be present.
If I shutdown node auto adjustment does not remove this node from baseline topology and reset_lost_partitions ends with error that all partition owners have left the grid, partition data has been lost.

So I remove them manually and this operation succeeds but with loss of data on offline nodes.

What I am trying to understand is that why graceful shutdown do not handles this situation in case of backup caches and persistance.
How can we automatically raise Ignite nodes if after redeploy data is lost because cluster can’t handle lost partitions problem?

Best regards,
Rose.

On 22 Nov 2022, at 5:44 PM, Sumit Deshinge <su...@gmail.com>> wrote:

Внимание: Внешний отправитель!
Если вы не знаете отправителя - не открывайте вложения, не переходите по ссылкам, не пересылайте письмо!

Please check if this helps: https://ignite.apache.org/docs/latest/configuring-caches/partition-loss-policy#handling-partition-loss
Also any reason baseline auto adjustment is disabled?

On Tue, Nov 22, 2022 at 6:38 PM Айсина Роза Мунеровна <ro...@sbermarket.ru>> wrote:
Hola again!

I discovered that enabling graceful shutdown via does not work.

In service logs I see that nothing happens when SIGTERM comes :(
Eventually stopping action has been timed out and SIGKILL has been sent which causes ungraceful shutdown.
Timeout is set to 10 minutes.

Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Starting Apache Ignite In-Memory Computing Platform Service...
Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Started Apache Ignite In-Memory Computing Platform Service.
Nov 22 12:29:25 yc-ignite-lab-02 systemd[1]: Stopping Apache Ignite In-Memory Computing Platform Service...
Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]: apache-ignite@config.xml.service: State 'stop-final-sigterm' timed out. Killing.
Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]: apache-ignite@config.xml.service: Killing process 11135 (java) with signal SIGKILL.
Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]: apache-ignite@config.xml.service: Failed with result 'timeout'.
Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]: Stopped Apache Ignite In-Memory Computing Platform Service.

I also enabled DEBUG level and see that nothing happens after rebalancing started (this is the end of log):

[2022-11-22T12:29:25,957][INFO ][shutdown-hook][G] Invoking shutdown hook...
[2022-11-22T12:29:25,958][DEBUG][shutdown-hook][G] Shutdown is in progress (ignoring): Shutdown in progress
[2022-11-22T12:29:25,959][INFO ][shutdown-hook][G] Ensuring that caches have sufficient backups and local rebalance completion...

I forgot to add that service is tarted with service.sh, not ignite.sh.

Please help!

On 22 Nov 2022, at 1:17 PM, Айсина Роза Мунеровна <ro...@sbermarket.ru>> wrote:

Hola!
I have a problem recovering from cluster crash in case when persistence is enabled.

Our setup is
- 5 VM nodes with 40G Ram and 200GB disk,
- persistence is enabled (on separate disk on each VM),
- all cluster actions are made through Ansible playbooks,
- all caches are either partitioned with backups = 1 or replicated,
- cluster starts as the service with running ignite.sh,
- baseline auto adjust is disabled.

Also following the docs about partition loss policy I have added -DIGNITE_WAIT_FOR_BACKUPS_ON_SHUTDOWN=true to JVM_OPTS to wait until partition rebalancing.

What problem we have: after shutting down several nodes (2 go 5) one after another exception about lost partitions is raised.

Caused by: org.apache.ignite.internal.processors.cache.CacheInvalidStateException: Failed to execute query because cache partition has been lostPart [cacheName=PUBLIC_StoreProductFeatures, part=512]

But in logs of dead nodes I see that all shutdown hooks are called as expected on both nodes:

[2022-11-22T09:24:19,614][INFO ][shutdown-hook][G] Invoking shutdown hook...
[2022-11-22T09:24:19,615][INFO ][shutdown-hook][G] Ensuring that caches have sufficient backups and local rebalance completion...

And baseline topology looks like this (with 2 offline nodes as expected):

Cluster state: active
Current topology version: 23
Baseline auto adjustment disabled: softTimeout=30000

Current topology version: 23 (Coordinator: ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b, Address=172.17.0.1, Order=3)

Baseline nodes:
ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b, Address=172.17.0.1, State=ONLINE, Order=3
ConsistentId=4f67fccb-211b-4514-916b-a6286d1bb71b, Address=172.17.0.1, State=ONLINE, Order=21
ConsistentId=d980fa1c-e955-428a-bac9-d67dbfebb75e, Address=172.17.0.1, State=ONLINE, Order=5
ConsistentId=f151bd52-c173-45d7-952d-45cbe1d5fe97, State=OFFLINE
ConsistentId=f6862354-b175-4a0c-a94c-20253a944996, State=OFFLINE
--------------------------------------------------------------------------------
Number of baseline nodes: 5

Other nodes not found.

So my questions are:

1) What can I do to recover from partitions lost problem after shutting down several nodes? I thought that in case of graceful shutdown this problem must be solved.

Now I can recover by returning one of offline nodes to cluster (starting the service) and running reset_lost_partitions command for broken cache. After this cache becomes available.

2) What can I do to prevent this problem in scenario with automatic cluster deployment? Should I add reset_lost_partitions command after activation or redeploy?

Please help.
Thanks in advance!

Best regards,
Rose.

Роза Айсина
Старший разработчик ПО
СберМаркет | Доставка из любимых магазинов

Email: roza.aysina@sbermarket.ru<ma...@sbermarket.ru>
Mob:
Web: sbermarket.ru<https://sbermarket.ru/>
App: iOS<https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457> и Android<https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>

УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ: это электронное сообщение и любые документы, приложенные к нему, содержат конфиденциальную информацию. Настоящим уведомляем Вас о том, что, если это сообщение не предназначено Вам, использование, копирование, распространение информации, содержащейся в настоящем сообщении, а также осуществление любых действий на основе этой информации, строго запрещено. Если Вы получили это сообщение по ошибке, пожалуйста, сообщите об этом отправителю по электронной почте и удалите это сообщение.
CONFIDENTIALITY NOTICE: This email and any files attached to it are confidential. If you are not the intended recipient you are notified that using, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error please notify the sender and delete this email.

Роза Айсина
Старший разработчик ПО
СберМаркет | Доставка из любимых магазинов

--
Regards,
Sumit Deshinge

Роза Айсина
Старший разработчик ПО
СберМаркет | Доставка из любимых магазинов

Роза Айсина

Старший разработчик ПО

СберМаркет | Доставка из любимых магазинов

Email: roza.aysina@sbermarket.ru<ma...@sbermarket.ru>

Mob:

Web: sbermarket.ru<https://sbermarket.ru/>

App: iOS<https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457> и Android<https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>

Re: Crash recover for Ignite persistence cluster (lost partitions case)

Posted by Ilya Shishkov <sh...@gmail.com>.

Hi Роза,

In addition to my previous answer:

Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]:
apache-ignite@config.xml.service: State 'stop-final-sigterm' timed out.
Killing.
Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]:
apache-ignite@config.xml.service: Killing process 11135 (java) with signal
SIGKILL.
Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]:
apache-ignite@config.xml.service: Failed with result 'timeout'.

Your nodes were killed (SIGKILL), so there was no graceful shutdown. And as
I said earlier, you should trigger a rebalance (i.e. remove stopping nodes
from baseline) and wait for rebalancing. After rebalancing nodes removed
from baseline will be gracefully shut down. Also about this feature you can
read in [1].

1. https://ignite.apache.org/docs/latest/starting-nodes#shutting-down-nodes

вт, 22 нояб. 2022 г. в 22:11, Ilya Shishkov <sh...@gmail.com>:

> About baseline topology you can read in documentation [1]. Manual baseline
> baseline management can be done available by means of control script [2].
>
> Links:
> 1. https://ignite.apache.org/docs/latest/clustering/baseline-topology
> 2.
> https://ignite.apache.org/docs/latest/tools/control-script#activation-deactivation-and-topology-management
>
> вт, 22 нояб. 2022 г. в 21:58, Ilya Shishkov <sh...@gmail.com>:
>
>> There is a typo here:
>> > Lost partitions are expected behaviour in case of partition because you
>> have only 1 backup and lost two nodes.
>>
>> I mean, that lost partitions are expected behaviour in case of
>> partitioned caches when the number of offline nodes is more than the number
>> of backups. In your case there are 1 backup and 2 offline nodes.
>>
>> вт, 22 нояб. 2022 г. в 21:56, Ilya Shishkov <sh...@gmail.com>:
>>
>>> Hi,
>>> > 1) What can I do to recover from partitions lost problem after
>>> shutting down several nodes?
>>> > I thought that in case of graceful shutdown this problem must be
>>> solved.
>>> > Now I can recover by returning *one* of offline nodes to cluster
>>> (starting the service) and running *reset_lost_partitions* command for
>>> broken cache. After this cache becomes available.
>>>
>>> Are caches with lost partitions replicated or partitioned? Lost
>>> partitions are expected behaviour in case of partition because you have
>>> only 1 backup and lost two nodes. If you want from cluster data to remain
>>> fully available in case of 2 nodes, you should set 2 backups for
>>> partitioned caches.
>>>
>>> As for graceful shutdown: why do you expect that data would not be lost?
>>> If you have 1 backup and 1 offline node, then there are some partitions
>>> without backups, because the latter remains inaccessible while their owner
>>> is offline. So, if you shutdown another one node with such partitions, they
>>> will be lost.
>>>
>>> So, for persistent clusters if you are in a situation, when you should
>>> work a long time without backups (i.e. with offline nodes, BUT without
>>> partition loss), you should trigger a rebalance. It can be done manually or
>>> automatically by changing the baseline.
>>> After rebalancing, the amount of data copies will be restored.
>>>
>>> Now you should bring back at least one of the nodes, in order to make
>>> partitions available. But if you need a full set of primary and partitions
>>> you need all baseline nodes in the cluster.
>>>
>>> 2) What can I do to prevent this problem in scenario with automatic
>>> cluster deployment? Should I add *reset_lost_partitions* command after
>>> activation or redeploy?
>>>
>>> I don't fully understand what you mean, but there are no problems with
>>> automatic deployments. In most cases, the situation with
>>> partition losses tells that cluster is in invalid state.
>>>
>>> вт, 22 нояб. 2022 г. в 19:49, Айсина Роза Мунеровна <
>>> roza.aysina@sbermarket.ru>:
>>>
>>>> Hi Sumit!
>>>>
>>>> Thanks for your reply!
>>>>
>>>> Yeah, I have used this utility reset_lost_partitions many times.
>>>>
>>>> The problem is that this function requires all baseline nodes to be
>>>> present.
>>>> If I shutdown node auto adjustment does not remove this node from
>>>> baseline topology and reset_lost_partitions ends with error that all
>>>> partition owners have left the grid, partition data has been lost.
>>>>
>>>> So I remove them manually and this operation succeeds but with loss of
>>>> data on offline nodes.
>>>>
>>>> What I am trying to understand is that why graceful shutdown do not
>>>> handles this situation in case of backup caches and persistance.
>>>> How can we automatically raise Ignite nodes if after redeploy data is
>>>> lost because cluster can’t handle lost partitions problem?
>>>>
>>>> Best regards,
>>>> Rose.
>>>>
>>>> On 22 Nov 2022, at 5:44 PM, Sumit Deshinge <su...@gmail.com>
>>>> wrote:
>>>>
>>>> Внимание: Внешний отправитель!
>>>> Если вы не знаете отправителя - не открывайте вложения, не переходите
>>>> по ссылкам, не пересылайте письмо!
>>>>
>>>> Please check if this helps:
>>>> https://ignite.apache.org/docs/latest/configuring-caches/partition-loss-policy#handling-partition-loss
>>>> Also any reason baseline auto adjustment is disabled?
>>>>
>>>> On Tue, Nov 22, 2022 at 6:38 PM Айсина Роза Мунеровна <
>>>> roza.aysina@sbermarket.ru> wrote:
>>>>
>>>>> Hola again!
>>>>>
>>>>> I discovered that enabling graceful shutdown via does not work.
>>>>>
>>>>> In service logs I see that nothing happens when *SIGTERM* comes :(
>>>>> Eventually stopping action has been timed out and *SIGKILL* has been
>>>>> sent which causes ungraceful shutdown.
>>>>> Timeout is set to *10 minutes*.
>>>>>
>>>>> Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Starting Apache Ignite
>>>>> In-Memory Computing Platform Service...
>>>>> Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Started Apache Ignite
>>>>> In-Memory Computing Platform Service.
>>>>> Nov 22 12:29:25 yc-ignite-lab-02 systemd[1]: Stopping Apache Ignite
>>>>> In-Memory Computing Platform Service...
>>>>> Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]:
>>>>> apache-ignite@config.xml.service: State 'stop-final-sigterm' timed
>>>>> out. Killing.
>>>>> Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]:
>>>>> apache-ignite@config.xml.service: Killing process 11135 (java) with
>>>>> signal SIGKILL.
>>>>> Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]:
>>>>> apache-ignite@config.xml.service: Failed with result 'timeout'.
>>>>> Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]: Stopped Apache Ignite
>>>>> In-Memory Computing Platform Service.
>>>>>
>>>>>
>>>>> I also enabled *DEBUG* level and see that nothing happens after
>>>>> rebalancing started (this is the end of log):
>>>>>
>>>>> [2022-11-22T12:29:25,957][INFO ][shutdown-hook][G] Invoking shutdown
>>>>> hook...
>>>>> [2022-11-22T12:29:25,958][DEBUG][shutdown-hook][G] Shutdown is in
>>>>> progress (ignoring): Shutdown in progress
>>>>> [2022-11-22T12:29:25,959][INFO ][shutdown-hook][G] Ensuring that
>>>>> caches have sufficient backups and local rebalance completion...
>>>>>
>>>>>
>>>>> I forgot to add that service is tarted with *service.sh*, not
>>>>> *ignite.sh*.
>>>>>
>>>>> Please help!
>>>>>
>>>>> On 22 Nov 2022, at 1:17 PM, Айсина Роза Мунеровна <
>>>>> roza.aysina@sbermarket.ru> wrote:
>>>>>
>>>>> Hola!
>>>>> I have a problem recovering from cluster crash in case when
>>>>> persistence is enabled.
>>>>>
>>>>> Our setup is
>>>>> - 5 VM nodes with 40G Ram and 200GB disk,
>>>>> - persistence is enabled (on separate disk on each VM),
>>>>> - all cluster actions are made through Ansible playbooks,
>>>>> - all caches are either partitioned with backups = 1 or replicated,
>>>>> - cluster starts as the service with running ignite.sh,
>>>>> - baseline auto adjust is disabled.
>>>>>
>>>>> Also following the docs about partition loss policy I have added
>>>>> *-DIGNITE_WAIT_FOR_BACKUPS_ON_SHUTDOWN=true* to *JVM_OPTS* to wait
>>>>> until partition rebalancing.
>>>>>
>>>>> What problem we have: after shutting down several nodes (2 go 5) one
>>>>> after another exception about lost partitions is raised.
>>>>>
>>>>> *Caused by:
>>>>> org.apache.ignite.internal.processors.cache.CacheInvalidStateException:
>>>>> Failed to execute query because cache partition has been lostPart
>>>>> [cacheName=PUBLIC_StoreProductFeatures, part=512]*
>>>>>
>>>>> But in logs of dead nodes I see that all shutdown hooks are called as
>>>>> expected on both nodes:
>>>>>
>>>>> [2022-11-22T09:24:19,614][INFO ][shutdown-hook][G] Invoking shutdown
>>>>> hook...
>>>>> [2022-11-22T09:24:19,615][INFO ][shutdown-hook][G] Ensuring that
>>>>> caches have sufficient backups and local rebalance completion...
>>>>>
>>>>>
>>>>>
>>>>> And baseline topology looks like this (with 2 offline nodes as
>>>>> expected):
>>>>>
>>>>> Cluster state: active
>>>>> Current topology version: 23
>>>>> Baseline auto adjustment disabled: softTimeout=30000
>>>>>
>>>>> Current topology version: 23 (Coordinator:
>>>>> ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b, Address=172.17.0.1,
>>>>> Order=3)
>>>>>
>>>>> Baseline nodes:
>>>>>     ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b,
>>>>> Address=172.17.0.1, State=ONLINE, Order=3
>>>>>     ConsistentId=4f67fccb-211b-4514-916b-a6286d1bb71b,
>>>>> Address=172.17.0.1, State=ONLINE, Order=21
>>>>>     ConsistentId=d980fa1c-e955-428a-bac9-d67dbfebb75e,
>>>>> Address=172.17.0.1, State=ONLINE, Order=5
>>>>>     ConsistentId=f151bd52-c173-45d7-952d-45cbe1d5fe97, State=OFFLINE
>>>>>     ConsistentId=f6862354-b175-4a0c-a94c-20253a944996, State=OFFLINE
>>>>>
>>>>> --------------------------------------------------------------------------------
>>>>> Number of baseline nodes: 5
>>>>>
>>>>> Other nodes not found.
>>>>>
>>>>>
>>>>>
>>>>> So my questions are:
>>>>>
>>>>> 1) What can I do to recover from partitions lost problem after
>>>>> shutting down several nodes? I thought that in case of graceful shutdown
>>>>> this problem must be solved.
>>>>>
>>>>> Now I can recover by returning *one* of offline nodes to cluster
>>>>> (starting the service) and running *reset_lost_partitions* command
>>>>> for broken cache. After this cache becomes available.
>>>>>
>>>>> 2) What can I do to prevent this problem in scenario with automatic
>>>>> cluster deployment? Should I add *reset_lost_partitions* command
>>>>> after activation or redeploy?
>>>>>
>>>>> Please help.
>>>>> Thanks in advance!
>>>>>
>>>>> Best regards,
>>>>> Rose.
>>>>>
>>>>> *--*
>>>>>
>>>>> *Роза Айсина*
>>>>> Старший разработчик ПО
>>>>> *СберМаркет* | Доставка из любимых магазинов
>>>>>
>>>>>
>>>>> Email: roza.aysina@sbermarket.ru
>>>>> Mob:
>>>>> Web: sbermarket.ru
>>>>> App: iOS
>>>>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
>>>>> и Android
>>>>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые
>>>>> документы, приложенные к нему, содержат конфиденциальную информацию.
>>>>> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено
>>>>> Вам, использование, копирование, распространение информации, содержащейся в
>>>>> настоящем сообщении, а также осуществление любых действий на основе этой
>>>>> информации, строго запрещено. Если Вы получили это сообщение по ошибке,
>>>>> пожалуйста, сообщите об этом отправителю по электронной почте и удалите это
>>>>> сообщение.
>>>>> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are
>>>>> confidential. If you are not the intended recipient you are notified that
>>>>> using, copying, distributing or taking any action in reliance on the
>>>>> contents of this information is strictly prohibited. If you have received
>>>>> this email in error please notify the sender and delete this email.
>>>>> *--*
>>>>>
>>>>> *Роза Айсина*
>>>>> Старший разработчик ПО
>>>>> *СберМаркет* | Доставка из любимых магазинов
>>>>>
>>>>>
>>>>> Email: roza.aysina@sbermarket.ru
>>>>> Mob:
>>>>> Web: sbermarket.ru
>>>>> App: iOS
>>>>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
>>>>> и Android
>>>>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *--*
>>>>>
>>>>> *Роза Айсина*
>>>>> Старший разработчик ПО
>>>>> *СберМаркет* | Доставка из любимых магазинов
>>>>>
>>>>>
>>>>> Email: roza.aysina@sbermarket.ru
>>>>> Mob:
>>>>> Web: sbermarket.ru
>>>>> App: iOS
>>>>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
>>>>> и Android
>>>>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые
>>>>> документы, приложенные к нему, содержат конфиденциальную информацию.
>>>>> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено
>>>>> Вам, использование, копирование, распространение информации, содержащейся в
>>>>> настоящем сообщении, а также осуществление любых действий на основе этой
>>>>> информации, строго запрещено. Если Вы получили это сообщение по ошибке,
>>>>> пожалуйста, сообщите об этом отправителю по электронной почте и удалите это
>>>>> сообщение.
>>>>> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are
>>>>> confidential. If you are not the intended recipient you are notified that
>>>>> using, copying, distributing or taking any action in reliance on the
>>>>> contents of this information is strictly prohibited. If you have received
>>>>> this email in error please notify the sender and delete this email.
>>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Sumit Deshinge
>>>>
>>>>
>>>> *--*
>>>>
>>>> *Роза Айсина*
>>>>
>>>> Старший разработчик ПО
>>>>
>>>> *СберМаркет* | Доставка из любимых магазинов
>>>>
>>>>
>>>>
>>>> Email: roza.aysina@sbermarket.ru
>>>>
>>>> Mob:
>>>>
>>>> Web: sbermarket.ru
>>>>
>>>> App: iOS
>>>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
>>>> и Android
>>>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>>>>
>>>>
>>>>
>>>> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые
>>>> документы, приложенные к нему, содержат конфиденциальную информацию.
>>>> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено
>>>> Вам, использование, копирование, распространение информации, содержащейся в
>>>> настоящем сообщении, а также осуществление любых действий на основе этой
>>>> информации, строго запрещено. Если Вы получили это сообщение по ошибке,
>>>> пожалуйста, сообщите об этом отправителю по электронной почте и удалите это
>>>> сообщение.
>>>> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are
>>>> confidential. If you are not the intended recipient you are notified that
>>>> using, copying, distributing or taking any action in reliance on the
>>>> contents of this information is strictly prohibited. If you have received
>>>> this email in error please notify the sender and delete this email.
>>>>
>>>

Re: Crash recover for Ignite persistence cluster (lost partitions case)

Posted by Ilya Shishkov <sh...@gmail.com>.

About baseline topology you can read in documentation [1]. Manual baseline
baseline management can be done available by means of control script [2].

Links:
1. https://ignite.apache.org/docs/latest/clustering/baseline-topology
2.
https://ignite.apache.org/docs/latest/tools/control-script#activation-deactivation-and-topology-management

вт, 22 нояб. 2022 г. в 21:58, Ilya Shishkov <sh...@gmail.com>:

> There is a typo here:
> > Lost partitions are expected behaviour in case of partition because you
> have only 1 backup and lost two nodes.
>
> I mean, that lost partitions are expected behaviour in case of partitioned
> caches when the number of offline nodes is more than the number of backups.
> In your case there are 1 backup and 2 offline nodes.
>
> вт, 22 нояб. 2022 г. в 21:56, Ilya Shishkov <sh...@gmail.com>:
>
>> Hi,
>> > 1) What can I do to recover from partitions lost problem after shutting
>> down several nodes?
>> > I thought that in case of graceful shutdown this problem must be
>> solved.
>> > Now I can recover by returning *one* of offline nodes to cluster
>> (starting the service) and running *reset_lost_partitions* command for
>> broken cache. After this cache becomes available.
>>
>> Are caches with lost partitions replicated or partitioned? Lost
>> partitions are expected behaviour in case of partition because you have
>> only 1 backup and lost two nodes. If you want from cluster data to remain
>> fully available in case of 2 nodes, you should set 2 backups for
>> partitioned caches.
>>
>> As for graceful shutdown: why do you expect that data would not be lost?
>> If you have 1 backup and 1 offline node, then there are some partitions
>> without backups, because the latter remains inaccessible while their owner
>> is offline. So, if you shutdown another one node with such partitions, they
>> will be lost.
>>
>> So, for persistent clusters if you are in a situation, when you should
>> work a long time without backups (i.e. with offline nodes, BUT without
>> partition loss), you should trigger a rebalance. It can be done manually or
>> automatically by changing the baseline.
>> After rebalancing, the amount of data copies will be restored.
>>
>> Now you should bring back at least one of the nodes, in order to make
>> partitions available. But if you need a full set of primary and partitions
>> you need all baseline nodes in the cluster.
>>
>> 2) What can I do to prevent this problem in scenario with automatic
>> cluster deployment? Should I add *reset_lost_partitions* command after
>> activation or redeploy?
>>
>> I don't fully understand what you mean, but there are no problems with
>> automatic deployments. In most cases, the situation with
>> partition losses tells that cluster is in invalid state.
>>
>> вт, 22 нояб. 2022 г. в 19:49, Айсина Роза Мунеровна <
>> roza.aysina@sbermarket.ru>:
>>
>>> Hi Sumit!
>>>
>>> Thanks for your reply!
>>>
>>> Yeah, I have used this utility reset_lost_partitions many times.
>>>
>>> The problem is that this function requires all baseline nodes to be
>>> present.
>>> If I shutdown node auto adjustment does not remove this node from
>>> baseline topology and reset_lost_partitions ends with error that all
>>> partition owners have left the grid, partition data has been lost.
>>>
>>> So I remove them manually and this operation succeeds but with loss of
>>> data on offline nodes.
>>>
>>> What I am trying to understand is that why graceful shutdown do not
>>> handles this situation in case of backup caches and persistance.
>>> How can we automatically raise Ignite nodes if after redeploy data is
>>> lost because cluster can’t handle lost partitions problem?
>>>
>>> Best regards,
>>> Rose.
>>>
>>> On 22 Nov 2022, at 5:44 PM, Sumit Deshinge <su...@gmail.com>
>>> wrote:
>>>
>>> Внимание: Внешний отправитель!
>>> Если вы не знаете отправителя - не открывайте вложения, не переходите по
>>> ссылкам, не пересылайте письмо!
>>>
>>> Please check if this helps:
>>> https://ignite.apache.org/docs/latest/configuring-caches/partition-loss-policy#handling-partition-loss
>>> Also any reason baseline auto adjustment is disabled?
>>>
>>> On Tue, Nov 22, 2022 at 6:38 PM Айсина Роза Мунеровна <
>>> roza.aysina@sbermarket.ru> wrote:
>>>
>>>> Hola again!
>>>>
>>>> I discovered that enabling graceful shutdown via does not work.
>>>>
>>>> In service logs I see that nothing happens when *SIGTERM* comes :(
>>>> Eventually stopping action has been timed out and *SIGKILL* has been
>>>> sent which causes ungraceful shutdown.
>>>> Timeout is set to *10 minutes*.
>>>>
>>>> Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Starting Apache Ignite
>>>> In-Memory Computing Platform Service...
>>>> Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Started Apache Ignite
>>>> In-Memory Computing Platform Service.
>>>> Nov 22 12:29:25 yc-ignite-lab-02 systemd[1]: Stopping Apache Ignite
>>>> In-Memory Computing Platform Service...
>>>> Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]:
>>>> apache-ignite@config.xml.service: State 'stop-final-sigterm' timed
>>>> out. Killing.
>>>> Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]:
>>>> apache-ignite@config.xml.service: Killing process 11135 (java) with
>>>> signal SIGKILL.
>>>> Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]:
>>>> apache-ignite@config.xml.service: Failed with result 'timeout'.
>>>> Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]: Stopped Apache Ignite
>>>> In-Memory Computing Platform Service.
>>>>
>>>>
>>>> I also enabled *DEBUG* level and see that nothing happens after
>>>> rebalancing started (this is the end of log):
>>>>
>>>> [2022-11-22T12:29:25,957][INFO ][shutdown-hook][G] Invoking shutdown
>>>> hook...
>>>> [2022-11-22T12:29:25,958][DEBUG][shutdown-hook][G] Shutdown is in
>>>> progress (ignoring): Shutdown in progress
>>>> [2022-11-22T12:29:25,959][INFO ][shutdown-hook][G] Ensuring that caches
>>>> have sufficient backups and local rebalance completion...
>>>>
>>>>
>>>> I forgot to add that service is tarted with *service.sh*, not
>>>> *ignite.sh*.
>>>>
>>>> Please help!
>>>>
>>>> On 22 Nov 2022, at 1:17 PM, Айсина Роза Мунеровна <
>>>> roza.aysina@sbermarket.ru> wrote:
>>>>
>>>> Hola!
>>>> I have a problem recovering from cluster crash in case when persistence
>>>> is enabled.
>>>>
>>>> Our setup is
>>>> - 5 VM nodes with 40G Ram and 200GB disk,
>>>> - persistence is enabled (on separate disk on each VM),
>>>> - all cluster actions are made through Ansible playbooks,
>>>> - all caches are either partitioned with backups = 1 or replicated,
>>>> - cluster starts as the service with running ignite.sh,
>>>> - baseline auto adjust is disabled.
>>>>
>>>> Also following the docs about partition loss policy I have added
>>>> *-DIGNITE_WAIT_FOR_BACKUPS_ON_SHUTDOWN=true* to *JVM_OPTS* to wait
>>>> until partition rebalancing.
>>>>
>>>> What problem we have: after shutting down several nodes (2 go 5) one
>>>> after another exception about lost partitions is raised.
>>>>
>>>> *Caused by:
>>>> org.apache.ignite.internal.processors.cache.CacheInvalidStateException:
>>>> Failed to execute query because cache partition has been lostPart
>>>> [cacheName=PUBLIC_StoreProductFeatures, part=512]*
>>>>
>>>> But in logs of dead nodes I see that all shutdown hooks are called as
>>>> expected on both nodes:
>>>>
>>>> [2022-11-22T09:24:19,614][INFO ][shutdown-hook][G] Invoking shutdown
>>>> hook...
>>>> [2022-11-22T09:24:19,615][INFO ][shutdown-hook][G] Ensuring that caches
>>>> have sufficient backups and local rebalance completion...
>>>>
>>>>
>>>>
>>>> And baseline topology looks like this (with 2 offline nodes as
>>>> expected):
>>>>
>>>> Cluster state: active
>>>> Current topology version: 23
>>>> Baseline auto adjustment disabled: softTimeout=30000
>>>>
>>>> Current topology version: 23 (Coordinator:
>>>> ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b, Address=172.17.0.1,
>>>> Order=3)
>>>>
>>>> Baseline nodes:
>>>>     ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b,
>>>> Address=172.17.0.1, State=ONLINE, Order=3
>>>>     ConsistentId=4f67fccb-211b-4514-916b-a6286d1bb71b,
>>>> Address=172.17.0.1, State=ONLINE, Order=21
>>>>     ConsistentId=d980fa1c-e955-428a-bac9-d67dbfebb75e,
>>>> Address=172.17.0.1, State=ONLINE, Order=5
>>>>     ConsistentId=f151bd52-c173-45d7-952d-45cbe1d5fe97, State=OFFLINE
>>>>     ConsistentId=f6862354-b175-4a0c-a94c-20253a944996, State=OFFLINE
>>>>
>>>> --------------------------------------------------------------------------------
>>>> Number of baseline nodes: 5
>>>>
>>>> Other nodes not found.
>>>>
>>>>
>>>>
>>>> So my questions are:
>>>>
>>>> 1) What can I do to recover from partitions lost problem after shutting
>>>> down several nodes? I thought that in case of graceful shutdown this
>>>> problem must be solved.
>>>>
>>>> Now I can recover by returning *one* of offline nodes to cluster
>>>> (starting the service) and running *reset_lost_partitions* command for
>>>> broken cache. After this cache becomes available.
>>>>
>>>> 2) What can I do to prevent this problem in scenario with automatic
>>>> cluster deployment? Should I add *reset_lost_partitions* command after
>>>> activation or redeploy?
>>>>
>>>> Please help.
>>>> Thanks in advance!
>>>>
>>>> Best regards,
>>>> Rose.
>>>>
>>>> *--*
>>>>
>>>> *Роза Айсина*
>>>> Старший разработчик ПО
>>>> *СберМаркет* | Доставка из любимых магазинов
>>>>
>>>>
>>>> Email: roza.aysina@sbermarket.ru
>>>> Mob:
>>>> Web: sbermarket.ru
>>>> App: iOS
>>>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
>>>> и Android
>>>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые
>>>> документы, приложенные к нему, содержат конфиденциальную информацию.
>>>> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено
>>>> Вам, использование, копирование, распространение информации, содержащейся в
>>>> настоящем сообщении, а также осуществление любых действий на основе этой
>>>> информации, строго запрещено. Если Вы получили это сообщение по ошибке,
>>>> пожалуйста, сообщите об этом отправителю по электронной почте и удалите это
>>>> сообщение.
>>>> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are
>>>> confidential. If you are not the intended recipient you are notified that
>>>> using, copying, distributing or taking any action in reliance on the
>>>> contents of this information is strictly prohibited. If you have received
>>>> this email in error please notify the sender and delete this email.
>>>> *--*
>>>>
>>>> *Роза Айсина*
>>>> Старший разработчик ПО
>>>> *СберМаркет* | Доставка из любимых магазинов
>>>>
>>>>
>>>> Email: roza.aysina@sbermarket.ru
>>>> Mob:
>>>> Web: sbermarket.ru
>>>> App: iOS
>>>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
>>>> и Android
>>>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *--*
>>>>
>>>> *Роза Айсина*
>>>> Старший разработчик ПО
>>>> *СберМаркет* | Доставка из любимых магазинов
>>>>
>>>>
>>>> Email: roza.aysina@sbermarket.ru
>>>> Mob:
>>>> Web: sbermarket.ru
>>>> App: iOS
>>>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
>>>> и Android
>>>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые
>>>> документы, приложенные к нему, содержат конфиденциальную информацию.
>>>> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено
>>>> Вам, использование, копирование, распространение информации, содержащейся в
>>>> настоящем сообщении, а также осуществление любых действий на основе этой
>>>> информации, строго запрещено. Если Вы получили это сообщение по ошибке,
>>>> пожалуйста, сообщите об этом отправителю по электронной почте и удалите это
>>>> сообщение.
>>>> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are
>>>> confidential. If you are not the intended recipient you are notified that
>>>> using, copying, distributing or taking any action in reliance on the
>>>> contents of this information is strictly prohibited. If you have received
>>>> this email in error please notify the sender and delete this email.
>>>>
>>>
>>>
>>> --
>>> Regards,
>>> Sumit Deshinge
>>>
>>>
>>> *--*
>>>
>>> *Роза Айсина*
>>>
>>> Старший разработчик ПО
>>>
>>> *СберМаркет* | Доставка из любимых магазинов
>>>
>>>
>>>
>>> Email: roza.aysina@sbermarket.ru
>>>
>>> Mob:
>>>
>>> Web: sbermarket.ru
>>>
>>> App: iOS
>>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
>>> и Android
>>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>>>
>>>
>>>
>>> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые
>>> документы, приложенные к нему, содержат конфиденциальную информацию.
>>> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено
>>> Вам, использование, копирование, распространение информации, содержащейся в
>>> настоящем сообщении, а также осуществление любых действий на основе этой
>>> информации, строго запрещено. Если Вы получили это сообщение по ошибке,
>>> пожалуйста, сообщите об этом отправителю по электронной почте и удалите это
>>> сообщение.
>>> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are
>>> confidential. If you are not the intended recipient you are notified that
>>> using, copying, distributing or taking any action in reliance on the
>>> contents of this information is strictly prohibited. If you have received
>>> this email in error please notify the sender and delete this email.
>>>
>>

Re: Crash recover for Ignite persistence cluster (lost partitions case)

Posted by Ilya Shishkov <sh...@gmail.com>.

There is a typo here:
> Lost partitions are expected behaviour in case of partition because you
have only 1 backup and lost two nodes.

I mean, that lost partitions are expected behaviour in case of partitioned
caches when the number of offline nodes is more than the number of backups.
In your case there are 1 backup and 2 offline nodes.

вт, 22 нояб. 2022 г. в 21:56, Ilya Shishkov <sh...@gmail.com>:

> Hi,
> > 1) What can I do to recover from partitions lost problem after shutting
> down several nodes?
> > I thought that in case of graceful shutdown this problem must be solved.
> > Now I can recover by returning *one* of offline nodes to cluster
> (starting the service) and running *reset_lost_partitions* command for
> broken cache. After this cache becomes available.
>
> Are caches with lost partitions replicated or partitioned? Lost partitions
> are expected behaviour in case of partition because you have only 1 backup
> and lost two nodes. If you want from cluster data to remain fully available
> in case of 2 nodes, you should set 2 backups for partitioned caches.
>
> As for graceful shutdown: why do you expect that data would not be lost?
> If you have 1 backup and 1 offline node, then there are some partitions
> without backups, because the latter remains inaccessible while their owner
> is offline. So, if you shutdown another one node with such partitions, they
> will be lost.
>
> So, for persistent clusters if you are in a situation, when you should
> work a long time without backups (i.e. with offline nodes, BUT without
> partition loss), you should trigger a rebalance. It can be done manually or
> automatically by changing the baseline.
> After rebalancing, the amount of data copies will be restored.
>
> Now you should bring back at least one of the nodes, in order to make
> partitions available. But if you need a full set of primary and partitions
> you need all baseline nodes in the cluster.
>
> 2) What can I do to prevent this problem in scenario with automatic
> cluster deployment? Should I add *reset_lost_partitions* command after
> activation or redeploy?
>
> I don't fully understand what you mean, but there are no problems with
> automatic deployments. In most cases, the situation with
> partition losses tells that cluster is in invalid state.
>
> вт, 22 нояб. 2022 г. в 19:49, Айсина Роза Мунеровна <
> roza.aysina@sbermarket.ru>:
>
>> Hi Sumit!
>>
>> Thanks for your reply!
>>
>> Yeah, I have used this utility reset_lost_partitions many times.
>>
>> The problem is that this function requires all baseline nodes to be
>> present.
>> If I shutdown node auto adjustment does not remove this node from
>> baseline topology and reset_lost_partitions ends with error that all
>> partition owners have left the grid, partition data has been lost.
>>
>> So I remove them manually and this operation succeeds but with loss of
>> data on offline nodes.
>>
>> What I am trying to understand is that why graceful shutdown do not
>> handles this situation in case of backup caches and persistance.
>> How can we automatically raise Ignite nodes if after redeploy data is
>> lost because cluster can’t handle lost partitions problem?
>>
>> Best regards,
>> Rose.
>>
>> On 22 Nov 2022, at 5:44 PM, Sumit Deshinge <su...@gmail.com>
>> wrote:
>>
>> Внимание: Внешний отправитель!
>> Если вы не знаете отправителя - не открывайте вложения, не переходите по
>> ссылкам, не пересылайте письмо!
>>
>> Please check if this helps:
>> https://ignite.apache.org/docs/latest/configuring-caches/partition-loss-policy#handling-partition-loss
>> Also any reason baseline auto adjustment is disabled?
>>
>> On Tue, Nov 22, 2022 at 6:38 PM Айсина Роза Мунеровна <
>> roza.aysina@sbermarket.ru> wrote:
>>
>>> Hola again!
>>>
>>> I discovered that enabling graceful shutdown via does not work.
>>>
>>> In service logs I see that nothing happens when *SIGTERM* comes :(
>>> Eventually stopping action has been timed out and *SIGKILL* has been
>>> sent which causes ungraceful shutdown.
>>> Timeout is set to *10 minutes*.
>>>
>>> Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Starting Apache Ignite
>>> In-Memory Computing Platform Service...
>>> Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Started Apache Ignite
>>> In-Memory Computing Platform Service.
>>> Nov 22 12:29:25 yc-ignite-lab-02 systemd[1]: Stopping Apache Ignite
>>> In-Memory Computing Platform Service...
>>> Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]:
>>> apache-ignite@config.xml.service: State 'stop-final-sigterm' timed out.
>>> Killing.
>>> Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]:
>>> apache-ignite@config.xml.service: Killing process 11135 (java) with
>>> signal SIGKILL.
>>> Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]:
>>> apache-ignite@config.xml.service: Failed with result 'timeout'.
>>> Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]: Stopped Apache Ignite
>>> In-Memory Computing Platform Service.
>>>
>>>
>>> I also enabled *DEBUG* level and see that nothing happens after
>>> rebalancing started (this is the end of log):
>>>
>>> [2022-11-22T12:29:25,957][INFO ][shutdown-hook][G] Invoking shutdown
>>> hook...
>>> [2022-11-22T12:29:25,958][DEBUG][shutdown-hook][G] Shutdown is in
>>> progress (ignoring): Shutdown in progress
>>> [2022-11-22T12:29:25,959][INFO ][shutdown-hook][G] Ensuring that caches
>>> have sufficient backups and local rebalance completion...
>>>
>>>
>>> I forgot to add that service is tarted with *service.sh*, not
>>> *ignite.sh*.
>>>
>>> Please help!
>>>
>>> On 22 Nov 2022, at 1:17 PM, Айсина Роза Мунеровна <
>>> roza.aysina@sbermarket.ru> wrote:
>>>
>>> Hola!
>>> I have a problem recovering from cluster crash in case when persistence
>>> is enabled.
>>>
>>> Our setup is
>>> - 5 VM nodes with 40G Ram and 200GB disk,
>>> - persistence is enabled (on separate disk on each VM),
>>> - all cluster actions are made through Ansible playbooks,
>>> - all caches are either partitioned with backups = 1 or replicated,
>>> - cluster starts as the service with running ignite.sh,
>>> - baseline auto adjust is disabled.
>>>
>>> Also following the docs about partition loss policy I have added
>>> *-DIGNITE_WAIT_FOR_BACKUPS_ON_SHUTDOWN=true* to *JVM_OPTS* to wait
>>> until partition rebalancing.
>>>
>>> What problem we have: after shutting down several nodes (2 go 5) one
>>> after another exception about lost partitions is raised.
>>>
>>> *Caused by:
>>> org.apache.ignite.internal.processors.cache.CacheInvalidStateException:
>>> Failed to execute query because cache partition has been lostPart
>>> [cacheName=PUBLIC_StoreProductFeatures, part=512]*
>>>
>>> But in logs of dead nodes I see that all shutdown hooks are called as
>>> expected on both nodes:
>>>
>>> [2022-11-22T09:24:19,614][INFO ][shutdown-hook][G] Invoking shutdown
>>> hook...
>>> [2022-11-22T09:24:19,615][INFO ][shutdown-hook][G] Ensuring that caches
>>> have sufficient backups and local rebalance completion...
>>>
>>>
>>>
>>> And baseline topology looks like this (with 2 offline nodes as
>>> expected):
>>>
>>> Cluster state: active
>>> Current topology version: 23
>>> Baseline auto adjustment disabled: softTimeout=30000
>>>
>>> Current topology version: 23 (Coordinator:
>>> ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b, Address=172.17.0.1,
>>> Order=3)
>>>
>>> Baseline nodes:
>>>     ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b,
>>> Address=172.17.0.1, State=ONLINE, Order=3
>>>     ConsistentId=4f67fccb-211b-4514-916b-a6286d1bb71b,
>>> Address=172.17.0.1, State=ONLINE, Order=21
>>>     ConsistentId=d980fa1c-e955-428a-bac9-d67dbfebb75e,
>>> Address=172.17.0.1, State=ONLINE, Order=5
>>>     ConsistentId=f151bd52-c173-45d7-952d-45cbe1d5fe97, State=OFFLINE
>>>     ConsistentId=f6862354-b175-4a0c-a94c-20253a944996, State=OFFLINE
>>>
>>> --------------------------------------------------------------------------------
>>> Number of baseline nodes: 5
>>>
>>> Other nodes not found.
>>>
>>>
>>>
>>> So my questions are:
>>>
>>> 1) What can I do to recover from partitions lost problem after shutting
>>> down several nodes? I thought that in case of graceful shutdown this
>>> problem must be solved.
>>>
>>> Now I can recover by returning *one* of offline nodes to cluster
>>> (starting the service) and running *reset_lost_partitions* command for
>>> broken cache. After this cache becomes available.
>>>
>>> 2) What can I do to prevent this problem in scenario with automatic
>>> cluster deployment? Should I add *reset_lost_partitions* command after
>>> activation or redeploy?
>>>
>>> Please help.
>>> Thanks in advance!
>>>
>>> Best regards,
>>> Rose.
>>>
>>> *--*
>>>
>>> *Роза Айсина*
>>> Старший разработчик ПО
>>> *СберМаркет* | Доставка из любимых магазинов
>>>
>>>
>>> Email: roza.aysina@sbermarket.ru
>>> Mob:
>>> Web: sbermarket.ru
>>> App: iOS
>>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
>>> и Android
>>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые
>>> документы, приложенные к нему, содержат конфиденциальную информацию.
>>> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено
>>> Вам, использование, копирование, распространение информации, содержащейся в
>>> настоящем сообщении, а также осуществление любых действий на основе этой
>>> информации, строго запрещено. Если Вы получили это сообщение по ошибке,
>>> пожалуйста, сообщите об этом отправителю по электронной почте и удалите это
>>> сообщение.
>>> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are
>>> confidential. If you are not the intended recipient you are notified that
>>> using, copying, distributing or taking any action in reliance on the
>>> contents of this information is strictly prohibited. If you have received
>>> this email in error please notify the sender and delete this email.
>>> *--*
>>>
>>> *Роза Айсина*
>>> Старший разработчик ПО
>>> *СберМаркет* | Доставка из любимых магазинов
>>>
>>>
>>> Email: roza.aysina@sbermarket.ru
>>> Mob:
>>> Web: sbermarket.ru
>>> App: iOS
>>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
>>> и Android
>>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>>>
>>>
>>>
>>>
>>>
>>> *--*
>>>
>>> *Роза Айсина*
>>> Старший разработчик ПО
>>> *СберМаркет* | Доставка из любимых магазинов
>>>
>>>
>>> Email: roza.aysina@sbermarket.ru
>>> Mob:
>>> Web: sbermarket.ru
>>> App: iOS
>>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
>>> и Android
>>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые
>>> документы, приложенные к нему, содержат конфиденциальную информацию.
>>> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено
>>> Вам, использование, копирование, распространение информации, содержащейся в
>>> настоящем сообщении, а также осуществление любых действий на основе этой
>>> информации, строго запрещено. Если Вы получили это сообщение по ошибке,
>>> пожалуйста, сообщите об этом отправителю по электронной почте и удалите это
>>> сообщение.
>>> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are
>>> confidential. If you are not the intended recipient you are notified that
>>> using, copying, distributing or taking any action in reliance on the
>>> contents of this information is strictly prohibited. If you have received
>>> this email in error please notify the sender and delete this email.
>>>
>>
>>
>> --
>> Regards,
>> Sumit Deshinge
>>
>>
>> *--*
>>
>> *Роза Айсина*
>>
>> Старший разработчик ПО
>>
>> *СберМаркет* | Доставка из любимых магазинов
>>
>>
>>
>> Email: roza.aysina@sbermarket.ru
>>
>> Mob:
>>
>> Web: sbermarket.ru
>>
>> App: iOS
>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
>> и Android
>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>>
>>
>>
>> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые
>> документы, приложенные к нему, содержат конфиденциальную информацию.
>> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено
>> Вам, использование, копирование, распространение информации, содержащейся в
>> настоящем сообщении, а также осуществление любых действий на основе этой
>> информации, строго запрещено. Если Вы получили это сообщение по ошибке,
>> пожалуйста, сообщите об этом отправителю по электронной почте и удалите это
>> сообщение.
>> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are
>> confidential. If you are not the intended recipient you are notified that
>> using, copying, distributing or taking any action in reliance on the
>> contents of this information is strictly prohibited. If you have received
>> this email in error please notify the sender and delete this email.
>>
>

Re: Crash recover for Ignite persistence cluster (lost partitions case)

Posted by Ilya Shishkov <sh...@gmail.com>.

Hi,
> 1) What can I do to recover from partitions lost problem after shutting
down several nodes?
> I thought that in case of graceful shutdown this problem must be solved.
> Now I can recover by returning *one* of offline nodes to cluster
(starting the service) and running *reset_lost_partitions* command for
broken cache. After this cache becomes available.

Are caches with lost partitions replicated or partitioned? Lost partitions
are expected behaviour in case of partition because you have only 1 backup
and lost two nodes. If you want from cluster data to remain fully available
in case of 2 nodes, you should set 2 backups for partitioned caches.

As for graceful shutdown: why do you expect that data would not be lost? If
you have 1 backup and 1 offline node, then there are some partitions
without backups, because the latter remains inaccessible while their owner
is offline. So, if you shutdown another one node with such partitions, they
will be lost.

So, for persistent clusters if you are in a situation, when you should work
a long time without backups (i.e. with offline nodes, BUT without partition
loss), you should trigger a rebalance. It can be done manually or
automatically by changing the baseline.
After rebalancing, the amount of data copies will be restored.

Now you should bring back at least one of the nodes, in order to make
partitions available. But if you need a full set of primary and partitions
you need all baseline nodes in the cluster.

2) What can I do to prevent this problem in scenario with automatic cluster
deployment? Should I add *reset_lost_partitions* command after activation
or redeploy?

I don't fully understand what you mean, but there are no problems with
automatic deployments. In most cases, the situation with
partition losses tells that cluster is in invalid state.

вт, 22 нояб. 2022 г. в 19:49, Айсина Роза Мунеровна <
roza.aysina@sbermarket.ru>:

> Hi Sumit!
>
> Thanks for your reply!
>
> Yeah, I have used this utility reset_lost_partitions many times.
>
> The problem is that this function requires all baseline nodes to be
> present.
> If I shutdown node auto adjustment does not remove this node from baseline
> topology and reset_lost_partitions ends with error that all partition
> owners have left the grid, partition data has been lost.
>
> So I remove them manually and this operation succeeds but with loss of
> data on offline nodes.
>
> What I am trying to understand is that why graceful shutdown do not
> handles this situation in case of backup caches and persistance.
> How can we automatically raise Ignite nodes if after redeploy data is lost
> because cluster can’t handle lost partitions problem?
>
> Best regards,
> Rose.
>
> On 22 Nov 2022, at 5:44 PM, Sumit Deshinge <su...@gmail.com>
> wrote:
>
> Внимание: Внешний отправитель!
> Если вы не знаете отправителя - не открывайте вложения, не переходите по
> ссылкам, не пересылайте письмо!
>
> Please check if this helps:
> https://ignite.apache.org/docs/latest/configuring-caches/partition-loss-policy#handling-partition-loss
> Also any reason baseline auto adjustment is disabled?
>
> On Tue, Nov 22, 2022 at 6:38 PM Айсина Роза Мунеровна <
> roza.aysina@sbermarket.ru> wrote:
>
>> Hola again!
>>
>> I discovered that enabling graceful shutdown via does not work.
>>
>> In service logs I see that nothing happens when *SIGTERM* comes :(
>> Eventually stopping action has been timed out and *SIGKILL* has been
>> sent which causes ungraceful shutdown.
>> Timeout is set to *10 minutes*.
>>
>> Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Starting Apache Ignite
>> In-Memory Computing Platform Service...
>> Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Started Apache Ignite
>> In-Memory Computing Platform Service.
>> Nov 22 12:29:25 yc-ignite-lab-02 systemd[1]: Stopping Apache Ignite
>> In-Memory Computing Platform Service...
>> Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]:
>> apache-ignite@config.xml.service: State 'stop-final-sigterm' timed out.
>> Killing.
>> Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]:
>> apache-ignite@config.xml.service: Killing process 11135 (java) with
>> signal SIGKILL.
>> Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]:
>> apache-ignite@config.xml.service: Failed with result 'timeout'.
>> Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]: Stopped Apache Ignite
>> In-Memory Computing Platform Service.
>>
>>
>> I also enabled *DEBUG* level and see that nothing happens after
>> rebalancing started (this is the end of log):
>>
>> [2022-11-22T12:29:25,957][INFO ][shutdown-hook][G] Invoking shutdown
>> hook...
>> [2022-11-22T12:29:25,958][DEBUG][shutdown-hook][G] Shutdown is in
>> progress (ignoring): Shutdown in progress
>> [2022-11-22T12:29:25,959][INFO ][shutdown-hook][G] Ensuring that caches
>> have sufficient backups and local rebalance completion...
>>
>>
>> I forgot to add that service is tarted with *service.sh*, not *ignite.sh*
>> .
>>
>> Please help!
>>
>> On 22 Nov 2022, at 1:17 PM, Айсина Роза Мунеровна <
>> roza.aysina@sbermarket.ru> wrote:
>>
>> Hola!
>> I have a problem recovering from cluster crash in case when persistence
>> is enabled.
>>
>> Our setup is
>> - 5 VM nodes with 40G Ram and 200GB disk,
>> - persistence is enabled (on separate disk on each VM),
>> - all cluster actions are made through Ansible playbooks,
>> - all caches are either partitioned with backups = 1 or replicated,
>> - cluster starts as the service with running ignite.sh,
>> - baseline auto adjust is disabled.
>>
>> Also following the docs about partition loss policy I have added
>> *-DIGNITE_WAIT_FOR_BACKUPS_ON_SHUTDOWN=true* to *JVM_OPTS* to wait until
>> partition rebalancing.
>>
>> What problem we have: after shutting down several nodes (2 go 5) one
>> after another exception about lost partitions is raised.
>>
>> *Caused by:
>> org.apache.ignite.internal.processors.cache.CacheInvalidStateException:
>> Failed to execute query because cache partition has been lostPart
>> [cacheName=PUBLIC_StoreProductFeatures, part=512]*
>>
>> But in logs of dead nodes I see that all shutdown hooks are called as
>> expected on both nodes:
>>
>> [2022-11-22T09:24:19,614][INFO ][shutdown-hook][G] Invoking shutdown
>> hook...
>> [2022-11-22T09:24:19,615][INFO ][shutdown-hook][G] Ensuring that caches
>> have sufficient backups and local rebalance completion...
>>
>>
>>
>> And baseline topology looks like this (with 2 offline nodes as expected):
>>
>> Cluster state: active
>> Current topology version: 23
>> Baseline auto adjustment disabled: softTimeout=30000
>>
>> Current topology version: 23 (Coordinator:
>> ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b, Address=172.17.0.1,
>> Order=3)
>>
>> Baseline nodes:
>>     ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b,
>> Address=172.17.0.1, State=ONLINE, Order=3
>>     ConsistentId=4f67fccb-211b-4514-916b-a6286d1bb71b,
>> Address=172.17.0.1, State=ONLINE, Order=21
>>     ConsistentId=d980fa1c-e955-428a-bac9-d67dbfebb75e,
>> Address=172.17.0.1, State=ONLINE, Order=5
>>     ConsistentId=f151bd52-c173-45d7-952d-45cbe1d5fe97, State=OFFLINE
>>     ConsistentId=f6862354-b175-4a0c-a94c-20253a944996, State=OFFLINE
>>
>> --------------------------------------------------------------------------------
>> Number of baseline nodes: 5
>>
>> Other nodes not found.
>>
>>
>>
>> So my questions are:
>>
>> 1) What can I do to recover from partitions lost problem after shutting
>> down several nodes? I thought that in case of graceful shutdown this
>> problem must be solved.
>>
>> Now I can recover by returning *one* of offline nodes to cluster
>> (starting the service) and running *reset_lost_partitions* command for
>> broken cache. After this cache becomes available.
>>
>> 2) What can I do to prevent this problem in scenario with automatic
>> cluster deployment? Should I add *reset_lost_partitions* command after
>> activation or redeploy?
>>
>> Please help.
>> Thanks in advance!
>>
>> Best regards,
>> Rose.
>>
>> *--*
>>
>> *Роза Айсина*
>> Старший разработчик ПО
>> *СберМаркет* | Доставка из любимых магазинов
>>
>>
>> Email: roza.aysina@sbermarket.ru
>> Mob:
>> Web: sbermarket.ru
>> App: iOS
>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
>> и Android
>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>>
>>
>>
>>
>>
>>
>>
>> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые
>> документы, приложенные к нему, содержат конфиденциальную информацию.
>> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено
>> Вам, использование, копирование, распространение информации, содержащейся в
>> настоящем сообщении, а также осуществление любых действий на основе этой
>> информации, строго запрещено. Если Вы получили это сообщение по ошибке,
>> пожалуйста, сообщите об этом отправителю по электронной почте и удалите это
>> сообщение.
>> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are
>> confidential. If you are not the intended recipient you are notified that
>> using, copying, distributing or taking any action in reliance on the
>> contents of this information is strictly prohibited. If you have received
>> this email in error please notify the sender and delete this email.
>> *--*
>>
>> *Роза Айсина*
>> Старший разработчик ПО
>> *СберМаркет* | Доставка из любимых магазинов
>>
>>
>> Email: roza.aysina@sbermarket.ru
>> Mob:
>> Web: sbermarket.ru
>> App: iOS
>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
>> и Android
>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>>
>>
>>
>>
>>
>> *--*
>>
>> *Роза Айсина*
>> Старший разработчик ПО
>> *СберМаркет* | Доставка из любимых магазинов
>>
>>
>> Email: roza.aysina@sbermarket.ru
>> Mob:
>> Web: sbermarket.ru
>> App: iOS
>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
>> и Android
>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>>
>>
>>
>>
>>
>>
>> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые
>> документы, приложенные к нему, содержат конфиденциальную информацию.
>> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено
>> Вам, использование, копирование, распространение информации, содержащейся в
>> настоящем сообщении, а также осуществление любых действий на основе этой
>> информации, строго запрещено. Если Вы получили это сообщение по ошибке,
>> пожалуйста, сообщите об этом отправителю по электронной почте и удалите это
>> сообщение.
>> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are
>> confidential. If you are not the intended recipient you are notified that
>> using, copying, distributing or taking any action in reliance on the
>> contents of this information is strictly prohibited. If you have received
>> this email in error please notify the sender and delete this email.
>>
>
>
> --
> Regards,
> Sumit Deshinge
>
>
> *--*
>
> *Роза Айсина*
>
> Старший разработчик ПО
>
> *СберМаркет* | Доставка из любимых магазинов
>
>
>
> Email: roza.aysina@sbermarket.ru
>
> Mob:
>
> Web: sbermarket.ru
>
> App: iOS
> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
> и Android
> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>
>
>
> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые
> документы, приложенные к нему, содержат конфиденциальную информацию.
> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено
> Вам, использование, копирование, распространение информации, содержащейся в
> настоящем сообщении, а также осуществление любых действий на основе этой
> информации, строго запрещено. Если Вы получили это сообщение по ошибке,
> пожалуйста, сообщите об этом отправителю по электронной почте и удалите это
> сообщение.
> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are
> confidential. If you are not the intended recipient you are notified that
> using, copying, distributing or taking any action in reliance on the
> contents of this information is strictly prohibited. If you have received
> this email in error please notify the sender and delete this email.
>

Re: Crash recover for Ignite persistence cluster (lost partitions case)

Posted by Айсина Роза Мунеровна <ro...@sbermarket.ru>.

Hi Sumit!

Thanks for your reply!

Yeah, I have used this utility reset_lost_partitions many times.

The problem is that this function requires all baseline nodes to be present.
If I shutdown node auto adjustment does not remove this node from baseline topology and reset_lost_partitions ends with error that all partition owners have left the grid, partition data has been lost.

So I remove them manually and this operation succeeds but with loss of data on offline nodes.

What I am trying to understand is that why graceful shutdown do not handles this situation in case of backup caches and persistance.
How can we automatically raise Ignite nodes if after redeploy data is lost because cluster can’t handle lost partitions problem?

Best regards,
Rose.

On 22 Nov 2022, at 5:44 PM, Sumit Deshinge <su...@gmail.com> wrote:

Внимание: Внешний отправитель!
Если вы не знаете отправителя - не открывайте вложения, не переходите по ссылкам, не пересылайте письмо!

Please check if this helps: https://ignite.apache.org/docs/latest/configuring-caches/partition-loss-policy#handling-partition-loss
Also any reason baseline auto adjustment is disabled?

On Tue, Nov 22, 2022 at 6:38 PM Айсина Роза Мунеровна <ro...@sbermarket.ru>> wrote:
Hola again!

I discovered that enabling graceful shutdown via does not work.

In service logs I see that nothing happens when SIGTERM comes :(
Eventually stopping action has been timed out and SIGKILL has been sent which causes ungraceful shutdown.
Timeout is set to 10 minutes.

Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Starting Apache Ignite In-Memory Computing Platform Service...
Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Started Apache Ignite In-Memory Computing Platform Service.
Nov 22 12:29:25 yc-ignite-lab-02 systemd[1]: Stopping Apache Ignite In-Memory Computing Platform Service...
Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]: apache-ignite@config.xml.service: State 'stop-final-sigterm' timed out. Killing.
Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]: apache-ignite@config.xml.service: Killing process 11135 (java) with signal SIGKILL.
Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]: apache-ignite@config.xml.service: Failed with result 'timeout'.
Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]: Stopped Apache Ignite In-Memory Computing Platform Service.

I also enabled DEBUG level and see that nothing happens after rebalancing started (this is the end of log):

[2022-11-22T12:29:25,957][INFO ][shutdown-hook][G] Invoking shutdown hook...
[2022-11-22T12:29:25,958][DEBUG][shutdown-hook][G] Shutdown is in progress (ignoring): Shutdown in progress
[2022-11-22T12:29:25,959][INFO ][shutdown-hook][G] Ensuring that caches have sufficient backups and local rebalance completion...

I forgot to add that service is tarted with service.sh, not ignite.sh.

Please help!

On 22 Nov 2022, at 1:17 PM, Айсина Роза Мунеровна <ro...@sbermarket.ru>> wrote:

Hola!
I have a problem recovering from cluster crash in case when persistence is enabled.

Our setup is
- 5 VM nodes with 40G Ram and 200GB disk,
- persistence is enabled (on separate disk on each VM),
- all cluster actions are made through Ansible playbooks,
- all caches are either partitioned with backups = 1 or replicated,
- cluster starts as the service with running ignite.sh,
- baseline auto adjust is disabled.

Also following the docs about partition loss policy I have added -DIGNITE_WAIT_FOR_BACKUPS_ON_SHUTDOWN=true to JVM_OPTS to wait until partition rebalancing.

What problem we have: after shutting down several nodes (2 go 5) one after another exception about lost partitions is raised.

Caused by: org.apache.ignite.internal.processors.cache.CacheInvalidStateException: Failed to execute query because cache partition has been lostPart [cacheName=PUBLIC_StoreProductFeatures, part=512]

But in logs of dead nodes I see that all shutdown hooks are called as expected on both nodes:

[2022-11-22T09:24:19,614][INFO ][shutdown-hook][G] Invoking shutdown hook...
[2022-11-22T09:24:19,615][INFO ][shutdown-hook][G] Ensuring that caches have sufficient backups and local rebalance completion...


And baseline topology looks like this (with 2 offline nodes as expected):

Cluster state: active
Current topology version: 23
Baseline auto adjustment disabled: softTimeout=30000

Current topology version: 23 (Coordinator: ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b, Address=172.17.0.1, Order=3)

Baseline nodes:
    ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b, Address=172.17.0.1, State=ONLINE, Order=3
    ConsistentId=4f67fccb-211b-4514-916b-a6286d1bb71b, Address=172.17.0.1, State=ONLINE, Order=21
    ConsistentId=d980fa1c-e955-428a-bac9-d67dbfebb75e, Address=172.17.0.1, State=ONLINE, Order=5
    ConsistentId=f151bd52-c173-45d7-952d-45cbe1d5fe97, State=OFFLINE
    ConsistentId=f6862354-b175-4a0c-a94c-20253a944996, State=OFFLINE
--------------------------------------------------------------------------------
Number of baseline nodes: 5

Other nodes not found.


So my questions are:

1) What can I do to recover from partitions lost problem after shutting down several nodes? I thought that in case of graceful shutdown this problem must be solved.

Now I can recover by returning one of offline nodes to cluster (starting the service) and running reset_lost_partitions command for broken cache. After this cache becomes available.

2) What can I do to prevent this problem in scenario with automatic cluster deployment? Should I add reset_lost_partitions command after activation or redeploy?

Please help.
Thanks in advance!

Best regards,
Rose.

--

Роза Айсина
Старший разработчик ПО
СберМаркет | Доставка из любимых магазинов



Email: roza.aysina@sbermarket.ru<ma...@sbermarket.ru>
Mob:
Web: sbermarket.ru<https://sbermarket.ru/>
App: iOS<https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457> и Android<https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>








УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ: это электронное сообщение и любые документы, приложенные к нему, содержат конфиденциальную информацию. Настоящим уведомляем Вас о том, что, если это сообщение не предназначено Вам, использование, копирование, распространение информации, содержащейся в настоящем сообщении, а также осуществление любых действий на основе этой информации, строго запрещено. Если Вы получили это сообщение по ошибке, пожалуйста, сообщите об этом отправителю по электронной почте и удалите это сообщение.
CONFIDENTIALITY NOTICE: This email and any files attached to it are confidential. If you are not the intended recipient you are notified that using, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error please notify the sender and delete this email.

--

Роза Айсина
Старший разработчик ПО
СберМаркет | Доставка из любимых магазинов



Email: roza.aysina@sbermarket.ru<ma...@sbermarket.ru>
Mob:
Web: sbermarket.ru<https://sbermarket.ru/>
App: iOS<https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457> и Android<https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>





--

Роза Айсина
Старший разработчик ПО
СберМаркет | Доставка из любимых магазинов



Email: roza.aysina@sbermarket.ru<ma...@sbermarket.ru>
Mob:
Web: sbermarket.ru<https://sbermarket.ru/>
App: iOS<https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457> и Android<https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>








УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ: это электронное сообщение и любые документы, приложенные к нему, содержат конфиденциальную информацию. Настоящим уведомляем Вас о том, что, если это сообщение не предназначено Вам, использование, копирование, распространение информации, содержащейся в настоящем сообщении, а также осуществление любых действий на основе этой информации, строго запрещено. Если Вы получили это сообщение по ошибке, пожалуйста, сообщите об этом отправителю по электронной почте и удалите это сообщение.
CONFIDENTIALITY NOTICE: This email and any files attached to it are confidential. If you are not the intended recipient you are notified that using, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error please notify the sender and delete this email.


--
Regards,
Sumit Deshinge



--

Роза Айсина

Старший разработчик ПО

СберМаркет | Доставка из любимых магазинов



Email: roza.aysina@sbermarket.ru<ma...@sbermarket.ru>

Mob:

Web: sbermarket.ru<https://sbermarket.ru/>

App: iOS<https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457> и Android<https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>



УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ: это электронное сообщение и любые документы, приложенные к нему, содержат конфиденциальную информацию. Настоящим уведомляем Вас о том, что, если это сообщение не предназначено Вам, использование, копирование, распространение информации, содержащейся в настоящем сообщении, а также осуществление любых действий на основе этой информации, строго запрещено. Если Вы получили это сообщение по ошибке, пожалуйста, сообщите об этом отправителю по электронной почте и удалите это сообщение.
CONFIDENTIALITY NOTICE: This email and any files attached to it are confidential. If you are not the intended recipient you are notified that using, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error please notify the sender and delete this email.

Re: Crash recover for Ignite persistence cluster (lost partitions case)

Posted by Sumit Deshinge <su...@gmail.com>.

Please check if this helps:
https://ignite.apache.org/docs/latest/configuring-caches/partition-loss-policy#handling-partition-loss
Also any reason baseline auto adjustment is disabled?

On Tue, Nov 22, 2022 at 6:38 PM Айсина Роза Мунеровна <
roza.aysina@sbermarket.ru> wrote:

> Hola again!
>
> I discovered that enabling graceful shutdown via does not work.
>
> In service logs I see that nothing happens when *SIGTERM* comes :(
> Eventually stopping action has been timed out and *SIGKILL* has been sent
> which causes ungraceful shutdown.
> Timeout is set to *10 minutes*.
>
> Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Starting Apache Ignite
> In-Memory Computing Platform Service...
> Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Started Apache Ignite
> In-Memory Computing Platform Service.
> Nov 22 12:29:25 yc-ignite-lab-02 systemd[1]: Stopping Apache Ignite
> In-Memory Computing Platform Service...
> Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]:
> apache-ignite@config.xml.service: State 'stop-final-sigterm' timed out.
> Killing.
> Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]:
> apache-ignite@config.xml.service: Killing process 11135 (java) with
> signal SIGKILL.
> Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]:
> apache-ignite@config.xml.service: Failed with result 'timeout'.
> Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]: Stopped Apache Ignite
> In-Memory Computing Platform Service.
>
>
> I also enabled *DEBUG* level and see that nothing happens after
> rebalancing started (this is the end of log):
>
> [2022-11-22T12:29:25,957][INFO ][shutdown-hook][G] Invoking shutdown
> hook...
> [2022-11-22T12:29:25,958][DEBUG][shutdown-hook][G] Shutdown is in progress
> (ignoring): Shutdown in progress
> [2022-11-22T12:29:25,959][INFO ][shutdown-hook][G] Ensuring that caches
> have sufficient backups and local rebalance completion...
>
>
> I forgot to add that service is tarted with *service.sh*, not *ignite.sh*
> .
>
> Please help!
>
> On 22 Nov 2022, at 1:17 PM, Айсина Роза Мунеровна <
> roza.aysina@sbermarket.ru> wrote:
>
> Hola!
> I have a problem recovering from cluster crash in case when persistence is
> enabled.
>
> Our setup is
> - 5 VM nodes with 40G Ram and 200GB disk,
> - persistence is enabled (on separate disk on each VM),
> - all cluster actions are made through Ansible playbooks,
> - all caches are either partitioned with backups = 1 or replicated,
> - cluster starts as the service with running ignite.sh,
> - baseline auto adjust is disabled.
>
> Also following the docs about partition loss policy I have added
> *-DIGNITE_WAIT_FOR_BACKUPS_ON_SHUTDOWN=true* to *JVM_OPTS* to wait until
> partition rebalancing.
>
> What problem we have: after shutting down several nodes (2 go 5) one after
> another exception about lost partitions is raised.
>
> *Caused by:
> org.apache.ignite.internal.processors.cache.CacheInvalidStateException:
> Failed to execute query because cache partition has been lostPart
> [cacheName=PUBLIC_StoreProductFeatures, part=512]*
>
> But in logs of dead nodes I see that all shutdown hooks are called as
> expected on both nodes:
>
> [2022-11-22T09:24:19,614][INFO ][shutdown-hook][G] Invoking shutdown
> hook...
> [2022-11-22T09:24:19,615][INFO ][shutdown-hook][G] Ensuring that caches
> have sufficient backups and local rebalance completion...
>
>
>
> And baseline topology looks like this (with 2 offline nodes as expected):
>
> Cluster state: active
> Current topology version: 23
> Baseline auto adjustment disabled: softTimeout=30000
>
> Current topology version: 23 (Coordinator:
> ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b, Address=172.17.0.1,
> Order=3)
>
> Baseline nodes:
>     ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b, Address=172.17.0.1,
> State=ONLINE, Order=3
>     ConsistentId=4f67fccb-211b-4514-916b-a6286d1bb71b, Address=172.17.0.1,
> State=ONLINE, Order=21
>     ConsistentId=d980fa1c-e955-428a-bac9-d67dbfebb75e, Address=172.17.0.1,
> State=ONLINE, Order=5
>     ConsistentId=f151bd52-c173-45d7-952d-45cbe1d5fe97, State=OFFLINE
>     ConsistentId=f6862354-b175-4a0c-a94c-20253a944996, State=OFFLINE
>
> --------------------------------------------------------------------------------
> Number of baseline nodes: 5
>
> Other nodes not found.
>
>
>
> So my questions are:
>
> 1) What can I do to recover from partitions lost problem after shutting
> down several nodes? I thought that in case of graceful shutdown this
> problem must be solved.
>
> Now I can recover by returning *one* of offline nodes to cluster
> (starting the service) and running *reset_lost_partitions* command for
> broken cache. After this cache becomes available.
>
> 2) What can I do to prevent this problem in scenario with automatic
> cluster deployment? Should I add *reset_lost_partitions* command after
> activation or redeploy?
>
> Please help.
> Thanks in advance!
>
> Best regards,
> Rose.
>
> *--*
>
> *Роза Айсина*
> Старший разработчик ПО
> *СберМаркет* | Доставка из любимых магазинов
>
>
> Email: roza.aysina@sbermarket.ru
> Mob:
> Web: sbermarket.ru
> App: iOS
> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
> и Android
> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>
>
>
>
>
>
>
> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые
> документы, приложенные к нему, содержат конфиденциальную информацию.
> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено
> Вам, использование, копирование, распространение информации, содержащейся в
> настоящем сообщении, а также осуществление любых действий на основе этой
> информации, строго запрещено. Если Вы получили это сообщение по ошибке,
> пожалуйста, сообщите об этом отправителю по электронной почте и удалите это
> сообщение.
> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are
> confidential. If you are not the intended recipient you are notified that
> using, copying, distributing or taking any action in reliance on the
> contents of this information is strictly prohibited. If you have received
> this email in error please notify the sender and delete this email.
> *--*
>
> *Роза Айсина*
> Старший разработчик ПО
> *СберМаркет* | Доставка из любимых магазинов
>
>
> Email: roza.aysina@sbermarket.ru
> Mob:
> Web: sbermarket.ru
> App: iOS
> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
> и Android
> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>
>
>
>
>
> *--*
>
> *Роза Айсина*
>
> Старший разработчик ПО
>
> *СберМаркет* | Доставка из любимых магазинов
>
>
>
> Email: roza.aysina@sbermarket.ru
>
> Mob:
>
> Web: sbermarket.ru
>
> App: iOS
> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
> и Android
> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>
>
>
> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые
> документы, приложенные к нему, содержат конфиденциальную информацию.
> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено
> Вам, использование, копирование, распространение информации, содержащейся в
> настоящем сообщении, а также осуществление любых действий на основе этой
> информации, строго запрещено. Если Вы получили это сообщение по ошибке,
> пожалуйста, сообщите об этом отправителю по электронной почте и удалите это
> сообщение.
> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are
> confidential. If you are not the intended recipient you are notified that
> using, copying, distributing or taking any action in reliance on the
> contents of this information is strictly prohibited. If you have received
> this email in error please notify the sender and delete this email.
>


-- 
Regards,
Sumit Deshinge

Re: Crash recover for Ignite persistence cluster (lost partitions case)

Posted by Айсина Роза Мунеровна <ro...@sbermarket.ru>.

Hola again!

I discovered that enabling graceful shutdown via does not work.

In service logs I see that nothing happens when SIGTERM comes :(
Eventually stopping action has been timed out and SIGKILL has been sent which causes ungraceful shutdown.
Timeout is set to 10 minutes.

Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Starting Apache Ignite In-Memory Computing Platform Service...
Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Started Apache Ignite In-Memory Computing Platform Service.
Nov 22 12:29:25 yc-ignite-lab-02 systemd[1]: Stopping Apache Ignite In-Memory Computing Platform Service...
Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]: apache-ignite@config.xml.service: State 'stop-final-sigterm' timed out. Killing.
Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]: apache-ignite@config.xml.service: Killing process 11135 (java) with signal SIGKILL.
Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]: apache-ignite@config.xml.service: Failed with result 'timeout'.
Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]: Stopped Apache Ignite In-Memory Computing Platform Service.

I also enabled DEBUG level and see that nothing happens after rebalancing started (this is the end of log):

[2022-11-22T12:29:25,957][INFO ][shutdown-hook][G] Invoking shutdown hook...
[2022-11-22T12:29:25,958][DEBUG][shutdown-hook][G] Shutdown is in progress (ignoring): Shutdown in progress
[2022-11-22T12:29:25,959][INFO ][shutdown-hook][G] Ensuring that caches have sufficient backups and local rebalance completion...

I forgot to add that service is tarted with service.sh, not ignite.sh.

Please help!

On 22 Nov 2022, at 1:17 PM, Айсина Роза Мунеровна <ro...@sbermarket.ru> wrote:

Hola!
I have a problem recovering from cluster crash in case when persistence is enabled.

Our setup is
- 5 VM nodes with 40G Ram and 200GB disk,
- persistence is enabled (on separate disk on each VM),
- all cluster actions are made through Ansible playbooks,
- all caches are either partitioned with backups = 1 or replicated,
- cluster starts as the service with running ignite.sh,
- baseline auto adjust is disabled.

Also following the docs about partition loss policy I have added -DIGNITE_WAIT_FOR_BACKUPS_ON_SHUTDOWN=true to JVM_OPTS to wait until partition rebalancing.

What problem we have: after shutting down several nodes (2 go 5) one after another exception about lost partitions is raised.

Caused by: org.apache.ignite.internal.processors.cache.CacheInvalidStateException: Failed to execute query because cache partition has been lostPart [cacheName=PUBLIC_StoreProductFeatures, part=512]

But in logs of dead nodes I see that all shutdown hooks are called as expected on both nodes:

[2022-11-22T09:24:19,614][INFO ][shutdown-hook][G] Invoking shutdown hook...
[2022-11-22T09:24:19,615][INFO ][shutdown-hook][G] Ensuring that caches have sufficient backups and local rebalance completion...


And baseline topology looks like this (with 2 offline nodes as expected):

Cluster state: active
Current topology version: 23
Baseline auto adjustment disabled: softTimeout=30000

Current topology version: 23 (Coordinator: ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b, Address=172.17.0.1, Order=3)

Baseline nodes:
    ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b, Address=172.17.0.1, State=ONLINE, Order=3
    ConsistentId=4f67fccb-211b-4514-916b-a6286d1bb71b, Address=172.17.0.1, State=ONLINE, Order=21
    ConsistentId=d980fa1c-e955-428a-bac9-d67dbfebb75e, Address=172.17.0.1, State=ONLINE, Order=5
    ConsistentId=f151bd52-c173-45d7-952d-45cbe1d5fe97, State=OFFLINE
    ConsistentId=f6862354-b175-4a0c-a94c-20253a944996, State=OFFLINE
--------------------------------------------------------------------------------
Number of baseline nodes: 5

Other nodes not found.


So my questions are:

1) What can I do to recover from partitions lost problem after shutting down several nodes? I thought that in case of graceful shutdown this problem must be solved.

Now I can recover by returning one of offline nodes to cluster (starting the service) and running reset_lost_partitions command for broken cache. After this cache becomes available.

2) What can I do to prevent this problem in scenario with automatic cluster deployment? Should I add reset_lost_partitions command after activation or redeploy?

Please help.
Thanks in advance!

Best regards,
Rose.

--

Роза Айсина
Старший разработчик ПО
СберМаркет | Доставка из любимых магазинов



Email: roza.aysina@sbermarket.ru<ma...@sbermarket.ru>
Mob:
Web: sbermarket.ru<https://sbermarket.ru/>
App: iOS<https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457> и Android<https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>








УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ: это электронное сообщение и любые документы, приложенные к нему, содержат конфиденциальную информацию. Настоящим уведомляем Вас о том, что, если это сообщение не предназначено Вам, использование, копирование, распространение информации, содержащейся в настоящем сообщении, а также осуществление любых действий на основе этой информации, строго запрещено. Если Вы получили это сообщение по ошибке, пожалуйста, сообщите об этом отправителю по электронной почте и удалите это сообщение.
CONFIDENTIALITY NOTICE: This email and any files attached to it are confidential. If you are not the intended recipient you are notified that using, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error please notify the sender and delete this email.

--

Роза Айсина
Старший разработчик ПО
СберМаркет | Доставка из любимых магазинов



Email: roza.aysina@sbermarket.ru<ma...@sbermarket.ru>
Mob:
Web: sbermarket.ru<https://sbermarket.ru/>
App: iOS<https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457> и Android<https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>






--

Роза Айсина

Старший разработчик ПО

СберМаркет | Доставка из любимых магазинов



Email: roza.aysina@sbermarket.ru<ma...@sbermarket.ru>

Mob:

Web: sbermarket.ru<https://sbermarket.ru/>

App: iOS<https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457> и Android<https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>



УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ: это электронное сообщение и любые документы, приложенные к нему, содержат конфиденциальную информацию. Настоящим уведомляем Вас о том, что, если это сообщение не предназначено Вам, использование, копирование, распространение информации, содержащейся в настоящем сообщении, а также осуществление любых действий на основе этой информации, строго запрещено. Если Вы получили это сообщение по ошибке, пожалуйста, сообщите об этом отправителю по электронной почте и удалите это сообщение.
CONFIDENTIALITY NOTICE: This email and any files attached to it are confidential. If you are not the intended recipient you are notified that using, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error please notify the sender and delete this email.