You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Vipul Thakur (Jira)" <ji...@apache.org> on 2024/01/11 17:00:00 UTC
[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

    [ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805709#comment-17805709 ] 

Vipul Thakur edited comment on IGNITE-21059 at 1/11/24 4:59 PM:
----------------------------------------------------------------

Hi [~zstan]  | [~cos] 

 

I ran a test again today in my local environment by changing my consistency to optimistic and isolation level to serializable with 5s txn timeout and ran long running load with low traffic only like we have multiple jms listeners which communicate with ignite while writing data and during the load i restarted one node to mimic the change n/w topology of the cluster, so for the first time when i did this nothing happened and but when i did the next time with another node we can observe the same issue as we are observing in prod. 

The write services listeners went into choked state and my queue started piling up. 

[^ignite_issue_1101.zip]

The zip contains the thread dump of the service , logs of the pod and logs from all the 3 nodes from that environment.

 

We have increased the wal size to 512mb , reduce the txn timeout 5secs and rolled back failuredetection timeout and client failuredetection timeout to default values.

Please help us with your observations.

 

I have also modified my code to detect thread deadlock like below : 

 

!image-2024-01-11-22-28-51-501.png|width=638,height=248!


was (Author: vipul.thakur):
Hi [~zstan]  | [~cos] 

 

I ran a test again today in my local environment by changing my consistency to optimistic and isolation level to serializable with 5s txn timeout and ran long running load with low traffic only like we have multiple jms listeners which communicate with ignite while writing data and during the load i restarted one node to mimic the change n/w topology of the cluster, so for the first time when i did this nothing happened and but when i did the next time with another node we can observe the same issue as we are observing in prod. 

The write services listeners went into choked state and my queue started piling up. 

[^ignite_issue_1101.zip]

The zip contains the thread dump of the service , logs of the pod and logs from all the 3 nodes from that environment.

 

We have increased the wal size to 512mb , reduce the txn timeout 5secs and rolled back failuredetection timeout and client failuredetection timeout to default values.

Please help us with your observations.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
> --------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-21059
>                 URL: https://issues.apache.org/jira/browse/IGNITE-21059
>             Project: Ignite
>          Issue Type: Bug
>          Components: binary, clients
>    Affects Versions: 2.14
>            Reporter: Vipul Thakur
>            Priority: Critical
>         Attachments: Ignite_server_logs.zip, cache-config-1.xml, client-service.zip, digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27.log, digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, ignite-server-nohup-1.out, ignite-server-nohup.out, ignite_issue_1101.zip, image-2024-01-11-22-28-51-501.png, image.png, long_txn_.png, nohup_12.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in production environment where cluster would go in hang state due to partition map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client end for our write operations but seems like sometimes cluster goes into hang state and all our get calls are stuck and slowly everything starts to freeze our jms listener threads and every thread reaches a choked up state in sometime.
> Due to which our read services which does not even use txn to retrieve data also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)