You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@qpid.apache.org by "Martijn Fennema (Jira)" <ji...@apache.org> on 2021/08/04 09:51:00 UTC
[jira] [Comment Edited] (PROTON-2411) Simultaneous idle timeout sequencing errors

    [ https://issues.apache.org/jira/browse/PROTON-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17392838#comment-17392838 ] 

Martijn Fennema edited comment on PROTON-2411 at 8/4/21, 9:50 AM:
------------------------------------------------------------------

I'm a collegae of [~Wiggelinkhuizen] and have been investigating this issue in his absence. 

After some further investigation we figured out that the issue is happening approximately three and a half weeks after the (re)boot of the machine Proton is running on. These approximately three and a half weeks are not entirely coincidental 2^31 seconds. Since we are building the code in 32 bit this could be an issue.

We took a closer look in the Release notes of the latest few versions with this in our mind and stumbled upon: PROTON-985 - Modify pn_transport_tick to explicitly use a monotonic clock, not wall clock time. 

We found the following suspicious code while searching for instance of CLOCK_MONOTONIC  in the code.

 
{code:java}
int64_t pn_proactor_now_64(void)
{
  struct timespec t;  
  clock_gettime(CLOCK_MONOTONIC, &t);  
  return t.tv_sec * 1000 + t.tv_nsec / 1000000; 
} 
{code}
 Since we compile the Proton code in 32-bit the timespec struct will return it's members as a 32-bit integer. The calculation for the return value will end up with and integer overflow if the time since last boost passes 2147483 seconds.

 

 To verify that this was indeed the issue we have been encountering, some test were performed. We modified the code in such a way that the monotonic time had an offset of 2146883 seconds ( 10 minutes prior to an overflow) .  After a reboot of the VM and waiting for 10 minutes, this would consistently result into the issue's mentioned in the ticket. 

 After applying the following fix we performed the test again (with the offset applied).
{code:java}
int64_t pn_proactor_now_64(void)
{
  struct timespec t;  
  clock_gettime(CLOCK_MONOTONIC, &t);  
  return ((int64_t)t.tv_sec) * 1000 + t.tv_nsec / 1000000; 
} 
{code}
 

After a reboot of the VM and waiting for 10 minutes, the system behaved normally and the crashes/issues mentioned in the ticket were not encountered anymore.

 I submitted a PR ([https://github.com/apache/qpid-proton/pull/327)]  with the mentioned fix.


was (Author: martijn fennema):
I'm a collegae of [~Wiggelinkhuizen] and have been investigating this issue in his absence. 

After some further investigation we figured out that the issue is happening approximately three and a half weeks after the (re)boot of the machine Proton is running on. These approximately three and a half weeks are not entirely coincidental 2^31 seconds. Since we are building the code in 32 bit this could be an issue.

We took a closer look in the Release notes of the latest few versions with this in our mind and stumbled upon: PROTON-985 - Modify pn_transport_tick to explicitly use a monotonic clock, not wall clock time. 

We found the following suspicious code while searching for instance of CLOCK_MONOTONIC  in the code.

 
{code:java}
int64_t pn_proactor_now_64(void)
{
  struct timespec t;  
  clock_gettime(CLOCK_MONOTONIC, &t);  
  return t.tv_sec * 1000 + t.tv_nsec / 1000000; 
} 
{code}
 Since we compile the Proton code in 32-bit the timespec struct will return it's members as a 32-bit integer. The calculation for the return value will end up with and integer overflow if the time since last boost passes 2147483 seconds.

 

 To verify that this was indeed the issue we have been encountering, some test were performed. We modified the code in such a way that the monotonic time had an offset of 2146883 seconds ( 10 minutes prior to an overflow) .  After a reboot of the VM and waiting for 10 minutes, this would consistently result into the issue's mentioned in the ticket. 

 After applying the following fix we performed the test again (with the offset applied).
{code:java}
int64_t pn_proactor_now_64(void)
{
  struct timespec t;  
  clock_gettime(CLOCK_MONOTONIC, &t);  
  return ((int64_t)t.tv_sec) * 1000 + t.tv_nsec / 1000000; 
} 
{code}
 

After a reboot of the VM and waiting for 10 minutes, the system behaved normally and the crashes/issues mentioned in the ticket were not encountered anymore.

 I will submit an PR with the mentioned fix.

> Simultaneous idle timeout sequencing errors
> -------------------------------------------
>
>                 Key: PROTON-2411
>                 URL: https://issues.apache.org/jira/browse/PROTON-2411
>             Project: Qpid Proton
>          Issue Type: Bug
>          Components: proton-c
>    Affects Versions: proton-c-0.34.0
>            Reporter: Jaap Wiggelinkhuizen
>            Priority: Critical
>         Attachments: p2411_0.diff
>
>
> In our mission critical software we use Qpid proton 0.34.0 in our C++-client software together with the Qpid dispatch router 1.16.0. We updated to these versions not so long ago, before we used proton 0.25.0 and dispatch 1.3.0. Our application runs on several VM’s with a router on each VM. All clients connect to the local router only and the routers connect to eachother in a hub spoke pattern. In both the client configuration as the router configuration we have configured an idle timeout of 30 seconds.
> On July 4th we were confronted with an incident in production where a lot of our client processes reported problems regarding the idle timeouts. These client processes were already running stable for more than 3 weeks. The problem appeared in two flavors:
>  # Transport error “error: amqp:resource-limit-exceeded: local-idle-timeout expired”
>  # epoll proactor failure in epoll_timer.c:263: “idle timeout sequencing error”
> On each VM at least 3 processes showed one of these problems in a total time window of less than a minute. We haven’t found any cause in the underlying hardware, hypervisor, network or operating system until now.
> Although we don’t know the root cause of the problems, we can solve the first situation by using the proper reconnect settings (by mistake we handled on_transport_error() as a fatal situation and will correct that so that only on_transport_close() will be handled as fatal). However the second situation is more odd because it results in an abort within proton itself. The comments in epoll_timer.c explain that this error occurs when a connection timer is moved backwards a second time. We don’t understand how this can happen suddenly.
>  
> Last sunday the problem occurred again on two more production sites where our software was operational just over 3 weeks now. And again it has happened on all VM's within a short timeframe. It's interesting that it only occurs on sunday mornings until now. Maybe it has something to do with how long the software is running and the fact that on sunday mornings there is less messaging traffic, i.e. more heartbeats?...
>  
> Unfortunately we haven't been able to reproduce the issue at our test facilities and hence can not provide a reproducer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@qpid.apache.org
For additional commands, e-mail: dev-help@qpid.apache.org