You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@qpid.apache.org by "Radu Dan (Created) (JIRA)" <ji...@apache.org> on 2012/02/20 17:59:34 UTC
[jira] [Created] (QPID-3860) Python client enters busy loop on heartbeat timeout

Python client enters busy loop on heartbeat timeout
---------------------------------------------------

                 Key: QPID-3860
                 URL: https://issues.apache.org/jira/browse/QPID-3860
             Project: Qpid
          Issue Type: Bug
          Components: Python Client
    Affects Versions: 0.14
         Environment: Client: Linux radudan.activesoft.ro 3.2.3-2.fc16.x86_64 #1 SMP Fri Feb 3 20:08:08 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
Server: Linux ophelia.activesoft.ro 2.6.32-220.2.1.el6.x86_64 #1 SMP Fri Dec 23 02:21:33 CST 2011 x86_64 x86_64 x86_64 GNU/Linux

            Reporter: Radu Dan
            Priority: Critical


Let's asume the following code, that's linked against and connected to qpid 0.14:

#!/bin/env python
import sys, time, logging
from qpid.messaging import *
logging.basicConfig(level=logging.INFO)

connection = Connection(
        host="ophelia.activesoft.ro",
        port=5672,
        heartbeat=1,
        reconnect=True,
        reconnect_interval_min=1,
        reconnect_interval_max=1,
        reconnect_limit=5,
        reconnect_timeout=5
)

connection.open()
session = connection.session()
sender = session.sender("test; {create: always}")

id = 0
while True:
        try:
                sender.send({
                        "hello": "world"
                })
                logging.info("Sent %d" % id)
                id += 1
        except KeyboardInterrupt:
                break
        except:
                logging.info("Error")
        finally:
                time.sleep(1)

The script will try to publish a message on the "test" queue to a broker, using a heartbeat value of 1s, and will reconnect at most 5 times with a timeout of 1s before giving up and reporting the message as failed. This works as expected when gracefully shutting the broker down (service qpidd stop):

<<daemon is running>>
INFO:root:Sent 0
INFO:root:Sent 1
INFO:root:Sent 2
INFO:root:Sent 3
INFO:root:Sent 4
<<daemon has been stopped>>
WARNING:qpid.messaging:recoverable error[attempt 0]: connection aborted
WARNING:qpid.messaging:trying: ophelia.activesoft.ro:5672
<<daemon is still stopped, yet a reconnect apparently succeeds>>
WARNING:qpid.messaging:reconnect succeeded: ophelia.activesoft.ro:5672
WARNING:qpid.messaging:recoverable error[attempt 0]: [Errno 104] Connection reset by peer
WARNING:qpid.messaging:trying: ophelia.activesoft.ro:5672
WARNING:qpid.messaging:recoverable error[attempt 1]: [Errno 111] Connection refused
WARNING:qpid.messaging:sleeping 1 seconds
WARNING:qpid.messaging:trying: ophelia.activesoft.ro:5672
WARNING:qpid.messaging:recoverable error[attempt 2]: [Errno 111] Connection refused
WARNING:qpid.messaging:sleeping 1 seconds
WARNING:qpid.messaging:trying: ophelia.activesoft.ro:5672
WARNING:qpid.messaging:recoverable error[attempt 3]: [Errno 111] Connection refused
WARNING:qpid.messaging:sleeping 1 seconds
WARNING:qpid.messaging:trying: ophelia.activesoft.ro:5672
WARNING:qpid.messaging:recoverable error[attempt 4]: [Errno 111] Connection refused
WARNING:qpid.messaging:sleeping 1 seconds
WARNING:qpid.messaging:trying: ophelia.activesoft.ro:5672
WARNING:qpid.messaging:recoverable error[attempt 5]: [Errno 111] Connection refused
WARNING:qpid.messaging:sleeping 1 seconds
WARNING:qpid.messaging:trying: ophelia.activesoft.ro:5672
<<gives up here after 5 attempts>>
INFO:root:Error
INFO:root:Error
INFO:root:Error
<<daemon has been started>>
INFO:root:Error
INFO:root:Error

I don't know how exactly it gets a reconnect succeeded right after the daemon is killed, but it seems to reconnect 5 times, afterwards it will fail with some exception. However, it makes no more attempts to establish a connection when this state is reached. Now I'm not entirely sure about the philosophy behind the API, but wouldn't it be more useful to try and reconnect if down whenever a new message is sent? Right now the default behaviour is to block until the connection is back up; Tweaking the reconnect parameters will yield a call that throws an exception after a while, but that invalidates the entire connection>session>sender chain

This behaviour can be duplicated by kill -9, because the kernel will shutdown and close the TCP sockets associated with any killed processes. However, an even greater problem arises when disconnecting the interface (I just pulled the ether cable out of my test box, but service network stop should probably do the same thing). This way, the socket never really closes because it's waiting for the other end to shutdown(), which may never come because the other end might have already given up on the connection (via a hard reset for example). Since most linux TCP stacks are EXTREMELY patient with regards to socket timeouts, it's up to the heartbeat mechanism to detect a connection failure and switch to the same state as a shutdown/close would.

However, what happens is something like this:

INFO:root:Sent 0
INFO:root:Sent 1
INFO:root:Sent 2
INFO:root:Sent 3
INFO:root:Sent 4
WARNING:qpid.messaging:recoverable error[attempt 0]: heartbeat timeout

At this point, the python process enters a busy loop of some sort from which it will never recover (plugging the ether cable back in doesn't seem to do anything). Switching logging to debug doesn't output anything past the heartbeat timeout.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org