You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@qpid.apache.org by Fraser Adams <fr...@blueyonder.co.uk> on 2014/10/04 19:40:18 UTC

Messenger doesn't seem to have any clean way to recover from errors.

Is there any way to recover from Messenger errors short of completely 
freeing the messenger instance and starting with a new one?


I've been deliberately making it fail, so for example starting a 
messenger with subscriptions like this:
amqp://~0.0.0.0,localhost:5672

with no broker running the first subscription should succeed and the 
second one should fail

In my case it's a bit more awkward because it's fully asynchronous, but 
what I see in this case is that it creates a connection instance to 
localhost:5672 because in pn_connect there is a test for

   if (connect(sock, addr->ai_addr, addr->ai_addrlen) == -1) {
     if (errno != EINPROGRESS) {
       pn_i_error_from_errno(io->error, "connect");
       freeaddrinfo(addr);
       close(sock);
       return PN_INVALID_SOCKET;
     }
   }

with my connect on a non-blocking socket EINPROGRESS is set so the 
socket ends up being valid, but subsequently it will fail to connect.


I've actually got a listener that can detect the Connection refused, but 
what I can't seem to do is to cleanly clear the connection object.

I've tried all sorts of hacks around 
pn_messenger_resolve/pni_messenger_reclaim (in that case 
pn_messenger_resolve found the connection object given the name 
"localhost:5672" which was found OK then I tried a pni_messenger_reclaim 
hack to clear it, but that didn't seem to close the underlying socket).

I also tried to find the relevant selectable pn_messenger_selectable 
that matched the file descriptor of the failed connection I then tried a 
pni_connection_finalize(sel) hack. In that case I seem to free up the 
connection and the underlying socket gets closed, but when I 
subsequently try to connect (to the working amqp://~0.0.0.0) although I 
get an accept on the right file descriptor I subsequently get an 
assertion failed at messenger.c,151,pni_context at Error


So in short given that a connection object gets created because of a 
connect on a non-blocking socket, which subsequently and asynchronously 
fails to connect there doesn't seem any way to tidy up that failed 
connection.

To be clear if I have subscriptions
amqp://~0.0.0.0,localhost:5672

And ignore any errors and don't bother to try and tidy up and I 
subsequently do a client connection to amqp://0.0.0.0 my client connects 
fine but on the next file descriptor up from the one created by the 
failed localhost:5672 connection so basically my failed subscription has 
leaked a connection. That is the listen fd for amqp://~0.0.0.0 is 3 the 
(failed) fd for localhost:5672 is 4 and when I connect to 
amqp://~0.0.0.0 the accept fd is 5, it really should be 4 but I can't 
get shot of the connection object etc. for localhost:5672.

The only way to deal with it seems to be to free and create a new 
messenger when anything fails, which is a pain because the subscription 
amqp://~0.0.0.0 is actually fine.


TBH messenger's error handling is driving me nuts, it has been mentioned 
in a few threads that it might be better to give up on messenger and 
just use engine.

Is messenger really irredeemably broken? Without decent error 
handling/recovery it's very little use in a production environment.

Frase





---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
For additional commands, e-mail: users-help@qpid.apache.org


Re: Messenger doesn't seem to have any clean way to recover from errors.

Posted by Fraser Adams <fr...@blueyonder.co.uk>.
On 14/10/14 15:05, Marcel Meulemans wrote:
> I think the main problem here is not in the messenger but a problem in
> the protocol engine I discovered a while ago:
> https://issues.apache.org/jira/browse/PROTON-644?filter=-2 ... there
> is no way for the messenger to recover or handle the error because it
> can not see the connection fail.
>
Hi Marcel,
I *think* that technically proton-c/src/posix is part of messenger, I 
thought that the main Engine stuff was agnostic about the actual IO 
stuff (certainly looking at the examples 
here:https://github.com/rhs/qpid-proton-demofrom the thread Proton 
tutorial: synchronous request-response I can see low-level socket code 
in the common.py file).


That's by the by though. I had a quick look at what you were suggesting, 
so I may have misunderstood, but it gives me a few concerns (possibly 
parochially :-))

So, full disclosure and all that, I'm doing something a little bit weird 
in that I'm writing JavaScript bindings that use a tool called 
emscripten to compile proton-c to JavaScript and then I have binding 
code wrapping that to make it idiomatically JavaScript. The main thing 
you have to understand with that is that the ability of Messenger to 
behave non-blocking/asynchronously is very much my friend. If you know 
JavaScript you'll know that it's fundamentally asynchronous.

So your suggestion "This fix make the pn_connect function block by using 
select to check the result of connect with a timeout (currently 10 sec). 
Another option would be to be to set the socket to non blocking after 
the connect, but this can can cause connect to block for minutes." 
rather makes me shudder, hopefully you can see why :-)


As it happens I've been meaning to follow up on my original post as I am 
now making a fair bit of headway on my quest for better error handling 
(imagine how much fun it is when *everything* is asynchronous).

I think that my problem (and I suspect your and everyone else's) turns 
out to be trusting the internal Messenger Event loop too much.

What I have ended up doing is to make use of the "passive" feature 
(pn_messenger_set_passive), which allows you to make use of a few API 
calls pn_messenger_selectable, pn_selectable_readable, 
pn_selectable_capacity, pn_selectable_writable, pn_selectable_pending, 
pn_selectable_is_terminal, pn_selectable_fd, pn_selectable_free

There's not a huge amount by way of examples for this API, but 
<proton>/tests/python/proton_tests/messenger.py

class Pump (around line 941) pretty much illustrates what to do.

In messenger.c pni_wait and pn_messenger_process are the important bits 
(those get bypassed when you do pn_messenger_set_passive)

What ends up being useful is that around the bit (in python)

    count = 0
     for s in readable:
       s.readable()
       count += 1
     for s in writable:
       s.writable()
       count += 1
     return count

or similarly

     if (events & PN_READABLE) {
       pn_selectable_readable(sel);
     }
     if (events & PN_WRITABLE) {
       pn_selectable_writable(sel);
       doMessengerTick = false;
     }
     if (events & PN_EXPIRED) {
       pn_selectable_expired(sel);
     }

you can check for errors relating to particular file descriptors around 
the pn_selectable_readable and pn_selectable_writable calls (the 
selectable passed in has an associated file descriptor).

What I mean is the pn_selectable_readable ends up calling 
pni_connection_readable so you should be able to error check around there.

To be fair the current error handling in pn_error_report leaves a lot to 
be desired (it currently does an fprintf!!!!) but Dominic Evans has put 
a patch together https://issues.apache.org/jira/browse/PROTON-571 
unfortunately that hasn't been pulled through yet, but I believe this 
approach is the way forward.

I'm quite fortunate, because in my JavaScript binding I can Monkey 
Patch, so I intercepted fprintf and set an error variable that I can 
check after my call to pn_selectable_readable, that should be redundant 
IDC after Dominic's patch gets pulled through.

So by using the selectable API I'm now able to trap connection and bind 
errors etc. and recover from them, including tidying up file 
descriptors. I'm still trying a few things out (AKA tinkering about :-)) 
but this approach seems to be the best way forward for getting access to 
low-level connection/transport type detail in order to trap and recover 
from errors.

In an ideal world it might be nice if the internal event loop handled 
this a bit more elegantly but now that I've figured out the selectable 
API I quite like having my own control over this stuff.


Hope this helps? I'd be interested in the views of anyone else who has 
been interested in this thread.

In case you are curious I've attached the EventDispatch stuff from the 
JavaScript binding, it might not make total sense in isolation, but 
hopefully you can see the parallels with the stuff in 
<proton>/tests/python/proton_tests/messenger.py (and my cheeky fprintf 
;-> that kind of amuses me TBH, though it'll be nice when the underlying 
code is fixed).

Cheers,
Frase




Re: Messenger doesn't seem to have any clean way to recover from errors.

Posted by Marcel Meulemans <m....@tkhinnovations.com>.
I think the main problem here is not in the messenger but a problem in
the protocol engine I discovered a while ago:
https://issues.apache.org/jira/browse/PROTON-644?filter=-2 ... there
is no way for the messenger to recover or handle the error because it
can not see the connection fail.

-- 
Marcel

On Sat, Oct 4, 2014 at 7:40 PM, Fraser Adams
<fr...@blueyonder.co.uk> wrote:
> Is there any way to recover from Messenger errors short of completely
> freeing the messenger instance and starting with a new one?
>
>
> I've been deliberately making it fail, so for example starting a messenger
> with subscriptions like this:
> amqp://~0.0.0.0,localhost:5672
>
> with no broker running the first subscription should succeed and the second
> one should fail
>
> In my case it's a bit more awkward because it's fully asynchronous, but what
> I see in this case is that it creates a connection instance to
> localhost:5672 because in pn_connect there is a test for
>
>   if (connect(sock, addr->ai_addr, addr->ai_addrlen) == -1) {
>     if (errno != EINPROGRESS) {
>       pn_i_error_from_errno(io->error, "connect");
>       freeaddrinfo(addr);
>       close(sock);
>       return PN_INVALID_SOCKET;
>     }
>   }
>
> with my connect on a non-blocking socket EINPROGRESS is set so the socket
> ends up being valid, but subsequently it will fail to connect.
>
>
> I've actually got a listener that can detect the Connection refused, but
> what I can't seem to do is to cleanly clear the connection object.
>
> I've tried all sorts of hacks around
> pn_messenger_resolve/pni_messenger_reclaim (in that case
> pn_messenger_resolve found the connection object given the name
> "localhost:5672" which was found OK then I tried a pni_messenger_reclaim
> hack to clear it, but that didn't seem to close the underlying socket).
>
> I also tried to find the relevant selectable pn_messenger_selectable that
> matched the file descriptor of the failed connection I then tried a
> pni_connection_finalize(sel) hack. In that case I seem to free up the
> connection and the underlying socket gets closed, but when I subsequently
> try to connect (to the working amqp://~0.0.0.0) although I get an accept on
> the right file descriptor I subsequently get an assertion failed at
> messenger.c,151,pni_context at Error
>
>
> So in short given that a connection object gets created because of a connect
> on a non-blocking socket, which subsequently and asynchronously fails to
> connect there doesn't seem any way to tidy up that failed connection.
>
> To be clear if I have subscriptions
> amqp://~0.0.0.0,localhost:5672
>
> And ignore any errors and don't bother to try and tidy up and I subsequently
> do a client connection to amqp://0.0.0.0 my client connects fine but on the
> next file descriptor up from the one created by the failed localhost:5672
> connection so basically my failed subscription has leaked a connection. That
> is the listen fd for amqp://~0.0.0.0 is 3 the (failed) fd for localhost:5672
> is 4 and when I connect to amqp://~0.0.0.0 the accept fd is 5, it really
> should be 4 but I can't get shot of the connection object etc. for
> localhost:5672.
>
> The only way to deal with it seems to be to free and create a new messenger
> when anything fails, which is a pain because the subscription
> amqp://~0.0.0.0 is actually fine.
>
>
> TBH messenger's error handling is driving me nuts, it has been mentioned in
> a few threads that it might be better to give up on messenger and just use
> engine.
>
> Is messenger really irredeemably broken? Without decent error
> handling/recovery it's very little use in a production environment.
>
> Frase
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
> For additional commands, e-mail: users-help@qpid.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
For additional commands, e-mail: users-help@qpid.apache.org