You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@qpid.apache.org by "Charles E. Rolke (Jira)" <ji...@apache.org> on 2021/04/26 16:06:00 UTC

[jira] [Created] (DISPATCH-2081) Fallback test fail - router not detecting drained link

Charles E. Rolke created DISPATCH-2081:
------------------------------------------

             Summary: Fallback test fail - router not detecting drained link
                 Key: DISPATCH-2081
                 URL: https://issues.apache.org/jira/browse/DISPATCH-2081
             Project: Qpid Dispatch
          Issue Type: Bug
          Components: Routing Engine
    Affects Versions: 1.15.0
         Environment: h4.  
            Reporter: Charles E. Rolke


h3. History

The fallback dest test, particularly the SwitchoverTest subclasses, have had a long history of persistent, intermittent failures. See DISPATCH-1361 and DISPATCH-1786. CI tests running on Ubunto xenial fail more frequently than any other platform
h3. Recreating the failure

The only way to get any clue at all is to get access to the router logs after a test failure. On the CI systems this is not an option.

A reproducer was created that fails usually before 1000 switchover tests run. This is an Ubuntu xenial docker image that is run with *--cpus=0.8*. This means slow-upon-slow to get internal scheduling just right. Then loop on *ctest -VV -R fallback_dest*. After the test finally fails then get the log files out of the docker image.
h3. Analyzing the logs
h4. Get the Scraper web page

Run command

{{  scraper -f I*.log E*.log > fallback_dest.html}}

Then view the resulting web page.
h4.  Navigating the web page

Nice web page. Now what? The tests are designed to help you a little here. The failing case was test_35. This test uses router address *dest.35* for link sources and targets making the test pretty easy to isolate in the >1,000,000 lines of web page. The address appears early on in lists of addresses and then happens for real in an attach launched by the self test.
h3. What happened?
 * This test sets up a sender to INTA, a primary receiver to INTB and a fallback receiver in EA1.
 * Surprisingly the fallback receiver connects before the primary receiver despite the order in the test souce code. Not to worry.
 * Then the test sends 300 messages that are received and accepted by the primary receiver.
 * The primary receiver closes
 * The sender starts sending 300 messages to the fallback receiver
 * These messages go into INTA and get forwarded to INTB. INTB has no destination for them so they are released.
 * When the sender gets the released status it sends more.
 * Pretty soon the sender has sent 1,700 messaged
 * Somewhere along the way INTB deletes address M0dest.35
 * Eventually router INTA sends a DRAIN to the sender.
 * The test sender sends enough messages to consume the remaining credit.
 * Then all message traffic stops.
 * The test sits there for a minute and then times out.

h3. What went wrong

It looks like the router started a drain cycle with the sender but the sender never sent a FLOW back with drain=true.

Proton python does not spontaneously send flow with drain=true. It is up to the application, in this case the fallback_dest self test code, to do that. Furthermore, if the application has consumed all the credit then proton will not send a flow with drain=true even if sender.drained() is called. Proton python sends the flow only if the drained function consumed any credits outside of message flow.

If the router is waiting for a flow then with this test setup it will never come.

Note: Knowing now that the issue is drain related the web page helps find the drain. In the Table of Contents click on the link for Noteworthy Log Lines. There was one 'Flow with drain set' entry. Clicking on the lozenge shows the line number link. Clicking on that link takes you to the flow performative for the router issuing it.
h3. What's the fix?
 # Track client credits and when the credit drops to zero then let that satisfy the ongoing drain cycle. Do this even without receiving the flow with drain=true.
 #  Don't send a drain to begin with. Come up with another way of dealing with the client's stream of messages internally that does not involve a drain.
 # The test client could be gimmicked to detect when it has consumed all but one credit. Then it could call drained() so proton python could consume the last credit via a drain cycle and send the AMQP flow with drain=true. This may work to get the test to pass but it won't help people in the real world who use the proton python client.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@qpid.apache.org
For additional commands, e-mail: dev-help@qpid.apache.org