You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tomcat.apache.org by bu...@apache.org on 2008/06/23 23:13:02 UTC

DO NOT REPLY [Bug 45261] New: Concurrent node failure leads to inconsistent views.

https://issues.apache.org/bugzilla/show_bug.cgi?id=45261

           Summary: Concurrent node failure leads to inconsistent views.
           Product: Tomcat 6
           Version: 6.0.16
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Cluster
        AssignedTo: tomcat-dev@jakarta.apache.org
        ReportedBy: robert.newson@gmail.com


Created an attachment (id=22166)
 --> (https://issues.apache.org/bugzilla/attachment.cgi?id=22166)
Demonstrate view inconsistency.

In a four node cluster, using NonBlockingCoordinator, if two nodes fail at the
same time, the remaining two nodes get different views and never converge.

When the other nodes restart, they never install a view at all.

I've attached the relevant demo code. Run it on 4 machines, wait for view
installation, then CTRL-C two of them. The other two will never print the same
UniqueId. Start a new node, view is always null.

Immediately after the two node failure, one of the surviving nodes issues this
stack trace;

WARN - Member send is failing for:tcp://{-64, -88, -91, 34}:4000 ; Setting to
su
spect and retrying.
ERROR - Error processing coordination message. Could be fatal.
org.apache.catalina.tribes.ChannelException: Send failed, attempt:2 max:1;
Fault
y members:tcp://{-64, -88, -91, 34}:4000; 
        at
org.apache.catalina.tribes.transport.nio.ParallelNioSender.doLoop(Par
allelNioSender.java:172)
        at
org.apache.catalina.tribes.transport.nio.ParallelNioSender.sendMessag
e(ParallelNioSender.java:78)
        at
org.apache.catalina.tribes.transport.nio.PooledParallelSender.sendMes
sage(PooledParallelSender.java:53)
        at
org.apache.catalina.tribes.transport.ReplicationTransmitter.sendMessa
ge(ReplicationTransmitter.java:80)
        at
org.apache.catalina.tribes.group.ChannelCoordinator.sendMessage(Chann
elCoordinator.java:78)
        at
org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(C
hannelInterceptorBase.java:75)
        at
org.apache.catalina.tribes.group.interceptors.NonBlockingCoordinator.
handleMyToken(NonBlockingCoordina


-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


[Bug 45261] Concurrent node failure leads to inconsistent views.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=45261

--- Comment #12 from Jackie Rosen <ja...@hushmail.com> ---
*** Bug 260998 has been marked as a duplicate of this bug. ***
Seen from the domain http://volichat.com
Page where seen: http://volichat.com/adult-chat-rooms
Marked for reference. Resolved as fixed @bugzilla.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


DO NOT REPLY [Bug 45261] Concurrent node failure leads to inconsistent views.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=45261





--- Comment #1 from Robert Newson <ro...@gmail.com>  2008-06-25 13:50:40 PST ---

So, I understand this better now and have a proposed fix.

Here's the procedure to reproduce the problem.

1) start four nodes.
2) see a view installation with four members.
3) kill two non-coordinator nodes in quick succession (a second or two)

>From this point onwards, until it is killed, the coordinator is oscillating
between two states. It recognizes that the state is inconsistent as it receives
heartbeats from the the other node and the UniqueId's of its view does not
match the coordinator. It then forces an election. Which fails as it believes
an election is already running. This cycle repeats forever.

When the first node crashed, memberDisappeared() is called on the coordinator.
It then starts sending messages as part of an election. A method throws here
with a connection timeout (it was attempting to send to the second node, which
just crashed). It never handles this case, leaving the 'election in progress'
flag on. Forever.

Clearing suggestedViewId when the ChannelException is thrown is the fix;

@@ -500,6 +500,7 @@ public class NonBlockingCoordinator extends
ChannelInterceptorBase {
                 processCoordMessage(cmsg, msg.getAddress());
             }catch ( ChannelException x ) {
                 log.error("Error processing coordination message. Could be
fatal.",x);
+                suggestedviewId = null;                
             }

this probably should only be done under some circumstances, so this isn't
obviously a safe patch. Hopefully the author will have a better fix!


-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


DO NOT REPLY [Bug 45261] Concurrent node failure leads to inconsistent views.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=45261


Filip Hanik <fh...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         AssignedTo|tomcat-                     |fhanik@apache.org
                   |dev@jakarta.apache.org      |




--- Comment #2 from Filip Hanik <fh...@apache.org>  2008-06-26 08:05:07 PST ---
hi Rob, 
the non blocking coordinator is still work in progress. Its one piece of code
that got a bit over complicated once I started developing it, and I think it
can be greatly simplified

I will take a look at this beginning of next week

Filip


-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org