You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@geronimo.apache.org by Trygve Hardersen <tr...@jotta.no> on 2009/10/30 17:39:34 UTC

2.2 in production

Hello

We have been using Geronimo 2.2-SNAPSHOT in production for a good month now,
and I thought I'd share some experiences with the community, and maybe get
some help. We are an online backup service, check out jottabackup.com if
you're interested.

Generally our experience has been very positive. We're using the GBean
framework for custom server components, the DB connection pools against
MySQL databases, stateless service EJBs and various MDB, and of course the
web tier (Jetty). Everything is running smoothly and we're very much looking
forward to 2.2 being released so we can "release" our own software.

The issues we're having are related to WADI clustering with Jetty. First we
can't use Jetty7 because of GERONIMO-4846, so we're using Jetty6 which works
fine. The more serious issue is that we often can not update our servers
without downtime. This is what happens:

We have 2 application servers (AS-000 and AS-001) running dynamic WADI HTTP
session replication between them. When updating we first stop one, AS-000 in
this case. That works fine and the active sessions are migrated to AS-001:

23:43:18,160 INFO  [SimpleStateManager]

=============================

New Partition Balancing

Partition Balancing

    Size [24]

    Partition[0] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
version [3]; mergeVersion [0]

    Partition[1] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
version [3]; mergeVersion [0]

    Partition[2] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
version [3]; mergeVersion [0]

    Partition[3] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
version [3]; mergeVersion [0]

    Partition[4] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
version [3]; mergeVersion [0]

    Partition[5] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
version [3]; mergeVersion [0]

    Partition[6] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
version [3]; mergeVersion [0]

    Partition[7] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
version [3]; mergeVersion [0]

    Partition[8] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
version [3]; mergeVersion [0]

    Partition[9] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
version [3]; mergeVersion [0]

    Partition[10] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
version [3]; mergeVersion [0]

    Partition[11] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
version [3]; mergeVersion [0]

    Partition[12] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
version [3]; mergeVersion [0]

    Partition[13] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
version [3]; mergeVersion [0]

    Partition[14] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
version [3]; mergeVersion [0]

    Partition[15] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
version [3]; mergeVersion [0]

    Partition[16] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
version [3]; mergeVersion [0]

    Partition[17] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
version [3]; mergeVersion [0]

    Partition[18] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
version [3]; mergeVersion [0]

    Partition[19] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
version [3]; mergeVersion [0]

    Partition[20] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
version [3]; mergeVersion [0]

    Partition[21] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
version [3]; mergeVersion [0]

    Partition[22] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
version [3]; mergeVersion [0]

    Partition[23] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
version [3]; mergeVersion [0]

=============================


23:43:28,539 INFO  [TcpFailureDetector] Verification complete. Member
disappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{10, 0,
10, 100}:4000,{10, 0, 10, 100},4000, alive=41104531,id={-4 -32 54 90 -109
-17 65 64 -117 40 -110 -14 36 93 -12 -118 }, payload={-84 -19 0 5 115 114 0
50 111 ...(423)}, command={66 65 66 89 45 65 76 69 88 ...(9)}, domain={74 79
84 84 65 95 87 65 68 ...(10)}, ]]

23:43:28,540 INFO  [ChannelInterceptorBase] memberDisappeared:tcp://{10, 0,
10, 100}:4000


We then update AS-000 and try to start it, but it fails to rejoin the WADI
cluster:


23:46:30,784 INFO  [ReceiverBase] Receiver Server Socket bound to:/
10.0.10.100:4000

23:46:30,864 INFO  [ChannelInterceptorBase] memberStart
local:org.apache.catalina.tribes.membership.MemberImpl[tcp://
10.0.10.100:4000,10.0.10.100,4000, alive=0,id={-103 34 80 -19 68 -51 70 -91
-108 39 -84 65 50 50 103 -107 }, payload={-84 -19 0 5 115 114 0 50 111
...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]
notify:false peer:AS-000

23:46:30,868 INFO  [McastService] Setting cluster mcast soTimeout to 500

23:46:30,908 INFO  [McastService] Sleeping for 1000 milliseconds to
establish cluster membership, start level:4

23:46:31,139 INFO  [ChannelInterceptorBase] memberAdded:tcp://{10, 0, 10,
101}:4000

23:46:31,908 INFO  [McastService] Done sleeping, membership established,
start level:4

23:46:31,912 INFO  [McastService] Sleeping for 1000 milliseconds to
establish cluster membership, start level:8

23:46:31,927 INFO  [BufferPool] Created a buffer pool with max
size:104857600 bytes of type:org.apache.catalina.tribes.io.BufferPool15Impl

23:46:32,912 INFO  [McastService] Done sleeping, membership established,
start level:8

23:46:32,912 INFO  [ChannelInterceptorBase] memberStart
local:org.apache.catalina.tribes.membership.MemberImpl[tcp://
10.0.10.100:4000,10.0.10.100,4000, alive=272,id={-103 34 80 -19 68 -51 70
-91 -108 39 -84 65 50 50 103 -107 }, payload={-84 -19 0 5 115 114 0 50 111
...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]
notify:false peer:AS-000

23:46:37,848 INFO  [DiscStore] Creating directory:
/usr/lib/jotta/jotta-as-prod-1.0-SNAPSHOT/var/temp/SessionStore

23:46:37,930 INFO  [BasicSingletonServiceHolder] [TribesPeer [AS-000; tcp://
10.0.10.100:4000]] owns singleton service [PartitionManager for ServiceSpace
[/]]

23:46:37,964 INFO  [BasicSingletonServiceHolder] [TribesPeer [AS-000; tcp://
10.0.10.100:4000]] resigns ownership of singleton service [PartitionManager
for ServiceSpace [/]]

23:47:40,065 ERROR [BasicServiceRegistry] Error while starting [Holder for
service
[org.codehaus.wadi.location.partitionmanager.SimplePartitionManager@7dc2445f]
named [PartitionManager] in space [ServiceSpace [/]]]

org.codehaus.wadi.location.partitionmanager.PartitionManagerException:
Partition [0] is unknown.

at
org.codehaus.wadi.location.partitionmanager.SimplePartitionManager.waitForBoot(SimplePartitionManager.java:248)

at
org.codehaus.wadi.location.partitionmanager.SimplePartitionManager.start(SimplePartitionManager.java:119)

at
org.codehaus.wadi.servicespace.basic.BasicServiceHolder.start(BasicServiceHolder.java:60)

at
org.codehaus.wadi.servicespace.basic.BasicServiceRegistry.start(BasicServiceRegistry.java:152)

at
org.codehaus.wadi.servicespace.basic.BasicServiceSpace.start(BasicServiceSpace.java:169)

at
org.apache.geronimo.clustering.wadi.BasicWADISessionManager.doStart(BasicWADISessionManager.java:125)

at
org.apache.geronimo.gbean.runtime.GBeanInstance.createInstance(GBeanInstance.java:948)

at
org.apache.geronimo.gbean.runtime.GBeanInstanceState.attemptFullStart(GBeanInstanceState.java:269)

at
org.apache.geronimo.gbean.runtime.GBeanInstanceState.start(GBeanInstanceState.java:103)

at
org.apache.geronimo.gbean.runtime.GBeanInstanceState.startRecursive(GBeanInstanceState.java:125)

at
org.apache.geronimo.gbean.runtime.GBeanInstance.startRecursive(GBeanInstance.java:538)

at
org.apache.geronimo.kernel.basic.BasicKernel.startRecursiveGBean(BasicKernel.java:377)

at
org.apache.geronimo.kernel.config.ConfigurationUtil.startConfigurationGBeans(ConfigurationUtil.java:456)

at
org.apache.geronimo.kernel.config.ConfigurationUtil.startConfigurationGBeans(ConfigurationUtil.java:493)

at
org.apache.geronimo.kernel.config.KernelConfigurationManager.start(KernelConfigurationManager.java:190)

at
org.apache.geronimo.kernel.config.SimpleConfigurationManager.startConfiguration(SimpleConfigurationManager.java:546)

at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)

at
org.apache.geronimo.gbean.runtime.ReflectionMethodInvoker.invoke(ReflectionMethodInvoker.java:34)

at
org.apache.geronimo.gbean.runtime.GBeanOperation.invoke(GBeanOperation.java:130)

at
org.apache.geronimo.gbean.runtime.GBeanInstance.invoke(GBeanInstance.java:815)

at org.apache.geronimo.gbean.runtime.RawInvoker.invoke(RawInvoker.java:57)

at
org.apache.geronimo.kernel.basic.RawOperationInvoker.invoke(RawOperationInvoker.java:35)

at
org.apache.geronimo.kernel.basic.ProxyMethodInterceptor.intercept(ProxyMethodInterceptor.java:96)

at
org.apache.geronimo.gbean.GBeanLifecycle$$EnhancerByCGLIB$$628b9237.startConfiguration(<generated>)

at
org.apache.geronimo.system.main.EmbeddedDaemon.doStartup(EmbeddedDaemon.java:161)

at
org.apache.geronimo.system.main.EmbeddedDaemon.execute(EmbeddedDaemon.java:78)

at
org.apache.geronimo.kernel.util.MainConfigurationBootstrapper.main(MainConfigurationBootstrapper.java:45)

at org.apache.geronimo.cli.AbstractCLI.executeMain(AbstractCLI.java:65)

at org.apache.geronimo.cli.daemon.DaemonCLI.main(DaemonCLI.java:30)

23:47:40,078 ERROR [BasicWADISessionManager] Failed to stop

org.codehaus.wadi.servicespace.ServiceSpaceNotFoundException:
ServiceSpaceName not found

at
org.codehaus.wadi.servicespace.basic.BasicServiceSpaceRegistry.unregister(BasicServiceSpaceRegistry.java:55)

at
org.codehaus.wadi.servicespace.basic.BasicServiceSpace.unregisterServiceSpace(BasicServiceSpace.java:228)

at
org.codehaus.wadi.servicespace.basic.BasicServiceSpace.stop(BasicServiceSpace.java:175)

at
org.apache.geronimo.clustering.wadi.BasicWADISessionManager.doFail(BasicWADISessionManager.java:134)

at
org.apache.geronimo.gbean.runtime.GBeanInstance.createInstance(GBeanInstance.java:978)

at
org.apache.geronimo.gbean.runtime.GBeanInstanceState.attemptFullStart(GBeanInstanceState.java:269)

at
org.apache.geronimo.gbean.runtime.GBeanInstanceState.start(GBeanInstanceState.java:103)

at
org.apache.geronimo.gbean.runtime.GBeanInstanceState.startRecursive(GBeanInstanceState.java:125)

at
org.apache.geronimo.gbean.runtime.GBeanInstance.startRecursive(GBeanInstance.java:538)

at
org.apache.geronimo.kernel.basic.BasicKernel.startRecursiveGBean(BasicKernel.java:377)

at
org.apache.geronimo.kernel.config.ConfigurationUtil.startConfigurationGBeans(ConfigurationUtil.java:456)

at
org.apache.geronimo.kernel.config.ConfigurationUtil.startConfigurationGBeans(ConfigurationUtil.java:493)

at
org.apache.geronimo.kernel.config.KernelConfigurationManager.start(KernelConfigurationManager.java:190)

at
org.apache.geronimo.kernel.config.SimpleConfigurationManager.startConfiguration(SimpleConfigurationManager.java:546)

at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)

at
org.apache.geronimo.gbean.runtime.ReflectionMethodInvoker.invoke(ReflectionMethodInvoker.java:34)

at
org.apache.geronimo.gbean.runtime.GBeanOperation.invoke(GBeanOperation.java:130)

at
org.apache.geronimo.gbean.runtime.GBeanInstance.invoke(GBeanInstance.java:815)

at org.apache.geronimo.gbean.runtime.RawInvoker.invoke(RawInvoker.java:57)

at
org.apache.geronimo.kernel.basic.RawOperationInvoker.invoke(RawOperationInvoker.java:35)

at
org.apache.geronimo.kernel.basic.ProxyMethodInterceptor.intercept(ProxyMethodInterceptor.java:96)

at
org.apache.geronimo.gbean.GBeanLifecycle$$EnhancerByCGLIB$$628b9237.startConfiguration(<generated>)

at
org.apache.geronimo.system.main.EmbeddedDaemon.doStartup(EmbeddedDaemon.java:161)

at
org.apache.geronimo.system.main.EmbeddedDaemon.execute(EmbeddedDaemon.java:78)

at
org.apache.geronimo.kernel.util.MainConfigurationBootstrapper.main(MainConfigurationBootstrapper.java:45)

at org.apache.geronimo.cli.AbstractCLI.executeMain(AbstractCLI.java:65)

at org.apache.geronimo.cli.daemon.DaemonCLI.main(DaemonCLI.java:30)


After this failure the server stops. Over at the running instance AS-001
this is logged:


23:46:31,909 INFO  [ChannelInterceptorBase] memberAdded:tcp://{10, 0, 10,
100}:4000

23:46:37,929 INFO  [BasicSingletonServiceHolder] [TribesPeer [AS-001; tcp://
10.0.10.101:4000]] owns singleton service [PartitionManager for ServiceSpace
[/]]

23:46:37,929 INFO  [BasicPartitionBalancerSingletonService] Queueing
partition rebalancing

23:46:38,438 ERROR [BasicEnvelopeDispatcherManager] problem dispatching
message

java.lang.IllegalArgumentException:
org.codehaus.wadi.core.store.BasicStoreMotable is not a Session

at
org.codehaus.wadi.replication.manager.basic.SessionStateHandler.newExtractFullStateExternalizable(SessionStateHandler.java:105)

at
org.codehaus.wadi.replication.manager.basic.SessionStateHandler.extractFullState(SessionStateHandler.java:53)

at
org.codehaus.wadi.replication.manager.basic.CreateStorageCommand.execute(CreateStorageCommand.java:45)

at
org.codehaus.wadi.replication.manager.basic.SyncSecondaryManager.updateSecondaries(SyncSecondaryManager.java:169)

at
org.codehaus.wadi.replication.manager.basic.SyncSecondaryManager.updateSecondaries(SyncSecondaryManager.java:114)

at
org.codehaus.wadi.replication.manager.basic.SyncSecondaryManager.updateSecondaries(SyncSecondaryManager.java:103)

at
org.codehaus.wadi.replication.manager.basic.SyncSecondaryManager.updateSecondariesFollowingJoiningPeer(SyncSecondaryManager.java:75)

at
org.codehaus.wadi.replication.manager.basic.ReOrganizeSecondariesListener.receive(ReOrganizeSecondariesListener.java:53)

at
org.codehaus.wadi.servicespace.basic.BasicServiceMonitor.notifyListeners(BasicServiceMonitor.java:124)

at
org.codehaus.wadi.servicespace.basic.BasicServiceMonitor.processLifecycleEvent(BasicServiceMonitor.java:141)

at
org.codehaus.wadi.servicespace.basic.BasicServiceMonitor$ServiceLifecycleEndpoint.dispatch(BasicServiceMonitor.java:148)

at
org.codehaus.wadi.group.impl.ServiceEndpointWrapper.dispatch(ServiceEndpointWrapper.java:50)

at
org.codehaus.wadi.group.impl.BasicEnvelopeDispatcherManager$DispatchRunner.run(BasicEnvelopeDispatcherManager.java:121)

at
org.codehaus.wadi.servicespace.basic.BasicServiceSpaceDispatcher$ExecuteInThread.execute(BasicServiceSpaceDispatcher.java:102)

at
org.codehaus.wadi.group.impl.BasicEnvelopeDispatcherManager.onEnvelope(BasicEnvelopeDispatcherManager.java:100)

at
org.codehaus.wadi.group.impl.AbstractDispatcher.doOnEnvelope(AbstractDispatcher.java:104)

at
org.codehaus.wadi.group.impl.AbstractDispatcher.onEnvelope(AbstractDispatcher.java:100)

at
org.codehaus.wadi.servicespace.basic.ServiceSpaceEndpoint.dispatch(ServiceSpaceEndpoint.java:49)

at
org.codehaus.wadi.group.impl.ServiceEndpointWrapper.dispatch(ServiceEndpointWrapper.java:50)

at
org.codehaus.wadi.group.impl.BasicEnvelopeDispatcherManager$DispatchRunner.run(BasicEnvelopeDispatcherManager.java:121)

at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:619)

23:46:40,063 INFO  [BasicPartitionBalancerSingletonService] Queueing
partition rebalancing

23:46:42,938 WARN  [BasicPartitionBalancerSingletonService] Rebalancing has
failed

org.codehaus.wadi.group.MessageExchangeException: No correlated messages
received within [5000]ms

at
org.codehaus.wadi.group.impl.AbstractDispatcher.attemptMultiRendezVous(AbstractDispatcher.java:174)

at
org.codehaus.wadi.location.balancing.BasicPartitionBalancer.fetchBalancingInfoState(BasicPartitionBalancer.java:85)

at
org.codehaus.wadi.location.balancing.BasicPartitionBalancer.balancePartitions(BasicPartitionBalancer.java:69)

at
org.codehaus.wadi.location.balancing.BasicPartitionBalancerSingletonService.run(BasicPartitionBalancerSingletonService.java:85)

at java.lang.Thread.run(Thread.java:619)

23:46:42,939 WARN  [BasicPartitionBalancerSingletonService] Will retry
rebalancing in [500] ms

23:46:43,439 INFO  [BasicPartitionBalancerSingletonService] Queueing
partition rebalancing

23:47:40,269 INFO  [TcpFailureDetector] Verification complete. Member
disappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{10, 0,
10, 100}:4000,{10, 0, 10, 100},4000, alive=69401,id={-103 34 80 -19 68 -51
70 -91 -108 39 -84 65 50 50 103 -107 }, payload={-84 -19 0 5 115 114 0 50
111 ...(423)}, command={66 65 66 89 45 65 76 69 88 ...(9)}, domain={74 79 84
84 65 95 87 65 68 ...(10)}, ]]

23:47:40,271 INFO  [ChannelInterceptorBase] memberDisappeared:tcp://{10, 0,
10, 100}:4000

23:47:40,271 INFO  [BasicPartitionBalancerSingletonService] Queueing
partition rebalancing


If I try to start AS-000 again the same thing happens. If we stop AS-001 the
following is logged;


23:49:18,695 INFO  [SimpleStateManager] Evacuating partitions

23:49:18,699 INFO  [BasicPartitionBalancerSingletonService] Queueing
partition rebalancing

23:49:23,698 WARN  [SimpleStateManager] Partition balancer has disappeared -
backing off for [1000]ms

23:49:24,699 INFO  [BasicPartitionBalancerSingletonService] Queueing
partition rebalancing

23:49:29,698 WARN  [SimpleStateManager] Partition balancer has disappeared -
backing off for [1000]ms

23:49:30,699 INFO  [BasicPartitionBalancerSingletonService] Queueing
partition rebalancing

23:49:35,699 WARN  [SimpleStateManager] Partition balancer has disappeared -
backing off for [1000]ms

23:49:36,700 INFO  [BasicPartitionBalancerSingletonService] Queueing
partition rebalancing

23:49:41,699 WARN  [SimpleStateManager] Partition balancer has disappeared -
backing off for [1000]ms

23:49:42,701 INFO  [BasicPartitionBalancerSingletonService] Queueing
partition rebalancing

23:49:47,700 WARN  [SimpleStateManager] Partition balancer has disappeared -
backing off for [1000]ms

23:49:48,700 INFO  [SimpleStateManager] Evacuated

23:49:48,808 INFO  [AbstractExclusiveContextualiser] Unloaded sessions=[36]

23:49:48,843 INFO  [AbstractExclusiveContextualiser] Unloaded sessions=[13]

23:49:58,852 INFO  [BasicSingletonServiceHolder] [TribesPeer [AS-001; tcp://
10.0.10.101:4000]] resigns ownership of singleton service [PartitionManager
for ServiceSpace [/]]


, however AS-001 then just hangs, and we have to kill the process to get it
stopped. After this we can start AS-000, update AS-001 and it always seems
to have no problem joining the cluster thereafter. The strange thing is that
this problem does not always occur, sometimes everything goes fine. I can't
find a consistent pattern but I've tried stopping AS-001 before AS-000, and
I'm sure no serializable object in the session has changed between the
updated and running instance.


My gut feeling is that this is either a concurrency-related bug in WADI or a
network-timeout related problem. During normal operation I'm sometimes
seeing messages like this in the log files:

17:14:08,869 INFO  [TcpFailureDetector] Received
memberDisappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{10,
0, 10, 101}:4000,{10, 0, 10, 101},4000, alive=95659954,id={-52 -76 98 22 10
71 76 -72 -122 -59 -21 -29 44 -86 38 114 }, payload={-84 -19 0 5 115 114 0
50 111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)},
]] message. Will verify.
17:14:08,870 INFO  [TcpFailureDetector] Verification complete. Member still
alive[org.apache.catalina.tribes.membership.MemberImpl[tcp://{10, 0, 10,
101}:4000,{10, 0, 10, 101},4000, alive=95659954,id={-52 -76 98 22 10 71 76
-72 -122 -59 -21 -29 44 -86 38 114 }, payload={-84 -19 0 5 115 114 0 50 111
...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]]

And lately, as traffic has increased, errors like this:


16:22:43,524 WARN  [UpdateReplicationCommand] Update has not been properly
cascaded due to a communication failure. If a targeted node has been lost,
state will be re-balanced automatically.

org.codehaus.wadi.servicespace.ServiceInvocationException:
org.codehaus.wadi.group.MessageExchangeException: No correlated messages
received within [2000]ms

at
org.codehaus.wadi.servicespace.basic.CGLIBServiceProxyFactory$ProxyMethodInterceptor.intercept(CGLIBServiceProxyFactory.java:209)

at
org.codehaus.wadi.replication.storage.ReplicaStorage$$EnhancerByCGLIB$$a901e91b.mergeUpdate(<generated>)

at
org.codehaus.wadi.replication.manager.basic.UpdateReplicationCommand.cascadeUpdate(UpdateReplicationCommand.java:93)

at
org.codehaus.wadi.replication.manager.basic.UpdateReplicationCommand.run(UpdateReplicationCommand.java:86)

at
org.codehaus.wadi.replication.manager.basic.SyncReplicationManager.update(SyncReplicationManager.java:138)

at
org.codehaus.wadi.replication.manager.basic.LoggingReplicationManager.update(LoggingReplicationManager.java:100)

at
org.codehaus.wadi.core.session.AbstractReplicableSession.onEndProcessing(AbstractReplicableSession.java:49)

at
org.codehaus.wadi.core.session.AtomicallyReplicableSession.onEndProcessing(AtomicallyReplicableSession.java:58)

at
org.apache.geronimo.clustering.wadi.WADISessionAdaptor.onEndAccess(WADISessionAdaptor.java:77)

at
org.apache.geronimo.jetty6.cluster.ClusteredSessionManager.complete(ClusteredSessionManager.java:60)

at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:198)

at
org.apache.geronimo.jetty6.cluster.ClusteredSessionHandler.doHandle(ClusteredSessionHandler.java:59)

at
org.apache.geronimo.jetty6.cluster.ClusteredSessionHandler$ActualHandler.handle(ClusteredSessionHandler.java:66)

at
org.apache.geronimo.jetty6.cluster.AbstractClusteredPreHandler$WebClusteredInvocation.invokeLocally(AbstractClusteredPreHandler.java:71)

at
org.apache.geronimo.jetty6.cluster.wadi.WADIClusteredPreHandler$WADIWebClusteredInvocation.access$000(WADIClusteredPreHandler.java:52)

at
org.apache.geronimo.jetty6.cluster.wadi.WADIClusteredPreHandler$WADIWebClusteredInvocation$1.doFilter(WADIClusteredPreHandler.java:64)

at org.codehaus.wadi.web.impl.WebInvocation.invoke(WebInvocation.java:116)

at
org.codehaus.wadi.core.contextualiser.MemoryContextualiser.handleLocally(MemoryContextualiser.java:71)

at
org.codehaus.wadi.core.contextualiser.AbstractExclusiveContextualiser.handle(AbstractExclusiveContextualiser.java:94)

at
org.codehaus.wadi.core.contextualiser.AbstractMotingContextualiser.contextualise(AbstractMotingContextualiser.java:37)

at
org.codehaus.wadi.core.manager.StandardManager.processStateful(StandardManager.java:150)

at
org.codehaus.wadi.core.manager.StandardManager.contextualise(StandardManager.java:142)

at
org.codehaus.wadi.core.manager.ClusteredManager.contextualise(ClusteredManager.java:81)

at
org.apache.geronimo.jetty6.cluster.wadi.WADIClusteredPreHandler$WADIWebClusteredInvocation.invoke(WADIClusteredPreHandler.java:72)

at
org.apache.geronimo.jetty6.cluster.AbstractClusteredPreHandler.handle(AbstractClusteredPreHandler.java:39)

at
org.apache.geronimo.jetty6.cluster.ClusteredSessionHandler.handle(ClusteredSessionHandler.java:51)

at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)

at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)

at
org.apache.geronimo.jetty6.handler.TwistyWebAppContext.access$101(TwistyWebAppContext.java:41)

at
org.apache.geronimo.jetty6.handler.TwistyWebAppContext$TwistyHandler.handle(TwistyWebAppContext.java:66)

at
org.apache.geronimo.jetty6.handler.ThreadClassloaderHandler.handle(ThreadClassloaderHandler.java:46)

at
org.apache.geronimo.jetty6.handler.InstanceContextHandler.handle(InstanceContextHandler.java:58)

at
org.apache.geronimo.jetty6.handler.UserTransactionHandler.handle(UserTransactionHandler.java:48)

at
org.apache.geronimo.jetty6.handler.ComponentContextHandler.handle(ComponentContextHandler.java:47)

at
org.apache.geronimo.jetty6.handler.TwistyWebAppContext.handle(TwistyWebAppContext.java:60)

at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)

at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)

at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)

at org.mortbay.jetty.Server.handle(Server.java:326)

at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)

at
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:879)

at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:747)

at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)

at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)

at
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)

at org.apache.geronimo.pool.ThreadPool$1.run(ThreadPool.java:214)

at
org.apache.geronimo.pool.ThreadPool$ContextClassLoaderRunnable.run(ThreadPool.java:344)

at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:619)

Caused by: org.codehaus.wadi.group.MessageExchangeException: No correlated
messages received within [2000]ms

at
org.codehaus.wadi.group.impl.AbstractDispatcher.attemptMultiRendezVous(AbstractDispatcher.java:174)

at
org.codehaus.wadi.servicespace.basic.BasicServiceInvoker.invokeOnPeers(BasicServiceInvoker.java:90)

at
org.codehaus.wadi.servicespace.basic.BasicServiceInvoker.invoke(BasicServiceInvoker.java:69)

at
org.codehaus.wadi.servicespace.basic.CGLIBServiceProxyFactory$ProxyMethodInterceptor.intercept(CGLIBServiceProxyFactory.java:193)

... 49 more


Does anyone have some insight into what might be causing this, or where the
timeouts can be increased if there are any?

I'm thinking that a static WADI configuration can be more stable than the
dynamic setup we have not which relies upon multicasting. Does anyone have
experience with similar setups?

Thanks!

Trygve Hardersen - Jotta

Re: 2.2 in production

Posted by Quintin Beukes <qu...@skywalk.co.za>.

Hey,

This isn't related to this problem. Just figured I'd mention it.

The hostnames for the members are printed as tcp://{-64, -88, 1, 60}:4000. I
checked it out and this is because it makes the hostname like so:
this.hostname = org.apache.catalina.tribes.util.Arrays.toString(host);

Where "host" is a byte[]. So the 192 as a signed 8bit gives -64. Is it
supposed to be like this or should I fix/report it?

Quintin Beukes


On Tue, Nov 3, 2009 at 5:35 PM, Trygve Hardersen <tr...@jotta.no> wrote:

>
>
> On Mon, Nov 2, 2009 at 5:39 PM, Kevan Miller <ke...@gmail.com>wrote:
>>
>>
>> Thanks Gianny. I'd like to see this included in the Geronimo 2.2 release.
>> Can we look for a new WADI release, soon? Once we know the problem is fixed?
>>
>> Trygve, the sooner we get confirmation that your issue is resolved, the
>> sooner we can start finalizing the 2.2 release.
>>
>> --kevan
>>
>
> I've now built Geronimo using the 2.2-SNAPSHOT of WADI and installed it on
> our test environment. No obvious issues so I'll go ahead and deploy this to
> production either later this evening (EU time) or tomorrow. Then it needs to
> run for a few days before I can confirm if the issue has really been
> resolved.
>
> BTW I got this on our test system:
>
> AS-000:
> 16:23:17,773 INFO  [TcpFailureDetector] Received
> memberDisappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64,
> -88, 1, 61}:4000,{-64, -88, 1, 61},4000, alive=1814258,id={50 18 86 10 111
> -47 79 83 -108 -4 82 -8 26 82 -79 -59 }, payload={-84 -19 0 5 115 114 0 50
> 111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]]
> message. Will verify.
> 16:23:17,897 INFO  [TcpFailureDetector] Verification complete. Member still
> alive[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64, -88, 1,
> 61}:4000,{-64, -88, 1, 61},4000, alive=1814258,id={50 18 86 10 111 -47 79 83
> -108 -4 82 -8 26 82 -79 -59 }, payload={-84 -19 0 5 115 114 0 50 111
> ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]]
>
> AS-001:
> 16:23:18,446 INFO  [TcpFailureDetector] Received
> memberDisappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64,
> -88, 1, 60}:4000,{-64, -88, 1, 60},4000, alive=2500759,id={107 -64 91 -23
> 109 93 75 116 -95 109 110 22 -85 53 -52 85 }, payload={-84 -19 0 5 115 114 0
> 50 111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)},
> ]] message. Will verify.
> 16:23:18,456 INFO  [TcpFailureDetector] Verification complete. Member still
> alive[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64, -88, 1,
> 60}:4000,{-64, -88, 1, 60},4000, alive=2500759,id={107 -64 91 -23 109 93 75
> 116 -95 109 110 22 -85 53 -52 85 }, payload={-84 -19 0 5 115 114 0 50 111
> ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]]
>
> And then:
>
> AS-000
> 16:30:02,576 INFO  [ChannelInterceptorBase] memberDisappeared:tcp://{-64,
> -88, 1, 61}:4000
> 16:30:02,577 INFO  [BasicPartitionBalancerSingletonService] Queueing
> partition rebalancing
> 16:30:02,600 INFO  [SimpleStateManager]
> =============================
> New Partition Balancing
> Partition Balancing
>     Size [24]
>     Partition[0] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[1] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[2] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[3] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[4] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[5] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[6] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[7] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[8] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[9] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[10] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[11] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[12] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[13] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[14] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[15] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[16] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[17] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[18] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[19] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[20] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[21] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[22] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[23] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
> =============================
>
> 16:30:02,888 WARN  [TcpFailureDetector] Member added, even though we werent
> notified:org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64, -88,
> 1, 61}:4000,{-64, -88, 1, 61},4000, alive=2221072,id={50 18 86 10 111 -47 79
> 83 -108 -4 82 -8 26 82 -79 -59 }, payload={-84 -19 0 5 115 114 0 50 111
> ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]
> 16:30:02,889 INFO  [ChannelInterceptorBase] memberAdded:tcp://{-64, -88, 1,
> 61}:4000
>
> AS-001
> Nothing....
>
> There is practically no load on this network. Anyway I'll try this with
> load and see what happens.
>
> Many thanks again!
>
> Trygve
>

Re: 2.2 in production

Posted by Trygve Hardersen <tr...@jotta.no>.

Great, thanks again!

We've had many more updates of our servers since I last reported and WADI
failover handling has proven stable so far.

Trygve

On Sat, Nov 14, 2009 at 7:59 AM, Gianny Damour <
gianny.damour@optusnet.com.au> wrote:

> Hi Trygve,
>
> Thanks for your test.
>
> These warnings indicate that a session replication message was not
> acknowledged within a 2000ms timeframe. The message has not been lost and
> will be processed by back-up nodes as expected.
>
>
> Kevan, I just deployed release 2.1.2 of WADI for geronimo 2.2.
>
>
> Thanks,
> Gianny
>
>
>
> On 09/11/2009, at 11:38 PM, Trygve Hardersen wrote:
>
>  Hello again
>>
>> The system has now been running for a few days and I've not seen the hang
>> on startup so I believe the issue has been resolved. Great!
>>
>> I am still seeing some warnings like this:
>>
>> 10:58:49,250 WARN  [UpdateReplicationCommand] Update has not been properly
>> cascaded due to a communication failure. If a targeted node has been lost,
>> state will be re-balanced automatically.
>> org.codehaus.wadi.servicespace.ServiceInvocationException:
>> org.codehaus.wadi.group.MessageExchangeException: No correlated messages
>> received within [2000]ms
>>        at
>> org.codehaus.wadi.servicespace.basic.CGLIBServiceProxyFactory$ProxyMethodInterceptor.intercept(CGLIBServiceProxyFactory.java:209)
>>        at
>> org.codehaus.wadi.replication.storage.ReplicaStorage$$EnhancerByCGLIB$$4372d264.mergeUpdate(<generated>)
>>        at
>> org.codehaus.wadi.replication.manager.basic.UpdateReplicationCommand.cascadeUpdate(UpdateReplicationCommand.java:93)
>>        at
>> org.codehaus.wadi.replication.manager.basic.UpdateReplicationCommand.run(UpdateReplicationCommand.java:86)
>>        at
>> org.codehaus.wadi.replication.manager.basic.SyncReplicationManager.update(SyncReplicationManager.java:138)
>>        at
>> org.codehaus.wadi.replication.manager.basic.LoggingReplicationManager.update(LoggingReplicationManager.java:100)
>>        at
>> org.codehaus.wadi.core.session.AbstractReplicableSession.onEndProcessing(AbstractReplicableSession.java:49)
>>        at
>> org.codehaus.wadi.core.session.AtomicallyReplicableSession.onEndProcessing(AtomicallyReplicableSession.java:58)
>>        at
>> org.apache.geronimo.clustering.wadi.WADISessionAdaptor.onEndAccess(WADISessionAdaptor.java:77)
>>        at
>> org.apache.geronimo.jetty6.cluster.ClusteredSessionManager.complete(ClusteredSessionManager.java:60)
>>        at
>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:198)
>>        at
>> org.apache.geronimo.jetty6.cluster.ClusteredSessionHandler.doHandle(ClusteredSessionHandler.java:59)
>>        at
>> org.apache.geronimo.jetty6.cluster.ClusteredSessionHandler$ActualHandler.handle(ClusteredSessionHandler.java:66)
>>        at
>> org.apache.geronimo.jetty6.cluster.AbstractClusteredPreHandler$WebClusteredInvocation.invokeLocally(AbstractClusteredPreHandler.java:71)
>>        at
>> org.apache.geronimo.jetty6.cluster.wadi.WADIClusteredPreHandler$WADIWebClusteredInvocation.access$000(WADIClusteredPreHandler.java:52)
>>        at
>> org.apache.geronimo.jetty6.cluster.wadi.WADIClusteredPreHandler$WADIWebClusteredInvocation$1.doFilter(WADIClusteredPreHandler.java:64)
>>        at
>> org.codehaus.wadi.web.impl.WebInvocation.invoke(WebInvocation.java:116)
>>        at
>> org.codehaus.wadi.core.contextualiser.MemoryContextualiser.handleLocally(MemoryContextualiser.java:71)
>>        at
>> org.codehaus.wadi.core.contextualiser.AbstractExclusiveContextualiser.handle(AbstractExclusiveContextualiser.java:94)
>>        at
>> org.codehaus.wadi.core.contextualiser.AbstractMotingContextualiser.contextualise(AbstractMotingContextualiser.java:37)
>>        at
>> org.codehaus.wadi.core.manager.StandardManager.processStateful(StandardManager.java:154)
>>        at
>> org.codehaus.wadi.core.manager.StandardManager.contextualise(StandardManager.java:146)
>>        at
>> org.codehaus.wadi.core.manager.ClusteredManager.contextualise(ClusteredManager.java:81)
>>        at
>> org.apache.geronimo.jetty6.cluster.wadi.WADIClusteredPreHandler$WADIWebClusteredInvocation.invoke(WADIClusteredPreHandler.java:72)
>>        at
>> org.apache.geronimo.jetty6.cluster.AbstractClusteredPreHandler.handle(AbstractClusteredPreHandler.java:39)
>>        at
>> org.apache.geronimo.jetty6.cluster.ClusteredSessionHandler.handle(ClusteredSessionHandler.java:51)
>>        at
>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
>>        at
>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
>>        at
>> org.apache.geronimo.jetty6.handler.TwistyWebAppContext.access$101(TwistyWebAppContext.java:41)
>>        at
>> org.apache.geronimo.jetty6.handler.TwistyWebAppContext$TwistyHandler.handle(TwistyWebAppContext.java:66)
>>        at
>> org.apache.geronimo.jetty6.handler.ThreadClassloaderHandler.handle(ThreadClassloaderHandler.java:46)
>>        at
>> org.apache.geronimo.jetty6.handler.InstanceContextHandler.handle(InstanceContextHandler.java:58)
>>        at
>> org.apache.geronimo.jetty6.handler.UserTransactionHandler.handle(UserTransactionHandler.java:48)
>>        at
>> org.apache.geronimo.jetty6.handler.ComponentContextHandler.handle(ComponentContextHandler.java:47)
>>        at
>> org.apache.geronimo.jetty6.handler.TwistyWebAppContext.handle(TwistyWebAppContext.java:60)
>>        at
>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>>        at
>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>>        at
>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>>        at org.mortbay.jetty.Server.handle(Server.java:326)
>>        at
>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
>>        at
>> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:879)
>>        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:747)
>>        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
>>        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
>>        at
>> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
>>        at org.apache.geronimo.pool.ThreadPool$1.run(ThreadPool.java:214)
>>        at
>> org.apache.geronimo.pool.ThreadPool$ContextClassLoaderRunnable.run(ThreadPool.java:344)
>>        at
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>        at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>        at java.lang.Thread.run(Thread.java:619)
>> Caused by: org.codehaus.wadi.group.MessageExchangeException: No correlated
>> messages received within [2000]ms
>>        at
>> org.codehaus.wadi.group.impl.AbstractDispatcher.attemptMultiRendezVous(AbstractDispatcher.java:174)
>>        at
>> org.codehaus.wadi.servicespace.basic.BasicServiceInvoker.invokeOnPeers(BasicServiceInvoker.java:90)
>>        at
>> org.codehaus.wadi.servicespace.basic.BasicServiceInvoker.invoke(BasicServiceInvoker.java:69)
>>        at
>> org.codehaus.wadi.servicespace.basic.CGLIBServiceProxyFactory$ProxyMethodInterceptor.intercept(CGLIBServiceProxyFactory.java:193)
>>        ... 49 more
>>
>> Not sure if I should worry about that.
>>
>> Many thanks for your help!
>>
>> Trygve
>>
>>
>> On Tue, Nov 3, 2009 at 4:35 PM, Trygve Hardersen <tr...@jotta.no> wrote:
>>
>>
>> On Mon, Nov 2, 2009 at 5:39 PM, Kevan Miller <ke...@gmail.com>
>> wrote:
>>
>> Thanks Gianny. I'd like to see this included in the Geronimo 2.2 release.
>> Can we look for a new WADI release, soon? Once we know the problem is fixed?
>>
>> Trygve, the sooner we get confirmation that your issue is resolved, the
>> sooner we can start finalizing the 2.2 release.
>>
>> --kevan
>>
>> I've now built Geronimo using the 2.2-SNAPSHOT of WADI and installed it on
>> our test environment. No obvious issues so I'll go ahead and deploy this to
>> production either later this evening (EU time) or tomorrow. Then it needs to
>> run for a few days before I can confirm if the issue has really been
>> resolved.
>>
>> BTW I got this on our test system:
>>
>> AS-000:
>> 16:23:17,773 INFO  [TcpFailureDetector] Received
>> memberDisappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64,
>> -88, 1, 61}:4000,{-64, -88, 1, 61},4000, alive=1814258,id={50 18 86 10 111
>> -47 79 83 -108 -4 82 -8 26 82 -79 -59 }, payload={-84 -19 0 5 115 114 0 50
>> 111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]]
>> message. Will verify.
>> 16:23:17,897 INFO  [TcpFailureDetector] Verification complete. Member
>> still alive[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64,
>> -88, 1, 61}:4000,{-64, -88, 1, 61},4000, alive=1814258,id={50 18 86 10 111
>> -47 79 83 -108 -4 82 -8 26 82 -79 -59 }, payload={-84 -19 0 5 115 114 0 50
>> 111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]]
>>
>> AS-001:
>> 16:23:18,446 INFO  [TcpFailureDetector] Received
>> memberDisappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64,
>> -88, 1, 60}:4000,{-64, -88, 1, 60},4000, alive=2500759,id={107 -64 91 -23
>> 109 93 75 116 -95 109 110 22 -85 53 -52 85 }, payload={-84 -19 0 5 115 114 0
>> 50 111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)},
>> ]] message. Will verify.
>> 16:23:18,456 INFO  [TcpFailureDetector] Verification complete. Member
>> still alive[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64,
>> -88, 1, 60}:4000,{-64, -88, 1, 60},4000, alive=2500759,id={107 -64 91 -23
>> 109 93 75 116 -95 109 110 22 -85 53 -52 85 }, payload={-84 -19 0 5 115 114 0
>> 50 111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)},
>> ]]
>>
>> And then:
>>
>> AS-000
>> 16:30:02,576 INFO  [ChannelInterceptorBase] memberDisappeared:tcp://{-64,
>> -88, 1, 61}:4000
>> 16:30:02,577 INFO  [BasicPartitionBalancerSingletonService] Queueing
>> partition rebalancing
>> 16:30:02,600 INFO  [SimpleStateManager]
>> =============================
>> New Partition Balancing
>> Partition Balancing
>>    Size [24]
>>    Partition[0] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[1] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[2] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[3] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[4] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[5] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[6] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[7] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[8] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[9] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[10] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[11] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[12] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[13] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[14] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[15] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[16] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[17] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[18] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[19] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[20] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[21] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[22] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[23] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>> =============================
>>
>> 16:30:02,888 WARN  [TcpFailureDetector] Member added, even though we
>> werent notified:org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64,
>> -88, 1, 61}:4000,{-64, -88, 1, 61},4000, alive=2221072,id={50 18 86 10 111
>> -47 79 83 -108 -4 82 -8 26 82 -79 -59 }, payload={-84 -19 0 5 115 114 0 50
>> 111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]
>> 16:30:02,889 INFO  [ChannelInterceptorBase] memberAdded:tcp://{-64, -88,
>> 1, 61}:4000
>>
>> AS-001
>> Nothing....
>>
>> There is practically no load on this network. Anyway I'll try this with
>> load and see what happens.
>>
>> Many thanks again!
>>
>> Trygve
>>
>>
>

Re: 2.2 in production

Posted by Trygve Sanne Hardersen <tr...@hypobytes.com>.

FYI I know that there are still issues with WADI at this server though I no
longer have access to the log files. I can however confirm that this issue
was resolved.

Thanks again!

Trygve

On Thu, Mar 4, 2010 at 4:50 PM, Jack Cai <gr...@gmail.com> wrote:

> +1
>
> -Jack
>
>
> On Thu, Mar 4, 2010 at 4:12 PM, Shawn Jiang <ge...@gmail.com> wrote:
>
>> The WADI is still 2.1.1 in both 22 branch and trunk somehow.  If no
>> objection, I'll upgrade it to 2.1.2.
>>
>>
>> On Sat, Nov 14, 2009 at 2:59 PM, Gianny Damour <
>> gianny.damour@optusnet.com.au> wrote:
>>
>>> Hi Trygve,
>>>
>>> Thanks for your test.
>>>
>>> These warnings indicate that a session replication message was not
>>> acknowledged within a 2000ms timeframe. The message has not been lost and
>>> will be processed by back-up nodes as expected.
>>>
>>>
>>> Kevan, I just deployed release 2.1.2 of WADI for geronimo 2.2.
>>>
>>>
>>> Thanks,
>>> Gianny
>>>
>>>
>>>
>>> On 09/11/2009, at 11:38 PM, Trygve Hardersen wrote:
>>>
>>>  Hello again
>>>>
>>>> The system has now been running for a few days and I've not seen the
>>>> hang on startup so I believe the issue has been resolved. Great!
>>>>
>>>> I am still seeing some warnings like this:
>>>>
>>>> 10:58:49,250 WARN  [UpdateReplicationCommand] Update has not been
>>>> properly cascaded due to a communication failure. If a targeted node has
>>>> been lost, state will be re-balanced automatically.
>>>> org.codehaus.wadi.servicespace.ServiceInvocationException:
>>>> org.codehaus.wadi.group.MessageExchangeException: No correlated messages
>>>> received within [2000]ms
>>>>        at
>>>> org.codehaus.wadi.servicespace.basic.CGLIBServiceProxyFactory$ProxyMethodInterceptor.intercept(CGLIBServiceProxyFactory.java:209)
>>>>        at
>>>> org.codehaus.wadi.replication.storage.ReplicaStorage$$EnhancerByCGLIB$$4372d264.mergeUpdate(<generated>)
>>>>        at
>>>> org.codehaus.wadi.replication.manager.basic.UpdateReplicationCommand.cascadeUpdate(UpdateReplicationCommand.java:93)
>>>>        at
>>>> org.codehaus.wadi.replication.manager.basic.UpdateReplicationCommand.run(UpdateReplicationCommand.java:86)
>>>>        at
>>>> org.codehaus.wadi.replication.manager.basic.SyncReplicationManager.update(SyncReplicationManager.java:138)
>>>>        at
>>>> org.codehaus.wadi.replication.manager.basic.LoggingReplicationManager.update(LoggingReplicationManager.java:100)
>>>>        at
>>>> org.codehaus.wadi.core.session.AbstractReplicableSession.onEndProcessing(AbstractReplicableSession.java:49)
>>>>        at
>>>> org.codehaus.wadi.core.session.AtomicallyReplicableSession.onEndProcessing(AtomicallyReplicableSession.java:58)
>>>>        at
>>>> org.apache.geronimo.clustering.wadi.WADISessionAdaptor.onEndAccess(WADISessionAdaptor.java:77)
>>>>        at
>>>> org.apache.geronimo.jetty6.cluster.ClusteredSessionManager.complete(ClusteredSessionManager.java:60)
>>>>        at
>>>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:198)
>>>>        at
>>>> org.apache.geronimo.jetty6.cluster.ClusteredSessionHandler.doHandle(ClusteredSessionHandler.java:59)
>>>>        at
>>>> org.apache.geronimo.jetty6.cluster.ClusteredSessionHandler$ActualHandler.handle(ClusteredSessionHandler.java:66)
>>>>        at
>>>> org.apache.geronimo.jetty6.cluster.AbstractClusteredPreHandler$WebClusteredInvocation.invokeLocally(AbstractClusteredPreHandler.java:71)
>>>>        at
>>>> org.apache.geronimo.jetty6.cluster.wadi.WADIClusteredPreHandler$WADIWebClusteredInvocation.access$000(WADIClusteredPreHandler.java:52)
>>>>        at
>>>> org.apache.geronimo.jetty6.cluster.wadi.WADIClusteredPreHandler$WADIWebClusteredInvocation$1.doFilter(WADIClusteredPreHandler.java:64)
>>>>        at
>>>> org.codehaus.wadi.web.impl.WebInvocation.invoke(WebInvocation.java:116)
>>>>        at
>>>> org.codehaus.wadi.core.contextualiser.MemoryContextualiser.handleLocally(MemoryContextualiser.java:71)
>>>>        at
>>>> org.codehaus.wadi.core.contextualiser.AbstractExclusiveContextualiser.handle(AbstractExclusiveContextualiser.java:94)
>>>>        at
>>>> org.codehaus.wadi.core.contextualiser.AbstractMotingContextualiser.contextualise(AbstractMotingContextualiser.java:37)
>>>>        at
>>>> org.codehaus.wadi.core.manager.StandardManager.processStateful(StandardManager.java:154)
>>>>        at
>>>> org.codehaus.wadi.core.manager.StandardManager.contextualise(StandardManager.java:146)
>>>>        at
>>>> org.codehaus.wadi.core.manager.ClusteredManager.contextualise(ClusteredManager.java:81)
>>>>        at
>>>> org.apache.geronimo.jetty6.cluster.wadi.WADIClusteredPreHandler$WADIWebClusteredInvocation.invoke(WADIClusteredPreHandler.java:72)
>>>>        at
>>>> org.apache.geronimo.jetty6.cluster.AbstractClusteredPreHandler.handle(AbstractClusteredPreHandler.java:39)
>>>>        at
>>>> org.apache.geronimo.jetty6.cluster.ClusteredSessionHandler.handle(ClusteredSessionHandler.java:51)
>>>>        at
>>>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
>>>>        at
>>>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
>>>>        at
>>>> org.apache.geronimo.jetty6.handler.TwistyWebAppContext.access$101(TwistyWebAppContext.java:41)
>>>>        at
>>>> org.apache.geronimo.jetty6.handler.TwistyWebAppContext$TwistyHandler.handle(TwistyWebAppContext.java:66)
>>>>        at
>>>> org.apache.geronimo.jetty6.handler.ThreadClassloaderHandler.handle(ThreadClassloaderHandler.java:46)
>>>>        at
>>>> org.apache.geronimo.jetty6.handler.InstanceContextHandler.handle(InstanceContextHandler.java:58)
>>>>        at
>>>> org.apache.geronimo.jetty6.handler.UserTransactionHandler.handle(UserTransactionHandler.java:48)
>>>>        at
>>>> org.apache.geronimo.jetty6.handler.ComponentContextHandler.handle(ComponentContextHandler.java:47)
>>>>        at
>>>> org.apache.geronimo.jetty6.handler.TwistyWebAppContext.handle(TwistyWebAppContext.java:60)
>>>>        at
>>>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>>>>        at
>>>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>>>>        at
>>>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>>>>        at org.mortbay.jetty.Server.handle(Server.java:326)
>>>>        at
>>>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
>>>>        at
>>>> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:879)
>>>>        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:747)
>>>>        at
>>>> org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
>>>>        at
>>>> org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
>>>>        at
>>>> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
>>>>        at org.apache.geronimo.pool.ThreadPool$1.run(ThreadPool.java:214)
>>>>        at
>>>> org.apache.geronimo.pool.ThreadPool$ContextClassLoaderRunnable.run(ThreadPool.java:344)
>>>>        at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>>        at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>>        at java.lang.Thread.run(Thread.java:619)
>>>> Caused by: org.codehaus.wadi.group.MessageExchangeException: No
>>>> correlated messages received within [2000]ms
>>>>        at
>>>> org.codehaus.wadi.group.impl.AbstractDispatcher.attemptMultiRendezVous(AbstractDispatcher.java:174)
>>>>        at
>>>> org.codehaus.wadi.servicespace.basic.BasicServiceInvoker.invokeOnPeers(BasicServiceInvoker.java:90)
>>>>        at
>>>> org.codehaus.wadi.servicespace.basic.BasicServiceInvoker.invoke(BasicServiceInvoker.java:69)
>>>>        at
>>>> org.codehaus.wadi.servicespace.basic.CGLIBServiceProxyFactory$ProxyMethodInterceptor.intercept(CGLIBServiceProxyFactory.java:193)
>>>>        ... 49 more
>>>>
>>>> Not sure if I should worry about that.
>>>>
>>>> Many thanks for your help!
>>>>
>>>> Trygve
>>>>
>>>>
>>>> On Tue, Nov 3, 2009 at 4:35 PM, Trygve Hardersen <tr...@jotta.no>
>>>> wrote:
>>>>
>>>>
>>>> On Mon, Nov 2, 2009 at 5:39 PM, Kevan Miller <ke...@gmail.com>
>>>> wrote:
>>>>
>>>> Thanks Gianny. I'd like to see this included in the Geronimo 2.2
>>>> release. Can we look for a new WADI release, soon? Once we know the problem
>>>> is fixed?
>>>>
>>>> Trygve, the sooner we get confirmation that your issue is resolved, the
>>>> sooner we can start finalizing the 2.2 release.
>>>>
>>>> --kevan
>>>>
>>>> I've now built Geronimo using the 2.2-SNAPSHOT of WADI and installed it
>>>> on our test environment. No obvious issues so I'll go ahead and deploy this
>>>> to production either later this evening (EU time) or tomorrow. Then it needs
>>>> to run for a few days before I can confirm if the issue has really been
>>>> resolved.
>>>>
>>>> BTW I got this on our test system:
>>>>
>>>> AS-000:
>>>> 16:23:17,773 INFO  [TcpFailureDetector] Received
>>>> memberDisappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64,
>>>> -88, 1, 61}:4000,{-64, -88, 1, 61},4000, alive=1814258,id={50 18 86 10 111
>>>> -47 79 83 -108 -4 82 -8 26 82 -79 -59 }, payload={-84 -19 0 5 115 114 0 50
>>>> 111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]]
>>>> message. Will verify.
>>>> 16:23:17,897 INFO  [TcpFailureDetector] Verification complete. Member
>>>> still alive[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64,
>>>> -88, 1, 61}:4000,{-64, -88, 1, 61},4000, alive=1814258,id={50 18 86 10 111
>>>> -47 79 83 -108 -4 82 -8 26 82 -79 -59 }, payload={-84 -19 0 5 115 114 0 50
>>>> 111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]]
>>>>
>>>> AS-001:
>>>> 16:23:18,446 INFO  [TcpFailureDetector] Received
>>>> memberDisappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64,
>>>> -88, 1, 60}:4000,{-64, -88, 1, 60},4000, alive=2500759,id={107 -64 91 -23
>>>> 109 93 75 116 -95 109 110 22 -85 53 -52 85 }, payload={-84 -19 0 5 115 114 0
>>>> 50 111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)},
>>>> ]] message. Will verify.
>>>> 16:23:18,456 INFO  [TcpFailureDetector] Verification complete. Member
>>>> still alive[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64,
>>>> -88, 1, 60}:4000,{-64, -88, 1, 60},4000, alive=2500759,id={107 -64 91 -23
>>>> 109 93 75 116 -95 109 110 22 -85 53 -52 85 }, payload={-84 -19 0 5 115 114 0
>>>> 50 111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)},
>>>> ]]
>>>>
>>>> And then:
>>>>
>>>> AS-000
>>>> 16:30:02,576 INFO  [ChannelInterceptorBase]
>>>> memberDisappeared:tcp://{-64, -88, 1, 61}:4000
>>>> 16:30:02,577 INFO  [BasicPartitionBalancerSingletonService] Queueing
>>>> partition rebalancing
>>>> 16:30:02,600 INFO  [SimpleStateManager]
>>>> =============================
>>>> New Partition Balancing
>>>> Partition Balancing
>>>>    Size [24]
>>>>    Partition[0] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>>> version [3]; mergeVersion [0]
>>>>    Partition[1] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>>> version [3]; mergeVersion [0]
>>>>    Partition[2] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>>> version [3]; mergeVersion [0]
>>>>    Partition[3] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>>> version [3]; mergeVersion [0]
>>>>    Partition[4] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>>> version [3]; mergeVersion [0]
>>>>    Partition[5] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>>> version [3]; mergeVersion [0]
>>>>    Partition[6] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>>> version [3]; mergeVersion [0]
>>>>    Partition[7] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>>> version [3]; mergeVersion [0]
>>>>    Partition[8] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>>> version [3]; mergeVersion [0]
>>>>    Partition[9] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>>> version [3]; mergeVersion [0]
>>>>    Partition[10] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>>> version [3]; mergeVersion [0]
>>>>    Partition[11] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>>> version [3]; mergeVersion [0]
>>>>    Partition[12] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>>> version [3]; mergeVersion [0]
>>>>    Partition[13] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>>> version [3]; mergeVersion [0]
>>>>    Partition[14] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>>> version [3]; mergeVersion [0]
>>>>    Partition[15] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>>> version [3]; mergeVersion [0]
>>>>    Partition[16] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>>> version [3]; mergeVersion [0]
>>>>    Partition[17] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>>> version [3]; mergeVersion [0]
>>>>    Partition[18] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>>> version [3]; mergeVersion [0]
>>>>    Partition[19] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>>> version [3]; mergeVersion [0]
>>>>    Partition[20] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>>> version [3]; mergeVersion [0]
>>>>    Partition[21] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>>> version [3]; mergeVersion [0]
>>>>    Partition[22] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>>> version [3]; mergeVersion [0]
>>>>    Partition[23] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>>> version [3]; mergeVersion [0]
>>>> =============================
>>>>
>>>> 16:30:02,888 WARN  [TcpFailureDetector] Member added, even though we
>>>> werent notified:org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64,
>>>> -88, 1, 61}:4000,{-64, -88, 1, 61},4000, alive=2221072,id={50 18 86 10 111
>>>> -47 79 83 -108 -4 82 -8 26 82 -79 -59 }, payload={-84 -19 0 5 115 114 0 50
>>>> 111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]
>>>> 16:30:02,889 INFO  [ChannelInterceptorBase] memberAdded:tcp://{-64, -88,
>>>> 1, 61}:4000
>>>>
>>>> AS-001
>>>> Nothing....
>>>>
>>>> There is practically no load on this network. Anyway I'll try this with
>>>> load and see what happens.
>>>>
>>>> Many thanks again!
>>>>
>>>> Trygve
>>>>
>>>>
>>>
>>
>>
>> --
>> Shawn
>>
>
>


-- 
HypoBytes Ltd.
Trygve Sanne Hardersen
Akersveien 24F
0177 Oslo
Norway

hypobytes.com
+47 40 55 30 25

Re: 2.2 in production

Posted by Jack Cai <gr...@gmail.com>.

+1

-Jack

On Thu, Mar 4, 2010 at 4:12 PM, Shawn Jiang <ge...@gmail.com> wrote:

> The WADI is still 2.1.1 in both 22 branch and trunk somehow.  If no
> objection, I'll upgrade it to 2.1.2.
>
>
> On Sat, Nov 14, 2009 at 2:59 PM, Gianny Damour <
> gianny.damour@optusnet.com.au> wrote:
>
>> Hi Trygve,
>>
>> Thanks for your test.
>>
>> These warnings indicate that a session replication message was not
>> acknowledged within a 2000ms timeframe. The message has not been lost and
>> will be processed by back-up nodes as expected.
>>
>>
>> Kevan, I just deployed release 2.1.2 of WADI for geronimo 2.2.
>>
>>
>> Thanks,
>> Gianny
>>
>>
>>
>> On 09/11/2009, at 11:38 PM, Trygve Hardersen wrote:
>>
>>  Hello again
>>>
>>> The system has now been running for a few days and I've not seen the hang
>>> on startup so I believe the issue has been resolved. Great!
>>>
>>> I am still seeing some warnings like this:
>>>
>>> 10:58:49,250 WARN  [UpdateReplicationCommand] Update has not been
>>> properly cascaded due to a communication failure. If a targeted node has
>>> been lost, state will be re-balanced automatically.
>>> org.codehaus.wadi.servicespace.ServiceInvocationException:
>>> org.codehaus.wadi.group.MessageExchangeException: No correlated messages
>>> received within [2000]ms
>>>        at
>>> org.codehaus.wadi.servicespace.basic.CGLIBServiceProxyFactory$ProxyMethodInterceptor.intercept(CGLIBServiceProxyFactory.java:209)
>>>        at
>>> org.codehaus.wadi.replication.storage.ReplicaStorage$$EnhancerByCGLIB$$4372d264.mergeUpdate(<generated>)
>>>        at
>>> org.codehaus.wadi.replication.manager.basic.UpdateReplicationCommand.cascadeUpdate(UpdateReplicationCommand.java:93)
>>>        at
>>> org.codehaus.wadi.replication.manager.basic.UpdateReplicationCommand.run(UpdateReplicationCommand.java:86)
>>>        at
>>> org.codehaus.wadi.replication.manager.basic.SyncReplicationManager.update(SyncReplicationManager.java:138)
>>>        at
>>> org.codehaus.wadi.replication.manager.basic.LoggingReplicationManager.update(LoggingReplicationManager.java:100)
>>>        at
>>> org.codehaus.wadi.core.session.AbstractReplicableSession.onEndProcessing(AbstractReplicableSession.java:49)
>>>        at
>>> org.codehaus.wadi.core.session.AtomicallyReplicableSession.onEndProcessing(AtomicallyReplicableSession.java:58)
>>>        at
>>> org.apache.geronimo.clustering.wadi.WADISessionAdaptor.onEndAccess(WADISessionAdaptor.java:77)
>>>        at
>>> org.apache.geronimo.jetty6.cluster.ClusteredSessionManager.complete(ClusteredSessionManager.java:60)
>>>        at
>>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:198)
>>>        at
>>> org.apache.geronimo.jetty6.cluster.ClusteredSessionHandler.doHandle(ClusteredSessionHandler.java:59)
>>>        at
>>> org.apache.geronimo.jetty6.cluster.ClusteredSessionHandler$ActualHandler.handle(ClusteredSessionHandler.java:66)
>>>        at
>>> org.apache.geronimo.jetty6.cluster.AbstractClusteredPreHandler$WebClusteredInvocation.invokeLocally(AbstractClusteredPreHandler.java:71)
>>>        at
>>> org.apache.geronimo.jetty6.cluster.wadi.WADIClusteredPreHandler$WADIWebClusteredInvocation.access$000(WADIClusteredPreHandler.java:52)
>>>        at
>>> org.apache.geronimo.jetty6.cluster.wadi.WADIClusteredPreHandler$WADIWebClusteredInvocation$1.doFilter(WADIClusteredPreHandler.java:64)
>>>        at
>>> org.codehaus.wadi.web.impl.WebInvocation.invoke(WebInvocation.java:116)
>>>        at
>>> org.codehaus.wadi.core.contextualiser.MemoryContextualiser.handleLocally(MemoryContextualiser.java:71)
>>>        at
>>> org.codehaus.wadi.core.contextualiser.AbstractExclusiveContextualiser.handle(AbstractExclusiveContextualiser.java:94)
>>>        at
>>> org.codehaus.wadi.core.contextualiser.AbstractMotingContextualiser.contextualise(AbstractMotingContextualiser.java:37)
>>>        at
>>> org.codehaus.wadi.core.manager.StandardManager.processStateful(StandardManager.java:154)
>>>        at
>>> org.codehaus.wadi.core.manager.StandardManager.contextualise(StandardManager.java:146)
>>>        at
>>> org.codehaus.wadi.core.manager.ClusteredManager.contextualise(ClusteredManager.java:81)
>>>        at
>>> org.apache.geronimo.jetty6.cluster.wadi.WADIClusteredPreHandler$WADIWebClusteredInvocation.invoke(WADIClusteredPreHandler.java:72)
>>>        at
>>> org.apache.geronimo.jetty6.cluster.AbstractClusteredPreHandler.handle(AbstractClusteredPreHandler.java:39)
>>>        at
>>> org.apache.geronimo.jetty6.cluster.ClusteredSessionHandler.handle(ClusteredSessionHandler.java:51)
>>>        at
>>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
>>>        at
>>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
>>>        at
>>> org.apache.geronimo.jetty6.handler.TwistyWebAppContext.access$101(TwistyWebAppContext.java:41)
>>>        at
>>> org.apache.geronimo.jetty6.handler.TwistyWebAppContext$TwistyHandler.handle(TwistyWebAppContext.java:66)
>>>        at
>>> org.apache.geronimo.jetty6.handler.ThreadClassloaderHandler.handle(ThreadClassloaderHandler.java:46)
>>>        at
>>> org.apache.geronimo.jetty6.handler.InstanceContextHandler.handle(InstanceContextHandler.java:58)
>>>        at
>>> org.apache.geronimo.jetty6.handler.UserTransactionHandler.handle(UserTransactionHandler.java:48)
>>>        at
>>> org.apache.geronimo.jetty6.handler.ComponentContextHandler.handle(ComponentContextHandler.java:47)
>>>        at
>>> org.apache.geronimo.jetty6.handler.TwistyWebAppContext.handle(TwistyWebAppContext.java:60)
>>>        at
>>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>>>        at
>>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>>>        at
>>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>>>        at org.mortbay.jetty.Server.handle(Server.java:326)
>>>        at
>>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
>>>        at
>>> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:879)
>>>        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:747)
>>>        at
>>> org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
>>>        at
>>> org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
>>>        at
>>> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
>>>        at org.apache.geronimo.pool.ThreadPool$1.run(ThreadPool.java:214)
>>>        at
>>> org.apache.geronimo.pool.ThreadPool$ContextClassLoaderRunnable.run(ThreadPool.java:344)
>>>        at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>        at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>        at java.lang.Thread.run(Thread.java:619)
>>> Caused by: org.codehaus.wadi.group.MessageExchangeException: No
>>> correlated messages received within [2000]ms
>>>        at
>>> org.codehaus.wadi.group.impl.AbstractDispatcher.attemptMultiRendezVous(AbstractDispatcher.java:174)
>>>        at
>>> org.codehaus.wadi.servicespace.basic.BasicServiceInvoker.invokeOnPeers(BasicServiceInvoker.java:90)
>>>        at
>>> org.codehaus.wadi.servicespace.basic.BasicServiceInvoker.invoke(BasicServiceInvoker.java:69)
>>>        at
>>> org.codehaus.wadi.servicespace.basic.CGLIBServiceProxyFactory$ProxyMethodInterceptor.intercept(CGLIBServiceProxyFactory.java:193)
>>>        ... 49 more
>>>
>>> Not sure if I should worry about that.
>>>
>>> Many thanks for your help!
>>>
>>> Trygve
>>>
>>>
>>> On Tue, Nov 3, 2009 at 4:35 PM, Trygve Hardersen <tr...@jotta.no>
>>> wrote:
>>>
>>>
>>> On Mon, Nov 2, 2009 at 5:39 PM, Kevan Miller <ke...@gmail.com>
>>> wrote:
>>>
>>> Thanks Gianny. I'd like to see this included in the Geronimo 2.2 release.
>>> Can we look for a new WADI release, soon? Once we know the problem is fixed?
>>>
>>> Trygve, the sooner we get confirmation that your issue is resolved, the
>>> sooner we can start finalizing the 2.2 release.
>>>
>>> --kevan
>>>
>>> I've now built Geronimo using the 2.2-SNAPSHOT of WADI and installed it
>>> on our test environment. No obvious issues so I'll go ahead and deploy this
>>> to production either later this evening (EU time) or tomorrow. Then it needs
>>> to run for a few days before I can confirm if the issue has really been
>>> resolved.
>>>
>>> BTW I got this on our test system:
>>>
>>> AS-000:
>>> 16:23:17,773 INFO  [TcpFailureDetector] Received
>>> memberDisappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64,
>>> -88, 1, 61}:4000,{-64, -88, 1, 61},4000, alive=1814258,id={50 18 86 10 111
>>> -47 79 83 -108 -4 82 -8 26 82 -79 -59 }, payload={-84 -19 0 5 115 114 0 50
>>> 111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]]
>>> message. Will verify.
>>> 16:23:17,897 INFO  [TcpFailureDetector] Verification complete. Member
>>> still alive[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64,
>>> -88, 1, 61}:4000,{-64, -88, 1, 61},4000, alive=1814258,id={50 18 86 10 111
>>> -47 79 83 -108 -4 82 -8 26 82 -79 -59 }, payload={-84 -19 0 5 115 114 0 50
>>> 111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]]
>>>
>>> AS-001:
>>> 16:23:18,446 INFO  [TcpFailureDetector] Received
>>> memberDisappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64,
>>> -88, 1, 60}:4000,{-64, -88, 1, 60},4000, alive=2500759,id={107 -64 91 -23
>>> 109 93 75 116 -95 109 110 22 -85 53 -52 85 }, payload={-84 -19 0 5 115 114 0
>>> 50 111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)},
>>> ]] message. Will verify.
>>> 16:23:18,456 INFO  [TcpFailureDetector] Verification complete. Member
>>> still alive[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64,
>>> -88, 1, 60}:4000,{-64, -88, 1, 60},4000, alive=2500759,id={107 -64 91 -23
>>> 109 93 75 116 -95 109 110 22 -85 53 -52 85 }, payload={-84 -19 0 5 115 114 0
>>> 50 111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)},
>>> ]]
>>>
>>> And then:
>>>
>>> AS-000
>>> 16:30:02,576 INFO  [ChannelInterceptorBase] memberDisappeared:tcp://{-64,
>>> -88, 1, 61}:4000
>>> 16:30:02,577 INFO  [BasicPartitionBalancerSingletonService] Queueing
>>> partition rebalancing
>>> 16:30:02,600 INFO  [SimpleStateManager]
>>> =============================
>>> New Partition Balancing
>>> Partition Balancing
>>>    Size [24]
>>>    Partition[0] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>> version [3]; mergeVersion [0]
>>>    Partition[1] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>> version [3]; mergeVersion [0]
>>>    Partition[2] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>> version [3]; mergeVersion [0]
>>>    Partition[3] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>> version [3]; mergeVersion [0]
>>>    Partition[4] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>> version [3]; mergeVersion [0]
>>>    Partition[5] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>> version [3]; mergeVersion [0]
>>>    Partition[6] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>> version [3]; mergeVersion [0]
>>>    Partition[7] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>> version [3]; mergeVersion [0]
>>>    Partition[8] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>> version [3]; mergeVersion [0]
>>>    Partition[9] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>> version [3]; mergeVersion [0]
>>>    Partition[10] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>> version [3]; mergeVersion [0]
>>>    Partition[11] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>> version [3]; mergeVersion [0]
>>>    Partition[12] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>> version [3]; mergeVersion [0]
>>>    Partition[13] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>> version [3]; mergeVersion [0]
>>>    Partition[14] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>> version [3]; mergeVersion [0]
>>>    Partition[15] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>> version [3]; mergeVersion [0]
>>>    Partition[16] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>> version [3]; mergeVersion [0]
>>>    Partition[17] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>> version [3]; mergeVersion [0]
>>>    Partition[18] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>> version [3]; mergeVersion [0]
>>>    Partition[19] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>> version [3]; mergeVersion [0]
>>>    Partition[20] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>> version [3]; mergeVersion [0]
>>>    Partition[21] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>> version [3]; mergeVersion [0]
>>>    Partition[22] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>> version [3]; mergeVersion [0]
>>>    Partition[23] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>>> version [3]; mergeVersion [0]
>>> =============================
>>>
>>> 16:30:02,888 WARN  [TcpFailureDetector] Member added, even though we
>>> werent notified:org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64,
>>> -88, 1, 61}:4000,{-64, -88, 1, 61},4000, alive=2221072,id={50 18 86 10 111
>>> -47 79 83 -108 -4 82 -8 26 82 -79 -59 }, payload={-84 -19 0 5 115 114 0 50
>>> 111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]
>>> 16:30:02,889 INFO  [ChannelInterceptorBase] memberAdded:tcp://{-64, -88,
>>> 1, 61}:4000
>>>
>>> AS-001
>>> Nothing....
>>>
>>> There is practically no load on this network. Anyway I'll try this with
>>> load and see what happens.
>>>
>>> Many thanks again!
>>>
>>> Trygve
>>>
>>>
>>
>
>
> --
> Shawn
>

Re: 2.2 in production

Posted by Shawn Jiang <ge...@gmail.com>.

The WADI is still 2.1.1 in both 22 branch and trunk somehow.  If no
objection, I'll upgrade it to 2.1.2.

On Sat, Nov 14, 2009 at 2:59 PM, Gianny Damour <
gianny.damour@optusnet.com.au> wrote:

> Hi Trygve,
>
> Thanks for your test.
>
> These warnings indicate that a session replication message was not
> acknowledged within a 2000ms timeframe. The message has not been lost and
> will be processed by back-up nodes as expected.
>
>
> Kevan, I just deployed release 2.1.2 of WADI for geronimo 2.2.
>
>
> Thanks,
> Gianny
>
>
>
> On 09/11/2009, at 11:38 PM, Trygve Hardersen wrote:
>
>  Hello again
>>
>> The system has now been running for a few days and I've not seen the hang
>> on startup so I believe the issue has been resolved. Great!
>>
>> I am still seeing some warnings like this:
>>
>> 10:58:49,250 WARN  [UpdateReplicationCommand] Update has not been properly
>> cascaded due to a communication failure. If a targeted node has been lost,
>> state will be re-balanced automatically.
>> org.codehaus.wadi.servicespace.ServiceInvocationException:
>> org.codehaus.wadi.group.MessageExchangeException: No correlated messages
>> received within [2000]ms
>>        at
>> org.codehaus.wadi.servicespace.basic.CGLIBServiceProxyFactory$ProxyMethodInterceptor.intercept(CGLIBServiceProxyFactory.java:209)
>>        at
>> org.codehaus.wadi.replication.storage.ReplicaStorage$$EnhancerByCGLIB$$4372d264.mergeUpdate(<generated>)
>>        at
>> org.codehaus.wadi.replication.manager.basic.UpdateReplicationCommand.cascadeUpdate(UpdateReplicationCommand.java:93)
>>        at
>> org.codehaus.wadi.replication.manager.basic.UpdateReplicationCommand.run(UpdateReplicationCommand.java:86)
>>        at
>> org.codehaus.wadi.replication.manager.basic.SyncReplicationManager.update(SyncReplicationManager.java:138)
>>        at
>> org.codehaus.wadi.replication.manager.basic.LoggingReplicationManager.update(LoggingReplicationManager.java:100)
>>        at
>> org.codehaus.wadi.core.session.AbstractReplicableSession.onEndProcessing(AbstractReplicableSession.java:49)
>>        at
>> org.codehaus.wadi.core.session.AtomicallyReplicableSession.onEndProcessing(AtomicallyReplicableSession.java:58)
>>        at
>> org.apache.geronimo.clustering.wadi.WADISessionAdaptor.onEndAccess(WADISessionAdaptor.java:77)
>>        at
>> org.apache.geronimo.jetty6.cluster.ClusteredSessionManager.complete(ClusteredSessionManager.java:60)
>>        at
>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:198)
>>        at
>> org.apache.geronimo.jetty6.cluster.ClusteredSessionHandler.doHandle(ClusteredSessionHandler.java:59)
>>        at
>> org.apache.geronimo.jetty6.cluster.ClusteredSessionHandler$ActualHandler.handle(ClusteredSessionHandler.java:66)
>>        at
>> org.apache.geronimo.jetty6.cluster.AbstractClusteredPreHandler$WebClusteredInvocation.invokeLocally(AbstractClusteredPreHandler.java:71)
>>        at
>> org.apache.geronimo.jetty6.cluster.wadi.WADIClusteredPreHandler$WADIWebClusteredInvocation.access$000(WADIClusteredPreHandler.java:52)
>>        at
>> org.apache.geronimo.jetty6.cluster.wadi.WADIClusteredPreHandler$WADIWebClusteredInvocation$1.doFilter(WADIClusteredPreHandler.java:64)
>>        at
>> org.codehaus.wadi.web.impl.WebInvocation.invoke(WebInvocation.java:116)
>>        at
>> org.codehaus.wadi.core.contextualiser.MemoryContextualiser.handleLocally(MemoryContextualiser.java:71)
>>        at
>> org.codehaus.wadi.core.contextualiser.AbstractExclusiveContextualiser.handle(AbstractExclusiveContextualiser.java:94)
>>        at
>> org.codehaus.wadi.core.contextualiser.AbstractMotingContextualiser.contextualise(AbstractMotingContextualiser.java:37)
>>        at
>> org.codehaus.wadi.core.manager.StandardManager.processStateful(StandardManager.java:154)
>>        at
>> org.codehaus.wadi.core.manager.StandardManager.contextualise(StandardManager.java:146)
>>        at
>> org.codehaus.wadi.core.manager.ClusteredManager.contextualise(ClusteredManager.java:81)
>>        at
>> org.apache.geronimo.jetty6.cluster.wadi.WADIClusteredPreHandler$WADIWebClusteredInvocation.invoke(WADIClusteredPreHandler.java:72)
>>        at
>> org.apache.geronimo.jetty6.cluster.AbstractClusteredPreHandler.handle(AbstractClusteredPreHandler.java:39)
>>        at
>> org.apache.geronimo.jetty6.cluster.ClusteredSessionHandler.handle(ClusteredSessionHandler.java:51)
>>        at
>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
>>        at
>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
>>        at
>> org.apache.geronimo.jetty6.handler.TwistyWebAppContext.access$101(TwistyWebAppContext.java:41)
>>        at
>> org.apache.geronimo.jetty6.handler.TwistyWebAppContext$TwistyHandler.handle(TwistyWebAppContext.java:66)
>>        at
>> org.apache.geronimo.jetty6.handler.ThreadClassloaderHandler.handle(ThreadClassloaderHandler.java:46)
>>        at
>> org.apache.geronimo.jetty6.handler.InstanceContextHandler.handle(InstanceContextHandler.java:58)
>>        at
>> org.apache.geronimo.jetty6.handler.UserTransactionHandler.handle(UserTransactionHandler.java:48)
>>        at
>> org.apache.geronimo.jetty6.handler.ComponentContextHandler.handle(ComponentContextHandler.java:47)
>>        at
>> org.apache.geronimo.jetty6.handler.TwistyWebAppContext.handle(TwistyWebAppContext.java:60)
>>        at
>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>>        at
>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>>        at
>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>>        at org.mortbay.jetty.Server.handle(Server.java:326)
>>        at
>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
>>        at
>> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:879)
>>        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:747)
>>        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
>>        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
>>        at
>> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
>>        at org.apache.geronimo.pool.ThreadPool$1.run(ThreadPool.java:214)
>>        at
>> org.apache.geronimo.pool.ThreadPool$ContextClassLoaderRunnable.run(ThreadPool.java:344)
>>        at
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>        at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>        at java.lang.Thread.run(Thread.java:619)
>> Caused by: org.codehaus.wadi.group.MessageExchangeException: No correlated
>> messages received within [2000]ms
>>        at
>> org.codehaus.wadi.group.impl.AbstractDispatcher.attemptMultiRendezVous(AbstractDispatcher.java:174)
>>        at
>> org.codehaus.wadi.servicespace.basic.BasicServiceInvoker.invokeOnPeers(BasicServiceInvoker.java:90)
>>        at
>> org.codehaus.wadi.servicespace.basic.BasicServiceInvoker.invoke(BasicServiceInvoker.java:69)
>>        at
>> org.codehaus.wadi.servicespace.basic.CGLIBServiceProxyFactory$ProxyMethodInterceptor.intercept(CGLIBServiceProxyFactory.java:193)
>>        ... 49 more
>>
>> Not sure if I should worry about that.
>>
>> Many thanks for your help!
>>
>> Trygve
>>
>>
>> On Tue, Nov 3, 2009 at 4:35 PM, Trygve Hardersen <tr...@jotta.no> wrote:
>>
>>
>> On Mon, Nov 2, 2009 at 5:39 PM, Kevan Miller <ke...@gmail.com>
>> wrote:
>>
>> Thanks Gianny. I'd like to see this included in the Geronimo 2.2 release.
>> Can we look for a new WADI release, soon? Once we know the problem is fixed?
>>
>> Trygve, the sooner we get confirmation that your issue is resolved, the
>> sooner we can start finalizing the 2.2 release.
>>
>> --kevan
>>
>> I've now built Geronimo using the 2.2-SNAPSHOT of WADI and installed it on
>> our test environment. No obvious issues so I'll go ahead and deploy this to
>> production either later this evening (EU time) or tomorrow. Then it needs to
>> run for a few days before I can confirm if the issue has really been
>> resolved.
>>
>> BTW I got this on our test system:
>>
>> AS-000:
>> 16:23:17,773 INFO  [TcpFailureDetector] Received
>> memberDisappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64,
>> -88, 1, 61}:4000,{-64, -88, 1, 61},4000, alive=1814258,id={50 18 86 10 111
>> -47 79 83 -108 -4 82 -8 26 82 -79 -59 }, payload={-84 -19 0 5 115 114 0 50
>> 111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]]
>> message. Will verify.
>> 16:23:17,897 INFO  [TcpFailureDetector] Verification complete. Member
>> still alive[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64,
>> -88, 1, 61}:4000,{-64, -88, 1, 61},4000, alive=1814258,id={50 18 86 10 111
>> -47 79 83 -108 -4 82 -8 26 82 -79 -59 }, payload={-84 -19 0 5 115 114 0 50
>> 111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]]
>>
>> AS-001:
>> 16:23:18,446 INFO  [TcpFailureDetector] Received
>> memberDisappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64,
>> -88, 1, 60}:4000,{-64, -88, 1, 60},4000, alive=2500759,id={107 -64 91 -23
>> 109 93 75 116 -95 109 110 22 -85 53 -52 85 }, payload={-84 -19 0 5 115 114 0
>> 50 111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)},
>> ]] message. Will verify.
>> 16:23:18,456 INFO  [TcpFailureDetector] Verification complete. Member
>> still alive[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64,
>> -88, 1, 60}:4000,{-64, -88, 1, 60},4000, alive=2500759,id={107 -64 91 -23
>> 109 93 75 116 -95 109 110 22 -85 53 -52 85 }, payload={-84 -19 0 5 115 114 0
>> 50 111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)},
>> ]]
>>
>> And then:
>>
>> AS-000
>> 16:30:02,576 INFO  [ChannelInterceptorBase] memberDisappeared:tcp://{-64,
>> -88, 1, 61}:4000
>> 16:30:02,577 INFO  [BasicPartitionBalancerSingletonService] Queueing
>> partition rebalancing
>> 16:30:02,600 INFO  [SimpleStateManager]
>> =============================
>> New Partition Balancing
>> Partition Balancing
>>    Size [24]
>>    Partition[0] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[1] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[2] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[3] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[4] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[5] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[6] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[7] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[8] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[9] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[10] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[11] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[12] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[13] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[14] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[15] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[16] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[17] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[18] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[19] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[20] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[21] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[22] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[23] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
>> version [3]; mergeVersion [0]
>> =============================
>>
>> 16:30:02,888 WARN  [TcpFailureDetector] Member added, even though we
>> werent notified:org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64,
>> -88, 1, 61}:4000,{-64, -88, 1, 61},4000, alive=2221072,id={50 18 86 10 111
>> -47 79 83 -108 -4 82 -8 26 82 -79 -59 }, payload={-84 -19 0 5 115 114 0 50
>> 111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]
>> 16:30:02,889 INFO  [ChannelInterceptorBase] memberAdded:tcp://{-64, -88,
>> 1, 61}:4000
>>
>> AS-001
>> Nothing....
>>
>> There is practically no load on this network. Anyway I'll try this with
>> load and see what happens.
>>
>> Many thanks again!
>>
>> Trygve
>>
>>
>


-- 
Shawn

Re: 2.2 in production

Posted by Gianny Damour <gi...@optusnet.com.au>.

Hi Trygve,

Thanks for your test.

These warnings indicate that a session replication message was not  
acknowledged within a 2000ms timeframe. The message has not been lost  
and will be processed by back-up nodes as expected.


Kevan, I just deployed release 2.1.2 of WADI for geronimo 2.2.


Thanks,
Gianny


On 09/11/2009, at 11:38 PM, Trygve Hardersen wrote:

> Hello again
>
> The system has now been running for a few days and I've not seen  
> the hang on startup so I believe the issue has been resolved. Great!
>
> I am still seeing some warnings like this:
>
> 10:58:49,250 WARN  [UpdateReplicationCommand] Update has not been  
> properly cascaded due to a communication failure. If a targeted  
> node has been lost, state will be re-balanced automatically.
> org.codehaus.wadi.servicespace.ServiceInvocationException:  
> org.codehaus.wadi.group.MessageExchangeException: No correlated  
> messages received within [2000]ms
> 	at org.codehaus.wadi.servicespace.basic.CGLIBServiceProxyFactory 
> $ProxyMethodInterceptor.intercept(CGLIBServiceProxyFactory.java:209)
> 	at org.codehaus.wadi.replication.storage.ReplicaStorage$ 
> $EnhancerByCGLIB$$4372d264.mergeUpdate(<generated>)
> 	at  
> org.codehaus.wadi.replication.manager.basic.UpdateReplicationCommand.c 
> ascadeUpdate(UpdateReplicationCommand.java:93)
> 	at  
> org.codehaus.wadi.replication.manager.basic.UpdateReplicationCommand.r 
> un(UpdateReplicationCommand.java:86)
> 	at  
> org.codehaus.wadi.replication.manager.basic.SyncReplicationManager.upd 
> ate(SyncReplicationManager.java:138)
> 	at  
> org.codehaus.wadi.replication.manager.basic.LoggingReplicationManager. 
> update(LoggingReplicationManager.java:100)
> 	at  
> org.codehaus.wadi.core.session.AbstractReplicableSession.onEndProcessi 
> ng(AbstractReplicableSession.java:49)
> 	at  
> org.codehaus.wadi.core.session.AtomicallyReplicableSession.onEndProces 
> sing(AtomicallyReplicableSession.java:58)
> 	at  
> org.apache.geronimo.clustering.wadi.WADISessionAdaptor.onEndAccess 
> (WADISessionAdaptor.java:77)
> 	at  
> org.apache.geronimo.jetty6.cluster.ClusteredSessionManager.complete 
> (ClusteredSessionManager.java:60)
> 	at org.mortbay.jetty.servlet.SessionHandler.handle 
> (SessionHandler.java:198)
> 	at  
> org.apache.geronimo.jetty6.cluster.ClusteredSessionHandler.doHandle 
> (ClusteredSessionHandler.java:59)
> 	at org.apache.geronimo.jetty6.cluster.ClusteredSessionHandler 
> $ActualHandler.handle(ClusteredSessionHandler.java:66)
> 	at org.apache.geronimo.jetty6.cluster.AbstractClusteredPreHandler 
> $WebClusteredInvocation.invokeLocally 
> (AbstractClusteredPreHandler.java:71)
> 	at org.apache.geronimo.jetty6.cluster.wadi.WADIClusteredPreHandler 
> $WADIWebClusteredInvocation.access$000(WADIClusteredPreHandler.java: 
> 52)
> 	at org.apache.geronimo.jetty6.cluster.wadi.WADIClusteredPreHandler 
> $WADIWebClusteredInvocation$1.doFilter(WADIClusteredPreHandler.java: 
> 64)
> 	at org.codehaus.wadi.web.impl.WebInvocation.invoke 
> (WebInvocation.java:116)
> 	at  
> org.codehaus.wadi.core.contextualiser.MemoryContextualiser.handleLocal 
> ly(MemoryContextualiser.java:71)
> 	at  
> org.codehaus.wadi.core.contextualiser.AbstractExclusiveContextualiser. 
> handle(AbstractExclusiveContextualiser.java:94)
> 	at  
> org.codehaus.wadi.core.contextualiser.AbstractMotingContextualiser.con 
> textualise(AbstractMotingContextualiser.java:37)
> 	at org.codehaus.wadi.core.manager.StandardManager.processStateful 
> (StandardManager.java:154)
> 	at org.codehaus.wadi.core.manager.StandardManager.contextualise 
> (StandardManager.java:146)
> 	at org.codehaus.wadi.core.manager.ClusteredManager.contextualise 
> (ClusteredManager.java:81)
> 	at org.apache.geronimo.jetty6.cluster.wadi.WADIClusteredPreHandler 
> $WADIWebClusteredInvocation.invoke(WADIClusteredPreHandler.java:72)
> 	at  
> org.apache.geronimo.jetty6.cluster.AbstractClusteredPreHandler.handle( 
> AbstractClusteredPreHandler.java:39)
> 	at  
> org.apache.geronimo.jetty6.cluster.ClusteredSessionHandler.handle 
> (ClusteredSessionHandler.java:51)
> 	at org.mortbay.jetty.handler.ContextHandler.handle 
> (ContextHandler.java:765)
> 	at org.mortbay.jetty.webapp.WebAppContext.handle 
> (WebAppContext.java:417)
> 	at org.apache.geronimo.jetty6.handler.TwistyWebAppContext.access 
> $101(TwistyWebAppContext.java:41)
> 	at org.apache.geronimo.jetty6.handler.TwistyWebAppContext 
> $TwistyHandler.handle(TwistyWebAppContext.java:66)
> 	at  
> org.apache.geronimo.jetty6.handler.ThreadClassloaderHandler.handle 
> (ThreadClassloaderHandler.java:46)
> 	at org.apache.geronimo.jetty6.handler.InstanceContextHandler.handle 
> (InstanceContextHandler.java:58)
> 	at org.apache.geronimo.jetty6.handler.UserTransactionHandler.handle 
> (UserTransactionHandler.java:48)
> 	at  
> org.apache.geronimo.jetty6.handler.ComponentContextHandler.handle 
> (ComponentContextHandler.java:47)
> 	at org.apache.geronimo.jetty6.handler.TwistyWebAppContext.handle 
> (TwistyWebAppContext.java:60)
> 	at org.mortbay.jetty.handler.ContextHandlerCollection.handle 
> (ContextHandlerCollection.java:230)
> 	at org.mortbay.jetty.handler.HandlerCollection.handle 
> (HandlerCollection.java:114)
> 	at org.mortbay.jetty.handler.HandlerWrapper.handle 
> (HandlerWrapper.java:152)
> 	at org.mortbay.jetty.Server.handle(Server.java:326)
> 	at org.mortbay.jetty.HttpConnection.handleRequest 
> (HttpConnection.java:534)
> 	at org.mortbay.jetty.HttpConnection$RequestHandler.content 
> (HttpConnection.java:879)
> 	at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:747)
> 	at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
> 	at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
> 	at org.mortbay.io.nio.SelectChannelEndPoint.run 
> (SelectChannelEndPoint.java:409)
> 	at org.apache.geronimo.pool.ThreadPool$1.run(ThreadPool.java:214)
> 	at org.apache.geronimo.pool.ThreadPool 
> $ContextClassLoaderRunnable.run(ThreadPool.java:344)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask 
> (ThreadPoolExecutor.java:886)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run 
> (ThreadPoolExecutor.java:908)
> 	at java.lang.Thread.run(Thread.java:619)
> Caused by: org.codehaus.wadi.group.MessageExchangeException: No  
> correlated messages received within [2000]ms
> 	at  
> org.codehaus.wadi.group.impl.AbstractDispatcher.attemptMultiRendezVous 
> (AbstractDispatcher.java:174)
> 	at  
> org.codehaus.wadi.servicespace.basic.BasicServiceInvoker.invokeOnPeers 
> (BasicServiceInvoker.java:90)
> 	at org.codehaus.wadi.servicespace.basic.BasicServiceInvoker.invoke 
> (BasicServiceInvoker.java:69)
> 	at org.codehaus.wadi.servicespace.basic.CGLIBServiceProxyFactory 
> $ProxyMethodInterceptor.intercept(CGLIBServiceProxyFactory.java:193)
> 	... 49 more
>
> Not sure if I should worry about that.
>
> Many thanks for your help!
>
> Trygve
>
>
> On Tue, Nov 3, 2009 at 4:35 PM, Trygve Hardersen <tr...@jotta.no>  
> wrote:
>
>
> On Mon, Nov 2, 2009 at 5:39 PM, Kevan Miller  
> <ke...@gmail.com> wrote:
>
> Thanks Gianny. I'd like to see this included in the Geronimo 2.2  
> release. Can we look for a new WADI release, soon? Once we know the  
> problem is fixed?
>
> Trygve, the sooner we get confirmation that your issue is resolved,  
> the sooner we can start finalizing the 2.2 release.
>
> --kevan
>
> I've now built Geronimo using the 2.2-SNAPSHOT of WADI and  
> installed it on our test environment. No obvious issues so I'll go  
> ahead and deploy this to production either later this evening (EU  
> time) or tomorrow. Then it needs to run for a few days before I can  
> confirm if the issue has really been resolved.
>
> BTW I got this on our test system:
>
> AS-000:
> 16:23:17,773 INFO  [TcpFailureDetector] Received memberDisappeared 
> [org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64, -88,  
> 1, 61}:4000,{-64, -88, 1, 61},4000, alive=1814258,id={50 18 86 10  
> 111 -47 79 83 -108 -4 82 -8 26 82 -79 -59 }, payload={-84 -19 0 5  
> 115 114 0 50 111 ...(423)}, command={}, domain={74 79 84 84 65 95  
> 87 65 68 ...(10)}, ]] message. Will verify.
> 16:23:17,897 INFO  [TcpFailureDetector] Verification complete.  
> Member still alive[org.apache.catalina.tribes.membership.MemberImpl 
> [tcp://{-64, -88, 1, 61}:4000,{-64, -88, 1, 61},4000,  
> alive=1814258,id={50 18 86 10 111 -47 79 83 -108 -4 82 -8 26 82 -79  
> -59 }, payload={-84 -19 0 5 115 114 0 50 111 ...(423)}, command={},  
> domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]]
>
> AS-001:
> 16:23:18,446 INFO  [TcpFailureDetector] Received memberDisappeared 
> [org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64, -88,  
> 1, 60}:4000,{-64, -88, 1, 60},4000, alive=2500759,id={107 -64 91  
> -23 109 93 75 116 -95 109 110 22 -85 53 -52 85 }, payload={-84 -19  
> 0 5 115 114 0 50 111 ...(423)}, command={}, domain={74 79 84 84 65  
> 95 87 65 68 ...(10)}, ]] message. Will verify.
> 16:23:18,456 INFO  [TcpFailureDetector] Verification complete.  
> Member still alive[org.apache.catalina.tribes.membership.MemberImpl 
> [tcp://{-64, -88, 1, 60}:4000,{-64, -88, 1, 60},4000,  
> alive=2500759,id={107 -64 91 -23 109 93 75 116 -95 109 110 22 -85  
> 53 -52 85 }, payload={-84 -19 0 5 115 114 0 50 111 ...(423)},  
> command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]]
>
> And then:
>
> AS-000
> 16:30:02,576 INFO  [ChannelInterceptorBase] memberDisappeared:tcp:// 
> {-64, -88, 1, 61}:4000
> 16:30:02,577 INFO  [BasicPartitionBalancerSingletonService]  
> Queueing partition rebalancing
> 16:30:02,600 INFO  [SimpleStateManager]
> =============================
> New Partition Balancing
> Partition Balancing
>     Size [24]
>     Partition[0] owned by [TribesPeer [AS-000; tcp:// 
> 192.168.1.60:4000]]; version [3]; mergeVersion [0]
>     Partition[1] owned by [TribesPeer [AS-000; tcp:// 
> 192.168.1.60:4000]]; version [3]; mergeVersion [0]
>     Partition[2] owned by [TribesPeer [AS-000; tcp:// 
> 192.168.1.60:4000]]; version [3]; mergeVersion [0]
>     Partition[3] owned by [TribesPeer [AS-000; tcp:// 
> 192.168.1.60:4000]]; version [3]; mergeVersion [0]
>     Partition[4] owned by [TribesPeer [AS-000; tcp:// 
> 192.168.1.60:4000]]; version [3]; mergeVersion [0]
>     Partition[5] owned by [TribesPeer [AS-000; tcp:// 
> 192.168.1.60:4000]]; version [3]; mergeVersion [0]
>     Partition[6] owned by [TribesPeer [AS-000; tcp:// 
> 192.168.1.60:4000]]; version [3]; mergeVersion [0]
>     Partition[7] owned by [TribesPeer [AS-000; tcp:// 
> 192.168.1.60:4000]]; version [3]; mergeVersion [0]
>     Partition[8] owned by [TribesPeer [AS-000; tcp:// 
> 192.168.1.60:4000]]; version [3]; mergeVersion [0]
>     Partition[9] owned by [TribesPeer [AS-000; tcp:// 
> 192.168.1.60:4000]]; version [3]; mergeVersion [0]
>     Partition[10] owned by [TribesPeer [AS-000; tcp:// 
> 192.168.1.60:4000]]; version [3]; mergeVersion [0]
>     Partition[11] owned by [TribesPeer [AS-000; tcp:// 
> 192.168.1.60:4000]]; version [3]; mergeVersion [0]
>     Partition[12] owned by [TribesPeer [AS-000; tcp:// 
> 192.168.1.60:4000]]; version [3]; mergeVersion [0]
>     Partition[13] owned by [TribesPeer [AS-000; tcp:// 
> 192.168.1.60:4000]]; version [3]; mergeVersion [0]
>     Partition[14] owned by [TribesPeer [AS-000; tcp:// 
> 192.168.1.60:4000]]; version [3]; mergeVersion [0]
>     Partition[15] owned by [TribesPeer [AS-000; tcp:// 
> 192.168.1.60:4000]]; version [3]; mergeVersion [0]
>     Partition[16] owned by [TribesPeer [AS-000; tcp:// 
> 192.168.1.60:4000]]; version [3]; mergeVersion [0]
>     Partition[17] owned by [TribesPeer [AS-000; tcp:// 
> 192.168.1.60:4000]]; version [3]; mergeVersion [0]
>     Partition[18] owned by [TribesPeer [AS-000; tcp:// 
> 192.168.1.60:4000]]; version [3]; mergeVersion [0]
>     Partition[19] owned by [TribesPeer [AS-000; tcp:// 
> 192.168.1.60:4000]]; version [3]; mergeVersion [0]
>     Partition[20] owned by [TribesPeer [AS-000; tcp:// 
> 192.168.1.60:4000]]; version [3]; mergeVersion [0]
>     Partition[21] owned by [TribesPeer [AS-000; tcp:// 
> 192.168.1.60:4000]]; version [3]; mergeVersion [0]
>     Partition[22] owned by [TribesPeer [AS-000; tcp:// 
> 192.168.1.60:4000]]; version [3]; mergeVersion [0]
>     Partition[23] owned by [TribesPeer [AS-000; tcp:// 
> 192.168.1.60:4000]]; version [3]; mergeVersion [0]
> =============================
>
> 16:30:02,888 WARN  [TcpFailureDetector] Member added, even though  
> we werent notified:org.apache.catalina.tribes.membership.MemberImpl 
> [tcp://{-64, -88, 1, 61}:4000,{-64, -88, 1, 61},4000,  
> alive=2221072,id={50 18 86 10 111 -47 79 83 -108 -4 82 -8 26 82 -79  
> -59 }, payload={-84 -19 0 5 115 114 0 50 111 ...(423)}, command={},  
> domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]
> 16:30:02,889 INFO  [ChannelInterceptorBase] memberAdded:tcp://{-64,  
> -88, 1, 61}:4000
>
> AS-001
> Nothing....
>
> There is practically no load on this network. Anyway I'll try this  
> with load and see what happens.
>
> Many thanks again!
>
> Trygve
>

Re: 2.2 in production

Posted by Trygve Hardersen <tr...@jotta.no>.

Hello again

The system has now been running for a few days and I've not seen the hang on
startup so I believe the issue has been resolved. Great!

I am still seeing some warnings like this:

10:58:49,250 WARN  [UpdateReplicationCommand] Update has not been properly
cascaded due to a communication failure. If a targeted node has been lost,
state will be re-balanced automatically.
org.codehaus.wadi.servicespace.ServiceInvocationException:
org.codehaus.wadi.group.MessageExchangeException: No correlated messages
received within [2000]ms
at
org.codehaus.wadi.servicespace.basic.CGLIBServiceProxyFactory$ProxyMethodInterceptor.intercept(CGLIBServiceProxyFactory.java:209)
at
org.codehaus.wadi.replication.storage.ReplicaStorage$$EnhancerByCGLIB$$4372d264.mergeUpdate(<generated>)
at
org.codehaus.wadi.replication.manager.basic.UpdateReplicationCommand.cascadeUpdate(UpdateReplicationCommand.java:93)
at
org.codehaus.wadi.replication.manager.basic.UpdateReplicationCommand.run(UpdateReplicationCommand.java:86)
at
org.codehaus.wadi.replication.manager.basic.SyncReplicationManager.update(SyncReplicationManager.java:138)
at
org.codehaus.wadi.replication.manager.basic.LoggingReplicationManager.update(LoggingReplicationManager.java:100)
at
org.codehaus.wadi.core.session.AbstractReplicableSession.onEndProcessing(AbstractReplicableSession.java:49)
at
org.codehaus.wadi.core.session.AtomicallyReplicableSession.onEndProcessing(AtomicallyReplicableSession.java:58)
at
org.apache.geronimo.clustering.wadi.WADISessionAdaptor.onEndAccess(WADISessionAdaptor.java:77)
at
org.apache.geronimo.jetty6.cluster.ClusteredSessionManager.complete(ClusteredSessionManager.java:60)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:198)
at
org.apache.geronimo.jetty6.cluster.ClusteredSessionHandler.doHandle(ClusteredSessionHandler.java:59)
at
org.apache.geronimo.jetty6.cluster.ClusteredSessionHandler$ActualHandler.handle(ClusteredSessionHandler.java:66)
at
org.apache.geronimo.jetty6.cluster.AbstractClusteredPreHandler$WebClusteredInvocation.invokeLocally(AbstractClusteredPreHandler.java:71)
at
org.apache.geronimo.jetty6.cluster.wadi.WADIClusteredPreHandler$WADIWebClusteredInvocation.access$000(WADIClusteredPreHandler.java:52)
at
org.apache.geronimo.jetty6.cluster.wadi.WADIClusteredPreHandler$WADIWebClusteredInvocation$1.doFilter(WADIClusteredPreHandler.java:64)
at org.codehaus.wadi.web.impl.WebInvocation.invoke(WebInvocation.java:116)
at
org.codehaus.wadi.core.contextualiser.MemoryContextualiser.handleLocally(MemoryContextualiser.java:71)
at
org.codehaus.wadi.core.contextualiser.AbstractExclusiveContextualiser.handle(AbstractExclusiveContextualiser.java:94)
at
org.codehaus.wadi.core.contextualiser.AbstractMotingContextualiser.contextualise(AbstractMotingContextualiser.java:37)
at
org.codehaus.wadi.core.manager.StandardManager.processStateful(StandardManager.java:154)
at
org.codehaus.wadi.core.manager.StandardManager.contextualise(StandardManager.java:146)
at
org.codehaus.wadi.core.manager.ClusteredManager.contextualise(ClusteredManager.java:81)
at
org.apache.geronimo.jetty6.cluster.wadi.WADIClusteredPreHandler$WADIWebClusteredInvocation.invoke(WADIClusteredPreHandler.java:72)
at
org.apache.geronimo.jetty6.cluster.AbstractClusteredPreHandler.handle(AbstractClusteredPreHandler.java:39)
at
org.apache.geronimo.jetty6.cluster.ClusteredSessionHandler.handle(ClusteredSessionHandler.java:51)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
at
org.apache.geronimo.jetty6.handler.TwistyWebAppContext.access$101(TwistyWebAppContext.java:41)
at
org.apache.geronimo.jetty6.handler.TwistyWebAppContext$TwistyHandler.handle(TwistyWebAppContext.java:66)
at
org.apache.geronimo.jetty6.handler.ThreadClassloaderHandler.handle(ThreadClassloaderHandler.java:46)
at
org.apache.geronimo.jetty6.handler.InstanceContextHandler.handle(InstanceContextHandler.java:58)
at
org.apache.geronimo.jetty6.handler.UserTransactionHandler.handle(UserTransactionHandler.java:48)
at
org.apache.geronimo.jetty6.handler.ComponentContextHandler.handle(ComponentContextHandler.java:47)
at
org.apache.geronimo.jetty6.handler.TwistyWebAppContext.handle(TwistyWebAppContext.java:60)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
at
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:879)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:747)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
at org.apache.geronimo.pool.ThreadPool$1.run(ThreadPool.java:214)
at
org.apache.geronimo.pool.ThreadPool$ContextClassLoaderRunnable.run(ThreadPool.java:344)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: org.codehaus.wadi.group.MessageExchangeException: No correlated
messages received within [2000]ms
at
org.codehaus.wadi.group.impl.AbstractDispatcher.attemptMultiRendezVous(AbstractDispatcher.java:174)
at
org.codehaus.wadi.servicespace.basic.BasicServiceInvoker.invokeOnPeers(BasicServiceInvoker.java:90)
at
org.codehaus.wadi.servicespace.basic.BasicServiceInvoker.invoke(BasicServiceInvoker.java:69)
at
org.codehaus.wadi.servicespace.basic.CGLIBServiceProxyFactory$ProxyMethodInterceptor.intercept(CGLIBServiceProxyFactory.java:193)
... 49 more

Not sure if I should worry about that.

Many thanks for your help!

Trygve

On Tue, Nov 3, 2009 at 4:35 PM, Trygve Hardersen <tr...@jotta.no> wrote:

>
>
> On Mon, Nov 2, 2009 at 5:39 PM, Kevan Miller <ke...@gmail.com>wrote:
>>
>>
>> Thanks Gianny. I'd like to see this included in the Geronimo 2.2 release.
>> Can we look for a new WADI release, soon? Once we know the problem is fixed?
>>
>> Trygve, the sooner we get confirmation that your issue is resolved, the
>> sooner we can start finalizing the 2.2 release.
>>
>> --kevan
>>
>
> I've now built Geronimo using the 2.2-SNAPSHOT of WADI and installed it on
> our test environment. No obvious issues so I'll go ahead and deploy this to
> production either later this evening (EU time) or tomorrow. Then it needs to
> run for a few days before I can confirm if the issue has really been
> resolved.
>
> BTW I got this on our test system:
>
> AS-000:
> 16:23:17,773 INFO  [TcpFailureDetector] Received
> memberDisappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64,
> -88, 1, 61}:4000,{-64, -88, 1, 61},4000, alive=1814258,id={50 18 86 10 111
> -47 79 83 -108 -4 82 -8 26 82 -79 -59 }, payload={-84 -19 0 5 115 114 0 50
> 111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]]
> message. Will verify.
> 16:23:17,897 INFO  [TcpFailureDetector] Verification complete. Member still
> alive[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64, -88, 1,
> 61}:4000,{-64, -88, 1, 61},4000, alive=1814258,id={50 18 86 10 111 -47 79 83
> -108 -4 82 -8 26 82 -79 -59 }, payload={-84 -19 0 5 115 114 0 50 111
> ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]]
>
> AS-001:
> 16:23:18,446 INFO  [TcpFailureDetector] Received
> memberDisappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64,
> -88, 1, 60}:4000,{-64, -88, 1, 60},4000, alive=2500759,id={107 -64 91 -23
> 109 93 75 116 -95 109 110 22 -85 53 -52 85 }, payload={-84 -19 0 5 115 114 0
> 50 111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)},
> ]] message. Will verify.
> 16:23:18,456 INFO  [TcpFailureDetector] Verification complete. Member still
> alive[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64, -88, 1,
> 60}:4000,{-64, -88, 1, 60},4000, alive=2500759,id={107 -64 91 -23 109 93 75
> 116 -95 109 110 22 -85 53 -52 85 }, payload={-84 -19 0 5 115 114 0 50 111
> ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]]
>
> And then:
>
> AS-000
> 16:30:02,576 INFO  [ChannelInterceptorBase] memberDisappeared:tcp://{-64,
> -88, 1, 61}:4000
> 16:30:02,577 INFO  [BasicPartitionBalancerSingletonService] Queueing
> partition rebalancing
> 16:30:02,600 INFO  [SimpleStateManager]
> =============================
> New Partition Balancing
> Partition Balancing
>     Size [24]
>     Partition[0] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[1] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[2] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[3] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[4] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[5] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[6] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[7] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[8] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[9] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[10] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[11] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[12] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[13] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[14] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[15] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[16] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[17] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[18] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[19] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[20] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[21] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[22] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
>     Partition[23] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
> version [3]; mergeVersion [0]
> =============================
>
> 16:30:02,888 WARN  [TcpFailureDetector] Member added, even though we werent
> notified:org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64, -88,
> 1, 61}:4000,{-64, -88, 1, 61},4000, alive=2221072,id={50 18 86 10 111 -47 79
> 83 -108 -4 82 -8 26 82 -79 -59 }, payload={-84 -19 0 5 115 114 0 50 111
> ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]
> 16:30:02,889 INFO  [ChannelInterceptorBase] memberAdded:tcp://{-64, -88, 1,
> 61}:4000
>
> AS-001
> Nothing....
>
> There is practically no load on this network. Anyway I'll try this with
> load and see what happens.
>
> Many thanks again!
>
> Trygve
>

Re: 2.2 in production

Posted by Trygve Hardersen <tr...@jotta.no>.

On Mon, Nov 2, 2009 at 5:39 PM, Kevan Miller <ke...@gmail.com> wrote:
>
>
> Thanks Gianny. I'd like to see this included in the Geronimo 2.2 release.
> Can we look for a new WADI release, soon? Once we know the problem is fixed?
>
> Trygve, the sooner we get confirmation that your issue is resolved, the
> sooner we can start finalizing the 2.2 release.
>
> --kevan
>

I've now built Geronimo using the 2.2-SNAPSHOT of WADI and installed it on
our test environment. No obvious issues so I'll go ahead and deploy this to
production either later this evening (EU time) or tomorrow. Then it needs to
run for a few days before I can confirm if the issue has really been
resolved.

BTW I got this on our test system:

AS-000:
16:23:17,773 INFO  [TcpFailureDetector] Received
memberDisappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64,
-88, 1, 61}:4000,{-64, -88, 1, 61},4000, alive=1814258,id={50 18 86 10 111
-47 79 83 -108 -4 82 -8 26 82 -79 -59 }, payload={-84 -19 0 5 115 114 0 50
111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]]
message. Will verify.
16:23:17,897 INFO  [TcpFailureDetector] Verification complete. Member still
alive[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64, -88, 1,
61}:4000,{-64, -88, 1, 61},4000, alive=1814258,id={50 18 86 10 111 -47 79 83
-108 -4 82 -8 26 82 -79 -59 }, payload={-84 -19 0 5 115 114 0 50 111
...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]]

AS-001:
16:23:18,446 INFO  [TcpFailureDetector] Received
memberDisappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64,
-88, 1, 60}:4000,{-64, -88, 1, 60},4000, alive=2500759,id={107 -64 91 -23
109 93 75 116 -95 109 110 22 -85 53 -52 85 }, payload={-84 -19 0 5 115 114 0
50 111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)},
]] message. Will verify.
16:23:18,456 INFO  [TcpFailureDetector] Verification complete. Member still
alive[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64, -88, 1,
60}:4000,{-64, -88, 1, 60},4000, alive=2500759,id={107 -64 91 -23 109 93 75
116 -95 109 110 22 -85 53 -52 85 }, payload={-84 -19 0 5 115 114 0 50 111
...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]]

And then:

AS-000
16:30:02,576 INFO  [ChannelInterceptorBase] memberDisappeared:tcp://{-64,
-88, 1, 61}:4000
16:30:02,577 INFO  [BasicPartitionBalancerSingletonService] Queueing
partition rebalancing
16:30:02,600 INFO  [SimpleStateManager]
=============================
New Partition Balancing
Partition Balancing
    Size [24]
    Partition[0] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
version [3]; mergeVersion [0]
    Partition[1] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
version [3]; mergeVersion [0]
    Partition[2] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
version [3]; mergeVersion [0]
    Partition[3] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
version [3]; mergeVersion [0]
    Partition[4] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
version [3]; mergeVersion [0]
    Partition[5] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
version [3]; mergeVersion [0]
    Partition[6] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
version [3]; mergeVersion [0]
    Partition[7] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
version [3]; mergeVersion [0]
    Partition[8] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
version [3]; mergeVersion [0]
    Partition[9] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
version [3]; mergeVersion [0]
    Partition[10] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
version [3]; mergeVersion [0]
    Partition[11] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
version [3]; mergeVersion [0]
    Partition[12] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
version [3]; mergeVersion [0]
    Partition[13] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
version [3]; mergeVersion [0]
    Partition[14] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
version [3]; mergeVersion [0]
    Partition[15] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
version [3]; mergeVersion [0]
    Partition[16] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
version [3]; mergeVersion [0]
    Partition[17] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
version [3]; mergeVersion [0]
    Partition[18] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
version [3]; mergeVersion [0]
    Partition[19] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
version [3]; mergeVersion [0]
    Partition[20] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
version [3]; mergeVersion [0]
    Partition[21] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
version [3]; mergeVersion [0]
    Partition[22] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
version [3]; mergeVersion [0]
    Partition[23] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]];
version [3]; mergeVersion [0]
=============================

16:30:02,888 WARN  [TcpFailureDetector] Member added, even though we werent
notified:org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64, -88,
1, 61}:4000,{-64, -88, 1, 61},4000, alive=2221072,id={50 18 86 10 111 -47 79
83 -108 -4 82 -8 26 82 -79 -59 }, payload={-84 -19 0 5 115 114 0 50 111
...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]
16:30:02,889 INFO  [ChannelInterceptorBase] memberAdded:tcp://{-64, -88, 1,
61}:4000

AS-001
Nothing....

There is practically no load on this network. Anyway I'll try this with load
and see what happens.

Many thanks again!

Trygve

Re: 2.2 in production

Posted by Kevan Miller <ke...@gmail.com>.

On Nov 2, 2009, at 9:52 AM, Trygve Hardersen wrote:

> Hi Gianny
>
> Many thanks for you help!
>
> I'll try to make a Geronimo build that uses this WADI version the  
> next time we update our servers. Will let you know what I find.
>
> The session size in our application should be small as only a couple  
> of simple objects are put there, though I've not really verified  
> this. I would not say the servers are not under any significant load  
> when the messages start appearing, but again this is hard to  
> quantify. CPU and bandwidth is not constrained at least.

Thanks Gianny. I'd like to see this included in the Geronimo 2.2  
release. Can we look for a new WADI release, soon? Once we know the  
problem is fixed?

Trygve, the sooner we get confirmation that your issue is resolved,  
the sooner we can start finalizing the 2.2 release.

--kevan

Re: 2.2 in production

Posted by Trygve Hardersen <tr...@jotta.no>.

Hi Gianny

Many thanks for you help!

I'll try to make a Geronimo build that uses this WADI version the next time
we update our servers. Will let you know what I find.

The session size in our application should be small as only a couple of
simple objects are put there, though I've not really verified this. I would
not say the servers are not under any significant load when the messages
start appearing, but again this is hard to quantify. CPU and bandwidth is
not constrained at least.

Trygve

On Sun, Nov 1, 2009 at 10:53 AM, Gianny Damour <
gianny.damour@optusnet.com.au> wrote:

> Hi Trygve,
>
> Thanks for reporting these problems with comprehensive log messages; they
> were very helpful to diagnose.
>
> The bug causing the inability to reliably restart without downtime has been
> identified and fixed. I just deployed new WADI 2.2-SNAPSHOT artifacts, which
> can be found there:
>
> http://snapshots.repository.codehaus.org/org/codehaus/wadi/
>
>
> Can you please confirm that this is now working as expected within your
> environment?
>
>
> Regarding the warning message
>
>
>  16:22:43,524 WARN  [UpdateReplicationCommand] Update has not been properly
>> cascaded due to a communication failure. If a targeted node has been lost,
>> state will be re-balanced automatically.
>>
>
> the timeout is not exposed via API. Even if this warning message is
> displayed, an update message has actually been queued for delivery to the
> relevant back-up nodes. However, there is no guarantee that in case of
> failure, the cluster will be able to restore the latest session state as the
> latest update message may not have actually been delivered to back-up nodes.
>
> I conducted load-testing on a single computer. As the network traffic going
> through the loopback interface does not actually touch the network card,
> this is a problem I was never able to observe.
>
> Do you have an idea of the traffic volume when this warning message starts
> to appear? Also, do you have an idea of the session size?
>
> Thanks,
> Gianny
>
>
> On 31/10/2009, at 3:39 AM, Trygve Hardersen wrote:
>
>  Hello
>>
>> We have been using Geronimo 2.2-SNAPSHOT in production for a good month
>> now, and I thought I'd share some experiences with the community, and maybe
>> get some help. We are an online backup service, check out jottabackup.comif you're interested.
>>
>> Generally our experience has been very positive. We're using the GBean
>> framework for custom server components, the DB connection pools against
>> MySQL databases, stateless service EJBs and various MDB, and of course the
>> web tier (Jetty). Everything is running smoothly and we're very much looking
>> forward to 2.2 being released so we can "release" our own software.
>>
>> The issues we're having are related to WADI clustering with Jetty. First
>> we can't use Jetty7 because of GERONIMO-4846, so we're using Jetty6 which
>> works fine. The more serious issue is that we often can not update our
>> servers without downtime. This is what happens:
>>
>> We have 2 application servers (AS-000 and AS-001) running dynamic WADI
>> HTTP session replication between them. When updating we first stop one,
>> AS-000 in this case. That works fine and the active sessions are migrated to
>> AS-001:
>>
>> 23:43:18,160 INFO  [SimpleStateManager]
>> =============================
>> New Partition Balancing
>> Partition Balancing
>>    Size [24]
>>    Partition[0] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[1] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[2] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[3] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[4] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[5] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[6] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[7] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[8] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[9] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[10] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[11] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[12] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[13] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[14] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[15] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[16] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[17] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[18] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[19] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[20] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[21] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[22] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
>> version [3]; mergeVersion [0]
>>    Partition[23] owned by [TribesPeer [AS-001; tcp://10.0.10.101:4000]];
>> version [3]; mergeVersion [0]
>> =============================
>>
>> 23:43:28,539 INFO  [TcpFailureDetector] Verification complete. Member
>> disappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{10, 0,
>> 10, 100}:4000,{10, 0, 10, 100},4000, alive=41104531,id={-4 -32 54 90 -109
>> -17 65 64 -117 40 -110 -14 36 93 -12 -118 }, payload={-84 -19 0 5 115 114 0
>> 50 111 ...(423)}, command={66 65 66 89 45 65 76 69 88 ...(9)}, domain={74 79
>> 84 84 65 95 87 65 68 ...(10)}, ]]
>> 23:43:28,540 INFO  [ChannelInterceptorBase] memberDisappeared:tcp://{10,
>> 0, 10, 100}:4000
>>
>> We then update AS-000 and try to start it, but it fails to rejoin the WADI
>> cluster:
>>
>> 23:46:30,784 INFO  [ReceiverBase] Receiver Server Socket bound to:/
>> 10.0.10.100:4000
>> 23:46:30,864 INFO  [ChannelInterceptorBase] memberStart
>> local:org.apache.catalina.tribes.membership.MemberImpl[tcp://
>> 10.0.10.100:4000,10.0.10.100,4000, alive=0,id={-103 34 80 -19 68 -51 70
>> -91 -108 39 -84 65 50 50 103 -107 }, payload={-84 -19 0 5 115 114 0 50 111
>> ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]
>> notify:false peer:AS-000
>> 23:46:30,868 INFO  [McastService] Setting cluster mcast soTimeout to 500
>> 23:46:30,908 INFO  [McastService] Sleeping for 1000 milliseconds to
>> establish cluster membership, start level:4
>> 23:46:31,139 INFO  [ChannelInterceptorBase] memberAdded:tcp://{10, 0, 10,
>> 101}:4000
>> 23:46:31,908 INFO  [McastService] Done sleeping, membership established,
>> start level:4
>> 23:46:31,912 INFO  [McastService] Sleeping for 1000 milliseconds to
>> establish cluster membership, start level:8
>> 23:46:31,927 INFO  [BufferPool] Created a buffer pool with max
>> size:104857600 bytes of type:org.apache.catalina.tribes.io.BufferPool15Impl
>> 23:46:32,912 INFO  [McastService] Done sleeping, membership established,
>> start level:8
>> 23:46:32,912 INFO  [ChannelInterceptorBase] memberStart
>> local:org.apache.catalina.tribes.membership.MemberImpl[tcp://
>> 10.0.10.100:4000,10.0.10.100,4000, alive=272,id={-103 34 80 -19 68 -51 70
>> -91 -108 39 -84 65 50 50 103 -107 }, payload={-84 -19 0 5 115 114 0 50 111
>> ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]
>> notify:false peer:AS-000
>> 23:46:37,848 INFO  [DiscStore] Creating directory:
>> /usr/lib/jotta/jotta-as-prod-1.0-SNAPSHOT/var/temp/SessionStore
>> 23:46:37,930 INFO  [BasicSingletonServiceHolder] [TribesPeer [AS-000;
>> tcp://10.0.10.100:4000]] owns singleton service [PartitionManager for
>> ServiceSpace [/]]
>> 23:46:37,964 INFO  [BasicSingletonServiceHolder] [TribesPeer [AS-000;
>> tcp://10.0.10.100:4000]] resigns ownership of singleton service
>> [PartitionManager for ServiceSpace [/]]
>> 23:47:40,065 ERROR [BasicServiceRegistry] Error while starting [Holder for
>> service
>> [org.codehaus.wadi.location.partitionmanager.SimplePartitionManager@7dc2445f]
>> named [PartitionManager] in space [ServiceSpace [/]]]
>> org.codehaus.wadi.location.partitionmanager.PartitionManagerException: Partition
>> [0] is unknown.
>>        at
>> org.codehaus.wadi.location.partitionmanager.SimplePartitionManager.waitForBoot(SimplePartitionManager.java:248)
>>        at
>> org.codehaus.wadi.location.partitionmanager.SimplePartitionManager.start(SimplePartitionManager.java:119)
>>        at
>> org.codehaus.wadi.servicespace.basic.BasicServiceHolder.start(BasicServiceHolder.java:60)
>>        at
>> org.codehaus.wadi.servicespace.basic.BasicServiceRegistry.start(BasicServiceRegistry.java:152)
>>        at
>> org.codehaus.wadi.servicespace.basic.BasicServiceSpace.start(BasicServiceSpace.java:169)
>>        at
>> org.apache.geronimo.clustering.wadi.BasicWADISessionManager.doStart(BasicWADISessionManager.java:125)
>>        at
>> org.apache.geronimo.gbean.runtime.GBeanInstance.createInstance(GBeanInstance.java:948)
>>        at
>> org.apache.geronimo.gbean.runtime.GBeanInstanceState.attemptFullStart(GBeanInstanceState.java:269)
>>        at
>> org.apache.geronimo.gbean.runtime.GBeanInstanceState.start(GBeanInstanceState.java:103)
>>        at
>> org.apache.geronimo.gbean.runtime.GBeanInstanceState.startRecursive(GBeanInstanceState.java:125)
>>        at
>> org.apache.geronimo.gbean.runtime.GBeanInstance.startRecursive(GBeanInstance.java:538)
>>        at
>> org.apache.geronimo.kernel.basic.BasicKernel.startRecursiveGBean(BasicKernel.java:377)
>>        at
>> org.apache.geronimo.kernel.config.ConfigurationUtil.startConfigurationGBeans(ConfigurationUtil.java:456)
>>        at
>> org.apache.geronimo.kernel.config.ConfigurationUtil.startConfigurationGBeans(ConfigurationUtil.java:493)
>>        at
>> org.apache.geronimo.kernel.config.KernelConfigurationManager.start(KernelConfigurationManager.java:190)
>>        at
>> org.apache.geronimo.kernel.config.SimpleConfigurationManager.startConfiguration(SimpleConfigurationManager.java:546)
>>        at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
>>        at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at
>> org.apache.geronimo.gbean.runtime.ReflectionMethodInvoker.invoke(ReflectionMethodInvoker.java:34)
>>        at
>> org.apache.geronimo.gbean.runtime.GBeanOperation.invoke(GBeanOperation.java:130)
>>        at
>> org.apache.geronimo.gbean.runtime.GBeanInstance.invoke(GBeanInstance.java:815)
>>        at
>> org.apache.geronimo.gbean.runtime.RawInvoker.invoke(RawInvoker.java:57)
>>        at
>> org.apache.geronimo.kernel.basic.RawOperationInvoker.invoke(RawOperationInvoker.java:35)
>>        at
>> org.apache.geronimo.kernel.basic.ProxyMethodInterceptor.intercept(ProxyMethodInterceptor.java:96)
>>        at
>> org.apache.geronimo.gbean.GBeanLifecycle$$EnhancerByCGLIB$$628b9237.startConfiguration(<generated>)
>>        at
>> org.apache.geronimo.system.main.EmbeddedDaemon.doStartup(EmbeddedDaemon.java:161)
>>        at
>> org.apache.geronimo.system.main.EmbeddedDaemon.execute(EmbeddedDaemon.java:78)
>>        at
>> org.apache.geronimo.kernel.util.MainConfigurationBootstrapper.main(MainConfigurationBootstrapper.java:45)
>>        at
>> org.apache.geronimo.cli.AbstractCLI.executeMain(AbstractCLI.java:65)
>>        at org.apache.geronimo.cli.daemon.DaemonCLI.main(DaemonCLI.java:30)
>> 23:47:40,078 ERROR [BasicWADISessionManager] Failed to stop
>> org.codehaus.wadi.servicespace.ServiceSpaceNotFoundException:
>> ServiceSpaceName not found
>>        at
>> org.codehaus.wadi.servicespace.basic.BasicServiceSpaceRegistry.unregister(BasicServiceSpaceRegistry.java:55)
>>        at
>> org.codehaus.wadi.servicespace.basic.BasicServiceSpace.unregisterServiceSpace(BasicServiceSpace.java:228)
>>        at
>> org.codehaus.wadi.servicespace.basic.BasicServiceSpace.stop(BasicServiceSpace.java:175)
>>        at
>> org.apache.geronimo.clustering.wadi.BasicWADISessionManager.doFail(BasicWADISessionManager.java:134)
>>        at
>> org.apache.geronimo.gbean.runtime.GBeanInstance.createInstance(GBeanInstance.java:978)
>>        at
>> org.apache.geronimo.gbean.runtime.GBeanInstanceState.attemptFullStart(GBeanInstanceState.java:269)
>>        at
>> org.apache.geronimo.gbean.runtime.GBeanInstanceState.start(GBeanInstanceState.java:103)
>>        at
>> org.apache.geronimo.gbean.runtime.GBeanInstanceState.startRecursive(GBeanInstanceState.java:125)
>>        at
>> org.apache.geronimo.gbean.runtime.GBeanInstance.startRecursive(GBeanInstance.java:538)
>>        at
>> org.apache.geronimo.kernel.basic.BasicKernel.startRecursiveGBean(BasicKernel.java:377)
>>        at
>> org.apache.geronimo.kernel.config.ConfigurationUtil.startConfigurationGBeans(ConfigurationUtil.java:456)
>>        at
>> org.apache.geronimo.kernel.config.ConfigurationUtil.startConfigurationGBeans(ConfigurationUtil.java:493)
>>        at
>> org.apache.geronimo.kernel.config.KernelConfigurationManager.start(KernelConfigurationManager.java:190)
>>        at
>> org.apache.geronimo.kernel.config.SimpleConfigurationManager.startConfiguration(SimpleConfigurationManager.java:546)
>>        at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
>>        at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at
>> org.apache.geronimo.gbean.runtime.ReflectionMethodInvoker.invoke(ReflectionMethodInvoker.java:34)
>>        at
>> org.apache.geronimo.gbean.runtime.GBeanOperation.invoke(GBeanOperation.java:130)
>>        at
>> org.apache.geronimo.gbean.runtime.GBeanInstance.invoke(GBeanInstance.java:815)
>>        at
>> org.apache.geronimo.gbean.runtime.RawInvoker.invoke(RawInvoker.java:57)
>>        at
>> org.apache.geronimo.kernel.basic.RawOperationInvoker.invoke(RawOperationInvoker.java:35)
>>        at
>> org.apache.geronimo.kernel.basic.ProxyMethodInterceptor.intercept(ProxyMethodInterceptor.java:96)
>>        at
>> org.apache.geronimo.gbean.GBeanLifecycle$$EnhancerByCGLIB$$628b9237.startConfiguration(<generated>)
>>        at
>> org.apache.geronimo.system.main.EmbeddedDaemon.doStartup(EmbeddedDaemon.java:161)
>>        at
>> org.apache.geronimo.system.main.EmbeddedDaemon.execute(EmbeddedDaemon.java:78)
>>        at
>> org.apache.geronimo.kernel.util.MainConfigurationBootstrapper.main(MainConfigurationBootstrapper.java:45)
>>        at
>> org.apache.geronimo.cli.AbstractCLI.executeMain(AbstractCLI.java:65)
>>        at org.apache.geronimo.cli.daemon.DaemonCLI.main(DaemonCLI.java:30)
>>
>> After this failure the server stops. Over at the running instance AS-001
>> this is logged:
>>
>> 23:46:31,909 INFO  [ChannelInterceptorBase] memberAdded:tcp://{10, 0, 10,
>> 100}:4000
>> 23:46:37,929 INFO  [BasicSingletonServiceHolder] [TribesPeer [AS-001;
>> tcp://10.0.10.101:4000]] owns singleton service [PartitionManager for
>> ServiceSpace [/]]
>> 23:46:37,929 INFO  [BasicPartitionBalancerSingletonService] Queueing
>> partition rebalancing
>> 23:46:38,438 ERROR [BasicEnvelopeDispatcherManager] problem dispatching
>> message
>> java.lang.IllegalArgumentException:
>> org.codehaus.wadi.core.store.BasicStoreMotable is not a Session
>>        at
>> org.codehaus.wadi.replication.manager.basic.SessionStateHandler.newExtractFullStateExternalizable(SessionStateHandler.java:105)
>>        at
>> org.codehaus.wadi.replication.manager.basic.SessionStateHandler.extractFullState(SessionStateHandler.java:53)
>>        at
>> org.codehaus.wadi.replication.manager.basic.CreateStorageCommand.execute(CreateStorageCommand.java:45)
>>        at
>> org.codehaus.wadi.replication.manager.basic.SyncSecondaryManager.updateSecondaries(SyncSecondaryManager.java:169)
>>        at
>> org.codehaus.wadi.replication.manager.basic.SyncSecondaryManager.updateSecondaries(SyncSecondaryManager.java:114)
>>        at
>> org.codehaus.wadi.replication.manager.basic.SyncSecondaryManager.updateSecondaries(SyncSecondaryManager.java:103)
>>        at
>> org.codehaus.wadi.replication.manager.basic.SyncSecondaryManager.updateSecondariesFollowingJoiningPeer(SyncSecondaryManager.java:75)
>>        at
>> org.codehaus.wadi.replication.manager.basic.ReOrganizeSecondariesListener.receive(ReOrganizeSecondariesListener.java:53)
>>        at
>> org.codehaus.wadi.servicespace.basic.BasicServiceMonitor.notifyListeners(BasicServiceMonitor.java:124)
>>        at
>> org.codehaus.wadi.servicespace.basic.BasicServiceMonitor.processLifecycleEvent(BasicServiceMonitor.java:141)
>>        at
>> org.codehaus.wadi.servicespace.basic.BasicServiceMonitor$ServiceLifecycleEndpoint.dispatch(BasicServiceMonitor.java:148)
>>        at
>> org.codehaus.wadi.group.impl.ServiceEndpointWrapper.dispatch(ServiceEndpointWrapper.java:50)
>>        at
>> org.codehaus.wadi.group.impl.BasicEnvelopeDispatcherManager$DispatchRunner.run(BasicEnvelopeDispatcherManager.java:121)
>>        at
>> org.codehaus.wadi.servicespace.basic.BasicServiceSpaceDispatcher$ExecuteInThread.execute(BasicServiceSpaceDispatcher.java:102)
>>        at
>> org.codehaus.wadi.group.impl.BasicEnvelopeDispatcherManager.onEnvelope(BasicEnvelopeDispatcherManager.java:100)
>>        at
>> org.codehaus.wadi.group.impl.AbstractDispatcher.doOnEnvelope(AbstractDispatcher.java:104)
>>        at
>> org.codehaus.wadi.group.impl.AbstractDispatcher.onEnvelope(AbstractDispatcher.java:100)
>>        at
>> org.codehaus.wadi.servicespace.basic.ServiceSpaceEndpoint.dispatch(ServiceSpaceEndpoint.java:49)
>>        at
>> org.codehaus.wadi.group.impl.ServiceEndpointWrapper.dispatch(ServiceEndpointWrapper.java:50)
>>        at
>> org.codehaus.wadi.group.impl.BasicEnvelopeDispatcherManager$DispatchRunner.run(BasicEnvelopeDispatcherManager.java:121)
>>        at
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>        at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>        at java.lang.Thread.run(Thread.java:619)
>> 23:46:40,063 INFO  [BasicPartitionBalancerSingletonService] Queueing
>> partition rebalancing
>> 23:46:42,938 WARN  [BasicPartitionBalancerSingletonService] Rebalancing
>> has failed
>> org.codehaus.wadi.group.MessageExchangeException: No correlated messages
>> received within [5000]ms
>>        at
>> org.codehaus.wadi.group.impl.AbstractDispatcher.attemptMultiRendezVous(AbstractDispatcher.java:174)
>>        at
>> org.codehaus.wadi.location.balancing.BasicPartitionBalancer.fetchBalancingInfoState(BasicPartitionBalancer.java:85)
>>        at
>> org.codehaus.wadi.location.balancing.BasicPartitionBalancer.balancePartitions(BasicPartitionBalancer.java:69)
>>        at
>> org.codehaus.wadi.location.balancing.BasicPartitionBalancerSingletonService.run(BasicPartitionBalancerSingletonService.java:85)
>>        at java.lang.Thread.run(Thread.java:619)
>> 23:46:42,939 WARN  [BasicPartitionBalancerSingletonService] Will retry
>> rebalancing in [500] ms
>> 23:46:43,439 INFO  [BasicPartitionBalancerSingletonService] Queueing
>> partition rebalancing
>> 23:47:40,269 INFO  [TcpFailureDetector] Verification complete. Member
>> disappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{10, 0,
>> 10, 100}:4000,{10, 0, 10, 100},4000, alive=69401,id={-103 34 80 -19 68 -51
>> 70 -91 -108 39 -84 65 50 50 103 -107 }, payload={-84 -19 0 5 115 114 0 50
>> 111 ...(423)}, command={66 65 66 89 45 65 76 69 88 ...(9)}, domain={74 79 84
>> 84 65 95 87 65 68 ...(10)}, ]]
>> 23:47:40,271 INFO  [ChannelInterceptorBase] memberDisappeared:tcp://{10,
>> 0, 10, 100}:4000
>> 23:47:40,271 INFO  [BasicPartitionBalancerSingletonService] Queueing
>> partition rebalancing
>>
>> If I try to start AS-000 again the same thing happens. If we stop AS-001
>> the following is logged;
>>
>> 23:49:18,695 INFO  [SimpleStateManager] Evacuating partitions
>> 23:49:18,699 INFO  [BasicPartitionBalancerSingletonService] Queueing
>> partition rebalancing
>> 23:49:23,698 WARN  [SimpleStateManager] Partition balancer has disappeared
>> - backing off for [1000]ms
>> 23:49:24,699 INFO  [BasicPartitionBalancerSingletonService] Queueing
>> partition rebalancing
>> 23:49:29,698 WARN  [SimpleStateManager] Partition balancer has disappeared
>> - backing off for [1000]ms
>> 23:49:30,699 INFO  [BasicPartitionBalancerSingletonService] Queueing
>> partition rebalancing
>> 23:49:35,699 WARN  [SimpleStateManager] Partition balancer has disappeared
>> - backing off for [1000]ms
>> 23:49:36,700 INFO  [BasicPartitionBalancerSingletonService] Queueing
>> partition rebalancing
>> 23:49:41,699 WARN  [SimpleStateManager] Partition balancer has disappeared
>> - backing off for [1000]ms
>> 23:49:42,701 INFO  [BasicPartitionBalancerSingletonService] Queueing
>> partition rebalancing
>> 23:49:47,700 WARN  [SimpleStateManager] Partition balancer has disappeared
>> - backing off for [1000]ms
>> 23:49:48,700 INFO  [SimpleStateManager] Evacuated
>> 23:49:48,808 INFO  [AbstractExclusiveContextualiser] Unloaded
>> sessions=[36]
>> 23:49:48,843 INFO  [AbstractExclusiveContextualiser] Unloaded
>> sessions=[13]
>> 23:49:58,852 INFO  [BasicSingletonServiceHolder] [TribesPeer [AS-001;
>> tcp://10.0.10.101:4000]] resigns ownership of singleton service
>> [PartitionManager for ServiceSpace [/]]
>>
>> , however AS-001 then just hangs, and we have to kill the process to get
>> it stopped. After this we can start AS-000, update AS-001 and it always
>> seems to have no problem joining the cluster thereafter. The strange thing
>> is that this problem does not always occur, sometimes everything goes fine.
>> I can't find a consistent pattern but I've tried stopping AS-001 before
>> AS-000, and I'm sure no serializable object in the session has changed
>> between the updated and running instance.
>>
>> My gut feeling is that this is either a concurrency-related bug in WADI or
>> a network-timeout related problem. During normal operation I'm sometimes
>> seeing messages like this in the log files:
>>
>> 17:14:08,869 INFO  [TcpFailureDetector] Received
>> memberDisappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{10,
>> 0, 10, 101}:4000,{10, 0, 10, 101},4000, alive=95659954,id={-52 -76 98 22 10
>> 71 76 -72 -122 -59 -21 -29 44 -86 38 114 }, payload={-84 -19 0 5 115 114 0
>> 50 111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)},
>> ]] message. Will verify.
>> 17:14:08,870 INFO  [TcpFailureDetector] Verification complete. Member
>> still alive[org.apache.catalina.tribes.membership.MemberImpl[tcp://{10, 0,
>> 10, 101}:4000,{10, 0, 10, 101},4000, alive=95659954,id={-52 -76 98 22 10 71
>> 76 -72 -122 -59 -21 -29 44 -86 38 114 }, payload={-84 -19 0 5 115 114 0 50
>> 111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]]
>> And lately, as traffic has increased, errors like this:
>>
>> 16:22:43,524 WARN  [UpdateReplicationCommand] Update has not been properly
>> cascaded due to a communication failure. If a targeted node has been lost,
>> state will be re-balanced automatically.
>> org.codehaus.wadi.servicespace.ServiceInvocationException:
>> org.codehaus.wadi.group.MessageExchangeException: No correlated messages
>> received within [2000]ms
>>        at
>> org.codehaus.wadi.servicespace.basic.CGLIBServiceProxyFactory$ProxyMethodInterceptor.intercept(CGLIBServiceProxyFactory.java:209)
>>        at
>> org.codehaus.wadi.replication.storage.ReplicaStorage$$EnhancerByCGLIB$$a901e91b.mergeUpdate(<generated>)
>>        at
>> org.codehaus.wadi.replication.manager.basic.UpdateReplicationCommand.cascadeUpdate(UpdateReplicationCommand.java:93)
>>        at
>> org.codehaus.wadi.replication.manager.basic.UpdateReplicationCommand.run(UpdateReplicationCommand.java:86)
>>        at
>> org.codehaus.wadi.replication.manager.basic.SyncReplicationManager.update(SyncReplicationManager.java:138)
>>        at
>> org.codehaus.wadi.replication.manager.basic.LoggingReplicationManager.update(LoggingReplicationManager.java:100)
>>        at
>> org.codehaus.wadi.core.session.AbstractReplicableSession.onEndProcessing(AbstractReplicableSession.java:49)
>>        at
>> org.codehaus.wadi.core.session.AtomicallyReplicableSession.onEndProcessing(AtomicallyReplicableSession.java:58)
>>        at
>> org.apache.geronimo.clustering.wadi.WADISessionAdaptor.onEndAccess(WADISessionAdaptor.java:77)
>>        at
>> org.apache.geronimo.jetty6.cluster.ClusteredSessionManager.complete(ClusteredSessionManager.java:60)
>>        at
>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:198)
>>        at
>> org.apache.geronimo.jetty6.cluster.ClusteredSessionHandler.doHandle(ClusteredSessionHandler.java:59)
>>        at
>> org.apache.geronimo.jetty6.cluster.ClusteredSessionHandler$ActualHandler.handle(ClusteredSessionHandler.java:66)
>>        at
>> org.apache.geronimo.jetty6.cluster.AbstractClusteredPreHandler$WebClusteredInvocation.invokeLocally(AbstractClusteredPreHandler.java:71)
>>        at
>> org.apache.geronimo.jetty6.cluster.wadi.WADIClusteredPreHandler$WADIWebClusteredInvocation.access$000(WADIClusteredPreHandler.java:52)
>>        at
>> org.apache.geronimo.jetty6.cluster.wadi.WADIClusteredPreHandler$WADIWebClusteredInvocation$1.doFilter(WADIClusteredPreHandler.java:64)
>>        at
>> org.codehaus.wadi.web.impl.WebInvocation.invoke(WebInvocation.java:116)
>>        at
>> org.codehaus.wadi.core.contextualiser.MemoryContextualiser.handleLocally(MemoryContextualiser.java:71)
>>        at
>> org.codehaus.wadi.core.contextualiser.AbstractExclusiveContextualiser.handle(AbstractExclusiveContextualiser.java:94)
>>        at
>> org.codehaus.wadi.core.contextualiser.AbstractMotingContextualiser.contextualise(AbstractMotingContextualiser.java:37)
>>        at
>> org.codehaus.wadi.core.manager.StandardManager.processStateful(StandardManager.java:150)
>>        at
>> org.codehaus.wadi.core.manager.StandardManager.contextualise(StandardManager.java:142)
>>        at
>> org.codehaus.wadi.core.manager.ClusteredManager.contextualise(ClusteredManager.java:81)
>>        at
>> org.apache.geronimo.jetty6.cluster.wadi.WADIClusteredPreHandler$WADIWebClusteredInvocation.invoke(WADIClusteredPreHandler.java:72)
>>        at
>> org.apache.geronimo.jetty6.cluster.AbstractClusteredPreHandler.handle(AbstractClusteredPreHandler.java:39)
>>        at
>> org.apache.geronimo.jetty6.cluster.ClusteredSessionHandler.handle(ClusteredSessionHandler.java:51)
>>        at
>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
>>        at
>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
>>        at
>> org.apache.geronimo.jetty6.handler.TwistyWebAppContext.access$101(TwistyWebAppContext.java:41)
>>        at
>> org.apache.geronimo.jetty6.handler.TwistyWebAppContext$TwistyHandler.handle(TwistyWebAppContext.java:66)
>>        at
>> org.apache.geronimo.jetty6.handler.ThreadClassloaderHandler.handle(ThreadClassloaderHandler.java:46)
>>        at
>> org.apache.geronimo.jetty6.handler.InstanceContextHandler.handle(InstanceContextHandler.java:58)
>>        at
>> org.apache.geronimo.jetty6.handler.UserTransactionHandler.handle(UserTransactionHandler.java:48)
>>        at
>> org.apache.geronimo.jetty6.handler.ComponentContextHandler.handle(ComponentContextHandler.java:47)
>>        at
>> org.apache.geronimo.jetty6.handler.TwistyWebAppContext.handle(TwistyWebAppContext.java:60)
>>        at
>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>>        at
>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>>        at
>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>>        at org.mortbay.jetty.Server.handle(Server.java:326)
>>        at
>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
>>        at
>> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:879)
>>        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:747)
>>        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
>>        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
>>        at
>> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
>>        at org.apache.geronimo.pool.ThreadPool$1.run(ThreadPool.java:214)
>>        at
>> org.apache.geronimo.pool.ThreadPool$ContextClassLoaderRunnable.run(ThreadPool.java:344)
>>        at
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>        at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>        at java.lang.Thread.run(Thread.java:619)
>> Caused by: org.codehaus.wadi.group.MessageExchangeException: No correlated
>> messages received within [2000]ms
>>        at
>> org.codehaus.wadi.group.impl.AbstractDispatcher.attemptMultiRendezVous(AbstractDispatcher.java:174)
>>        at
>> org.codehaus.wadi.servicespace.basic.BasicServiceInvoker.invokeOnPeers(BasicServiceInvoker.java:90)
>>        at
>> org.codehaus.wadi.servicespace.basic.BasicServiceInvoker.invoke(BasicServiceInvoker.java:69)
>>        at
>> org.codehaus.wadi.servicespace.basic.CGLIBServiceProxyFactory$ProxyMethodInterceptor.intercept(CGLIBServiceProxyFactory.java:193)
>>        ... 49 more
>>
>> Does anyone have some insight into what might be causing this, or where
>> the timeouts can be increased if there are any?
>>
>> I'm thinking that a static WADI configuration can be more stable than the
>> dynamic setup we have not which relies upon multicasting. Does anyone have
>> experience with similar setups?
>>
>> Thanks!
>>
>> Trygve Hardersen - Jotta
>>
>>
>>
>>
>>
>

Re: 2.2 in production

Posted by Gianny Damour <gi...@optusnet.com.au>.

Hi Trygve,

Thanks for reporting these problems with comprehensive log messages;  
they were very helpful to diagnose.

The bug causing the inability to reliably restart without downtime  
has been identified and fixed. I just deployed new WADI 2.2-SNAPSHOT  
artifacts, which can be found there:

http://snapshots.repository.codehaus.org/org/codehaus/wadi/


Can you please confirm that this is now working as expected within  
your environment?


Regarding the warning message

> 16:22:43,524 WARN  [UpdateReplicationCommand] Update has not been  
> properly cascaded due to a communication failure. If a targeted  
> node has been lost, state will be re-balanced automatically.

the timeout is not exposed via API. Even if this warning message is  
displayed, an update message has actually been queued for delivery to  
the relevant back-up nodes. However, there is no guarantee that in  
case of failure, the cluster will be able to restore the latest  
session state as the latest update message may not have actually been  
delivered to back-up nodes.

I conducted load-testing on a single computer. As the network traffic  
going through the loopback interface does not actually touch the  
network card, this is a problem I was never able to observe.

Do you have an idea of the traffic volume when this warning message  
starts to appear? Also, do you have an idea of the session size?

Thanks,
Gianny

On 31/10/2009, at 3:39 AM, Trygve Hardersen wrote:

> Hello
>
> We have been using Geronimo 2.2-SNAPSHOT in production for a good  
> month now, and I thought I'd share some experiences with the  
> community, and maybe get some help. We are an online backup  
> service, check out jottabackup.com if you're interested.
>
> Generally our experience has been very positive. We're using the  
> GBean framework for custom server components, the DB connection  
> pools against MySQL databases, stateless service EJBs and various  
> MDB, and of course the web tier (Jetty). Everything is running  
> smoothly and we're very much looking forward to 2.2 being released  
> so we can "release" our own software.
>
> The issues we're having are related to WADI clustering with Jetty.  
> First we can't use Jetty7 because of GERONIMO-4846, so we're using  
> Jetty6 which works fine. The more serious issue is that we often  
> can not update our servers without downtime. This is what happens:
>
> We have 2 application servers (AS-000 and AS-001) running dynamic  
> WADI HTTP session replication between them. When updating we first  
> stop one, AS-000 in this case. That works fine and the active  
> sessions are migrated to AS-001:
>
> 23:43:18,160 INFO  [SimpleStateManager]
> =============================
> New Partition Balancing
> Partition Balancing
>     Size [24]
>     Partition[0] owned by [TribesPeer [AS-001; tcp:// 
> 10.0.10.101:4000]]; version [3]; mergeVersion [0]
>     Partition[1] owned by [TribesPeer [AS-001; tcp:// 
> 10.0.10.101:4000]]; version [3]; mergeVersion [0]
>     Partition[2] owned by [TribesPeer [AS-001; tcp:// 
> 10.0.10.101:4000]]; version [3]; mergeVersion [0]
>     Partition[3] owned by [TribesPeer [AS-001; tcp:// 
> 10.0.10.101:4000]]; version [3]; mergeVersion [0]
>     Partition[4] owned by [TribesPeer [AS-001; tcp:// 
> 10.0.10.101:4000]]; version [3]; mergeVersion [0]
>     Partition[5] owned by [TribesPeer [AS-001; tcp:// 
> 10.0.10.101:4000]]; version [3]; mergeVersion [0]
>     Partition[6] owned by [TribesPeer [AS-001; tcp:// 
> 10.0.10.101:4000]]; version [3]; mergeVersion [0]
>     Partition[7] owned by [TribesPeer [AS-001; tcp:// 
> 10.0.10.101:4000]]; version [3]; mergeVersion [0]
>     Partition[8] owned by [TribesPeer [AS-001; tcp:// 
> 10.0.10.101:4000]]; version [3]; mergeVersion [0]
>     Partition[9] owned by [TribesPeer [AS-001; tcp:// 
> 10.0.10.101:4000]]; version [3]; mergeVersion [0]
>     Partition[10] owned by [TribesPeer [AS-001; tcp:// 
> 10.0.10.101:4000]]; version [3]; mergeVersion [0]
>     Partition[11] owned by [TribesPeer [AS-001; tcp:// 
> 10.0.10.101:4000]]; version [3]; mergeVersion [0]
>     Partition[12] owned by [TribesPeer [AS-001; tcp:// 
> 10.0.10.101:4000]]; version [3]; mergeVersion [0]
>     Partition[13] owned by [TribesPeer [AS-001; tcp:// 
> 10.0.10.101:4000]]; version [3]; mergeVersion [0]
>     Partition[14] owned by [TribesPeer [AS-001; tcp:// 
> 10.0.10.101:4000]]; version [3]; mergeVersion [0]
>     Partition[15] owned by [TribesPeer [AS-001; tcp:// 
> 10.0.10.101:4000]]; version [3]; mergeVersion [0]
>     Partition[16] owned by [TribesPeer [AS-001; tcp:// 
> 10.0.10.101:4000]]; version [3]; mergeVersion [0]
>     Partition[17] owned by [TribesPeer [AS-001; tcp:// 
> 10.0.10.101:4000]]; version [3]; mergeVersion [0]
>     Partition[18] owned by [TribesPeer [AS-001; tcp:// 
> 10.0.10.101:4000]]; version [3]; mergeVersion [0]
>     Partition[19] owned by [TribesPeer [AS-001; tcp:// 
> 10.0.10.101:4000]]; version [3]; mergeVersion [0]
>     Partition[20] owned by [TribesPeer [AS-001; tcp:// 
> 10.0.10.101:4000]]; version [3]; mergeVersion [0]
>     Partition[21] owned by [TribesPeer [AS-001; tcp:// 
> 10.0.10.101:4000]]; version [3]; mergeVersion [0]
>     Partition[22] owned by [TribesPeer [AS-001; tcp:// 
> 10.0.10.101:4000]]; version [3]; mergeVersion [0]
>     Partition[23] owned by [TribesPeer [AS-001; tcp:// 
> 10.0.10.101:4000]]; version [3]; mergeVersion [0]
> =============================
>
> 23:43:28,539 INFO  [TcpFailureDetector] Verification complete.  
> Member disappeared[org.apache.catalina.tribes.membership.MemberImpl 
> [tcp://{10, 0, 10, 100}:4000,{10, 0, 10, 100},4000,  
> alive=41104531,id={-4 -32 54 90 -109 -17 65 64 -117 40 -110 -14 36  
> 93 -12 -118 }, payload={-84 -19 0 5 115 114 0 50 111 ...(423)},  
> command={66 65 66 89 45 65 76 69 88 ...(9)}, domain={74 79 84 84 65  
> 95 87 65 68 ...(10)}, ]]
> 23:43:28,540 INFO  [ChannelInterceptorBase] memberDisappeared:tcp:// 
> {10, 0, 10, 100}:4000
>
> We then update AS-000 and try to start it, but it fails to rejoin  
> the WADI cluster:
>
> 23:46:30,784 INFO  [ReceiverBase] Receiver Server Socket bound to:/ 
> 10.0.10.100:4000
> 23:46:30,864 INFO  [ChannelInterceptorBase] memberStart  
> local:org.apache.catalina.tribes.membership.MemberImpl[tcp:// 
> 10.0.10.100:4000,10.0.10.100,4000, alive=0,id={-103 34 80 -19 68  
> -51 70 -91 -108 39 -84 65 50 50 103 -107 }, payload={-84 -19 0 5  
> 115 114 0 50 111 ...(423)}, command={}, domain={74 79 84 84 65 95  
> 87 65 68 ...(10)}, ] notify:false peer:AS-000
> 23:46:30,868 INFO  [McastService] Setting cluster mcast soTimeout  
> to 500
> 23:46:30,908 INFO  [McastService] Sleeping for 1000 milliseconds to  
> establish cluster membership, start level:4
> 23:46:31,139 INFO  [ChannelInterceptorBase] memberAdded:tcp://{10,  
> 0, 10, 101}:4000
> 23:46:31,908 INFO  [McastService] Done sleeping, membership  
> established, start level:4
> 23:46:31,912 INFO  [McastService] Sleeping for 1000 milliseconds to  
> establish cluster membership, start level:8
> 23:46:31,927 INFO  [BufferPool] Created a buffer pool with max size: 
> 104857600 bytes of type:org.apache.catalina.tribes.io.BufferPool15Impl
> 23:46:32,912 INFO  [McastService] Done sleeping, membership  
> established, start level:8
> 23:46:32,912 INFO  [ChannelInterceptorBase] memberStart  
> local:org.apache.catalina.tribes.membership.MemberImpl[tcp:// 
> 10.0.10.100:4000,10.0.10.100,4000, alive=272,id={-103 34 80 -19 68  
> -51 70 -91 -108 39 -84 65 50 50 103 -107 }, payload={-84 -19 0 5  
> 115 114 0 50 111 ...(423)}, command={}, domain={74 79 84 84 65 95  
> 87 65 68 ...(10)}, ] notify:false peer:AS-000
> 23:46:37,848 INFO  [DiscStore] Creating directory: /usr/lib/jotta/ 
> jotta-as-prod-1.0-SNAPSHOT/var/temp/SessionStore
> 23:46:37,930 INFO  [BasicSingletonServiceHolder] [TribesPeer  
> [AS-000; tcp://10.0.10.100:4000]] owns singleton service  
> [PartitionManager for ServiceSpace [/]]
> 23:46:37,964 INFO  [BasicSingletonServiceHolder] [TribesPeer  
> [AS-000; tcp://10.0.10.100:4000]] resigns ownership of singleton  
> service [PartitionManager for ServiceSpace [/]]
> 23:47:40,065 ERROR [BasicServiceRegistry] Error while starting  
> [Holder for service  
> [org.codehaus.wadi.location.partitionmanager.SimplePartitionManager@7d 
> c2445f] named [PartitionManager] in space [ServiceSpace [/]]]
> org.codehaus.wadi.location.partitionmanager.PartitionManagerException: 
>  Partition [0] is unknown.
> 	at  
> org.codehaus.wadi.location.partitionmanager.SimplePartitionManager.wai 
> tForBoot(SimplePartitionManager.java:248)
> 	at  
> org.codehaus.wadi.location.partitionmanager.SimplePartitionManager.sta 
> rt(SimplePartitionManager.java:119)
> 	at org.codehaus.wadi.servicespace.basic.BasicServiceHolder.start 
> (BasicServiceHolder.java:60)
> 	at org.codehaus.wadi.servicespace.basic.BasicServiceRegistry.start 
> (BasicServiceRegistry.java:152)
> 	at org.codehaus.wadi.servicespace.basic.BasicServiceSpace.start 
> (BasicServiceSpace.java:169)
> 	at  
> org.apache.geronimo.clustering.wadi.BasicWADISessionManager.doStart 
> (BasicWADISessionManager.java:125)
> 	at org.apache.geronimo.gbean.runtime.GBeanInstance.createInstance 
> (GBeanInstance.java:948)
> 	at  
> org.apache.geronimo.gbean.runtime.GBeanInstanceState.attemptFullStart( 
> GBeanInstanceState.java:269)
> 	at org.apache.geronimo.gbean.runtime.GBeanInstanceState.start 
> (GBeanInstanceState.java:103)
> 	at  
> org.apache.geronimo.gbean.runtime.GBeanInstanceState.startRecursive 
> (GBeanInstanceState.java:125)
> 	at org.apache.geronimo.gbean.runtime.GBeanInstance.startRecursive 
> (GBeanInstance.java:538)
> 	at org.apache.geronimo.kernel.basic.BasicKernel.startRecursiveGBean 
> (BasicKernel.java:377)
> 	at  
> org.apache.geronimo.kernel.config.ConfigurationUtil.startConfiguration 
> GBeans(ConfigurationUtil.java:456)
> 	at  
> org.apache.geronimo.kernel.config.ConfigurationUtil.startConfiguration 
> GBeans(ConfigurationUtil.java:493)
> 	at  
> org.apache.geronimo.kernel.config.KernelConfigurationManager.start 
> (KernelConfigurationManager.java:190)
> 	at  
> org.apache.geronimo.kernel.config.SimpleConfigurationManager.startConf 
> iguration(SimpleConfigurationManager.java:546)
> 	at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke 
> (DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.geronimo.gbean.runtime.ReflectionMethodInvoker.invoke 
> (ReflectionMethodInvoker.java:34)
> 	at org.apache.geronimo.gbean.runtime.GBeanOperation.invoke 
> (GBeanOperation.java:130)
> 	at org.apache.geronimo.gbean.runtime.GBeanInstance.invoke 
> (GBeanInstance.java:815)
> 	at org.apache.geronimo.gbean.runtime.RawInvoker.invoke 
> (RawInvoker.java:57)
> 	at org.apache.geronimo.kernel.basic.RawOperationInvoker.invoke 
> (RawOperationInvoker.java:35)
> 	at  
> org.apache.geronimo.kernel.basic.ProxyMethodInterceptor.intercept 
> (ProxyMethodInterceptor.java:96)
> 	at org.apache.geronimo.gbean.GBeanLifecycle$$EnhancerByCGLIB$ 
> $628b9237.startConfiguration(<generated>)
> 	at org.apache.geronimo.system.main.EmbeddedDaemon.doStartup 
> (EmbeddedDaemon.java:161)
> 	at org.apache.geronimo.system.main.EmbeddedDaemon.execute 
> (EmbeddedDaemon.java:78)
> 	at  
> org.apache.geronimo.kernel.util.MainConfigurationBootstrapper.main 
> (MainConfigurationBootstrapper.java:45)
> 	at org.apache.geronimo.cli.AbstractCLI.executeMain 
> (AbstractCLI.java:65)
> 	at org.apache.geronimo.cli.daemon.DaemonCLI.main(DaemonCLI.java:30)
> 23:47:40,078 ERROR [BasicWADISessionManager] Failed to stop
> org.codehaus.wadi.servicespace.ServiceSpaceNotFoundException:  
> ServiceSpaceName not found
> 	at  
> org.codehaus.wadi.servicespace.basic.BasicServiceSpaceRegistry.unregis 
> ter(BasicServiceSpaceRegistry.java:55)
> 	at  
> org.codehaus.wadi.servicespace.basic.BasicServiceSpace.unregisterServi 
> ceSpace(BasicServiceSpace.java:228)
> 	at org.codehaus.wadi.servicespace.basic.BasicServiceSpace.stop 
> (BasicServiceSpace.java:175)
> 	at  
> org.apache.geronimo.clustering.wadi.BasicWADISessionManager.doFail 
> (BasicWADISessionManager.java:134)
> 	at org.apache.geronimo.gbean.runtime.GBeanInstance.createInstance 
> (GBeanInstance.java:978)
> 	at  
> org.apache.geronimo.gbean.runtime.GBeanInstanceState.attemptFullStart( 
> GBeanInstanceState.java:269)
> 	at org.apache.geronimo.gbean.runtime.GBeanInstanceState.start 
> (GBeanInstanceState.java:103)
> 	at  
> org.apache.geronimo.gbean.runtime.GBeanInstanceState.startRecursive 
> (GBeanInstanceState.java:125)
> 	at org.apache.geronimo.gbean.runtime.GBeanInstance.startRecursive 
> (GBeanInstance.java:538)
> 	at org.apache.geronimo.kernel.basic.BasicKernel.startRecursiveGBean 
> (BasicKernel.java:377)
> 	at  
> org.apache.geronimo.kernel.config.ConfigurationUtil.startConfiguration 
> GBeans(ConfigurationUtil.java:456)
> 	at  
> org.apache.geronimo.kernel.config.ConfigurationUtil.startConfiguration 
> GBeans(ConfigurationUtil.java:493)
> 	at  
> org.apache.geronimo.kernel.config.KernelConfigurationManager.start 
> (KernelConfigurationManager.java:190)
> 	at  
> org.apache.geronimo.kernel.config.SimpleConfigurationManager.startConf 
> iguration(SimpleConfigurationManager.java:546)
> 	at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke 
> (DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.geronimo.gbean.runtime.ReflectionMethodInvoker.invoke 
> (ReflectionMethodInvoker.java:34)
> 	at org.apache.geronimo.gbean.runtime.GBeanOperation.invoke 
> (GBeanOperation.java:130)
> 	at org.apache.geronimo.gbean.runtime.GBeanInstance.invoke 
> (GBeanInstance.java:815)
> 	at org.apache.geronimo.gbean.runtime.RawInvoker.invoke 
> (RawInvoker.java:57)
> 	at org.apache.geronimo.kernel.basic.RawOperationInvoker.invoke 
> (RawOperationInvoker.java:35)
> 	at  
> org.apache.geronimo.kernel.basic.ProxyMethodInterceptor.intercept 
> (ProxyMethodInterceptor.java:96)
> 	at org.apache.geronimo.gbean.GBeanLifecycle$$EnhancerByCGLIB$ 
> $628b9237.startConfiguration(<generated>)
> 	at org.apache.geronimo.system.main.EmbeddedDaemon.doStartup 
> (EmbeddedDaemon.java:161)
> 	at org.apache.geronimo.system.main.EmbeddedDaemon.execute 
> (EmbeddedDaemon.java:78)
> 	at  
> org.apache.geronimo.kernel.util.MainConfigurationBootstrapper.main 
> (MainConfigurationBootstrapper.java:45)
> 	at org.apache.geronimo.cli.AbstractCLI.executeMain 
> (AbstractCLI.java:65)
> 	at org.apache.geronimo.cli.daemon.DaemonCLI.main(DaemonCLI.java:30)
>
> After this failure the server stops. Over at the running instance  
> AS-001 this is logged:
>
> 23:46:31,909 INFO  [ChannelInterceptorBase] memberAdded:tcp://{10,  
> 0, 10, 100}:4000
> 23:46:37,929 INFO  [BasicSingletonServiceHolder] [TribesPeer  
> [AS-001; tcp://10.0.10.101:4000]] owns singleton service  
> [PartitionManager for ServiceSpace [/]]
> 23:46:37,929 INFO  [BasicPartitionBalancerSingletonService]  
> Queueing partition rebalancing
> 23:46:38,438 ERROR [BasicEnvelopeDispatcherManager] problem  
> dispatching message
> java.lang.IllegalArgumentException:  
> org.codehaus.wadi.core.store.BasicStoreMotable is not a Session
> 	at  
> org.codehaus.wadi.replication.manager.basic.SessionStateHandler.newExt 
> ractFullStateExternalizable(SessionStateHandler.java:105)
> 	at  
> org.codehaus.wadi.replication.manager.basic.SessionStateHandler.extrac 
> tFullState(SessionStateHandler.java:53)
> 	at  
> org.codehaus.wadi.replication.manager.basic.CreateStorageCommand.execu 
> te(CreateStorageCommand.java:45)
> 	at  
> org.codehaus.wadi.replication.manager.basic.SyncSecondaryManager.updat 
> eSecondaries(SyncSecondaryManager.java:169)
> 	at  
> org.codehaus.wadi.replication.manager.basic.SyncSecondaryManager.updat 
> eSecondaries(SyncSecondaryManager.java:114)
> 	at  
> org.codehaus.wadi.replication.manager.basic.SyncSecondaryManager.updat 
> eSecondaries(SyncSecondaryManager.java:103)
> 	at  
> org.codehaus.wadi.replication.manager.basic.SyncSecondaryManager.updat 
> eSecondariesFollowingJoiningPeer(SyncSecondaryManager.java:75)
> 	at  
> org.codehaus.wadi.replication.manager.basic.ReOrganizeSecondariesListe 
> ner.receive(ReOrganizeSecondariesListener.java:53)
> 	at  
> org.codehaus.wadi.servicespace.basic.BasicServiceMonitor.notifyListene 
> rs(BasicServiceMonitor.java:124)
> 	at  
> org.codehaus.wadi.servicespace.basic.BasicServiceMonitor.processLifecy 
> cleEvent(BasicServiceMonitor.java:141)
> 	at org.codehaus.wadi.servicespace.basic.BasicServiceMonitor 
> $ServiceLifecycleEndpoint.dispatch(BasicServiceMonitor.java:148)
> 	at org.codehaus.wadi.group.impl.ServiceEndpointWrapper.dispatch 
> (ServiceEndpointWrapper.java:50)
> 	at org.codehaus.wadi.group.impl.BasicEnvelopeDispatcherManager 
> $DispatchRunner.run(BasicEnvelopeDispatcherManager.java:121)
> 	at org.codehaus.wadi.servicespace.basic.BasicServiceSpaceDispatcher 
> $ExecuteInThread.execute(BasicServiceSpaceDispatcher.java:102)
> 	at  
> org.codehaus.wadi.group.impl.BasicEnvelopeDispatcherManager.onEnvelope 
> (BasicEnvelopeDispatcherManager.java:100)
> 	at org.codehaus.wadi.group.impl.AbstractDispatcher.doOnEnvelope 
> (AbstractDispatcher.java:104)
> 	at org.codehaus.wadi.group.impl.AbstractDispatcher.onEnvelope 
> (AbstractDispatcher.java:100)
> 	at  
> org.codehaus.wadi.servicespace.basic.ServiceSpaceEndpoint.dispatch 
> (ServiceSpaceEndpoint.java:49)
> 	at org.codehaus.wadi.group.impl.ServiceEndpointWrapper.dispatch 
> (ServiceEndpointWrapper.java:50)
> 	at org.codehaus.wadi.group.impl.BasicEnvelopeDispatcherManager 
> $DispatchRunner.run(BasicEnvelopeDispatcherManager.java:121)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask 
> (ThreadPoolExecutor.java:886)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run 
> (ThreadPoolExecutor.java:908)
> 	at java.lang.Thread.run(Thread.java:619)
> 23:46:40,063 INFO  [BasicPartitionBalancerSingletonService]  
> Queueing partition rebalancing
> 23:46:42,938 WARN  [BasicPartitionBalancerSingletonService]  
> Rebalancing has failed
> org.codehaus.wadi.group.MessageExchangeException: No correlated  
> messages received within [5000]ms
> 	at  
> org.codehaus.wadi.group.impl.AbstractDispatcher.attemptMultiRendezVous 
> (AbstractDispatcher.java:174)
> 	at  
> org.codehaus.wadi.location.balancing.BasicPartitionBalancer.fetchBalan 
> cingInfoState(BasicPartitionBalancer.java:85)
> 	at  
> org.codehaus.wadi.location.balancing.BasicPartitionBalancer.balancePar 
> titions(BasicPartitionBalancer.java:69)
> 	at  
> org.codehaus.wadi.location.balancing.BasicPartitionBalancerSingletonSe 
> rvice.run(BasicPartitionBalancerSingletonService.java:85)
> 	at java.lang.Thread.run(Thread.java:619)
> 23:46:42,939 WARN  [BasicPartitionBalancerSingletonService] Will  
> retry rebalancing in [500] ms
> 23:46:43,439 INFO  [BasicPartitionBalancerSingletonService]  
> Queueing partition rebalancing
> 23:47:40,269 INFO  [TcpFailureDetector] Verification complete.  
> Member disappeared[org.apache.catalina.tribes.membership.MemberImpl 
> [tcp://{10, 0, 10, 100}:4000,{10, 0, 10, 100},4000, alive=69401,id= 
> {-103 34 80 -19 68 -51 70 -91 -108 39 -84 65 50 50 103 -107 },  
> payload={-84 -19 0 5 115 114 0 50 111 ...(423)}, command={66 65 66  
> 89 45 65 76 69 88 ...(9)}, domain={74 79 84 84 65 95 87 65 68 ... 
> (10)}, ]]
> 23:47:40,271 INFO  [ChannelInterceptorBase] memberDisappeared:tcp:// 
> {10, 0, 10, 100}:4000
> 23:47:40,271 INFO  [BasicPartitionBalancerSingletonService]  
> Queueing partition rebalancing
>
> If I try to start AS-000 again the same thing happens. If we stop  
> AS-001 the following is logged;
>
> 23:49:18,695 INFO  [SimpleStateManager] Evacuating partitions
> 23:49:18,699 INFO  [BasicPartitionBalancerSingletonService]  
> Queueing partition rebalancing
> 23:49:23,698 WARN  [SimpleStateManager] Partition balancer has  
> disappeared - backing off for [1000]ms
> 23:49:24,699 INFO  [BasicPartitionBalancerSingletonService]  
> Queueing partition rebalancing
> 23:49:29,698 WARN  [SimpleStateManager] Partition balancer has  
> disappeared - backing off for [1000]ms
> 23:49:30,699 INFO  [BasicPartitionBalancerSingletonService]  
> Queueing partition rebalancing
> 23:49:35,699 WARN  [SimpleStateManager] Partition balancer has  
> disappeared - backing off for [1000]ms
> 23:49:36,700 INFO  [BasicPartitionBalancerSingletonService]  
> Queueing partition rebalancing
> 23:49:41,699 WARN  [SimpleStateManager] Partition balancer has  
> disappeared - backing off for [1000]ms
> 23:49:42,701 INFO  [BasicPartitionBalancerSingletonService]  
> Queueing partition rebalancing
> 23:49:47,700 WARN  [SimpleStateManager] Partition balancer has  
> disappeared - backing off for [1000]ms
> 23:49:48,700 INFO  [SimpleStateManager] Evacuated
> 23:49:48,808 INFO  [AbstractExclusiveContextualiser] Unloaded  
> sessions=[36]
> 23:49:48,843 INFO  [AbstractExclusiveContextualiser] Unloaded  
> sessions=[13]
> 23:49:58,852 INFO  [BasicSingletonServiceHolder] [TribesPeer  
> [AS-001; tcp://10.0.10.101:4000]] resigns ownership of singleton  
> service [PartitionManager for ServiceSpace [/]]
>
> , however AS-001 then just hangs, and we have to kill the process  
> to get it stopped. After this we can start AS-000, update AS-001  
> and it always seems to have no problem joining the cluster  
> thereafter. The strange thing is that this problem does not always  
> occur, sometimes everything goes fine. I can't find a consistent  
> pattern but I've tried stopping AS-001 before AS-000, and I'm sure  
> no serializable object in the session has changed between the  
> updated and running instance.
>
> My gut feeling is that this is either a concurrency-related bug in  
> WADI or a network-timeout related problem. During normal operation  
> I'm sometimes seeing messages like this in the log files:
>
> 17:14:08,869 INFO  [TcpFailureDetector] Received memberDisappeared 
> [org.apache.catalina.tribes.membership.MemberImpl[tcp://{10, 0, 10,  
> 101}:4000,{10, 0, 10, 101},4000, alive=95659954,id={-52 -76 98 22  
> 10 71 76 -72 -122 -59 -21 -29 44 -86 38 114 }, payload={-84 -19 0 5  
> 115 114 0 50 111 ...(423)}, command={}, domain={74 79 84 84 65 95  
> 87 65 68 ...(10)}, ]] message. Will verify.
> 17:14:08,870 INFO  [TcpFailureDetector] Verification complete.  
> Member still alive[org.apache.catalina.tribes.membership.MemberImpl 
> [tcp://{10, 0, 10, 101}:4000,{10, 0, 10, 101},4000,  
> alive=95659954,id={-52 -76 98 22 10 71 76 -72 -122 -59 -21 -29 44  
> -86 38 114 }, payload={-84 -19 0 5 115 114 0 50 111 ...(423)},  
> command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]]
> And lately, as traffic has increased, errors like this:
>
> 16:22:43,524 WARN  [UpdateReplicationCommand] Update has not been  
> properly cascaded due to a communication failure. If a targeted  
> node has been lost, state will be re-balanced automatically.
> org.codehaus.wadi.servicespace.ServiceInvocationException:  
> org.codehaus.wadi.group.MessageExchangeException: No correlated  
> messages received within [2000]ms
> 	at org.codehaus.wadi.servicespace.basic.CGLIBServiceProxyFactory 
> $ProxyMethodInterceptor.intercept(CGLIBServiceProxyFactory.java:209)
> 	at org.codehaus.wadi.replication.storage.ReplicaStorage$ 
> $EnhancerByCGLIB$$a901e91b.mergeUpdate(<generated>)
> 	at  
> org.codehaus.wadi.replication.manager.basic.UpdateReplicationCommand.c 
> ascadeUpdate(UpdateReplicationCommand.java:93)
> 	at  
> org.codehaus.wadi.replication.manager.basic.UpdateReplicationCommand.r 
> un(UpdateReplicationCommand.java:86)
> 	at  
> org.codehaus.wadi.replication.manager.basic.SyncReplicationManager.upd 
> ate(SyncReplicationManager.java:138)
> 	at  
> org.codehaus.wadi.replication.manager.basic.LoggingReplicationManager. 
> update(LoggingReplicationManager.java:100)
> 	at  
> org.codehaus.wadi.core.session.AbstractReplicableSession.onEndProcessi 
> ng(AbstractReplicableSession.java:49)
> 	at  
> org.codehaus.wadi.core.session.AtomicallyReplicableSession.onEndProces 
> sing(AtomicallyReplicableSession.java:58)
> 	at  
> org.apache.geronimo.clustering.wadi.WADISessionAdaptor.onEndAccess 
> (WADISessionAdaptor.java:77)
> 	at  
> org.apache.geronimo.jetty6.cluster.ClusteredSessionManager.complete 
> (ClusteredSessionManager.java:60)
> 	at org.mortbay.jetty.servlet.SessionHandler.handle 
> (SessionHandler.java:198)
> 	at  
> org.apache.geronimo.jetty6.cluster.ClusteredSessionHandler.doHandle 
> (ClusteredSessionHandler.java:59)
> 	at org.apache.geronimo.jetty6.cluster.ClusteredSessionHandler 
> $ActualHandler.handle(ClusteredSessionHandler.java:66)
> 	at org.apache.geronimo.jetty6.cluster.AbstractClusteredPreHandler 
> $WebClusteredInvocation.invokeLocally 
> (AbstractClusteredPreHandler.java:71)
> 	at org.apache.geronimo.jetty6.cluster.wadi.WADIClusteredPreHandler 
> $WADIWebClusteredInvocation.access$000(WADIClusteredPreHandler.java: 
> 52)
> 	at org.apache.geronimo.jetty6.cluster.wadi.WADIClusteredPreHandler 
> $WADIWebClusteredInvocation$1.doFilter(WADIClusteredPreHandler.java: 
> 64)
> 	at org.codehaus.wadi.web.impl.WebInvocation.invoke 
> (WebInvocation.java:116)
> 	at  
> org.codehaus.wadi.core.contextualiser.MemoryContextualiser.handleLocal 
> ly(MemoryContextualiser.java:71)
> 	at  
> org.codehaus.wadi.core.contextualiser.AbstractExclusiveContextualiser. 
> handle(AbstractExclusiveContextualiser.java:94)
> 	at  
> org.codehaus.wadi.core.contextualiser.AbstractMotingContextualiser.con 
> textualise(AbstractMotingContextualiser.java:37)
> 	at org.codehaus.wadi.core.manager.StandardManager.processStateful 
> (StandardManager.java:150)
> 	at org.codehaus.wadi.core.manager.StandardManager.contextualise 
> (StandardManager.java:142)
> 	at org.codehaus.wadi.core.manager.ClusteredManager.contextualise 
> (ClusteredManager.java:81)
> 	at org.apache.geronimo.jetty6.cluster.wadi.WADIClusteredPreHandler 
> $WADIWebClusteredInvocation.invoke(WADIClusteredPreHandler.java:72)
> 	at  
> org.apache.geronimo.jetty6.cluster.AbstractClusteredPreHandler.handle( 
> AbstractClusteredPreHandler.java:39)
> 	at  
> org.apache.geronimo.jetty6.cluster.ClusteredSessionHandler.handle 
> (ClusteredSessionHandler.java:51)
> 	at org.mortbay.jetty.handler.ContextHandler.handle 
> (ContextHandler.java:765)
> 	at org.mortbay.jetty.webapp.WebAppContext.handle 
> (WebAppContext.java:417)
> 	at org.apache.geronimo.jetty6.handler.TwistyWebAppContext.access 
> $101(TwistyWebAppContext.java:41)
> 	at org.apache.geronimo.jetty6.handler.TwistyWebAppContext 
> $TwistyHandler.handle(TwistyWebAppContext.java:66)
> 	at  
> org.apache.geronimo.jetty6.handler.ThreadClassloaderHandler.handle 
> (ThreadClassloaderHandler.java:46)
> 	at org.apache.geronimo.jetty6.handler.InstanceContextHandler.handle 
> (InstanceContextHandler.java:58)
> 	at org.apache.geronimo.jetty6.handler.UserTransactionHandler.handle 
> (UserTransactionHandler.java:48)
> 	at  
> org.apache.geronimo.jetty6.handler.ComponentContextHandler.handle 
> (ComponentContextHandler.java:47)
> 	at org.apache.geronimo.jetty6.handler.TwistyWebAppContext.handle 
> (TwistyWebAppContext.java:60)
> 	at org.mortbay.jetty.handler.ContextHandlerCollection.handle 
> (ContextHandlerCollection.java:230)
> 	at org.mortbay.jetty.handler.HandlerCollection.handle 
> (HandlerCollection.java:114)
> 	at org.mortbay.jetty.handler.HandlerWrapper.handle 
> (HandlerWrapper.java:152)
> 	at org.mortbay.jetty.Server.handle(Server.java:326)
> 	at org.mortbay.jetty.HttpConnection.handleRequest 
> (HttpConnection.java:534)
> 	at org.mortbay.jetty.HttpConnection$RequestHandler.content 
> (HttpConnection.java:879)
> 	at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:747)
> 	at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
> 	at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
> 	at org.mortbay.io.nio.SelectChannelEndPoint.run 
> (SelectChannelEndPoint.java:409)
> 	at org.apache.geronimo.pool.ThreadPool$1.run(ThreadPool.java:214)
> 	at org.apache.geronimo.pool.ThreadPool 
> $ContextClassLoaderRunnable.run(ThreadPool.java:344)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask 
> (ThreadPoolExecutor.java:886)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run 
> (ThreadPoolExecutor.java:908)
> 	at java.lang.Thread.run(Thread.java:619)
> Caused by: org.codehaus.wadi.group.MessageExchangeException: No  
> correlated messages received within [2000]ms
> 	at  
> org.codehaus.wadi.group.impl.AbstractDispatcher.attemptMultiRendezVous 
> (AbstractDispatcher.java:174)
> 	at  
> org.codehaus.wadi.servicespace.basic.BasicServiceInvoker.invokeOnPeers 
> (BasicServiceInvoker.java:90)
> 	at org.codehaus.wadi.servicespace.basic.BasicServiceInvoker.invoke 
> (BasicServiceInvoker.java:69)
> 	at org.codehaus.wadi.servicespace.basic.CGLIBServiceProxyFactory 
> $ProxyMethodInterceptor.intercept(CGLIBServiceProxyFactory.java:193)
> 	... 49 more
>
> Does anyone have some insight into what might be causing this, or  
> where the timeouts can be increased if there are any?
>
> I'm thinking that a static WADI configuration can be more stable  
> than the dynamic setup we have not which relies upon multicasting.  
> Does anyone have experience with similar setups?
>
> Thanks!
>
> Trygve Hardersen - Jotta
>
>
>
>