You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@tomcat.apache.org by Ossi <lo...@gmail.com> on 2010/10/29 12:17:35 UTC

BackupManager vs DeltaManager

Hi!

Should BackupManager work well with any number of nodes?
And with large clusters it should work even better than DeltaManager?

We have large production clusters (10+) nodes and we have evaluated if we
can use BackupManager.

In test cluster of 6 nodes it didn't work too well: much higher request
latency, with logs full of following errors:

2010-09-24 14:17:34,536 ERROR [tomcat-processor-53]
(org.apache.catalina.tribes.tipis.AbstractReplicatedMap) Unable to replicate
out data for a LazyReplicatedMap.get
operationorg.apache.catalina.tribes.ChannelException: Operation has timed
out(3000 ms.).; Faulty members:tcp://{10, 1, 8, 219}:4200;
    at
org.apache.catalina.tribes.transport.nio.ParallelNioSender.sendMessage(ParallelNioSender.java:97)

    at
org.apache.catalina.tribes.transport.nio.PooledParallelSender.sendMessage(PooledParallelSender.java:53)

    at
org.apache.catalina.tribes.transport.ReplicationTransmitter.sendMessage(ReplicationTransmitter.java:80)

    at
org.apache.catalina.tribes.group.ChannelCoordinator.sendMessage(ChannelCoordinator.java:78)

    at
org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(ChannelInterceptorBase.java:75)

    at
org.apache.catalina.tribes.group.interceptors.MessageDispatchInterceptor.sendMessage(MessageDispatchInterceptor.java:73)

    at
org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(ChannelInterceptorBase.java:75)

    at
org.apache.catalina.tribes.group.interceptors.TcpFailureDetector.sendMessage(TcpFailureDetector.java:87)

    at
org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(ChannelInterceptorBase.java:75)

    at
org.apache.catalina.tribes.group.GroupChannel.send(GroupChannel.java:216)
    at
org.apache.catalina.tribes.group.GroupChannel.send(GroupChannel.java:175)
    at org.apache.catalina.tribes.group.RpcChannel.send(RpcChannel.java:89)
    at
org.apache.catalina.tribes.tipis.AbstractReplicatedMap.get(AbstractReplicatedMap.java:844)

    at
org.apache.catalina.session.ManagerBase.findSession(ManagerBase.java:887)
    at org.apache.catalina.connector.Request.doGetSession(Request.java:2363)

    at org.apache.catalina.connector.Request.getSession(Request.java:2098)
    at
org.apache.catalina.connector.RequestFacade.getSession(RequestFacade.java:833)

    at
javax.servlet.http.HttpServletRequestWrapper.getSession(HttpServletRequestWrapper.java:216)

    at
com.sulake.habboweb.util.TomcatSessionFixationPreventerFilter$RequestWrapper.getSession(TomcatSessionFixationPreventerFilter.java:72)

.....


Yes, I know that documentation says: "Downside of the BackupManager: not
quite as battle tested as the delta manager". Maybe this is it. :)

Regards,
Ossi

Re: BackupManager vs DeltaManager

Posted by Pid <pi...@pidster.com>.
On 01/11/2010 14:44, Ossi wrote:
> On Fri, Oct 29, 2010 at 1:51 PM, Pid <pi...@pidster.com> wrote:
> 
>> On 29/10/2010 11:17, Ossi wrote:
>>> Hi!
>>>
>>> Should BackupManager work well with any number of nodes?
>>
>> Yes.
>>
>>> And with large clusters it should work even better than DeltaManager?
>>
>> Yes.  *Should*.
>>
>>> We have large production clusters (10+) nodes and we have evaluated if we
>>> can use BackupManager.
>>>
>>> In test cluster of 6 nodes it didn't work too well: much higher request
>>> latency, with logs full of following errors:
>>>
>>> 2010-09-24 14:17:34,536 ERROR [tomcat-processor-53]
>>> (org.apache.catalina.tribes.tipis.AbstractReplicatedMap) Unable to
>> replicate
>>> out data for a LazyReplicatedMap.get
>>> operationorg.apache.catalina.tribes.ChannelException: Operation has timed
>>> out(3000 ms.).; Faulty members:tcp://{10, 1, 8, 219}:4200;
>>
>> It's timing out for some reason.  You could try increasing the timeout.
>>
> 
> 
> Yes, I noticed that. However it is using same configs that with DeltaManager
> and we didn't get
> those same errors with that.

It'll be a bit tedious, but it might be beneficial to look at a tcpdump
trace of the connection traffic to see what's happening.

> What could be reason for those timeouts? How to know what
> operation could be causing the timeout? Like is that on
> initialization/starting phase (so, it couldn't connect
> at all) or I something in replication just taking a lot of time.

BackupManager doesn't replicate to the whole cluster, it replicates each
session to one designated backup node.  It does replicate the map of
where all the sessions are to the whole cluster, however.

Maybe it's the latter which is a problem.

> I'll test this with different timeouts.
>
>> Does this occur on all cluster members, or just a few?
> 
> Sorry, I don't remember it has been awhile when we did those test and
> apparently the logs are gone.
> Gotta check this when I test this next time.

OK.  Let us know.


p

Re: BackupManager vs DeltaManager

Posted by Ossi <lo...@gmail.com>.
On Fri, Oct 29, 2010 at 1:51 PM, Pid <pi...@pidster.com> wrote:

> On 29/10/2010 11:17, Ossi wrote:
> > Hi!
> >
> > Should BackupManager work well with any number of nodes?
>
> Yes.
>
> > And with large clusters it should work even better than DeltaManager?
>
> Yes.  *Should*.
>
> > We have large production clusters (10+) nodes and we have evaluated if we
> > can use BackupManager.
> >
> > In test cluster of 6 nodes it didn't work too well: much higher request
> > latency, with logs full of following errors:
> >
> > 2010-09-24 14:17:34,536 ERROR [tomcat-processor-53]
> > (org.apache.catalina.tribes.tipis.AbstractReplicatedMap) Unable to
> replicate
> > out data for a LazyReplicatedMap.get
> > operationorg.apache.catalina.tribes.ChannelException: Operation has timed
> > out(3000 ms.).; Faulty members:tcp://{10, 1, 8, 219}:4200;
>
> It's timing out for some reason.  You could try increasing the timeout.
>


Yes, I noticed that. However it is using same configs that with DeltaManager
and we didn't get
those same errors with that.

What could be reason for those timeouts? How to know what
operation could be causing the timeout? Like is that on
initialization/starting phase (so, it couldn't connect
at all) or I something in replication just taking a lot of time.

I'll test this with different timeouts.


>
> Does this occur on all cluster members, or just a few?
>


Sorry, I don't remember it has been awhile when we did those test and
apparently the logs are gone.
Gotta check this when I test this next time.



>
> p
>
>
> >     at
> >
> org.apache.catalina.tribes.transport.nio.ParallelNioSender.sendMessage(ParallelNioSender.java:97)
> >
> >     at
> >
> org.apache.catalina.tribes.transport.nio.PooledParallelSender.sendMessage(PooledParallelSender.java:53)
> >
> >     at
> >
> org.apache.catalina.tribes.transport.ReplicationTransmitter.sendMessage(ReplicationTransmitter.java:80)
> >
> >     at
> >
> org.apache.catalina.tribes.group.ChannelCoordinator.sendMessage(ChannelCoordinator.java:78)
> >
> >     at
> >
> org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(ChannelInterceptorBase.java:75)
> >
> >     at
> >
> org.apache.catalina.tribes.group.interceptors.MessageDispatchInterceptor.sendMessage(MessageDispatchInterceptor.java:73)
> >
> >     at
> >
> org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(ChannelInterceptorBase.java:75)
> >
> >     at
> >
> org.apache.catalina.tribes.group.interceptors.TcpFailureDetector.sendMessage(TcpFailureDetector.java:87)
> >
> >     at
> >
> org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(ChannelInterceptorBase.java:75)
> >
> >     at
> > org.apache.catalina.tribes.group.GroupChannel.send(GroupChannel.java:216)
> >     at
> > org.apache.catalina.tribes.group.GroupChannel.send(GroupChannel.java:175)
> >     at
> org.apache.catalina.tribes.group.RpcChannel.send(RpcChannel.java:89)
> >     at
> >
> org.apache.catalina.tribes.tipis.AbstractReplicatedMap.get(AbstractReplicatedMap.java:844)
> >
> >     at
> > org.apache.catalina.session.ManagerBase.findSession(ManagerBase.java:887)
> >     at
> org.apache.catalina.connector.Request.doGetSession(Request.java:2363)
> >
> >     at
> org.apache.catalina.connector.Request.getSession(Request.java:2098)
> >     at
> >
> org.apache.catalina.connector.RequestFacade.getSession(RequestFacade.java:833)
> >
> >     at
> >
> javax.servlet.http.HttpServletRequestWrapper.getSession(HttpServletRequestWrapper.java:216)
> >
> >     at
> >
> com.sulake.habboweb.util.TomcatSessionFixationPreventerFilter$RequestWrapper.getSession(TomcatSessionFixationPreventerFilter.java:72)
> >
> > .....
> >
> >
> > Yes, I know that documentation says: "Downside of the BackupManager: not
> > quite as battle tested as the delta manager". Maybe this is it. :)
> >
> > Regards,
> > Ossi
> >
>
>

Re: BackupManager vs DeltaManager

Posted by Pid <pi...@pidster.com>.
On 29/10/2010 11:17, Ossi wrote:
> Hi!
> 
> Should BackupManager work well with any number of nodes?

Yes.

> And with large clusters it should work even better than DeltaManager?

Yes.  *Should*.

> We have large production clusters (10+) nodes and we have evaluated if we
> can use BackupManager.
> 
> In test cluster of 6 nodes it didn't work too well: much higher request
> latency, with logs full of following errors:
> 
> 2010-09-24 14:17:34,536 ERROR [tomcat-processor-53]
> (org.apache.catalina.tribes.tipis.AbstractReplicatedMap) Unable to replicate
> out data for a LazyReplicatedMap.get
> operationorg.apache.catalina.tribes.ChannelException: Operation has timed
> out(3000 ms.).; Faulty members:tcp://{10, 1, 8, 219}:4200;

It's timing out for some reason.  You could try increasing the timeout.

Does this occur on all cluster members, or just a few?


p


>     at
> org.apache.catalina.tribes.transport.nio.ParallelNioSender.sendMessage(ParallelNioSender.java:97)
> 
>     at
> org.apache.catalina.tribes.transport.nio.PooledParallelSender.sendMessage(PooledParallelSender.java:53)
> 
>     at
> org.apache.catalina.tribes.transport.ReplicationTransmitter.sendMessage(ReplicationTransmitter.java:80)
> 
>     at
> org.apache.catalina.tribes.group.ChannelCoordinator.sendMessage(ChannelCoordinator.java:78)
> 
>     at
> org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(ChannelInterceptorBase.java:75)
> 
>     at
> org.apache.catalina.tribes.group.interceptors.MessageDispatchInterceptor.sendMessage(MessageDispatchInterceptor.java:73)
> 
>     at
> org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(ChannelInterceptorBase.java:75)
> 
>     at
> org.apache.catalina.tribes.group.interceptors.TcpFailureDetector.sendMessage(TcpFailureDetector.java:87)
> 
>     at
> org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(ChannelInterceptorBase.java:75)
> 
>     at
> org.apache.catalina.tribes.group.GroupChannel.send(GroupChannel.java:216)
>     at
> org.apache.catalina.tribes.group.GroupChannel.send(GroupChannel.java:175)
>     at org.apache.catalina.tribes.group.RpcChannel.send(RpcChannel.java:89)
>     at
> org.apache.catalina.tribes.tipis.AbstractReplicatedMap.get(AbstractReplicatedMap.java:844)
> 
>     at
> org.apache.catalina.session.ManagerBase.findSession(ManagerBase.java:887)
>     at org.apache.catalina.connector.Request.doGetSession(Request.java:2363)
> 
>     at org.apache.catalina.connector.Request.getSession(Request.java:2098)
>     at
> org.apache.catalina.connector.RequestFacade.getSession(RequestFacade.java:833)
> 
>     at
> javax.servlet.http.HttpServletRequestWrapper.getSession(HttpServletRequestWrapper.java:216)
> 
>     at
> com.sulake.habboweb.util.TomcatSessionFixationPreventerFilter$RequestWrapper.getSession(TomcatSessionFixationPreventerFilter.java:72)
> 
> .....
> 
> 
> Yes, I know that documentation says: "Downside of the BackupManager: not
> quite as battle tested as the delta manager". Maybe this is it. :)
> 
> Regards,
> Ossi
>