You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@activemq.apache.org by GitBox <gi...@apache.org> on 2021/02/06 05:46:10 UTC

[GitHub] [activemq-artemis] franz1981 edited a comment on pull request #3392: ARTEMIS-3045 ReplicationManager can batch sent replicated packets

franz1981 edited a comment on pull request #3392:
URL: https://github.com/apache/activemq-artemis/pull/3392#issuecomment-774405162


   @clebertsuconic @michaelandrepearce @jbertram @brusdev @gtully 
   This change seems to perform best, but I cannot say I am satisfied, because I see a problem both on this and the original implementation ie we are not handling back-pressure back to the replicated journal and beyond.
   This remind me of https://cassandra.apache.org/blog/2020/09/03/improving-resiliency.html for who is interested..
   
   TLDR:
   - in the master implementation we can make the broker to go OOM by adding too many Runnables to the replication stream because it can block awaiting Netty writability for 30 seconds and stop consuming the tasks
   - in this PR first implementation we can have a JCTools q of the outstanding packet requests that can grow unbounded for the same reason and makes the broker to go OOM again
   - in this PR last implementation we can have the Netty internal outbound (off-heap) buffer thet can grow unbounded
   
   Despite the first 2 solutions seems better at first look, because they wait for enough room in Netty buffer, we can still get OOM under the same circumstances ie while awaiting the backup to catch-up.
   
   We should:
   - track the amount of pending work for monitoring
   - try to propagate back-pressure until clients
   
   But given that Artemis clients are of very different types, probably we should (similar to Cassandra) setAutoread false to client connections, but that means that we rely on client to save themselves from OOM.
   Or, depending by client type, we can stop sending credits back to clients to slow'em.
   At worst we should stop accepting connecting clients too (but is too drastic, because maybe they won't make the broker to replicate anything).
   
   I cannot say what's the best option here and if we already use some form of end-to-end protection I cannot see here, but it doesn't seem the case, given that many parallel client can still cause overloading much before receiving back notification of a durable local write + backup notification.
   Any thought?
   
   IMO solving this correctly can bring a huge performance increase with an improved stability too.
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org