You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@couchdb.apache.org by GitBox <gi...@apache.org> on 2020/05/27 09:25:11 UTC
[GitHub] [couchdb] janl commented on issue #2903: CouchDB 2.3 inter node replication latency Monitor

janl commented on issue #2903:
URL: https://github.com/apache/couchdb/issues/2903#issuecomment-634540821


   For posterity, from Slack:
   
   GimDew  7:48 AM
   Hello, Can I know is there any method to monitor the replication latency between nodes in a couch 2.3 cluster with replication factor 3?
   
   jan:couchdb:  8:39 AM
   By default all writes go to all nodes that need to see the data, so there is no replication. Only if one node gets left behind, either due to an outage, or network failure, that replication is used to catch up again
   
   GimDew  9:56 AM
   @jan If we have connected our data source only to a single node in the cluster, what is the process of data going to other nodes in the cluster if it's not the replication
   
   jan:couchdb:  10:03 AM
   you shouldn’t do that and instead use a load balancer to spread all requests coming in across all nodes
   each request gets sent to all participating nodes from the node that receives it from your app
   
   rnewson  10:37 AM
   "internal replication", that is, which happens over erlang rpc not http
   10:38
   (i.e, is less visible)
   10:38
   there's an endpoint/metric for the internal repl backlog that can indicate a) that you are backed up and b) when that will likely clear
   10:39
   as jan implied, each node of the cluster is both a participant in data operations (read/write/etc) and can act as a coordinator for clustered operations. in a 3 node cluster with an n=3 db, every node is a participant, including the coordinator node
   10:40
   so the latency you talk of is either a) the time it took for the coordinator to return the http response (all the writes have happened) b) the internal replication backlog, if there is one. Which do you refer to?
   
   jan:couchdb:  10:45 AM
   @rnewson do we have a metric beyond "this is how many internal replications are currently ongoing?"
   10:45
   It's a nice indicator, but too coarse when it comes to "how far"
   
   GimDew  10:48 AM
   Thanks @jan !
   
   rnewson  11:00 AM
   @jan I don't think so. I'm fuzzy on the details but the cloudant ops team look at this metric often as a sign of cluster health / lack of
   11:01
   internal_replication_jobs from the _node/blah/_system endpoint
   11:02
   capturing and graphing is something you'd have to do elsewhere tho
   
   jan:couchdb:  11:06 AM
   Yeah, I do the same. I usually don’t need finer granularity
   11:07
   It’s likely that folks coming from other databases, where closely looking at the replication lag is important for consistency, want that for CouchDB, but it makes no sense in our world
   
   rnewson  11:08 AM
   depends what you mean by consistency but yeah
   11:09
   we block for the first 2 of 3 for reads and writes, so typically couch appears consistent. There are, as you know, several circumstances that will reveal the gaps in that.
   11:09
   (saying this for the channel, not for you)
   
   jan:couchdb:  11:13 AM
   Right
   11:14
   In a MySQL replication scenario, folks set things up to pretend they don't have a distributed system, so having a replica be an exact copy of the primary is important. So they watch their replication lag, so they can decide until when they can trust that assumption
   11:14
   Not advocating this is a great approach, but it's common over there, where the data model is undistributable.
   New
   
   rnewson  11:20 AM
   right
   11:20
   a primary/secondary system. I remember those.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org