You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2023/01/06 15:03:08 UTC

[GitHub] [ozone] sodonnel opened a new pull request, #4152: HDDS-7695. EC metrics related to replication commands don't add up

sodonnel opened a new pull request, #4152:
URL: https://github.com/apache/ozone/pull/4152

   ## What changes were proposed in this pull request?
   
   ```
       "EcReplicationCmdsSentTotal" : 0,
       "EcDeletionCmdsSentTotal" : 259,
       "EcReplicationCmdsCompletedTotal" : 51,
       "EcDeletionCmdsCompletedTotal" : 51,
       "EcReconstructionCmdsSentTotal" : 571,
       "EcReplicationCmdsTimeoutTotal" : 765,
       "EcDeletionCmdsTimeoutTotal" : 204
   ```
   
   Total replication commands sent are 0, while timed out are 765.
   
   I think the code is working as intended, but it is confusing.
   
   We have a metric for "EcReplicationCmdsSentTotal" and EcReconstructionCmdsSentTotal. However on completion or timeout we only have a metric EcReplicationCmdsCompletedTotal and EcReplicationCmdsTimeoutTotal - we don't have a reconstruction completed / timeout. This is because we track completion in ContainerReplicaPendingOps, and all it sees is a replica that has been scheduled to be created. It doesn't know if its an simple copy or a reconstruction that is going to create it.
   
   That can explain why "EcReplicationCmdsSentTotal=0" and "EcReplicationCmdsTimeoutTotal=765" - likely all these scheduled commands were actually reconstructions, as we have 571 of those sent.
   
   Why then do we have more ECReplication completed and timed out than scheduled? An EC reconstruction can create multiple new replicas in a single command, and they are tracked as a single command when sent, but then when the commands are completed in pending ops, it counts one per replica. So we can schedule a reconstruction to create 2 new replicas, and we will end up with 1 command sent and 2 in EcReplicationCmdsCompletedTotal.
   
   To make this less confusing I have renamed the "complete" metrics in this PR to be Replicas created / deleted / timed out, rather than commands.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-7695
   
   ## How was this patch tested?
   
   Existing tests should cover this as its just a rename of variables / methods.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] adoroszlai merged pull request #4152: HDDS-7695. EC metrics related to replication commands don't add up

Posted by GitBox <gi...@apache.org>.
adoroszlai merged PR #4152:
URL: https://github.com/apache/ozone/pull/4152


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org