You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by GitBox <gi...@apache.org> on 2020/08/14 18:18:11 UTC

[GitHub] [samza] Sanil15 opened a new pull request #1417: SAMZA-2582: Add a metric to track container failure tracking metric for Samza

Sanil15 opened a new pull request #1417:
URL: https://github.com/apache/samza/pull/1417


   **Changes:** Added a metric to <container-id>-failure-count to track failure count of a single container
   
   **API Changes:** None
   
   **Tests:** Tested the change with a yarn job deploy
   
   **Upgrade Instructions:** None
   
   **Usage Instructions:** None


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [samza] rmatharu commented on a change in pull request #1417: SAMZA-2582: Add a metric to track container failure tracking metric for Samza

Posted by GitBox <gi...@apache.org>.
rmatharu commented on a change in pull request #1417:
URL: https://github.com/apache/samza/pull/1417#discussion_r473228119



##########
File path: samza-core/src/main/java/org/apache/samza/clustermanager/ContainerProcessManager.java
##########
@@ -236,6 +237,12 @@ public void start() {
     Map<String, String> processorToHostMapping = state.jobModelManager.jobModel().getAllContainerLocality();
     containerAllocator.requestResources(processorToHostMapping);
 
+    // Initialize the per processor failure count to be 0
+    processorToHostMapping.keySet().forEach(processorId -> {

Review comment:
       See comment below on how/why this information isnt really in "state"




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [samza] Sanil15 commented on a change in pull request #1417: SAMZA-2582: Add a metric to track container failure tracking metric for Samza

Posted by GitBox <gi...@apache.org>.
Sanil15 commented on a change in pull request #1417:
URL: https://github.com/apache/samza/pull/1417#discussion_r474341307



##########
File path: samza-core/src/main/java/org/apache/samza/clustermanager/ContainerProcessManager.java
##########
@@ -472,6 +479,9 @@ void onResourceCompletedWithUnknownStatus(SamzaResourceStatus resourceStatus, St
     LOG.info("Container ID: {} for Processor ID: {} failed with exit code: {}.", containerId, processorId, exitStatus);
     Instant now = Instant.now();
     state.failedContainers.incrementAndGet();
+    if (state.perProcessorFailureCount.get(processorId) != null) {
+      state.perProcessorFailureCount.get(processorId).incrementAndGet();
+    }

Review comment:
       This method is the helper to and is invoked from onResourceCompleted(...) which does the check for processorId to be legit, remeber that we also get redundant notifications so we cannot declare a container orphan / unknown, we need more testing to deem callback senarios as orphans and that work is beyond the scope of this change
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [samza] cameronlee314 commented on pull request #1417: SAMZA-2582: Add a metric to track container failure tracking metric for Samza

Posted by GitBox <gi...@apache.org>.
cameronlee314 commented on pull request #1417:
URL: https://github.com/apache/samza/pull/1417#issuecomment-675055404


   Can you please update the PR description to include what issue/symptom you are fixing with this? It's unclear why you need a container failure metric for each container specifically instead of an aggregate container failure metric.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [samza] Sanil15 commented on a change in pull request #1417: SAMZA-2582: Add a metric to track container failure tracking metric for Samza

Posted by GitBox <gi...@apache.org>.
Sanil15 commented on a change in pull request #1417:
URL: https://github.com/apache/samza/pull/1417#discussion_r474342960



##########
File path: samza-core/src/main/java/org/apache/samza/clustermanager/SamzaApplicationState.java
##########
@@ -115,6 +115,13 @@
    */
   public final ConcurrentHashMap<String, SamzaResourceStatus> failedProcessors = new ConcurrentHashMap<>(0);
 
+
+  /**
+   *  Map of the Samza processor ID to the count of failed attempts
+   *  Modified by AMRMCallbackThread
+   */
+  public final ConcurrentMap<String, AtomicInteger> perProcessorFailureCount = new ConcurrentHashMap<>(0);
+

Review comment:
       Check SamzaApplicationState most of the state there is just used for metric emissions




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [samza] f3flight commented on pull request #1417: SAMZA-2582: Add a metric to track container failure tracking metric for Samza

Posted by GitBox <gi...@apache.org>.
f3flight commented on pull request #1417:
URL: https://github.com/apache/samza/pull/1417#issuecomment-675202391


   @cameronlee314 this is needed to be able to track individual container health issues and make informed ops decisions based on that data, this is useful for both containers with host affinity to detect unstable hosts, as well as containers w/o affinity - to detect issues caused by partitioning (i.e. when specific traffic goes to certain containers and causes instability from time to time).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [samza] rmatharu commented on a change in pull request #1417: SAMZA-2582: Add a metric to track container failure tracking metric for Samza

Posted by GitBox <gi...@apache.org>.
rmatharu commented on a change in pull request #1417:
URL: https://github.com/apache/samza/pull/1417#discussion_r473215075



##########
File path: docs/learn/documentation/versioned/operations/monitoring.md
##########
@@ -369,6 +369,7 @@ All \<system\>, \<stream\>, \<partition\>, \<store-name\>, \<topic\>, are popula
 | | expired-preferred-host-requests | Number of expired resource-requests-for -preferred-host received by the cluster manager. |
 | | expired-any-host-requests | Number of expired resource-requests-for -any-host received by the cluster manager. |
 | | host-affinity-match-pct | Percentage of non-expired preferred host requests. This measures the % of resource-requests for which host-affinity provided the preferred host. |
+| | \<containerId\>-failure-count | Number of times a container identified by containerId has failed |

Review comment:
       I believe we decided to use "processorId" for 0,1,2.. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [samza] Sanil15 commented on a change in pull request #1417: SAMZA-2582: Add a metric to track container failure tracking metric for Samza

Posted by GitBox <gi...@apache.org>.
Sanil15 commented on a change in pull request #1417:
URL: https://github.com/apache/samza/pull/1417#discussion_r474342670



##########
File path: samza-core/src/main/java/org/apache/samza/clustermanager/ContainerProcessManager.java
##########
@@ -236,6 +237,12 @@ public void start() {
     Map<String, String> processorToHostMapping = state.jobModelManager.jobModel().getAllContainerLocality();
     containerAllocator.requestResources(processorToHostMapping);
 
+    // Initialize the per processor failure count to be 0
+    processorToHostMapping.keySet().forEach(processorId -> {

Review comment:
       replied there




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [samza] rmatharu commented on a change in pull request #1417: SAMZA-2582: Add a metric to track container failure tracking metric for Samza

Posted by GitBox <gi...@apache.org>.
rmatharu commented on a change in pull request #1417:
URL: https://github.com/apache/samza/pull/1417#discussion_r473227782



##########
File path: samza-core/src/main/java/org/apache/samza/clustermanager/SamzaApplicationState.java
##########
@@ -115,6 +115,13 @@
    */
   public final ConcurrentHashMap<String, SamzaResourceStatus> failedProcessors = new ConcurrentHashMap<>(0);
 
+
+  /**
+   *  Map of the Samza processor ID to the count of failed attempts
+   *  Modified by AMRMCallbackThread
+   */
+  public final ConcurrentMap<String, AtomicInteger> perProcessorFailureCount = new ConcurrentHashMap<>(0);
+

Review comment:
       If this information is only useful for metric-emission, does it need to be stored in "state" ?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [samza] Sanil15 commented on a change in pull request #1417: SAMZA-2582: Add a metric to track container failure tracking metric for Samza

Posted by GitBox <gi...@apache.org>.
Sanil15 commented on a change in pull request #1417:
URL: https://github.com/apache/samza/pull/1417#discussion_r474340146



##########
File path: docs/learn/documentation/versioned/operations/monitoring.md
##########
@@ -369,6 +369,7 @@ All \<system\>, \<stream\>, \<partition\>, \<store-name\>, \<topic\>, are popula
 | | expired-preferred-host-requests | Number of expired resource-requests-for -preferred-host received by the cluster manager. |
 | | expired-any-host-requests | Number of expired resource-requests-for -any-host received by the cluster manager. |
 | | host-affinity-match-pct | Percentage of non-expired preferred host requests. This measures the % of resource-requests for which host-affinity provided the preferred host. |
+| | \<containerId\>-failure-count | Number of times a container identified by containerId has failed |

Review comment:
       that lingo is used internally in code as the naming conventions for javadocs, this is public-facing metrics page where we do not need to have context between processorId and containerId




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [samza] Sanil15 commented on a change in pull request #1417: SAMZA-2582: Add a metric to track container failure tracking metric for Samza

Posted by GitBox <gi...@apache.org>.
Sanil15 commented on a change in pull request #1417:
URL: https://github.com/apache/samza/pull/1417#discussion_r474342587



##########
File path: samza-core/src/main/java/org/apache/samza/clustermanager/SamzaApplicationState.java
##########
@@ -115,6 +115,13 @@
    */
   public final ConcurrentHashMap<String, SamzaResourceStatus> failedProcessors = new ConcurrentHashMap<>(0);
 
+
+  /**
+   *  Map of the Samza processor ID to the count of failed attempts
+   *  Modified by AMRMCallbackThread
+   */
+  public final ConcurrentMap<String, AtomicInteger> perProcessorFailureCount = new ConcurrentHashMap<>(0);
+

Review comment:
       That is correct, we can directly wire metrics registry in ContainerManager, ContainerAllocator and instantiate new guage and counters in the code but all metrics related to AM are under this ContainerProcessManagerMetrics class which holds MetricsRegistry and SamzaApplicationState, so once does not need to wire MetricsRegistry individually to each AM class ContainerManager, ContainerAllocator. This is the justification for maintained this state variable to wire metrics, I feel this approach is cleaner 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [samza] rmatharu commented on a change in pull request #1417: SAMZA-2582: Add a metric to track container failure tracking metric for Samza

Posted by GitBox <gi...@apache.org>.
rmatharu commented on a change in pull request #1417:
URL: https://github.com/apache/samza/pull/1417#discussion_r473226673



##########
File path: samza-core/src/main/java/org/apache/samza/clustermanager/ContainerProcessManager.java
##########
@@ -472,6 +479,9 @@ void onResourceCompletedWithUnknownStatus(SamzaResourceStatus resourceStatus, St
     LOG.info("Container ID: {} for Processor ID: {} failed with exit code: {}.", containerId, processorId, exitStatus);
     Instant now = Instant.now();
     state.failedContainers.incrementAndGet();
+    if (state.perProcessorFailureCount.get(processorId) != null) {
+      state.perProcessorFailureCount.get(processorId).incrementAndGet();
+    }

Review comment:
       else {
   Log.error("Unknown/orphan container") ??
   }




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org