You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by "mcvsubbu (via GitHub)" <gi...@apache.org> on 2023/11/23 00:31:42 UTC

[PR] Fix incorrect handling of consumer creation errors [pinot]

mcvsubbu opened a new pull request, #12045:
URL: https://github.com/apache/pinot/pull/12045

   The current handling of exceptions during creation of a consumer is incorrect, since the ExternalView for the segment remains in ERROR state while we specify the IdealState to be OFFLINE. This happens for one replica, while other replicas may consume fine and reach ONLINE state eventually. At that time, however, the particular segment that had problem consuming is not able to transition to ONLINE since a transition from ERROR to ONLINE is not suppored by Helix.
   
   A partition state of ERROR is a special state in Helix. Helix does not work the same way moving from ERROR to other states. Instead, Helix provides an admin API to reset the state of a partition from ERROR to its starting state (which in our case is OFFLINE). When this reset API is invoked, a state transition message of ERROR to StartingState is sent to the specific instance that hosts the partition in question. If the participant's currentstate is not ERROR, then this message is discarded automatically (and Pinot will never see it). Otherwise, it is passed on to Pinot and we have a transition from ERROR to OFFLINE.
   
   Tested by manually inserting an exception in the consumer creation code ad observing that externalview changes to ERROR, and then later onto OFFLINE.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Fix incorrect handling of consumer creation errors [pinot]

Posted by "jtao15 (via GitHub)" <gi...@apache.org>.
jtao15 commented on code in PR #12045:
URL: https://github.com/apache/pinot/pull/12045#discussion_r1406815062


##########
pinot-controller/src/main/java/org/apache/pinot/controller/helix/core/realtime/PinotLLCRealtimeSegmentManager.java:
##########
@@ -821,6 +821,19 @@ public void segmentStoppedConsuming(LLCSegmentName llcSegmentName, String instan
       _controllerMetrics.addMeteredTableValue(realtimeTableName, ControllerMeter.LLC_ZOOKEEPER_UPDATE_FAILURES, 1L);
       throw e;
     }
+    // We know that we have successfully set the idealstate to be OFFLINE.
+    // We can now do a best effort to reset the externalview to be OFFLINE if it is in ERROR state.
+    // If the externalview is not in error state, then this reset will be ignored by the helix participant
+    // in the server when it receives the ERROR to OFFLINE state transition.
+    // Helix throws an exception if we try to reset state of a partition that is NOT in ERROR state in EV,
+    // So, if any exceptions are thrown, ignore it here.
+    try {
+      _helixAdmin.resetPartition(_helixManager.getClusterName(), instanceName,
+          TableNameBuilder.REALTIME.tableNameWithType(llcSegmentName.getTableName()),
+          Collections.singletonList(segmentName));
+    } catch (Exception e) {
+      // Ignore

Review Comment:
   (nit) Good to log the error



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Fix incorrect handling of consumer creation errors [pinot]

Posted by "mcvsubbu (via GitHub)" <gi...@apache.org>.
mcvsubbu commented on code in PR #12045:
URL: https://github.com/apache/pinot/pull/12045#discussion_r1406864843


##########
pinot-controller/src/main/java/org/apache/pinot/controller/helix/core/realtime/PinotLLCRealtimeSegmentManager.java:
##########
@@ -821,6 +821,19 @@ public void segmentStoppedConsuming(LLCSegmentName llcSegmentName, String instan
       _controllerMetrics.addMeteredTableValue(realtimeTableName, ControllerMeter.LLC_ZOOKEEPER_UPDATE_FAILURES, 1L);
       throw e;
     }
+    // We know that we have successfully set the idealstate to be OFFLINE.
+    // We can now do a best effort to reset the externalview to be OFFLINE if it is in ERROR state.
+    // If the externalview is not in error state, then this reset will be ignored by the helix participant
+    // in the server when it receives the ERROR to OFFLINE state transition.
+    // Helix throws an exception if we try to reset state of a partition that is NOT in ERROR state in EV,
+    // So, if any exceptions are thrown, ignore it here.
+    try {
+      _helixAdmin.resetPartition(_helixManager.getClusterName(), instanceName,
+          TableNameBuilder.REALTIME.tableNameWithType(llcSegmentName.getTableName()),
+          Collections.singletonList(segmentName));
+    } catch (Exception e) {
+      // Ignore

Review Comment:
   So, as I described in the comments, this call from the server may come in one of two situations:
   - The server could not start consumption at all (could not construct the consumer), so the state transition from OFFLINE to CONSUMING failed. The externalview will be set to ERROR by Helix. 
   - The server succeeded the OFFLINE to CONSUMING state transition, but could not consume rows from the stream at some point after the state transition. There could be zero or any number of rows already consumed. In this case, the externalview is in CONSUMING state.
   
   If we call the reset API in the second case, helix throws an exception. If we call in the first case, helix simply  does its best to reset the state and send an ERROR to OFFLINE state transition to the server.
   
   Logging an error in this case will make the operator think that we did not reguister the offline call from server correctly. I am therefore not in favor of logging an error. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Fix incorrect handling of consumer creation errors [pinot]

Posted by "Jackie-Jiang (via GitHub)" <gi...@apache.org>.
Jackie-Jiang commented on PR #12045:
URL: https://github.com/apache/pinot/pull/12045#issuecomment-1830672663

   I'm good with the change, and want to start a discussion on how we want to handle consumption error at high level.
   
   Error can occur when:
   1. Consuming segment is created (with state transition)
   2. During consumption, including segment commit (no state transition)
   3. Post segment commit, switching from CONSUMING to ONLINE (with state transition)
   
   Currently:
   - Server notifies controller when 1 or 2 happens, and controller changes IS to OFFLINE
   - When 1 or 3 happens, segment moves to ERROR state in EV
   
   The current behavior can leave segment in 3 conditions, and IMO quite hard to manage:
   1. IS: OFFLINE, EV: ERROR
   2. IS: OFFLINE, EV: CONSUMING -> OFFLINE
   3. IS: ONLINE, EV: ERROR
   
   What this PR is trying to fix is when 1 happens: IS: OFFLINE, EV: ERROR -> OFFLINE (by segment reset). A easier way to fix this might be to not throw exception after notifying the controller, and handle 1 & 2 using the same mechanism: IS: OFFLINE, EV: CONSUMING -> OFFLINE. This way we do not need to wait for 30 seconds before issuing the segment reset.
   
   I also want to discuss if we want to mark IS as OFFLINE so that we can auto-recover. If we can make 2 changing EV to ERROR, then we may use segment reset to fix the consumption error (this can also apply to OFFLINE table). This approach has 2 benefits:
   - More general, and can fix both CONSUMING and ONLINE segments
   - IS only contains ONLINE/CONSUMING state, which makes other code much error-prune. Contributors usually miss the handling of OFFLINE state in IS because they won't know it can exist


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Fix incorrect handling of consumer creation errors [pinot]

Posted by "mcvsubbu (via GitHub)" <gi...@apache.org>.
mcvsubbu commented on PR #12045:
URL: https://github.com/apache/pinot/pull/12045#issuecomment-1823693872

   Fixes problem in PR #11166 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Fix incorrect handling of consumer creation errors [pinot]

Posted by "jackjlli (via GitHub)" <gi...@apache.org>.
jackjlli commented on code in PR #12045:
URL: https://github.com/apache/pinot/pull/12045#discussion_r1408175585


##########
pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/RealtimeSegmentDataManager.java:
##########
@@ -1491,9 +1515,26 @@ public RealtimeSegmentDataManager(SegmentZKMetadata segmentZKMetadata, TableConf
       _realtimeTableDataManager.addSegmentError(_segmentNameStr, new SegmentErrorInfo(now(),
           "Failed to initialize segment data manager", e));
       _segmentLogger.warn(
-          "Calling controller to mark the segment as OFFLINE in Ideal State because of initialization error: '{}'",
+          "Scheduling task to call controller to mark the segment as OFFLINE in Ideal State due"
+           + "to initialization error: '{}'",

Review Comment:
   nit: missing one space between `due` and `to`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Fix incorrect handling of consumer creation errors [pinot]

Posted by "sajjad-moradi (via GitHub)" <gi...@apache.org>.
sajjad-moradi commented on code in PR #12045:
URL: https://github.com/apache/pinot/pull/12045#discussion_r1406823873


##########
pinot-controller/src/main/java/org/apache/pinot/controller/helix/core/realtime/PinotLLCRealtimeSegmentManager.java:
##########
@@ -821,6 +821,19 @@ public void segmentStoppedConsuming(LLCSegmentName llcSegmentName, String instan
       _controllerMetrics.addMeteredTableValue(realtimeTableName, ControllerMeter.LLC_ZOOKEEPER_UPDATE_FAILURES, 1L);
       throw e;
     }
+    // We know that we have successfully set the idealstate to be OFFLINE.
+    // We can now do a best effort to reset the externalview to be OFFLINE if it is in ERROR state.
+    // If the externalview is not in error state, then this reset will be ignored by the helix participant
+    // in the server when it receives the ERROR to OFFLINE state transition.
+    // Helix throws an exception if we try to reset state of a partition that is NOT in ERROR state in EV,
+    // So, if any exceptions are thrown, ignore it here.
+    try {
+      _helixAdmin.resetPartition(_helixManager.getClusterName(), instanceName,
+          TableNameBuilder.REALTIME.tableNameWithType(llcSegmentName.getTableName()),

Review Comment:
   ```suggestion
             realtimeTableName,
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Fix incorrect handling of consumer creation errors [pinot]

Posted by "codecov-commenter (via GitHub)" <gi...@apache.org>.
codecov-commenter commented on PR #12045:
URL: https://github.com/apache/pinot/pull/12045#issuecomment-1823712665

   ## [Codecov](https://app.codecov.io/gh/apache/pinot/pull/12045?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report
   Attention: `24 lines` in your changes are missing coverage. Please review.
   > Comparison is base [(`22403e5`)](https://app.codecov.io/gh/apache/pinot/commit/22403e567510fabf6016f01865721bea324ac21c?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) 27.55% compared to head [(`daacffd`)](https://app.codecov.io/gh/apache/pinot/pull/12045?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) 27.58%.
   > Report is 1 commits behind head on master.
   
   | [Files](https://app.codecov.io/gh/apache/pinot/pull/12045?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Patch % | Lines |
   |---|---|---|
   | [...a/manager/realtime/RealtimeSegmentDataManager.java](https://app.codecov.io/gh/apache/pinot/pull/12045?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9kYXRhL21hbmFnZXIvcmVhbHRpbWUvUmVhbHRpbWVTZWdtZW50RGF0YU1hbmFnZXIuamF2YQ==) | 0.00% | [21 Missing :warning: ](https://app.codecov.io/gh/apache/pinot/pull/12045?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) |
   | [.../core/realtime/PinotLLCRealtimeSegmentManager.java](https://app.codecov.io/gh/apache/pinot/pull/12045?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cGlub3QtY29udHJvbGxlci9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29udHJvbGxlci9oZWxpeC9jb3JlL3JlYWx0aW1lL1Bpbm90TExDUmVhbHRpbWVTZWdtZW50TWFuYWdlci5qYXZh) | 0.00% | [3 Missing :warning: ](https://app.codecov.io/gh/apache/pinot/pull/12045?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) |
   
   <details><summary>Additional details and impacted files</summary>
   
   
   ```diff
   @@             Coverage Diff              @@
   ##             master   #12045      +/-   ##
   ============================================
   + Coverage     27.55%   27.58%   +0.03%     
   + Complexity      207      201       -6     
   ============================================
     Files          2385     2385              
     Lines        129519   129541      +22     
     Branches      20043    20044       +1     
   ============================================
   + Hits          35683    35730      +47     
   + Misses        91141    91105      -36     
   - Partials       2695     2706      +11     
   ```
   
   | [Flag](https://app.codecov.io/gh/apache/pinot/pull/12045/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Coverage Δ | |
   |---|---|---|
   | [custom-integration1](https://app.codecov.io/gh/apache/pinot/pull/12045/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `?` | |
   | [integration](https://app.codecov.io/gh/apache/pinot/pull/12045/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `?` | |
   | [integration1](https://app.codecov.io/gh/apache/pinot/pull/12045/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `?` | |
   | [java-11](https://app.codecov.io/gh/apache/pinot/pull/12045/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `?` | |
   | [java-21](https://app.codecov.io/gh/apache/pinot/pull/12045/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `27.58% <0.00%> (+0.03%)` | :arrow_up: |
   | [skip-bytebuffers-false](https://app.codecov.io/gh/apache/pinot/pull/12045/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `?` | |
   | [skip-bytebuffers-true](https://app.codecov.io/gh/apache/pinot/pull/12045/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `27.58% <0.00%> (+0.03%)` | :arrow_up: |
   | [temurin](https://app.codecov.io/gh/apache/pinot/pull/12045/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `27.58% <0.00%> (+0.03%)` | :arrow_up: |
   | [unittests](https://app.codecov.io/gh/apache/pinot/pull/12045/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `27.58% <0.00%> (+0.03%)` | :arrow_up: |
   | [unittests2](https://app.codecov.io/gh/apache/pinot/pull/12045/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `27.58% <0.00%> (+0.03%)` | :arrow_up: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   
   </details>
   
   [:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/pinot/pull/12045?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).   
   :loudspeaker: Have feedback on the report? [Share it here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Fix incorrect handling of consumer creation errors [pinot]

Posted by "mcvsubbu (via GitHub)" <gi...@apache.org>.
mcvsubbu commented on PR #12045:
URL: https://github.com/apache/pinot/pull/12045#issuecomment-1830742824

   > I'm good with the change, and want to start a discussion on how we want to handle consumption error at high level.
   > 
   > Error can occur when:
   > 
   >     1. Consuming segment is created (with state transition)
   > 
   >     2. During consumption, including segment commit (no state transition)
   > 
   >     3. Post segment commit, switching from CONSUMING to ONLINE (with state transition)
   > 
   > 
   > Currently:
   > 
   >     * Server notifies controller when 1 or 2 happens, and controller changes IS to OFFLINE
   > 
   >     * When 1 or 3 happens, segment moves to ERROR state in EV
   > 
   > 
   > The current behavior can leave segment in 3 conditions, and IMO quite hard to manage:
   > 
   >     1. IS: OFFLINE, EV: ERROR
   > 
   >     2. IS: OFFLINE, EV: CONSUMING -> OFFLINE
   > 
   >     3. IS: ONLINE, EV: ERROR
   > 
   > 
   > What this PR is trying to fix is when 1 happens: IS: OFFLINE, EV: ERROR -> OFFLINE (by segment reset). A easier way to fix this might be to not throw exception after notifying the controller, and handle 1 & 2 using the same mechanism: IS: OFFLINE, EV: CONSUMING -> OFFLINE. This way we do not need to wait for 30 seconds before issuing the segment reset.
   
   Sure, but that could result in queries going to the server when the segment does not exist. The earlier fix did not change the EV to OFFLINE all the time. We found instances when EV was in ERROR state, and the IS was in OFFLINE state.  Anyways, I have added integration tests so that any other fix we adopt can still be tested locally at least.
   
   > 
   > I also want to discuss if we want to mark IS as OFFLINE so that we can auto-recover. If we can make 2 changing EV to ERROR, then we may use segment reset to fix the consumption error (this can also apply to OFFLINE table). This approach has 2 benefits:
   > 
   >     * More general, and can fix both CONSUMING and ONLINE segments
   > 
   >     * IS only contains ONLINE/CONSUMING state, which makes other code much error-prune. Contributors usually miss the handling of OFFLINE state in IS because they won't know it can exist
   
   Open to discussions.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Fix incorrect handling of consumer creation errors [pinot]

Posted by "mcvsubbu (via GitHub)" <gi...@apache.org>.
mcvsubbu commented on code in PR #12045:
URL: https://github.com/apache/pinot/pull/12045#discussion_r1406907002


##########
pinot-controller/src/main/java/org/apache/pinot/controller/helix/core/realtime/PinotLLCRealtimeSegmentManager.java:
##########
@@ -821,6 +821,19 @@ public void segmentStoppedConsuming(LLCSegmentName llcSegmentName, String instan
       _controllerMetrics.addMeteredTableValue(realtimeTableName, ControllerMeter.LLC_ZOOKEEPER_UPDATE_FAILURES, 1L);
       throw e;
     }
+    // We know that we have successfully set the idealstate to be OFFLINE.
+    // We can now do a best effort to reset the externalview to be OFFLINE if it is in ERROR state.
+    // If the externalview is not in error state, then this reset will be ignored by the helix participant
+    // in the server when it receives the ERROR to OFFLINE state transition.
+    // Helix throws an exception if we try to reset state of a partition that is NOT in ERROR state in EV,
+    // So, if any exceptions are thrown, ignore it here.
+    try {
+      _helixAdmin.resetPartition(_helixManager.getClusterName(), instanceName,
+          TableNameBuilder.REALTIME.tableNameWithType(llcSegmentName.getTableName()),
+          Collections.singletonList(segmentName));
+    } catch (Exception e) {
+      // Ignore

Review Comment:
   @sajjad-moradi  and I discussed this, and we think the ideal solution is to:
   - Create specific reason codes (not strings) that we can exchange between the controller and server when the server posts any event to the controller (not just to indicate stop consumption)
   - Use the reason code to decide whether the reset API should be invoked (e.g. a reason code that indicates that the consuming segment was never created, will end up invoking the helix reset API). I will file an issue for this as a cleanup task.
   
   Another alternative can be to look at the externalview in this same method before invoking the API. I had considered that alternative, and discarded it since it involves another Helix API call that may come in with its own errors and exceptions to be handled. It seemed to me to better to just call the reset API once and for all. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Fix incorrect handling of consumer creation errors [pinot]

Posted by "mcvsubbu (via GitHub)" <gi...@apache.org>.
mcvsubbu commented on PR #12045:
URL: https://github.com/apache/pinot/pull/12045#issuecomment-1831045947

   Also, @Jackie-Jiang , it used to be the case originally that the consumer was created in the new thread that is spawned. I think @sajjad-moradi  (or, maybe @navina ) moved it to be created in the constructor, which is causing us to handle a new case (I don't remember the reason, but either of them may know). We can also think of moving it back to be created in the consuming thread.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Fix incorrect handling of consumer creation errors [pinot]

Posted by "mcvsubbu (via GitHub)" <gi...@apache.org>.
mcvsubbu merged PR #12045:
URL: https://github.com/apache/pinot/pull/12045


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [PR] Fix incorrect handling of consumer creation errors [pinot]

Posted by "mcvsubbu (via GitHub)" <gi...@apache.org>.
mcvsubbu commented on code in PR #12045:
URL: https://github.com/apache/pinot/pull/12045#discussion_r1406964545


##########
pinot-controller/src/main/java/org/apache/pinot/controller/helix/core/realtime/PinotLLCRealtimeSegmentManager.java:
##########
@@ -821,6 +821,19 @@ public void segmentStoppedConsuming(LLCSegmentName llcSegmentName, String instan
       _controllerMetrics.addMeteredTableValue(realtimeTableName, ControllerMeter.LLC_ZOOKEEPER_UPDATE_FAILURES, 1L);
       throw e;
     }
+    // We know that we have successfully set the idealstate to be OFFLINE.
+    // We can now do a best effort to reset the externalview to be OFFLINE if it is in ERROR state.
+    // If the externalview is not in error state, then this reset will be ignored by the helix participant
+    // in the server when it receives the ERROR to OFFLINE state transition.
+    // Helix throws an exception if we try to reset state of a partition that is NOT in ERROR state in EV,
+    // So, if any exceptions are thrown, ignore it here.
+    try {
+      _helixAdmin.resetPartition(_helixManager.getClusterName(), instanceName,
+          TableNameBuilder.REALTIME.tableNameWithType(llcSegmentName.getTableName()),

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org