You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by "eaugene (via GitHub)" <gi...@apache.org> on 2024/02/09 17:31:35 UTC

[I] InterruptedException when acquiring partitonConsumerSemaphore [pinot]

eaugene opened a new issue, #12390:
URL: https://github.com/apache/pinot/issues/12390

   We are encountering the following issue when OFFLINE-> ONLINE transition of the segment 
   
   ```
   Caught exception in state transition from OFFLINE -> ONLINE for resource: <TableName>, partition:<Segment> 
   java.lang.RuntimeException: InterruptedException when acquiring the partitionConsumerSemaphore for segment: rta_eater_rtp_wav_baf_prod__0__4985__20240209T0153Z
   	at o.a.p.c.d.m.r.LLRealtimeSegmentDataManager.<init>(LLRealtimeSegmentDataManager.java:1432)
   	at o.a.p.c.d.m.r.RealtimeTableDataManager.addSegment(RealtimeTableDataManager.java:380)
   	at o.a.p.s.s.h.HelixInstanceDataManager.addRealtimeSegment(HelixInstanceDataManager.java:221)
   	at o.a.p.s.s.h.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeOnlineFromOffline(SegmentOnlineOfflineStateModelFactory.java:161)
   	at o.a.p.s.s.h.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeConsumingFromOffline(SegmentOnlineOfflineStateModelFactory.java:83)
   	at j.i.r.GeneratedMethodAccessor1558.invoke(Unknown Source)
   	at j.i.r.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:566)
   	at o.a.h.m.h.HelixStateTransitionHandler...
   ```
   Pinot Version : 0.12
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [I] InterruptedException when acquiring partitonConsumerSemaphore [pinot]

Posted by "Jackie-Jiang (via GitHub)" <gi...@apache.org>.
Jackie-Jiang commented on issue #12390:
URL: https://github.com/apache/pinot/issues/12390#issuecomment-1957777881

   @ankitsultana Can you check why the server received `OFFLINE` to `ONLINE` state transition instead of `CONSUMING` to `ONLINE` state transition? Our current state model doesn't consider this scenario, so we should try to find out why Helix sent this state transition when the segment is actually in `CONSUMING` state. Did it receive a `CONSUMING` to `OFFLINE` state transition before `OFFLINE` to `ONLINE`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [I] InterruptedException when acquiring partitonConsumerSemaphore [pinot]

Posted by "ankitsultana (via GitHub)" <gi...@apache.org>.
ankitsultana commented on issue #12390:
URL: https://github.com/apache/pinot/issues/12390#issuecomment-1957808371

   Ah my bad, I wasn't clear in the description above. Soon after the segment receives a DISCARD from the controller, there's a Helix disconnection. And a few seconds later the server reconnects and is back online.
   
   My understanding is that after Helix reconnects, the segment will show up as OFFLINE and hence trigger the OFFLINE to ONLINE transition.
   
   Sampled some logs which show this:
   
   ```
   {"@timestamp":"2024-02-12T11:17:02.203+00:00","message":"Controller response {\"buildTimeSec\":-1,\"isSplitCommitType\":false,\"streamPartitionMsgOffset\":null,\"offset\":-1,\"status\":\"DISCARD\"} for http://redacted:1234/segmentConsumed?
   reason=timeLimit&streamPartitionMsgOffset=368526984&instance=redacted&offset=-1&name=redacted__44__3696__20240212T1016Z&rowCount=48937&memoryUsedBytes=80637706","logger_name":"org.apache.pinot.server.realtime
   .ServerSegmentCompletionProtocolHandler","thread_name":"redacted__44__3696__20240212T1016Z","level":"INFO"}
   ....
   {"@timestamp":"2024-02-12T11:17:03.801+00:00","message":"Unable to reconnect to ZooKeeper service, session 0x500e8e168947a76 has expired","logger_name":"org.apache.zookeeper.ClientCnxn","thread_name":"main-SendThread(redacted:2181)","level":"WARN"}
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [I] InterruptedException when acquiring partitonConsumerSemaphore [pinot]

Posted by "ankitsultana (via GitHub)" <gi...@apache.org>.
ankitsultana commented on issue #12390:
URL: https://github.com/apache/pinot/issues/12390#issuecomment-1963528138

   I am not well versed with Helix, but I am kinda sure that the server started from OFFLINE state.
   
   There's this log coming from `HelixStateMachineEngine`: `Resetting HelixStateMachineEngine`.
   
   This is logged in the `reset` method, which does this:
   
   ```
       loopStateModelFactories(stateModel -> {
         stateModel.reset();
         String initialState = _stateModelParser.getInitialState(stateModel.getClass());
         stateModel.updateState(initialState);
       });
   ```
   
   The initial state is taken from the annotation of the class. Initial state for all Pinot state models is OFFLINE.
   
   Moreover, after this is logged, there's a ton of messages with format: `Operating on table_name:segment_name` (one for each segment).
   
   @Jackie-Jiang : lmk if this is sufficient.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [I] InterruptedException when acquiring partitonConsumerSemaphore [pinot]

Posted by "ankitsultana (via GitHub)" <gi...@apache.org>.
ankitsultana commented on issue #12390:
URL: https://github.com/apache/pinot/issues/12390#issuecomment-1936819688

   The exact issue is described below (all of these are confirmed via logs):
   
   * Server has a full GC which leads to ZK and Helix disconnection.
   * When ZK reconnects, a bunch of OFFLINE to CONSUMING messages are sent
   * We see the exception above.
   
   **Current Theory**: When ZK disconnection happens, the PartitionConsumer thread is still alive and holding the semaphore, and so when Helix sends OFFLINE to CONSUMING transition again for that segment, the Segment Data Manager fails to init.
   
   I am low on time right now so can't dig deeper. Wondering if anyone can hint at some potential solutions.
   
   I had also seen this somewhat related issue from a few years ago: https://github.com/apache/pinot/issues/7874
   
   cc: @Jackie-Jiang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [I] InterruptedException when acquiring partitonConsumerSemaphore [pinot]

Posted by "Jackie-Jiang (via GitHub)" <gi...@apache.org>.
Jackie-Jiang commented on issue #12390:
URL: https://github.com/apache/pinot/issues/12390#issuecomment-1960810712

   Can you check the log and see if server started from OFFLINE state after re-connecting to Helix? I think it is very possible because when re-connecting to Helix, server will generate a new hash, and start from a new current state. If that is the case, that explains why we observed server not able to consume after re-connecting to Helix. We can probably add a test to reproduce this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [I] InterruptedException when acquiring partitonConsumerSemaphore [pinot]

Posted by "ankitsultana (via GitHub)" <gi...@apache.org>.
ankitsultana commented on issue #12390:
URL: https://github.com/apache/pinot/issues/12390#issuecomment-1945473584

   Took a deep-dive today and found the root-cause. Here's the sequence of events which triggers this:
   
   * We have a consuming segment on a server. On time of commit, it receives a DISCARD from the controller.
   * There's a Huge GC
   * Instead of receiving a CONSUMING to ONLINE transition, the segment receives a OFFLINE to ONLINE transition.
   * Since the TDM already has a Segment Data Manager (SDM) of the consuming segment, we end up skipping the `addSegment` call and mark it succeeded anyways. [RealtimeSegmentDataManager.java#L387](https://github.com/apache/pinot/blob/38d86b0a6432e9a7249f1692ace36b6e34171b0a/pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/RealtimeTableDataManager.java#L387)
   
   So the `RealtimeSegmentDataManager` is never destroyed and the `_partitionGroupConsumerSemaphore` is never released.
   
   As more and more segments for this partition are committed on other servers, we receive many OFFLINE to CONSUMING transitions, which pile up.
   
   When there's another big GC, and say Helix reconnects, then the `HelixTaskExecutor` will be shutdown, leading to interruption of all the semaphore acquire calls and the original error message above. (note `onBecomeConsumingFromOffline` in the stack trace).
   
   @Jackie-Jiang : this is an additional case where `RealtimeTableDataManager#addSegment` can be called. Do you have any recommendations on how to handle this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [I] InterruptedException when acquiring partitonConsumerSemaphore [pinot]

Posted by "Jackie-Jiang (via GitHub)" <gi...@apache.org>.
Jackie-Jiang commented on issue #12390:
URL: https://github.com/apache/pinot/issues/12390#issuecomment-1970186099

   This is great finding! Let me create an issue to track this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org