You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2020/12/03 01:20:41 UTC

[GitHub] [incubator-pinot] npawar opened a new issue #6308: Errors in consuming segment completely stops consumption

npawar opened a new issue #6308:
URL: https://github.com/apache/incubator-pinot/issues/6308


   Just in the last day, we had 2 reports of consumption stopping completely for a partition, because of some error occurrences in consumption.
   
   Example 1:
   User saw this error, and reported that consumption is totally stopped for that partition.
   ```
   ERROR [LLRealtimeSegmentDataManager_spanEventView__0__15__20201201T0448Z] [spanEventView__0__15__20201201T0448Z] Could not build segment
   java.lang.IllegalStateException: Cannot create output dir: /var/pinot/server/data/index/spanEventView_REALTIME/_tmp/tmp-spanEventView__0__15__20201201T0448Z-160688315943
   ```
   This is coming from LLRealtimeSegmentDataManager#buildSegmentInternal(). As a result of this, we reach this in LLRealtimeSegmentDataManager
   ```
   // We could not build the segment. Go into error state.
                   _state = State.ERROR;
   ```
   Problem: After that, the consumer thread exits. I believe an assumption is made here that some other replica will have built the segment, and this errored replica can simply download the segment. **But this assumption fails if user has only 1 replica.**
   Lack of detection mechanism: Because the ideal state and external view match, the Cluster Manager continues to report GOOD. Users find out of the problem because they notice data lag in their application.  Most users who are just starting out have not set up metrics monitoring, so asking them to look for LLC_PARTITION_CONSUMING metric is also not an option.
   Lack of easy recovery mechanism: The validation manager does nothing to fix this, because the ideal state is still CONSUMING and segment metadata is still IN_PROGRESS. Only way to recover is restarting that server.
   
   
   Example 2:
   Again, user reported that consumption is totally stopped on the partition, and reported this error.
   ```
   Exception while executing a state transition task realtimesolines__234__4__20201127T1841Z
   java.lang.reflect.InvocationTargetException: null
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_77]
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_77]
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_77]
   	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_77]
   	at org.apache.helix.messaging.handling.HelixStateTransitionHandler.invoke(HelixStateTransitionHandler.java:404) ~[pinot-all-0.6.0-SNAPSHOT-jar-with-dependencies.jar:0.6.0-SNAPSHOT-fb9
   6cb36efb3e4dbceb10461e73c0d6465e81493]
   	at org.apache.helix.messaging.handling.HelixStateTransitionHandler.handleMessage(HelixStateTransitionHandler.java:331) [pinot-all-0.6.0-SNAPSHOT-jar-with-dependencies.jar:0.6.0-SNAPSH
   OT-fb96cb36efb3e4dbceb10461e73c0d6465e81493]
   	at org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:97) [pinot-all-0.6.0-SNAPSHOT-jar-with-dependencies.jar:0.6.0-SNAPSHOT-fb96cb36efb3e4dbceb10461e73c0d6465e81493]
   	at org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:49) [pinot-all-0.6.0-SNAPSHOT-jar-with-dependencies.jar:0.6.0-SNAPSHOT-fb96cb36efb3e4dbceb10461e73c0d6465e81493]
   	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_77]
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_77]
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_77]
   	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_77]
   Caused by: java.lang.OutOfMemoryError: Direct buffer memory
   	at java.nio.Bits.reserveMemory(Bits.java:693) ~[?:1.8.0_77]
   	at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123) ~[?:1.8.0_77]
   	at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) ~[?:1.8.0_77]
   	at org.apache.pinot.core.segment.memory.PinotByteBuffer.allocateDirect(PinotByteBuffer.java:38) ~[pinot-all-0.6.0-SNAPSHOT-jar-with-dependencies.jar:0.6.0-SNAPSHOT-fb96cb36efb3e4dbceb
   10461e73c0d6465e81493]
   	at org.apache.pinot.core.segment.memory.PinotDataBuffer.allocateDirect(PinotDataBuffer.java:116) ~[pinot-all-0.6.0-SNAPSHOT-jar-with-dependencies.jar:0.6.0-SNAPSHOT-fb96cb36efb3e4dbce
   b10461e73c0d6465e81493]
   	at org.apache.pinot.core.io.writer.impl.DirectMemoryManager.allocateInternal(DirectMemoryManager.java:53) ~[pinot-all-0.6.0-SNAPSHOT-jar-with-dependencies.jar:0.6.0-SNAPSHOT-fb96cb36e
   fb3e4dbceb10461e73c0d6465e81493]
   	at org.apache.pinot.core.io.readerwriter.RealtimeIndexOffHeapMemoryManager.allocate(RealtimeIndexOffHeapMemoryManager.java:79) ~[pinot-all-0.6.0-SNAPSHOT-jar-with-dependencies.jar:0.6
   .0-SNAPSHOT-fb96cb36efb3e4dbceb10461e73c0d6465e81493]
   	at org.apache.pinot.core.realtime.impl.forward.FixedByteSVMutableForwardIndex.addBuffer(FixedByteSVMutableForwardIndex.java:208) ~[pinot-all-0.6.0-SNAPSHOT-jar-with-dependencies.jar:0
   .6.0-SNAPSHOT-fb96cb36efb3e4dbceb10461e73c0d6465e81493]
   	at org.apache.pinot.core.realtime.impl.forward.FixedByteSVMutableForwardIndex.<init>(FixedByteSVMutableForwardIndex.java:77) ~[pinot-all-0.6.0-SNAPSHOT-jar-with-dependencies.jar:0.6.0
   -SNAPSHOT-fb96cb36efb3e4dbceb10461e73c0d6465e81493]
   	at org.apache.pinot.core.indexsegment.mutable.MutableSegmentImpl.<init>(MutableSegmentImpl.java:294) ~[pinot-all-0.6.0-SNAPSHOT-jar-with-dependencies.jar:0.6.0-SNAPSHOT-fb96cb36efb3e4
   dbceb10461e73c0d6465e81493]
   	at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.<init>(LLRealtimeSegmentDataManager.java:1245) ~[pinot-all-0.6.0-SNAPSHOT-jar-with-dependencies.jar:0.6.0-S
   NAPSHOT-fb96cb36efb3e4dbceb10461e73c0d6465e81493]
   	at org.apache.pinot.core.data.manager.realtime.RealtimeTableDataManager.addSegment(RealtimeTableDataManager.java:312) ~[pinot-all-0.6.0-SNAPSHOT-jar-with-dependencies.jar:0.6.0-SNAPSH
   OT-fb96cb36efb3e4dbceb10461e73c0d6465e81493]
   	at org.apache.pinot.server.starter.helix.HelixInstanceDataManager.addRealtimeSegment(HelixInstanceDataManager.java:133) ~[pinot-all-0.6.0-SNAPSHOT-jar-with-dependencies.jar:0.6.0-SNAP
   SHOT-fb96cb36efb3e4dbceb10461e73c0d6465e81493]
   	at org.apache.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeOnlineFromOffline(SegmentOnlineOfflineStateModelFactory.java:164)
    ~[pinot-all-0.6.0-SNAPSHOT-jar-with-dependencies.jar:0.6.0-SNAPSHOT-fb96cb36efb3e4dbceb10461e73c0d6465e81493]
   	at org.apache.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeConsumingFromOffline(SegmentOnlineOfflineStateModelFactory.java:8
   8) ~[pinot-all-0.6.0-SNAPSHOT-jar-with-dependencies.jar:0.6.0-SNAPSHOT-fb96cb36efb3e4dbceb10461e73c0d6465e81493]
   	... 12 more
   ```
   This appears to have happened when the state transition for OFFLINE to CONSUMING was being processed (during memory allocation for beginning consumption of a new segment. The consumption loop hasn't been reached). The state transition method `onBecomeOnlineFromOffline` threw the exception, and the segment must've gone into ERROR state in the external view. Ideal state will still be CONSUMING, and segment metadata will be IN_PROGRESS.
   Problem: Similar to above case, the ideal state and segment metadata indicate that nothing is wrong with the system. But consumption is not happening, the consumer thread never even started.
   Lack of detection mechanism: This should've been flagged on the Cluster Manager as BAD, as there is a mismatch between IS and EV. 
   Lack of recovery mechanism: Again, the validation manager won't fix this, because it only looks at Ideal State, and according to the IS everything looks fine. Only way to recover will be restart of the server.
   
   
   We need 
   1. A way to restart consumption automatically in both cases - Possibly make Validation Manager handle both these cases.
   2. A way to track state of consumers for every CONSUMING segment - an API on the controller would be ideal


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] mcvsubbu commented on issue #6308: Errors in consuming segment completely stops consumption

Posted by GitBox <gi...@apache.org>.
mcvsubbu commented on issue #6308:
URL: https://github.com/apache/incubator-pinot/issues/6308#issuecomment-738956959


   I am fine adding admin API, but let us not do anything via validation manager. It can lead to some timing issues (depending on what we are trying to do) and not work in production correctly. If you do want to change automatic validation, please put forward a proposal, and we can discuss that.
   
   Btw, here are some ideas for an API https://github.com/apache/incubator-pinot/issues/4035
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] mcvsubbu commented on issue #6308: Errors in consuming segment completely stops consumption

Posted by GitBox <gi...@apache.org>.
mcvsubbu commented on issue #6308:
URL: https://github.com/apache/incubator-pinot/issues/6308#issuecomment-738148942


   I disagree that ERROR state should trigger a consume stop.  The ERROR state is set (perhaps not best name) when the segment build fails. We can change the name to SEGMENT_BUILD_FAILED if you like, and it is an end state.
   
   In production use case, there are always multiple replicas, and we should attempt to download if the build fails -- for whatever reason. We could check the number of replicas, and if it is one, we know that on a build failure, there will be nothing to download, and we can post the consumptionStopped message.
   
   Another way to solve this whole problem is to introduce special code to monitor the consumption-stopped metric. If it stops for more than (say) 2 mins, then kill the process (call system.exit()). This can be done in service manager code.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] mcvsubbu commented on issue #6308: Errors in consuming segment completely stops consumption

Posted by GitBox <gi...@apache.org>.
mcvsubbu commented on issue #6308:
URL: https://github.com/apache/incubator-pinot/issues/6308#issuecomment-737607128


   The key here is that there is no alert set on lack of consumption, and there is no guarantee that restarting will even fix the issue. 
   
   We do have a mechanism by which servers can report problems to the controller. We use that currently for problems that happen during consumption.
   
   We may want to  extend it (in case of 1 replica, maybe) to transition the segment to OFFLINE state if segment build fails (just throwing out the idea here, not sure if this is the best). In that case, the validator will come back and create a new segment.  Not sure if customers are ok waiting until validator runs, or if you are looking for instant gratification. Doing something the same instant an error is reported will generally be hard if there are more the one controllers, or if validation manager starts around the same time, etc. I think we should avoid that.
   
   Validation manager can be extended to handle ERROR state in all replicas, I suppose. It needs to change the idealstate to be OFFLINE for this segment, and invoke the same logic that we have now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] npawar edited a comment on issue #6308: Errors in consuming segment completely stops consumption

Posted by GitBox <gi...@apache.org>.
npawar edited a comment on issue #6308:
URL: https://github.com/apache/incubator-pinot/issues/6308#issuecomment-737632162


   @mcvsubbu 
   
   For example 1, what you suggest `We may want to extend it (in case of 1 replica, maybe) to transition the segment to OFFLINE state if segment build fails` would work. Letting the ValidationManager fix it should be fine. 
   However, I think we should transition the segment to OFFLINE state, any time the while loop in the consumer thread exits with `ERROR` state (and not just for failed segment build or single replica).
   ```
   try {
       while (!_state.isFinal()) {
           ...
       }
       if (state == ERROR) {
           throw IllegalStateException("Exited with ERROR state", e);
       }
   } catch (Exception e) {
           segmentLogger.error("Exception while in work", e);
           postStopConsumedMsg(e.getClass().getName());
           _state = State.ERROR;
           _serverMetrics.setValueOfTableGauge(_metricKeyName, ServerGauge.LLC_PARTITION_CONSUMING, 0);
           return;
   }
   ```
   As long as we reach the `postStopConsumedMsg` for all error conditions, I think we should be fine.
   
   
   -------------------------------------------------------------------------------
   
   I'm assuming this is for example 2: `Validation manager can be extended to handle ERROR state in all replicas` ? So we start looking at the External View in the Validation Manager?
   
   -------------------------------------------------------------------------------
   
   Regarding `We do have a mechanism by which servers can report problems to the controller. We use that currently for problems that happen during consumption.` - Where is this mechanism? Can we use that to create a Controller API that reports the status of all consumers? 
   This API will help in
   1. If users notice a lag, they can call this API and see consumer health. Right now they see segment metadata is IN_PROGRESS and segment is CONSUMING in ideal state, which causes confusion as to why the lag. 
   2. We could also use that API in the Validation Manager to restart consumption if any consumers are dead
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] npawar closed issue #6308: Errors in consuming segment completely stops consumption

Posted by GitBox <gi...@apache.org>.
npawar closed issue #6308:
URL: https://github.com/apache/incubator-pinot/issues/6308


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] npawar commented on issue #6308: Errors in consuming segment completely stops consumption

Posted by GitBox <gi...@apache.org>.
npawar commented on issue #6308:
URL: https://github.com/apache/incubator-pinot/issues/6308#issuecomment-737632162


   For example 1, what you suggest `We may want to extend it (in case of 1 replica, maybe) to transition the segment to OFFLINE state if segment build fails` would work. Letting the ValidationManager fix it should be fine. 
   However, I think we should transition the segment to OFFLINE state, any time the while loop in the consumer thread exits with `ERROR` state (and not just for failed segment build or single replica).
   ```
   try {
       while (!_state.isFinal()) {
           ...
       }
       if (state == ERROR) {
           throw IllegalStateException("Exited with ERROR state", e);
       }
   } catch (Exception e) {
           segmentLogger.error("Exception while in work", e);
           postStopConsumedMsg(e.getClass().getName());
           _state = State.ERROR;
           _serverMetrics.setValueOfTableGauge(_metricKeyName, ServerGauge.LLC_PARTITION_CONSUMING, 0);
           return;
   }
   ```
   As long as we reach the `postStopConsumedMsg` for all error conditions, I think we should be fine.
   
   
   -------------------------------------------------------------------------------
   
   I'm assuming this is for example 2: `Validation manager can be extended to handle ERROR state in all replicas` ? So we start looking at the External View in the Validation Manager?
   
   -------------------------------------------------------------------------------
   
   Regarding `We do have a mechanism by which servers can report problems to the controller. We use that currently for problems that happen during consumption.` - Where is this mechanism? Can we use that to create a Controller API that reports the status of all consumers? 
   This API will help in
   1. If users notice a lag, they can call this API and see consumer health. Right now they see segment metadata is IN_PROGRESS and segment is CONSUMING in ideal state, which causes confusion as to why the lag. 
   2. We could also use that API in the Validation Manager to restart consumption if any consumers are dead
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] npawar commented on issue #6308: Errors in consuming segment completely stops consumption

Posted by GitBox <gi...@apache.org>.
npawar commented on issue #6308:
URL: https://github.com/apache/incubator-pinot/issues/6308#issuecomment-737598198


   @kishoreg @mcvsubbu @fx19880617 @icefury71 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] kishoreg commented on issue #6308: Errors in consuming segment completely stops consumption

Posted by GitBox <gi...@apache.org>.
kishoreg commented on issue #6308:
URL: https://github.com/apache/incubator-pinot/issues/6308#issuecomment-738194604


   Can do this instead
   1. Add an API to get the status of consuming segment across all servers (including offset is possible)
   2. If the status is bad/error because of whatever reason, disable and enable the consuming segment. 
   
   Once we have controller API for these two, the invocation can be manual or via validation manager.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org