You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Pavel Tupitsyn (Jira)" <ji...@apache.org> on 2020/05/29 07:33:00 UTC

[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events

    [ https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119356#comment-17119356 ] 

Pavel Tupitsyn commented on IGNITE-12845:
-----------------------------------------

[~alex_pl] [~ivandasch] any updates on this issue?

> GridNioServer can infinitely lose some events 
> ----------------------------------------------
>
>                 Key: IGNITE-12845
>                 URL: https://issues.apache.org/jira/browse/IGNITE-12845
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Aleksey Plekhanov
>            Priority: Major
>
> With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) {{GridNioServer}} can lose some events for a channel (depending on JDK version and OS). It can lead to connected applications hang. Reproducer: 
> {code:java}
>     public void testConcurrentLoad() throws Exception {
>         startGrid(0);
>         try (IgniteClient client = Ignition.startClient(new ClientConfiguration().setAddresses("127.0.0.1:10800"))) {
>             ClientCache<Integer, Integer> cache = client.getOrCreateCache(DEFAULT_CACHE_NAME);
>             GridTestUtils.runMultiThreaded(
>                 () -> {
>                     for (int i = 0; i < 1000; i++)
>                         cache.put(i, i);
>                 }, 5, "run-async");
>         }
>     }
> {code}
> This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, 14), hangs on some Linux environments (for example passed more than 100 times on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, 11) and never hanged (passed more than 100 times) on windows system, but passes on all systems and JDK versions when system property {{IGNITE_NO_SELECTOR_OPTS = true}} is set.
>  
> The root cause: optimized {{SelectedSelectionKeySet}} always returns {{false}} for {{contains()}} method. The {{contains()}} method used by {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method:
> {code:java}
> if (selectedKeys.contains(ski)) {
>     if (ski.translateAndUpdateReadyOps(rOps)) {
>         return 1;
>     }
> } else {
>     ski.translateAndSetReadyOps(rOps);
>     if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) {
>         selectedKeys.add(ski);
>         return 1;
>     }
> }
> {code}
> So, for fair implementation, if a selection key is contained in the selected keys set, then ready operations flags are updated, but for {{SelectedSelectionKeySet}} ready operations flags will be always overridden and new selector key will be added even if it's already contained in the set. Some {{SelectorImpl}} implementations can pass several events for one selector key to {{processReadyEvents}} method (for example, MacOs implementation {{KQueueSelectorImpl}} works in such a way). In this case, duplicated selector keys will be added to {{selectedKeys}} and all events except last will be lost.
> Two bad things happen in {{GridNioServer}} due to described above reasons:
>  # Some event flags are lost and the worker doesn't process corresponding action (for attached reproducer "channel is ready for reading" event is lost and the workers never read the channel after some point in time).
>  # Duplicated selector keys with the same event flags (for attached reproducer it's "channel is ready for writing" event, this duplication leads to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which will be {{false}} in some cases, but at the same time selector key's {{interestedOps}} will contain {{OP_WRITE}} operation and this operation never be excluded) 
> Possible solutions:
>  * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this will solve all problems but can be resource consuming)
>  * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when adding {{OP_WRITE}} to {{interestedOps}} (for example in {{AbstractNioClientWorker.registerWrite()}} method). In this case, some "channel is ready for reading" events (but not data) still can be lost, but not infinitely, and eventually data will be read. If events will be reordered (first "channel is ready for writing", after it "channel is ready for reading") then write to the channel will be only processed after all data will be read.
>  * Exclude {{OP_WRITE}} from {{interestedOps}} even if {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). This solution has the same shortcomings as the previous one. 
>  * Hybrid approach. Use some probabilistic implementation for {{contains}} method (bloom filter or just check the last element) and use one of two previous solutions as a workaround, for cases when we incorrectly return {{false}} for {{contains}}. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)