You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Mikhail Petrov (Jira)" <ji...@apache.org> on 2020/09/08 10:23:00 UTC

[jira] [Commented] (IGNITE-13361) Sending of communication messages can hang infinitely.

    [ https://issues.apache.org/jira/browse/IGNITE-13361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192117#comment-17192117 ] 

Mikhail Petrov commented on IGNITE-13361:
-----------------------------------------

[~alex_pl] Thanks a lot for the review.

> Sending of communication messages can hang infinitely.
> ------------------------------------------------------
>
>                 Key: IGNITE-13361
>                 URL: https://issues.apache.org/jira/browse/IGNITE-13361
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Mikhail Petrov
>            Assignee: Mikhail Petrov
>            Priority: Major
>             Fix For: 2.10
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> The following test hangs intermittently (once for 4-5 runs) on my laptop (Ubuntu 20.04, i7-8565u, 16gb RAM, JDK 1.8.0_251). The cursor iteration randomly hangs on the stage of waiting for the next page from the remote node.
> {code:java}
>     /** */
>     public static final int NODES_CNT = 2;
>     /** */
>     public static final int TABLE_POPULATION = 2000;
>     /** */
>     public static final int SELECT_RANGE = 1000;
>     /** */
>     public static final int QRY_PAGE_SIZE = 5;
>     /** */
>     @Test
>     public void test() throws Exception {
>         for (int i = 0; i < NODES_CNT; i++)
>             startGrid(i, false);
>         IgniteEx cli = startGrid(NODES_CNT, true);
>         GridQueryProcessor qryProc = cli.context().query();
>         qryProc.querySqlFields(
>             new SqlFieldsQuery("CREATE TABLE test_table (id LONG PRIMARY KEY, val LONG)"), false);
>         qryProc.querySqlFields(new SqlFieldsQuery("CREATE INDEX val_idx ON test_table (val)"), false);
>         for (long l = 0; l < TABLE_POPULATION; ++l) {
>             qryProc.querySqlFields(
>                 new SqlFieldsQuery("INSERT INTO test_table (id, val) VALUES (?, ?)").setArgs(l, l),
>                 true
>             );
>         }
>         for (int i = 0; i < 10000 ; i++) {
>             long lowId = ThreadLocalRandom.current().nextLong(TABLE_POPULATION - SELECT_RANGE);
>             long highId = lowId + SELECT_RANGE;
>             try (
>                 FieldsQueryCursor<List<?>> cursor = cli
>                     .context().query().querySqlFields(
>                         new SqlFieldsQuery("SELECT id, val FROM test_table WHERE id BETWEEN ? and ?")
>                             .setArgs(lowId, highId)
>                             .setPageSize(QRY_PAGE_SIZE),
>                         false
>                     )
>             ) {
>                 cursor.iterator().forEachRemaining(val -> {});
>             }
>         }
>     }
>     /** */
>     private IgniteEx startGrid(int idx, boolean clientMode) throws Exception {
>         return (IgniteEx) Ignition.start(new IgniteConfiguration()
>             .setIgniteInstanceName("node-" + idx)
>             .setGridLogger(new Log4JLogger("modules/core/src/test/config/log4j-test.xml"))
>             .setClientMode(clientMode));
>     }
> {code}
> UPD It seems that IGNITE-12845 is responsible for the behavior described above. Commit which is related to this ticket is the first since which the code mentioned above started to hang. 
> Cursor iteration hangs due to GridQueryNextPageRequest in some cases are not sent correctly from the client node.
> UPD Simplified reproducer of the problem described above: 
> {code:java}
>  @Test
>     public void test() throws Exception {
>         IgniteEx srv = startGrid(0);
>         IgniteEx cli = startClientGrid(1);
>         GridQueryNextPageRequest msg = new GridQueryNextPageRequest(0, 0, 0, 0, (byte)0);
>         CyclicBarrier barrier = new CyclicBarrier(2);
>         srv.context().io().addMessageListener(GridTopic.TOPIC_QUERY, new GridMessageListener() {
>             @Override public void onMessage(UUID nodeId, Object msg, byte plc) {
>                 try {
>                     if (msg instanceof GridQueryNextPageRequest)
>                         barrier.await();
>                 }
>                 catch (InterruptedException | BrokenBarrierException e) {
>                     throw new RuntimeException(e);
>                 }
>             }
>         });
>         for (int i = 0; i < 1000; i++) {
>             barrier.reset();
>             cli.context().io().sendToGridTopic(srv.context().discovery().localNode(), GridTopic.TOPIC_QUERY, msg, GridIoPolicy.QUERY_POOL);
>             try {
>                 barrier.await(1, TimeUnit.SECONDS);
>             }
>             catch (InterruptedException | BrokenBarrierException | TimeoutException e) {
>                 fail();
>             }
>         }
>     }
> {code}
> The root cause of the hanging is lack of synchronization between org.apache.ignite.internal.util.nio.GridNioServer#stopPollingForWrite and org.apache.ignite.internal.util.nio.GridNioServer#send0 methods. The following situation is possible: 
> 1. In  stopPollingForWrite method worker thread checks that the the queue is empty:
> {code:java}
> if (ses.writeQueue().isEmpty()) {
> {code}
> and this condition appears true. The worker thread stops its execution.
> 2. Message sender thread calls send0 method and it returns. org.apache.ignite.internal.util.nio.GridSelectorNioSessionImpl#procWrite was not set to false yet, so sending message isn't  added to worker queue due to:
> {code:java}
> else if (!ses.procWrite.get() && ses.procWrite.compareAndSet(false, true)) {
> {code}
> 3. The worker thread continues stopPollingForWrite execution and disables OP_WRITE flag, which means that socket write events are no longer listened.
> So the message remains unsent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)