You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Mikhail Petrov (Jira)" <ji...@apache.org> on 2020/09/08 10:23:00 UTC
[jira] [Commented] (IGNITE-13361) Sending of communication messages
can hang infinitely.
[ https://issues.apache.org/jira/browse/IGNITE-13361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192117#comment-17192117 ]
Mikhail Petrov commented on IGNITE-13361:
-----------------------------------------
[~alex_pl] Thanks a lot for the review.
> Sending of communication messages can hang infinitely.
> ------------------------------------------------------
>
> Key: IGNITE-13361
> URL: https://issues.apache.org/jira/browse/IGNITE-13361
> Project: Ignite
> Issue Type: Bug
> Reporter: Mikhail Petrov
> Assignee: Mikhail Petrov
> Priority: Major
> Fix For: 2.10
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> The following test hangs intermittently (once for 4-5 runs) on my laptop (Ubuntu 20.04, i7-8565u, 16gb RAM, JDK 1.8.0_251). The cursor iteration randomly hangs on the stage of waiting for the next page from the remote node.
> {code:java}
> /** */
> public static final int NODES_CNT = 2;
> /** */
> public static final int TABLE_POPULATION = 2000;
> /** */
> public static final int SELECT_RANGE = 1000;
> /** */
> public static final int QRY_PAGE_SIZE = 5;
> /** */
> @Test
> public void test() throws Exception {
> for (int i = 0; i < NODES_CNT; i++)
> startGrid(i, false);
> IgniteEx cli = startGrid(NODES_CNT, true);
> GridQueryProcessor qryProc = cli.context().query();
> qryProc.querySqlFields(
> new SqlFieldsQuery("CREATE TABLE test_table (id LONG PRIMARY KEY, val LONG)"), false);
> qryProc.querySqlFields(new SqlFieldsQuery("CREATE INDEX val_idx ON test_table (val)"), false);
> for (long l = 0; l < TABLE_POPULATION; ++l) {
> qryProc.querySqlFields(
> new SqlFieldsQuery("INSERT INTO test_table (id, val) VALUES (?, ?)").setArgs(l, l),
> true
> );
> }
> for (int i = 0; i < 10000 ; i++) {
> long lowId = ThreadLocalRandom.current().nextLong(TABLE_POPULATION - SELECT_RANGE);
> long highId = lowId + SELECT_RANGE;
> try (
> FieldsQueryCursor<List<?>> cursor = cli
> .context().query().querySqlFields(
> new SqlFieldsQuery("SELECT id, val FROM test_table WHERE id BETWEEN ? and ?")
> .setArgs(lowId, highId)
> .setPageSize(QRY_PAGE_SIZE),
> false
> )
> ) {
> cursor.iterator().forEachRemaining(val -> {});
> }
> }
> }
> /** */
> private IgniteEx startGrid(int idx, boolean clientMode) throws Exception {
> return (IgniteEx) Ignition.start(new IgniteConfiguration()
> .setIgniteInstanceName("node-" + idx)
> .setGridLogger(new Log4JLogger("modules/core/src/test/config/log4j-test.xml"))
> .setClientMode(clientMode));
> }
> {code}
> UPD It seems that IGNITE-12845 is responsible for the behavior described above. Commit which is related to this ticket is the first since which the code mentioned above started to hang.
> Cursor iteration hangs due to GridQueryNextPageRequest in some cases are not sent correctly from the client node.
> UPD Simplified reproducer of the problem described above:
> {code:java}
> @Test
> public void test() throws Exception {
> IgniteEx srv = startGrid(0);
> IgniteEx cli = startClientGrid(1);
> GridQueryNextPageRequest msg = new GridQueryNextPageRequest(0, 0, 0, 0, (byte)0);
> CyclicBarrier barrier = new CyclicBarrier(2);
> srv.context().io().addMessageListener(GridTopic.TOPIC_QUERY, new GridMessageListener() {
> @Override public void onMessage(UUID nodeId, Object msg, byte plc) {
> try {
> if (msg instanceof GridQueryNextPageRequest)
> barrier.await();
> }
> catch (InterruptedException | BrokenBarrierException e) {
> throw new RuntimeException(e);
> }
> }
> });
> for (int i = 0; i < 1000; i++) {
> barrier.reset();
> cli.context().io().sendToGridTopic(srv.context().discovery().localNode(), GridTopic.TOPIC_QUERY, msg, GridIoPolicy.QUERY_POOL);
> try {
> barrier.await(1, TimeUnit.SECONDS);
> }
> catch (InterruptedException | BrokenBarrierException | TimeoutException e) {
> fail();
> }
> }
> }
> {code}
> The root cause of the hanging is lack of synchronization between org.apache.ignite.internal.util.nio.GridNioServer#stopPollingForWrite and org.apache.ignite.internal.util.nio.GridNioServer#send0 methods. The following situation is possible:
> 1. In stopPollingForWrite method worker thread checks that the the queue is empty:
> {code:java}
> if (ses.writeQueue().isEmpty()) {
> {code}
> and this condition appears true. The worker thread stops its execution.
> 2. Message sender thread calls send0 method and it returns. org.apache.ignite.internal.util.nio.GridSelectorNioSessionImpl#procWrite was not set to false yet, so sending message isn't added to worker queue due to:
> {code:java}
> else if (!ses.procWrite.get() && ses.procWrite.compareAndSet(false, true)) {
> {code}
> 3. The worker thread continues stopPollingForWrite execution and disables OP_WRITE flag, which means that socket write events are no longer listened.
> So the message remains unsent.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)