You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ignite.apache.org by Vyacheslav Daradur <da...@gmail.com> on 2017/08/04 08:41:29 UTC

Nodes which started in separate JVM couldn't stop properly (in tests)

Hi Igniters,

Working on my task I found a bug at call the method #stopGrid(name),
it produced ClassCastException. I created a ticket[1].

After it was fixed[2] I saw that nodes which was started in a separate JVM
could stay in process of operation system.
It was fixed too, but not sure is it fixed in proper way or not.

Could someone review it?

[1] https://issues.apache.org/jira/browse/IGNITE-5910
[2] https://github.com/apache/ignite/pull/2382

-- 
Best Regards, Vyacheslav D.

Re: Nodes which started in separate JVM couldn't stop properly (in tests)

Posted by Vyacheslav Daradur <da...@gmail.com>.

Dmitry, Nikolay, thanks for your help!

Since the changes are merged we are able to implement this feature in
Compatibility Testing Framework.

It provides us to work flexibly with multi-version cluster and
implement new testing scenarios (I know that Ignite doesn't have such
feature yet).

Does it make sense to create a ticket?


On Wed, Mar 21, 2018 at 8:16 PM, Dmitry Pavlov <dp...@gmail.com> wrote:
> Hi Nickolay,
>
> it seems we have lazy consesus here.
>
> Failures:  tests 11 suites 1, all these tests are failed in master.
>
> Could you merge?
>
> Sincerely,
> Dmitriy Pavlov
>
> пт, 16 мар. 2018 г. в 18:29, Nikolay Izhikov <ni...@apache.org>:
>
>> Hello, Guys.
>>
>> I'm reviewed changes and it looks good to me.
>> There is a simple reproducer for a bug in test framework, see below.
>>
>> It fails in master and works in branch.
>>
>> I'm planning to merge the fix [1] if Run All will be OK.
>>
>> Please, write to me if you have any objections.
>>
>> [1] https://github.com/apache/ignite/pull/2382
>>
>> ```
>> public class MultiJvmSelfTest extends GridCommonAbstractTest {
>>     @Override protected boolean isMultiJvm() { return true; }
>>
>>     public void testGrid() throws Exception {
>>         final IgniteInternalFuture fut = GridTestUtils.runAsync(new
>> RunnableX() {
>>             @Override public void runx() throws Exception {
>>                 try {
>>                     startGrid(0);
>>                     startGrid(1);
>>                 }
>>                 finally {
>>                     stopGrid(1);
>>                     stopGrid(0);
>>                 }
>>             }
>>         });
>>
>>         try {
>>             fut.get(20_000L);
>>         } finally {
>>             stopAllGrids(true);
>>         }
>>     }
>> }
>> ```
>>
>> В Чт, 15/03/2018 в 15:59 +0000, Dmitry Pavlov пишет:
>> > I see now. Thank you.
>> >
>> > Nikolay, could you please merge this change?
>> >
>> > чт, 15 мар. 2018 г. в 18:48, Vyacheslav Daradur <da...@gmail.com>:
>> >
>> > > In brief:
>> > > Nodes in *separate* JVMs are shutting down by the computing task
>> > > *StopGridTask* which has sent from *local* JVM *synchronously* that
>> > > means *local* node must wait for task's finish.
>> > >
>> > > At the same time when a node in *separate* JVM executes the received
>> > > *StopGridTask* which *synchronously* calls *G.stop(igniteInstanceName,
>> > > FALSE)* which is waiting for all computing task's finish, including
>> > > *StopGridTask* which has invoked it.
>> > >
>> > > We have some kind of deadlock:
>> > > *Local* node is waiting for the computing task's finish which is
>> > > waiting for finish of execution *G.stop* which is waiting for all
>> > > computing tasks finish including *StopGridTask*.
>> > >
>> > > We have not noticed that before because we use only stopAllGrids() in
>> > > out tests which stop local JVM without waiting for nodes in other
>> > > JVMs.
>> > >
>> > >
>> > >
>> > > On Thu, Mar 15, 2018 at 6:11 PM, Dmitry Pavlov <dp...@gmail.com>
>> > > wrote:
>> > > > Please address comments in PR.
>> > > >
>> > > > I did not fully understood why sync GridStopMessage message was
>> lost, but
>> > > > async will be successfull. Probably we need discuss it briefly.
>> > > >
>> > > > чт, 1 мар. 2018 г. в 12:11, Vyacheslav Daradur <daradurvs@gmail.com
>> >:
>> > > > >
>> > > > > Thank you, Dmitry!
>> > > > >
>> > > > > I'll join this review soon.
>> > > > >
>> > > > > On Thu, Mar 1, 2018 at 12:07 PM, Dmitry Pavlov <
>> dpavlov.spb@gmail.com>
>> > > > > wrote:
>> > > > > > Hi Vyacheslav,
>> > > > > >
>> > > > > > I will take a look, but first of all I am going to review
>> > > > > > https://reviews.ignite.apache.org/ignite/review/IGNT-CR-502  -
>> it is
>> > > > > > impact
>> > > > > > change in testing framework. Hope you also will join to this
>> review .
>> > > > > >
>> > > > > > Sincerely,
>> > > > > > Dmitiry Pavlov
>> > > > > >
>> > > > > >
>> > > > > > чт, 1 мар. 2018 г. в 11:13, Vyacheslav Daradur <
>> daradurvs@gmail.com>:
>> > > > > > >
>> > > > > > > Hi, Dmitry, could you please review it, because you are one of
>> the
>> > > > > > > most experienced people in the testing framework.
>> > > > > > >
>> > > > > > > Please see comment in Jira, because it is in pretty-format
>> there.
>> > > > > > >
>> > > > > > > On Thu, Feb 22, 2018 at 11:56 AM, Vyacheslav Daradur
>> > > > > > > <da...@gmail.com> wrote:
>> > > > > > > > Hi Igniters!
>> > > > > > > >
>> > > > > > > > I have investigated the issue [1] and found that stopping
>> node in
>> > > > > > > > separate JVM may stuck thread or leave system process alive
>> after
>> > > > > > > > test
>> > > > > > > > finished.
>> > > > > > > > The main reason is *StopGridTask* that we send from node in
>> local
>> > >
>> > > JVM
>> > > > > > > > to node in separate JVM via remote computing.
>> > > > > > > > We send job synchronously to be sure that node will be
>> stopped, but
>> > > > > > > > job calls synchronously *G.stop(igniteInstanceName,
>> cancel))* with
>> > > > > > > > *cancel = false*, that means node must wait to compute jobs
>> before
>> > >
>> > > it
>> > > > > > > > goes down what leads to some kind of deadlock. Using of
>> *cancel =
>> > > > > > > > true* would solve the issue but may break some tests’ logic,
>> for
>> > >
>> > > this
>> > > > > > > > reason, I've reworked the method’s synchronization logic [2].
>> > > > > > > >
>> > > > > > > > We have not noticed that before because we use only
>> > >
>> > > *stopAllGrids()*
>> > > > > > > > in out tests which stop local JVM without waiting for nodes
>> in
>> > >
>> > > other
>> > > > > > > > JVMs.
>> > > > > > > > I believe this fix should reduce the number of flaky tests on
>> > > > > > > > TeamCity, especially which fails because of a cluster from
>> the
>> > > > > > > > previous test has not been stopped properly.
>> > > > > > > >
>> > > > > > > > Ci.tests [3] look a bit better than in master.
>> > > > > > > > Please review prepared PR [2] and share your thoughts.
>> > > > > > > >
>> > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5910
>> > > > > > > > [2] https://github.com/apache/ignite/pull/2382
>> > > > > > > > [3]
>> https://ci.ignite.apache.org/viewLog.html?buildId=1105939
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > On Fri, Aug 4, 2017 at 11:41 AM, Vyacheslav Daradur
>> > > > > > > > <da...@gmail.com> wrote:
>> > > > > > > > > Hi Igniters,
>> > > > > > > > >
>> > > > > > > > > Working on my task I found a bug at call the method
>> > >
>> > > #stopGrid(name),
>> > > > > > > > > it produced ClassCastException. I created a ticket[1].
>> > > > > > > > >
>> > > > > > > > > After it was fixed[2] I saw that nodes which was started
>> in a
>> > > > > > > > > separate
>> > > > > > > > > JVM
>> > > > > > > > > could stay in process of operation system.
>> > > > > > > > > It was fixed too, but not sure is it fixed in proper way
>> or not.
>> > > > > > > > >
>> > > > > > > > > Could someone review it?
>> > > > > > > > >
>> > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5910
>> > > > > > > > > [2] https://github.com/apache/ignite/pull/2382
>> > > > > > > > >
>> > > > > > > > > --
>> > > > > > > > > Best Regards, Vyacheslav D.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > --
>> > > > > > > > Best Regards, Vyacheslav D.
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > --
>> > > > > > > Best Regards, Vyacheslav D.
>> > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > Best Regards, Vyacheslav D.
>> > >
>> > >
>> > >
>> > > --
>> > > Best Regards, Vyacheslav D.
>> > >



-- 
Best Regards, Vyacheslav D.

Re: Nodes which started in separate JVM couldn't stop properly (in tests)

Posted by Dmitry Pavlov <dp...@gmail.com>.

Hi Nickolay,

it seems we have lazy consesus here.

Failures:  tests 11 suites 1, all these tests are failed in master.

Could you merge?

Sincerely,
Dmitriy Pavlov

пт, 16 мар. 2018 г. в 18:29, Nikolay Izhikov <ni...@apache.org>:

> Hello, Guys.
>
> I'm reviewed changes and it looks good to me.
> There is a simple reproducer for a bug in test framework, see below.
>
> It fails in master and works in branch.
>
> I'm planning to merge the fix [1] if Run All will be OK.
>
> Please, write to me if you have any objections.
>
> [1] https://github.com/apache/ignite/pull/2382
>
> ```
> public class MultiJvmSelfTest extends GridCommonAbstractTest {
>     @Override protected boolean isMultiJvm() { return true; }
>
>     public void testGrid() throws Exception {
>         final IgniteInternalFuture fut = GridTestUtils.runAsync(new
> RunnableX() {
>             @Override public void runx() throws Exception {
>                 try {
>                     startGrid(0);
>                     startGrid(1);
>                 }
>                 finally {
>                     stopGrid(1);
>                     stopGrid(0);
>                 }
>             }
>         });
>
>         try {
>             fut.get(20_000L);
>         } finally {
>             stopAllGrids(true);
>         }
>     }
> }
> ```
>
> В Чт, 15/03/2018 в 15:59 +0000, Dmitry Pavlov пишет:
> > I see now. Thank you.
> >
> > Nikolay, could you please merge this change?
> >
> > чт, 15 мар. 2018 г. в 18:48, Vyacheslav Daradur <da...@gmail.com>:
> >
> > > In brief:
> > > Nodes in *separate* JVMs are shutting down by the computing task
> > > *StopGridTask* which has sent from *local* JVM *synchronously* that
> > > means *local* node must wait for task's finish.
> > >
> > > At the same time when a node in *separate* JVM executes the received
> > > *StopGridTask* which *synchronously* calls *G.stop(igniteInstanceName,
> > > FALSE)* which is waiting for all computing task's finish, including
> > > *StopGridTask* which has invoked it.
> > >
> > > We have some kind of deadlock:
> > > *Local* node is waiting for the computing task's finish which is
> > > waiting for finish of execution *G.stop* which is waiting for all
> > > computing tasks finish including *StopGridTask*.
> > >
> > > We have not noticed that before because we use only stopAllGrids() in
> > > out tests which stop local JVM without waiting for nodes in other
> > > JVMs.
> > >
> > >
> > >
> > > On Thu, Mar 15, 2018 at 6:11 PM, Dmitry Pavlov <dp...@gmail.com>
> > > wrote:
> > > > Please address comments in PR.
> > > >
> > > > I did not fully understood why sync GridStopMessage message was
> lost, but
> > > > async will be successfull. Probably we need discuss it briefly.
> > > >
> > > > чт, 1 мар. 2018 г. в 12:11, Vyacheslav Daradur <daradurvs@gmail.com
> >:
> > > > >
> > > > > Thank you, Dmitry!
> > > > >
> > > > > I'll join this review soon.
> > > > >
> > > > > On Thu, Mar 1, 2018 at 12:07 PM, Dmitry Pavlov <
> dpavlov.spb@gmail.com>
> > > > > wrote:
> > > > > > Hi Vyacheslav,
> > > > > >
> > > > > > I will take a look, but first of all I am going to review
> > > > > > https://reviews.ignite.apache.org/ignite/review/IGNT-CR-502  -
> it is
> > > > > > impact
> > > > > > change in testing framework. Hope you also will join to this
> review .
> > > > > >
> > > > > > Sincerely,
> > > > > > Dmitiry Pavlov
> > > > > >
> > > > > >
> > > > > > чт, 1 мар. 2018 г. в 11:13, Vyacheslav Daradur <
> daradurvs@gmail.com>:
> > > > > > >
> > > > > > > Hi, Dmitry, could you please review it, because you are one of
> the
> > > > > > > most experienced people in the testing framework.
> > > > > > >
> > > > > > > Please see comment in Jira, because it is in pretty-format
> there.
> > > > > > >
> > > > > > > On Thu, Feb 22, 2018 at 11:56 AM, Vyacheslav Daradur
> > > > > > > <da...@gmail.com> wrote:
> > > > > > > > Hi Igniters!
> > > > > > > >
> > > > > > > > I have investigated the issue [1] and found that stopping
> node in
> > > > > > > > separate JVM may stuck thread or leave system process alive
> after
> > > > > > > > test
> > > > > > > > finished.
> > > > > > > > The main reason is *StopGridTask* that we send from node in
> local
> > >
> > > JVM
> > > > > > > > to node in separate JVM via remote computing.
> > > > > > > > We send job synchronously to be sure that node will be
> stopped, but
> > > > > > > > job calls synchronously *G.stop(igniteInstanceName,
> cancel))* with
> > > > > > > > *cancel = false*, that means node must wait to compute jobs
> before
> > >
> > > it
> > > > > > > > goes down what leads to some kind of deadlock. Using of
> *cancel =
> > > > > > > > true* would solve the issue but may break some tests’ logic,
> for
> > >
> > > this
> > > > > > > > reason, I've reworked the method’s synchronization logic [2].
> > > > > > > >
> > > > > > > > We have not noticed that before because we use only
> > >
> > > *stopAllGrids()*
> > > > > > > > in out tests which stop local JVM without waiting for nodes
> in
> > >
> > > other
> > > > > > > > JVMs.
> > > > > > > > I believe this fix should reduce the number of flaky tests on
> > > > > > > > TeamCity, especially which fails because of a cluster from
> the
> > > > > > > > previous test has not been stopped properly.
> > > > > > > >
> > > > > > > > Ci.tests [3] look a bit better than in master.
> > > > > > > > Please review prepared PR [2] and share your thoughts.
> > > > > > > >
> > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5910
> > > > > > > > [2] https://github.com/apache/ignite/pull/2382
> > > > > > > > [3]
> https://ci.ignite.apache.org/viewLog.html?buildId=1105939
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Aug 4, 2017 at 11:41 AM, Vyacheslav Daradur
> > > > > > > > <da...@gmail.com> wrote:
> > > > > > > > > Hi Igniters,
> > > > > > > > >
> > > > > > > > > Working on my task I found a bug at call the method
> > >
> > > #stopGrid(name),
> > > > > > > > > it produced ClassCastException. I created a ticket[1].
> > > > > > > > >
> > > > > > > > > After it was fixed[2] I saw that nodes which was started
> in a
> > > > > > > > > separate
> > > > > > > > > JVM
> > > > > > > > > could stay in process of operation system.
> > > > > > > > > It was fixed too, but not sure is it fixed in proper way
> or not.
> > > > > > > > >
> > > > > > > > > Could someone review it?
> > > > > > > > >
> > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5910
> > > > > > > > > [2] https://github.com/apache/ignite/pull/2382
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Best Regards, Vyacheslav D.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best Regards, Vyacheslav D.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best Regards, Vyacheslav D.
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best Regards, Vyacheslav D.
> > >
> > >
> > >
> > > --
> > > Best Regards, Vyacheslav D.
> > >

Re: Nodes which started in separate JVM couldn't stop properly (in tests)

Posted by Nikolay Izhikov <ni...@apache.org>.

Hello, Guys.

I'm reviewed changes and it looks good to me.
There is a simple reproducer for a bug in test framework, see below.

It fails in master and works in branch.

I'm planning to merge the fix [1] if Run All will be OK.

Please, write to me if you have any objections.

[1] https://github.com/apache/ignite/pull/2382

```
public class MultiJvmSelfTest extends GridCommonAbstractTest {
    @Override protected boolean isMultiJvm() { return true; }

    public void testGrid() throws Exception {
        final IgniteInternalFuture fut = GridTestUtils.runAsync(new RunnableX() {
            @Override public void runx() throws Exception {
                try {
                    startGrid(0);
                    startGrid(1);
                }
                finally {
                    stopGrid(1);
                    stopGrid(0);
                }
            }
        });

        try {
            fut.get(20_000L);
        } finally {
            stopAllGrids(true);
        }
    }
}
```

В Чт, 15/03/2018 в 15:59 +0000, Dmitry Pavlov пишет:
> I see now. Thank you.
> 
> Nikolay, could you please merge this change?
> 
> чт, 15 мар. 2018 г. в 18:48, Vyacheslav Daradur <da...@gmail.com>:
> 
> > In brief:
> > Nodes in *separate* JVMs are shutting down by the computing task
> > *StopGridTask* which has sent from *local* JVM *synchronously* that
> > means *local* node must wait for task's finish.
> > 
> > At the same time when a node in *separate* JVM executes the received
> > *StopGridTask* which *synchronously* calls *G.stop(igniteInstanceName,
> > FALSE)* which is waiting for all computing task's finish, including
> > *StopGridTask* which has invoked it.
> > 
> > We have some kind of deadlock:
> > *Local* node is waiting for the computing task's finish which is
> > waiting for finish of execution *G.stop* which is waiting for all
> > computing tasks finish including *StopGridTask*.
> > 
> > We have not noticed that before because we use only stopAllGrids() in
> > out tests which stop local JVM without waiting for nodes in other
> > JVMs.
> > 
> > 
> > 
> > On Thu, Mar 15, 2018 at 6:11 PM, Dmitry Pavlov <dp...@gmail.com>
> > wrote:
> > > Please address comments in PR.
> > > 
> > > I did not fully understood why sync GridStopMessage message was lost, but
> > > async will be successfull. Probably we need discuss it briefly.
> > > 
> > > чт, 1 мар. 2018 г. в 12:11, Vyacheslav Daradur <da...@gmail.com>:
> > > > 
> > > > Thank you, Dmitry!
> > > > 
> > > > I'll join this review soon.
> > > > 
> > > > On Thu, Mar 1, 2018 at 12:07 PM, Dmitry Pavlov <dp...@gmail.com>
> > > > wrote:
> > > > > Hi Vyacheslav,
> > > > > 
> > > > > I will take a look, but first of all I am going to review
> > > > > https://reviews.ignite.apache.org/ignite/review/IGNT-CR-502  - it is
> > > > > impact
> > > > > change in testing framework. Hope you also will join to this review .
> > > > > 
> > > > > Sincerely,
> > > > > Dmitiry Pavlov
> > > > > 
> > > > > 
> > > > > чт, 1 мар. 2018 г. в 11:13, Vyacheslav Daradur <da...@gmail.com>:
> > > > > > 
> > > > > > Hi, Dmitry, could you please review it, because you are one of the
> > > > > > most experienced people in the testing framework.
> > > > > > 
> > > > > > Please see comment in Jira, because it is in pretty-format there.
> > > > > > 
> > > > > > On Thu, Feb 22, 2018 at 11:56 AM, Vyacheslav Daradur
> > > > > > <da...@gmail.com> wrote:
> > > > > > > Hi Igniters!
> > > > > > > 
> > > > > > > I have investigated the issue [1] and found that stopping node in
> > > > > > > separate JVM may stuck thread or leave system process alive after
> > > > > > > test
> > > > > > > finished.
> > > > > > > The main reason is *StopGridTask* that we send from node in local
> > 
> > JVM
> > > > > > > to node in separate JVM via remote computing.
> > > > > > > We send job synchronously to be sure that node will be stopped, but
> > > > > > > job calls synchronously *G.stop(igniteInstanceName, cancel))* with
> > > > > > > *cancel = false*, that means node must wait to compute jobs before
> > 
> > it
> > > > > > > goes down what leads to some kind of deadlock. Using of *cancel =
> > > > > > > true* would solve the issue but may break some tests’ logic, for
> > 
> > this
> > > > > > > reason, I've reworked the method’s synchronization logic [2].
> > > > > > > 
> > > > > > > We have not noticed that before because we use only
> > 
> > *stopAllGrids()*
> > > > > > > in out tests which stop local JVM without waiting for nodes in
> > 
> > other
> > > > > > > JVMs.
> > > > > > > I believe this fix should reduce the number of flaky tests on
> > > > > > > TeamCity, especially which fails because of a cluster from the
> > > > > > > previous test has not been stopped properly.
> > > > > > > 
> > > > > > > Ci.tests [3] look a bit better than in master.
> > > > > > > Please review prepared PR [2] and share your thoughts.
> > > > > > > 
> > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5910
> > > > > > > [2] https://github.com/apache/ignite/pull/2382
> > > > > > > [3] https://ci.ignite.apache.org/viewLog.html?buildId=1105939
> > > > > > > 
> > > > > > > 
> > > > > > > On Fri, Aug 4, 2017 at 11:41 AM, Vyacheslav Daradur
> > > > > > > <da...@gmail.com> wrote:
> > > > > > > > Hi Igniters,
> > > > > > > > 
> > > > > > > > Working on my task I found a bug at call the method
> > 
> > #stopGrid(name),
> > > > > > > > it produced ClassCastException. I created a ticket[1].
> > > > > > > > 
> > > > > > > > After it was fixed[2] I saw that nodes which was started in a
> > > > > > > > separate
> > > > > > > > JVM
> > > > > > > > could stay in process of operation system.
> > > > > > > > It was fixed too, but not sure is it fixed in proper way or not.
> > > > > > > > 
> > > > > > > > Could someone review it?
> > > > > > > > 
> > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5910
> > > > > > > > [2] https://github.com/apache/ignite/pull/2382
> > > > > > > > 
> > > > > > > > --
> > > > > > > > Best Regards, Vyacheslav D.
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > --
> > > > > > > Best Regards, Vyacheslav D.
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > --
> > > > > > Best Regards, Vyacheslav D.
> > > > 
> > > > 
> > > > 
> > > > --
> > > > Best Regards, Vyacheslav D.
> > 
> > 
> > 
> > --
> > Best Regards, Vyacheslav D.
> >

Re: Nodes which started in separate JVM couldn't stop properly (in tests)

Posted by Dmitry Pavlov <dp...@gmail.com>.

I see now. Thank you.

Nikolay, could you please merge this change?

чт, 15 мар. 2018 г. в 18:48, Vyacheslav Daradur <da...@gmail.com>:

> In brief:
> Nodes in *separate* JVMs are shutting down by the computing task
> *StopGridTask* which has sent from *local* JVM *synchronously* that
> means *local* node must wait for task's finish.
>
> At the same time when a node in *separate* JVM executes the received
> *StopGridTask* which *synchronously* calls *G.stop(igniteInstanceName,
> FALSE)* which is waiting for all computing task's finish, including
> *StopGridTask* which has invoked it.
>
> We have some kind of deadlock:
> *Local* node is waiting for the computing task's finish which is
> waiting for finish of execution *G.stop* which is waiting for all
> computing tasks finish including *StopGridTask*.
>
> We have not noticed that before because we use only stopAllGrids() in
> out tests which stop local JVM without waiting for nodes in other
> JVMs.
>
>
>
> On Thu, Mar 15, 2018 at 6:11 PM, Dmitry Pavlov <dp...@gmail.com>
> wrote:
> > Please address comments in PR.
> >
> > I did not fully understood why sync GridStopMessage message was lost, but
> > async will be successfull. Probably we need discuss it briefly.
> >
> > чт, 1 мар. 2018 г. в 12:11, Vyacheslav Daradur <da...@gmail.com>:
> >>
> >> Thank you, Dmitry!
> >>
> >> I'll join this review soon.
> >>
> >> On Thu, Mar 1, 2018 at 12:07 PM, Dmitry Pavlov <dp...@gmail.com>
> >> wrote:
> >> > Hi Vyacheslav,
> >> >
> >> > I will take a look, but first of all I am going to review
> >> > https://reviews.ignite.apache.org/ignite/review/IGNT-CR-502  - it is
> >> > impact
> >> > change in testing framework. Hope you also will join to this review .
> >> >
> >> > Sincerely,
> >> > Dmitiry Pavlov
> >> >
> >> >
> >> > чт, 1 мар. 2018 г. в 11:13, Vyacheslav Daradur <da...@gmail.com>:
> >> >>
> >> >> Hi, Dmitry, could you please review it, because you are one of the
> >> >> most experienced people in the testing framework.
> >> >>
> >> >> Please see comment in Jira, because it is in pretty-format there.
> >> >>
> >> >> On Thu, Feb 22, 2018 at 11:56 AM, Vyacheslav Daradur
> >> >> <da...@gmail.com> wrote:
> >> >> > Hi Igniters!
> >> >> >
> >> >> > I have investigated the issue [1] and found that stopping node in
> >> >> > separate JVM may stuck thread or leave system process alive after
> >> >> > test
> >> >> > finished.
> >> >> > The main reason is *StopGridTask* that we send from node in local
> JVM
> >> >> > to node in separate JVM via remote computing.
> >> >> > We send job synchronously to be sure that node will be stopped, but
> >> >> > job calls synchronously *G.stop(igniteInstanceName, cancel))* with
> >> >> > *cancel = false*, that means node must wait to compute jobs before
> it
> >> >> > goes down what leads to some kind of deadlock. Using of *cancel =
> >> >> > true* would solve the issue but may break some tests’ logic, for
> this
> >> >> > reason, I've reworked the method’s synchronization logic [2].
> >> >> >
> >> >> > We have not noticed that before because we use only
> *stopAllGrids()*
> >> >> > in out tests which stop local JVM without waiting for nodes in
> other
> >> >> > JVMs.
> >> >> > I believe this fix should reduce the number of flaky tests on
> >> >> > TeamCity, especially which fails because of a cluster from the
> >> >> > previous test has not been stopped properly.
> >> >> >
> >> >> > Ci.tests [3] look a bit better than in master.
> >> >> > Please review prepared PR [2] and share your thoughts.
> >> >> >
> >> >> > [1] https://issues.apache.org/jira/browse/IGNITE-5910
> >> >> > [2] https://github.com/apache/ignite/pull/2382
> >> >> > [3] https://ci.ignite.apache.org/viewLog.html?buildId=1105939
> >> >> >
> >> >> >
> >> >> > On Fri, Aug 4, 2017 at 11:41 AM, Vyacheslav Daradur
> >> >> > <da...@gmail.com> wrote:
> >> >> >> Hi Igniters,
> >> >> >>
> >> >> >> Working on my task I found a bug at call the method
> #stopGrid(name),
> >> >> >> it produced ClassCastException. I created a ticket[1].
> >> >> >>
> >> >> >> After it was fixed[2] I saw that nodes which was started in a
> >> >> >> separate
> >> >> >> JVM
> >> >> >> could stay in process of operation system.
> >> >> >> It was fixed too, but not sure is it fixed in proper way or not.
> >> >> >>
> >> >> >> Could someone review it?
> >> >> >>
> >> >> >> [1] https://issues.apache.org/jira/browse/IGNITE-5910
> >> >> >> [2] https://github.com/apache/ignite/pull/2382
> >> >> >>
> >> >> >> --
> >> >> >> Best Regards, Vyacheslav D.
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Best Regards, Vyacheslav D.
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Best Regards, Vyacheslav D.
> >>
> >>
> >>
> >> --
> >> Best Regards, Vyacheslav D.
>
>
>
> --
> Best Regards, Vyacheslav D.
>

Re: Nodes which started in separate JVM couldn't stop properly (in tests)

Posted by Vyacheslav Daradur <da...@gmail.com>.

In brief:
Nodes in *separate* JVMs are shutting down by the computing task
*StopGridTask* which has sent from *local* JVM *synchronously* that
means *local* node must wait for task's finish.

At the same time when a node in *separate* JVM executes the received
*StopGridTask* which *synchronously* calls *G.stop(igniteInstanceName,
FALSE)* which is waiting for all computing task's finish, including
*StopGridTask* which has invoked it.

We have some kind of deadlock:
*Local* node is waiting for the computing task's finish which is
waiting for finish of execution *G.stop* which is waiting for all
computing tasks finish including *StopGridTask*.

We have not noticed that before because we use only stopAllGrids() in
out tests which stop local JVM without waiting for nodes in other
JVMs.



On Thu, Mar 15, 2018 at 6:11 PM, Dmitry Pavlov <dp...@gmail.com> wrote:
> Please address comments in PR.
>
> I did not fully understood why sync GridStopMessage message was lost, but
> async will be successfull. Probably we need discuss it briefly.
>
> чт, 1 мар. 2018 г. в 12:11, Vyacheslav Daradur <da...@gmail.com>:
>>
>> Thank you, Dmitry!
>>
>> I'll join this review soon.
>>
>> On Thu, Mar 1, 2018 at 12:07 PM, Dmitry Pavlov <dp...@gmail.com>
>> wrote:
>> > Hi Vyacheslav,
>> >
>> > I will take a look, but first of all I am going to review
>> > https://reviews.ignite.apache.org/ignite/review/IGNT-CR-502  - it is
>> > impact
>> > change in testing framework. Hope you also will join to this review .
>> >
>> > Sincerely,
>> > Dmitiry Pavlov
>> >
>> >
>> > чт, 1 мар. 2018 г. в 11:13, Vyacheslav Daradur <da...@gmail.com>:
>> >>
>> >> Hi, Dmitry, could you please review it, because you are one of the
>> >> most experienced people in the testing framework.
>> >>
>> >> Please see comment in Jira, because it is in pretty-format there.
>> >>
>> >> On Thu, Feb 22, 2018 at 11:56 AM, Vyacheslav Daradur
>> >> <da...@gmail.com> wrote:
>> >> > Hi Igniters!
>> >> >
>> >> > I have investigated the issue [1] and found that stopping node in
>> >> > separate JVM may stuck thread or leave system process alive after
>> >> > test
>> >> > finished.
>> >> > The main reason is *StopGridTask* that we send from node in local JVM
>> >> > to node in separate JVM via remote computing.
>> >> > We send job synchronously to be sure that node will be stopped, but
>> >> > job calls synchronously *G.stop(igniteInstanceName, cancel))* with
>> >> > *cancel = false*, that means node must wait to compute jobs before it
>> >> > goes down what leads to some kind of deadlock. Using of *cancel =
>> >> > true* would solve the issue but may break some tests’ logic, for this
>> >> > reason, I've reworked the method’s synchronization logic [2].
>> >> >
>> >> > We have not noticed that before because we use only *stopAllGrids()*
>> >> > in out tests which stop local JVM without waiting for nodes in other
>> >> > JVMs.
>> >> > I believe this fix should reduce the number of flaky tests on
>> >> > TeamCity, especially which fails because of a cluster from the
>> >> > previous test has not been stopped properly.
>> >> >
>> >> > Ci.tests [3] look a bit better than in master.
>> >> > Please review prepared PR [2] and share your thoughts.
>> >> >
>> >> > [1] https://issues.apache.org/jira/browse/IGNITE-5910
>> >> > [2] https://github.com/apache/ignite/pull/2382
>> >> > [3] https://ci.ignite.apache.org/viewLog.html?buildId=1105939
>> >> >
>> >> >
>> >> > On Fri, Aug 4, 2017 at 11:41 AM, Vyacheslav Daradur
>> >> > <da...@gmail.com> wrote:
>> >> >> Hi Igniters,
>> >> >>
>> >> >> Working on my task I found a bug at call the method #stopGrid(name),
>> >> >> it produced ClassCastException. I created a ticket[1].
>> >> >>
>> >> >> After it was fixed[2] I saw that nodes which was started in a
>> >> >> separate
>> >> >> JVM
>> >> >> could stay in process of operation system.
>> >> >> It was fixed too, but not sure is it fixed in proper way or not.
>> >> >>
>> >> >> Could someone review it?
>> >> >>
>> >> >> [1] https://issues.apache.org/jira/browse/IGNITE-5910
>> >> >> [2] https://github.com/apache/ignite/pull/2382
>> >> >>
>> >> >> --
>> >> >> Best Regards, Vyacheslav D.
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Best Regards, Vyacheslav D.
>> >>
>> >>
>> >>
>> >> --
>> >> Best Regards, Vyacheslav D.
>>
>>
>>
>> --
>> Best Regards, Vyacheslav D.



-- 
Best Regards, Vyacheslav D.

Re: Nodes which started in separate JVM couldn't stop properly (in tests)

Posted by Dmitry Pavlov <dp...@gmail.com>.

Please address comments in PR.

I did not fully understood why sync GridStopMessage message was lost, but
async will be successfull. Probably we need discuss it briefly.

чт, 1 мар. 2018 г. в 12:11, Vyacheslav Daradur <da...@gmail.com>:

> Thank you, Dmitry!
>
> I'll join this review soon.
>
> On Thu, Mar 1, 2018 at 12:07 PM, Dmitry Pavlov <dp...@gmail.com>
> wrote:
> > Hi Vyacheslav,
> >
> > I will take a look, but first of all I am going to review
> > https://reviews.ignite.apache.org/ignite/review/IGNT-CR-502  - it is
> impact
> > change in testing framework. Hope you also will join to this review .
> >
> > Sincerely,
> > Dmitiry Pavlov
> >
> >
> > чт, 1 мар. 2018 г. в 11:13, Vyacheslav Daradur <da...@gmail.com>:
> >>
> >> Hi, Dmitry, could you please review it, because you are one of the
> >> most experienced people in the testing framework.
> >>
> >> Please see comment in Jira, because it is in pretty-format there.
> >>
> >> On Thu, Feb 22, 2018 at 11:56 AM, Vyacheslav Daradur
> >> <da...@gmail.com> wrote:
> >> > Hi Igniters!
> >> >
> >> > I have investigated the issue [1] and found that stopping node in
> >> > separate JVM may stuck thread or leave system process alive after test
> >> > finished.
> >> > The main reason is *StopGridTask* that we send from node in local JVM
> >> > to node in separate JVM via remote computing.
> >> > We send job synchronously to be sure that node will be stopped, but
> >> > job calls synchronously *G.stop(igniteInstanceName, cancel))* with
> >> > *cancel = false*, that means node must wait to compute jobs before it
> >> > goes down what leads to some kind of deadlock. Using of *cancel =
> >> > true* would solve the issue but may break some tests’ logic, for this
> >> > reason, I've reworked the method’s synchronization logic [2].
> >> >
> >> > We have not noticed that before because we use only *stopAllGrids()*
> >> > in out tests which stop local JVM without waiting for nodes in other
> >> > JVMs.
> >> > I believe this fix should reduce the number of flaky tests on
> >> > TeamCity, especially which fails because of a cluster from the
> >> > previous test has not been stopped properly.
> >> >
> >> > Ci.tests [3] look a bit better than in master.
> >> > Please review prepared PR [2] and share your thoughts.
> >> >
> >> > [1] https://issues.apache.org/jira/browse/IGNITE-5910
> >> > [2] https://github.com/apache/ignite/pull/2382
> >> > [3] https://ci.ignite.apache.org/viewLog.html?buildId=1105939
> >> >
> >> >
> >> > On Fri, Aug 4, 2017 at 11:41 AM, Vyacheslav Daradur
> >> > <da...@gmail.com> wrote:
> >> >> Hi Igniters,
> >> >>
> >> >> Working on my task I found a bug at call the method #stopGrid(name),
> >> >> it produced ClassCastException. I created a ticket[1].
> >> >>
> >> >> After it was fixed[2] I saw that nodes which was started in a
> separate
> >> >> JVM
> >> >> could stay in process of operation system.
> >> >> It was fixed too, but not sure is it fixed in proper way or not.
> >> >>
> >> >> Could someone review it?
> >> >>
> >> >> [1] https://issues.apache.org/jira/browse/IGNITE-5910
> >> >> [2] https://github.com/apache/ignite/pull/2382
> >> >>
> >> >> --
> >> >> Best Regards, Vyacheslav D.
> >> >
> >> >
> >> >
> >> > --
> >> > Best Regards, Vyacheslav D.
> >>
> >>
> >>
> >> --
> >> Best Regards, Vyacheslav D.
>
>
>
> --
> Best Regards, Vyacheslav D.
>

Re: Nodes which started in separate JVM couldn't stop properly (in tests)

Posted by Vyacheslav Daradur <da...@gmail.com>.

Thank you, Dmitry!

I'll join this review soon.

On Thu, Mar 1, 2018 at 12:07 PM, Dmitry Pavlov <dp...@gmail.com> wrote:
> Hi Vyacheslav,
>
> I will take a look, but first of all I am going to review
> https://reviews.ignite.apache.org/ignite/review/IGNT-CR-502  - it is impact
> change in testing framework. Hope you also will join to this review .
>
> Sincerely,
> Dmitiry Pavlov
>
>
> чт, 1 мар. 2018 г. в 11:13, Vyacheslav Daradur <da...@gmail.com>:
>>
>> Hi, Dmitry, could you please review it, because you are one of the
>> most experienced people in the testing framework.
>>
>> Please see comment in Jira, because it is in pretty-format there.
>>
>> On Thu, Feb 22, 2018 at 11:56 AM, Vyacheslav Daradur
>> <da...@gmail.com> wrote:
>> > Hi Igniters!
>> >
>> > I have investigated the issue [1] and found that stopping node in
>> > separate JVM may stuck thread or leave system process alive after test
>> > finished.
>> > The main reason is *StopGridTask* that we send from node in local JVM
>> > to node in separate JVM via remote computing.
>> > We send job synchronously to be sure that node will be stopped, but
>> > job calls synchronously *G.stop(igniteInstanceName, cancel))* with
>> > *cancel = false*, that means node must wait to compute jobs before it
>> > goes down what leads to some kind of deadlock. Using of *cancel =
>> > true* would solve the issue but may break some tests’ logic, for this
>> > reason, I've reworked the method’s synchronization logic [2].
>> >
>> > We have not noticed that before because we use only *stopAllGrids()*
>> > in out tests which stop local JVM without waiting for nodes in other
>> > JVMs.
>> > I believe this fix should reduce the number of flaky tests on
>> > TeamCity, especially which fails because of a cluster from the
>> > previous test has not been stopped properly.
>> >
>> > Ci.tests [3] look a bit better than in master.
>> > Please review prepared PR [2] and share your thoughts.
>> >
>> > [1] https://issues.apache.org/jira/browse/IGNITE-5910
>> > [2] https://github.com/apache/ignite/pull/2382
>> > [3] https://ci.ignite.apache.org/viewLog.html?buildId=1105939
>> >
>> >
>> > On Fri, Aug 4, 2017 at 11:41 AM, Vyacheslav Daradur
>> > <da...@gmail.com> wrote:
>> >> Hi Igniters,
>> >>
>> >> Working on my task I found a bug at call the method #stopGrid(name),
>> >> it produced ClassCastException. I created a ticket[1].
>> >>
>> >> After it was fixed[2] I saw that nodes which was started in a separate
>> >> JVM
>> >> could stay in process of operation system.
>> >> It was fixed too, but not sure is it fixed in proper way or not.
>> >>
>> >> Could someone review it?
>> >>
>> >> [1] https://issues.apache.org/jira/browse/IGNITE-5910
>> >> [2] https://github.com/apache/ignite/pull/2382
>> >>
>> >> --
>> >> Best Regards, Vyacheslav D.
>> >
>> >
>> >
>> > --
>> > Best Regards, Vyacheslav D.
>>
>>
>>
>> --
>> Best Regards, Vyacheslav D.



-- 
Best Regards, Vyacheslav D.

Re: Nodes which started in separate JVM couldn't stop properly (in tests)

Posted by Dmitry Pavlov <dp...@gmail.com>.

Hi Vyacheslav,

I will take a look, but first of all I am going to review
https://reviews.ignite.apache.org/ignite/review/IGNT-CR-502  - it is impact
change in testing framework. Hope you also will join to this review .

Sincerely,
Dmitiry Pavlov


чт, 1 мар. 2018 г. в 11:13, Vyacheslav Daradur <da...@gmail.com>:

> Hi, Dmitry, could you please review it, because you are one of the
> most experienced people in the testing framework.
>
> Please see comment in Jira, because it is in pretty-format there.
>
> On Thu, Feb 22, 2018 at 11:56 AM, Vyacheslav Daradur
> <da...@gmail.com> wrote:
> > Hi Igniters!
> >
> > I have investigated the issue [1] and found that stopping node in
> > separate JVM may stuck thread or leave system process alive after test
> > finished.
> > The main reason is *StopGridTask* that we send from node in local JVM
> > to node in separate JVM via remote computing.
> > We send job synchronously to be sure that node will be stopped, but
> > job calls synchronously *G.stop(igniteInstanceName, cancel))* with
> > *cancel = false*, that means node must wait to compute jobs before it
> > goes down what leads to some kind of deadlock. Using of *cancel =
> > true* would solve the issue but may break some tests’ logic, for this
> > reason, I've reworked the method’s synchronization logic [2].
> >
> > We have not noticed that before because we use only *stopAllGrids()*
> > in out tests which stop local JVM without waiting for nodes in other
> > JVMs.
> > I believe this fix should reduce the number of flaky tests on
> > TeamCity, especially which fails because of a cluster from the
> > previous test has not been stopped properly.
> >
> > Ci.tests [3] look a bit better than in master.
> > Please review prepared PR [2] and share your thoughts.
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-5910
> > [2] https://github.com/apache/ignite/pull/2382
> > [3] https://ci.ignite.apache.org/viewLog.html?buildId=1105939
> >
> >
> > On Fri, Aug 4, 2017 at 11:41 AM, Vyacheslav Daradur <da...@gmail.com>
> wrote:
> >> Hi Igniters,
> >>
> >> Working on my task I found a bug at call the method #stopGrid(name),
> >> it produced ClassCastException. I created a ticket[1].
> >>
> >> After it was fixed[2] I saw that nodes which was started in a separate
> JVM
> >> could stay in process of operation system.
> >> It was fixed too, but not sure is it fixed in proper way or not.
> >>
> >> Could someone review it?
> >>
> >> [1] https://issues.apache.org/jira/browse/IGNITE-5910
> >> [2] https://github.com/apache/ignite/pull/2382
> >>
> >> --
> >> Best Regards, Vyacheslav D.
> >
> >
> >
> > --
> > Best Regards, Vyacheslav D.
>
>
>
> --
> Best Regards, Vyacheslav D.
>

Re: Nodes which started in separate JVM couldn't stop properly (in tests)

Posted by Vyacheslav Daradur <da...@gmail.com>.

Hi, Dmitry, could you please review it, because you are one of the
most experienced people in the testing framework.

Please see comment in Jira, because it is in pretty-format there.

On Thu, Feb 22, 2018 at 11:56 AM, Vyacheslav Daradur
<da...@gmail.com> wrote:
> Hi Igniters!
>
> I have investigated the issue [1] and found that stopping node in
> separate JVM may stuck thread or leave system process alive after test
> finished.
> The main reason is *StopGridTask* that we send from node in local JVM
> to node in separate JVM via remote computing.
> We send job synchronously to be sure that node will be stopped, but
> job calls synchronously *G.stop(igniteInstanceName, cancel))* with
> *cancel = false*, that means node must wait to compute jobs before it
> goes down what leads to some kind of deadlock. Using of *cancel =
> true* would solve the issue but may break some tests’ logic, for this
> reason, I've reworked the method’s synchronization logic [2].
>
> We have not noticed that before because we use only *stopAllGrids()*
> in out tests which stop local JVM without waiting for nodes in other
> JVMs.
> I believe this fix should reduce the number of flaky tests on
> TeamCity, especially which fails because of a cluster from the
> previous test has not been stopped properly.
>
> Ci.tests [3] look a bit better than in master.
> Please review prepared PR [2] and share your thoughts.
>
> [1] https://issues.apache.org/jira/browse/IGNITE-5910
> [2] https://github.com/apache/ignite/pull/2382
> [3] https://ci.ignite.apache.org/viewLog.html?buildId=1105939
>
>
> On Fri, Aug 4, 2017 at 11:41 AM, Vyacheslav Daradur <da...@gmail.com> wrote:
>> Hi Igniters,
>>
>> Working on my task I found a bug at call the method #stopGrid(name),
>> it produced ClassCastException. I created a ticket[1].
>>
>> After it was fixed[2] I saw that nodes which was started in a separate JVM
>> could stay in process of operation system.
>> It was fixed too, but not sure is it fixed in proper way or not.
>>
>> Could someone review it?
>>
>> [1] https://issues.apache.org/jira/browse/IGNITE-5910
>> [2] https://github.com/apache/ignite/pull/2382
>>
>> --
>> Best Regards, Vyacheslav D.
>
>
>
> --
> Best Regards, Vyacheslav D.



-- 
Best Regards, Vyacheslav D.

Re: Nodes which started in separate JVM couldn't stop properly (in tests)

Posted by Vyacheslav Daradur <da...@gmail.com>.

Hi Igniters!

I have investigated the issue [1] and found that stopping node in
separate JVM may stuck thread or leave system process alive after test
finished.
The main reason is *StopGridTask* that we send from node in local JVM
to node in separate JVM via remote computing.
We send job synchronously to be sure that node will be stopped, but
job calls synchronously *G.stop(igniteInstanceName, cancel))* with
*cancel = false*, that means node must wait to compute jobs before it
goes down what leads to some kind of deadlock. Using of *cancel =
true* would solve the issue but may break some tests’ logic, for this
reason, I've reworked the method’s synchronization logic [2].

We have not noticed that before because we use only *stopAllGrids()*
in out tests which stop local JVM without waiting for nodes in other
JVMs.
I believe this fix should reduce the number of flaky tests on
TeamCity, especially which fails because of a cluster from the
previous test has not been stopped properly.

Ci.tests [3] look a bit better than in master.
Please review prepared PR [2] and share your thoughts.

[1] https://issues.apache.org/jira/browse/IGNITE-5910
[2] https://github.com/apache/ignite/pull/2382
[3] https://ci.ignite.apache.org/viewLog.html?buildId=1105939

On Fri, Aug 4, 2017 at 11:41 AM, Vyacheslav Daradur <da...@gmail.com> wrote:
> Hi Igniters,
>
> Working on my task I found a bug at call the method #stopGrid(name),
> it produced ClassCastException. I created a ticket[1].
>
> After it was fixed[2] I saw that nodes which was started in a separate JVM
> could stay in process of operation system.
> It was fixed too, but not sure is it fixed in proper way or not.
>
> Could someone review it?
>
> [1] https://issues.apache.org/jira/browse/IGNITE-5910
> [2] https://github.com/apache/ignite/pull/2382
>
> --
> Best Regards, Vyacheslav D.

-- 
Best Regards, Vyacheslav D.