You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ignite.apache.org by Sergey Chugunov <se...@gmail.com> on 2016/11/08 09:37:26 UTC

IGNITE-4155 IgniteSemaphoreExample unexpected behavior

 Hello folks,

I found a reason why *IgniteSemaphoreExample* hangs when started twice
without restarting a cluster; and it doesn't seem minor to me anymore.

From here I'm going to refer to example's code so please have it opened.

So, when the first instance of node running example code finishes and
leaves the cluster, synchronization semaphore named
"IgniteSemaphoreExample" goes to broken state on all other cluster nodes.
If I restart example without restarting all nodes of the cluster, final
*acquire *call on the semaphore on client side hangs because all other
nodes treat it as broken and don't increase permits with their *release *calls
on it.

There is an interesting comment inside its *tryReleaseShared* implementation
(BTW it is implemented in *GridCacheSemaphoreImpl*):

"// If broken, return immediately, exception will be thrown anyway.
 if (broken)
   return true;"

It seems that no exceptions are thrown neither on client side calling *acquire
*or on server side calling *release *methods on a broken semaphore.

Does anybody know why it behaves in that way? Is it expected behavior at
all and if yes where is it documented?

Thanks,
Sergey Chugunov.

Re: IGNITE-4155 IgniteSemaphoreExample unexpected behavior

Posted by Vladisav Jelisavcic <vl...@gmail.com>.

Hi Sergey,

thanks for finding and submitting this bug!

Best regards,
Vladisav

On Thu, Nov 10, 2016 at 1:46 PM, Sergey Chugunov <se...@gmail.com>
wrote:

> Hello Vladisav,
>
> Thanks for confirmation!
>
> I created a JIRA <https://issues.apache.org/jira/browse/IGNITE-4209> to
> track this issue, feel free to edit it if it isn't descriptive enough.
>
> Thank you,
> Sergey.
>
> On Thu, Nov 10, 2016 at 9:44 AM, Vladisav Jelisavcic <vl...@gmail.com>
> wrote:
>
> > Hi Sergey,
> >
> > you are right - I can reproduce this also.
> > It seems to me that this is caused because we treat the same both
> > EVT_NODE_LEFT and EVT_NODE_FAILED events.
> > In this case, node leaves the topology without failure, but fails to
> > release the semaphore before EVT_NODE_LEFT event occurs on other nodes,
> > this really is a bug.
> >
> > Thanks!
> > Vladisav
> >
> > On Wed, Nov 9, 2016 at 11:23 AM, Sergey Chugunov <
> > sergey.chugunov@gmail.com>
> > wrote:
> >
> > > Hello Vladisav,
> > >
> > > I found this behavior in a very simple environment: I had two nodes on
> my
> > > local machine started by *ExampleNodeStartup* class and another node
> > > started with *IgniteSemaphoreExample* class.
> > >
> > > No modifications were made to any code or configuration and I used
> latest
> > > version of code available in master branch.
> > > No node failures occurred during test execution as well.
> > >
> > > As far as I understood from short investigation synchronization
> semaphore
> > > of name "IgniteSemaphoreExample" goes to broken state when
> > > *IgniteSemaphoreExample* node normally finishes and disconnects from
> the
> > > cluster.
> > > After that reusing of this semaphore becomes impossible and leads to
> > > hanging of new nodes doing so.
> > >
> > > Can you reproduce this? If so I will submit a ticket and share with
> you.
> > >
> > > Thank you,
> > > Sergey.
> > >
> > >
> > > On Wed, Nov 9, 2016 at 10:55 AM, Vladisav Jelisavcic <
> > vladisavj@gmail.com>
> > > wrote:
> > >
> > > > Hi Sergey,
> > > >
> > > > can you please provide more information?
> > > > Have you changed the example (if so, can you provide the changes you
> > > made?)
> > > > Is the example executed normally (without node failures)?
> > > >
> > > > In the example, semaphore is created in non-failover safe mode,
> > > > which means it is not safe to use it once it is broken (something
> like
> > > > CyclicBarrier in java.util.concurrent),
> > > > and the semaphore is preserved in spite of the first node failing (if
> > the
> > > > backups are configured),
> > > > so if the first node failed, then (broken) semaphore with the same
> name
> > > > should still be in the cache,
> > > > and this is expected behavior.
> > > >
> > > > If this is not the case (test was executed normally) then please
> > submit a
> > > > ticket describing more your setup,
> > > > how many nodes, how many backups configured, etc..
> > > >
> > > > Thanks!
> > > > Vladisav
> > > >
> > > > On Tue, Nov 8, 2016 at 10:37 AM, Sergey Chugunov <
> > > > sergey.chugunov@gmail.com>
> > > > wrote:
> > > >
> > > > >  Hello folks,
> > > > >
> > > > > I found a reason why *IgniteSemaphoreExample* hangs when started
> > twice
> > > > > without restarting a cluster; and it doesn't seem minor to me
> > anymore.
> > > > >
> > > > > From here I'm going to refer to example's code so please have it
> > > opened.
> > > > >
> > > > > So, when the first instance of node running example code finishes
> and
> > > > > leaves the cluster, synchronization semaphore named
> > > > > "IgniteSemaphoreExample" goes to broken state on all other cluster
> > > nodes.
> > > > > If I restart example without restarting all nodes of the cluster,
> > final
> > > > > *acquire *call on the semaphore on client side hangs because all
> > other
> > > > > nodes treat it as broken and don't increase permits with their
> > *release
> > > > > *calls
> > > > > on it.
> > > > >
> > > > > There is an interesting comment inside its *tryReleaseShared*
> > > > > implementation
> > > > > (BTW it is implemented in *GridCacheSemaphoreImpl*):
> > > > >
> > > > > "// If broken, return immediately, exception will be thrown anyway.
> > > > >  if (broken)
> > > > >    return true;"
> > > > >
> > > > > It seems that no exceptions are thrown neither on client side
> calling
> > > > > *acquire
> > > > > *or on server side calling *release *methods on a broken semaphore.
> > > > >
> > > > > Does anybody know why it behaves in that way? Is it expected
> behavior
> > > at
> > > > > all and if yes where is it documented?
> > > > >
> > > > > Thanks,
> > > > > Sergey Chugunov.
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > С уважением,
> > > Сергей Чугунов.
> > >
> >
>

Re: IGNITE-4155 IgniteSemaphoreExample unexpected behavior

Posted by Sergey Chugunov <se...@gmail.com>.

Hello Vladisav,

Thanks for confirmation!

I created a JIRA <https://issues.apache.org/jira/browse/IGNITE-4209> to
track this issue, feel free to edit it if it isn't descriptive enough.

Thank you,
Sergey.

On Thu, Nov 10, 2016 at 9:44 AM, Vladisav Jelisavcic <vl...@gmail.com>
wrote:

> Hi Sergey,
>
> you are right - I can reproduce this also.
> It seems to me that this is caused because we treat the same both
> EVT_NODE_LEFT and EVT_NODE_FAILED events.
> In this case, node leaves the topology without failure, but fails to
> release the semaphore before EVT_NODE_LEFT event occurs on other nodes,
> this really is a bug.
>
> Thanks!
> Vladisav
>
> On Wed, Nov 9, 2016 at 11:23 AM, Sergey Chugunov <
> sergey.chugunov@gmail.com>
> wrote:
>
> > Hello Vladisav,
> >
> > I found this behavior in a very simple environment: I had two nodes on my
> > local machine started by *ExampleNodeStartup* class and another node
> > started with *IgniteSemaphoreExample* class.
> >
> > No modifications were made to any code or configuration and I used latest
> > version of code available in master branch.
> > No node failures occurred during test execution as well.
> >
> > As far as I understood from short investigation synchronization semaphore
> > of name "IgniteSemaphoreExample" goes to broken state when
> > *IgniteSemaphoreExample* node normally finishes and disconnects from the
> > cluster.
> > After that reusing of this semaphore becomes impossible and leads to
> > hanging of new nodes doing so.
> >
> > Can you reproduce this? If so I will submit a ticket and share with you.
> >
> > Thank you,
> > Sergey.
> >
> >
> > On Wed, Nov 9, 2016 at 10:55 AM, Vladisav Jelisavcic <
> vladisavj@gmail.com>
> > wrote:
> >
> > > Hi Sergey,
> > >
> > > can you please provide more information?
> > > Have you changed the example (if so, can you provide the changes you
> > made?)
> > > Is the example executed normally (without node failures)?
> > >
> > > In the example, semaphore is created in non-failover safe mode,
> > > which means it is not safe to use it once it is broken (something like
> > > CyclicBarrier in java.util.concurrent),
> > > and the semaphore is preserved in spite of the first node failing (if
> the
> > > backups are configured),
> > > so if the first node failed, then (broken) semaphore with the same name
> > > should still be in the cache,
> > > and this is expected behavior.
> > >
> > > If this is not the case (test was executed normally) then please
> submit a
> > > ticket describing more your setup,
> > > how many nodes, how many backups configured, etc..
> > >
> > > Thanks!
> > > Vladisav
> > >
> > > On Tue, Nov 8, 2016 at 10:37 AM, Sergey Chugunov <
> > > sergey.chugunov@gmail.com>
> > > wrote:
> > >
> > > >  Hello folks,
> > > >
> > > > I found a reason why *IgniteSemaphoreExample* hangs when started
> twice
> > > > without restarting a cluster; and it doesn't seem minor to me
> anymore.
> > > >
> > > > From here I'm going to refer to example's code so please have it
> > opened.
> > > >
> > > > So, when the first instance of node running example code finishes and
> > > > leaves the cluster, synchronization semaphore named
> > > > "IgniteSemaphoreExample" goes to broken state on all other cluster
> > nodes.
> > > > If I restart example without restarting all nodes of the cluster,
> final
> > > > *acquire *call on the semaphore on client side hangs because all
> other
> > > > nodes treat it as broken and don't increase permits with their
> *release
> > > > *calls
> > > > on it.
> > > >
> > > > There is an interesting comment inside its *tryReleaseShared*
> > > > implementation
> > > > (BTW it is implemented in *GridCacheSemaphoreImpl*):
> > > >
> > > > "// If broken, return immediately, exception will be thrown anyway.
> > > >  if (broken)
> > > >    return true;"
> > > >
> > > > It seems that no exceptions are thrown neither on client side calling
> > > > *acquire
> > > > *or on server side calling *release *methods on a broken semaphore.
> > > >
> > > > Does anybody know why it behaves in that way? Is it expected behavior
> > at
> > > > all and if yes where is it documented?
> > > >
> > > > Thanks,
> > > > Sergey Chugunov.
> > > >
> > >
> >
> >
> >
> > --
> > С уважением,
> > Сергей Чугунов.
> >
>

Re: IGNITE-4155 IgniteSemaphoreExample unexpected behavior

Posted by Vladisav Jelisavcic <vl...@gmail.com>.

Hi Sergey,

you are right - I can reproduce this also.
It seems to me that this is caused because we treat the same both
EVT_NODE_LEFT and EVT_NODE_FAILED events.
In this case, node leaves the topology without failure, but fails to
release the semaphore before EVT_NODE_LEFT event occurs on other nodes,
this really is a bug.

Thanks!
Vladisav

On Wed, Nov 9, 2016 at 11:23 AM, Sergey Chugunov <se...@gmail.com>
wrote:

> Hello Vladisav,
>
> I found this behavior in a very simple environment: I had two nodes on my
> local machine started by *ExampleNodeStartup* class and another node
> started with *IgniteSemaphoreExample* class.
>
> No modifications were made to any code or configuration and I used latest
> version of code available in master branch.
> No node failures occurred during test execution as well.
>
> As far as I understood from short investigation synchronization semaphore
> of name "IgniteSemaphoreExample" goes to broken state when
> *IgniteSemaphoreExample* node normally finishes and disconnects from the
> cluster.
> After that reusing of this semaphore becomes impossible and leads to
> hanging of new nodes doing so.
>
> Can you reproduce this? If so I will submit a ticket and share with you.
>
> Thank you,
> Sergey.
>
>
> On Wed, Nov 9, 2016 at 10:55 AM, Vladisav Jelisavcic <vl...@gmail.com>
> wrote:
>
> > Hi Sergey,
> >
> > can you please provide more information?
> > Have you changed the example (if so, can you provide the changes you
> made?)
> > Is the example executed normally (without node failures)?
> >
> > In the example, semaphore is created in non-failover safe mode,
> > which means it is not safe to use it once it is broken (something like
> > CyclicBarrier in java.util.concurrent),
> > and the semaphore is preserved in spite of the first node failing (if the
> > backups are configured),
> > so if the first node failed, then (broken) semaphore with the same name
> > should still be in the cache,
> > and this is expected behavior.
> >
> > If this is not the case (test was executed normally) then please submit a
> > ticket describing more your setup,
> > how many nodes, how many backups configured, etc..
> >
> > Thanks!
> > Vladisav
> >
> > On Tue, Nov 8, 2016 at 10:37 AM, Sergey Chugunov <
> > sergey.chugunov@gmail.com>
> > wrote:
> >
> > >  Hello folks,
> > >
> > > I found a reason why *IgniteSemaphoreExample* hangs when started twice
> > > without restarting a cluster; and it doesn't seem minor to me anymore.
> > >
> > > From here I'm going to refer to example's code so please have it
> opened.
> > >
> > > So, when the first instance of node running example code finishes and
> > > leaves the cluster, synchronization semaphore named
> > > "IgniteSemaphoreExample" goes to broken state on all other cluster
> nodes.
> > > If I restart example without restarting all nodes of the cluster, final
> > > *acquire *call on the semaphore on client side hangs because all other
> > > nodes treat it as broken and don't increase permits with their *release
> > > *calls
> > > on it.
> > >
> > > There is an interesting comment inside its *tryReleaseShared*
> > > implementation
> > > (BTW it is implemented in *GridCacheSemaphoreImpl*):
> > >
> > > "// If broken, return immediately, exception will be thrown anyway.
> > >  if (broken)
> > >    return true;"
> > >
> > > It seems that no exceptions are thrown neither on client side calling
> > > *acquire
> > > *or on server side calling *release *methods on a broken semaphore.
> > >
> > > Does anybody know why it behaves in that way? Is it expected behavior
> at
> > > all and if yes where is it documented?
> > >
> > > Thanks,
> > > Sergey Chugunov.
> > >
> >
>
>
>
> --
> С уважением,
> Сергей Чугунов.
>

Re: IGNITE-4155 IgniteSemaphoreExample unexpected behavior

Posted by Sergey Chugunov <se...@gmail.com>.

Hello Vladisav,

I found this behavior in a very simple environment: I had two nodes on my
local machine started by *ExampleNodeStartup* class and another node
started with *IgniteSemaphoreExample* class.

No modifications were made to any code or configuration and I used latest
version of code available in master branch.
No node failures occurred during test execution as well.

As far as I understood from short investigation synchronization semaphore
of name "IgniteSemaphoreExample" goes to broken state when
*IgniteSemaphoreExample* node normally finishes and disconnects from the
cluster.
After that reusing of this semaphore becomes impossible and leads to
hanging of new nodes doing so.

Can you reproduce this? If so I will submit a ticket and share with you.

Thank you,
Sergey.


On Wed, Nov 9, 2016 at 10:55 AM, Vladisav Jelisavcic <vl...@gmail.com>
wrote:

> Hi Sergey,
>
> can you please provide more information?
> Have you changed the example (if so, can you provide the changes you made?)
> Is the example executed normally (without node failures)?
>
> In the example, semaphore is created in non-failover safe mode,
> which means it is not safe to use it once it is broken (something like
> CyclicBarrier in java.util.concurrent),
> and the semaphore is preserved in spite of the first node failing (if the
> backups are configured),
> so if the first node failed, then (broken) semaphore with the same name
> should still be in the cache,
> and this is expected behavior.
>
> If this is not the case (test was executed normally) then please submit a
> ticket describing more your setup,
> how many nodes, how many backups configured, etc..
>
> Thanks!
> Vladisav
>
> On Tue, Nov 8, 2016 at 10:37 AM, Sergey Chugunov <
> sergey.chugunov@gmail.com>
> wrote:
>
> >  Hello folks,
> >
> > I found a reason why *IgniteSemaphoreExample* hangs when started twice
> > without restarting a cluster; and it doesn't seem minor to me anymore.
> >
> > From here I'm going to refer to example's code so please have it opened.
> >
> > So, when the first instance of node running example code finishes and
> > leaves the cluster, synchronization semaphore named
> > "IgniteSemaphoreExample" goes to broken state on all other cluster nodes.
> > If I restart example without restarting all nodes of the cluster, final
> > *acquire *call on the semaphore on client side hangs because all other
> > nodes treat it as broken and don't increase permits with their *release
> > *calls
> > on it.
> >
> > There is an interesting comment inside its *tryReleaseShared*
> > implementation
> > (BTW it is implemented in *GridCacheSemaphoreImpl*):
> >
> > "// If broken, return immediately, exception will be thrown anyway.
> >  if (broken)
> >    return true;"
> >
> > It seems that no exceptions are thrown neither on client side calling
> > *acquire
> > *or on server side calling *release *methods on a broken semaphore.
> >
> > Does anybody know why it behaves in that way? Is it expected behavior at
> > all and if yes where is it documented?
> >
> > Thanks,
> > Sergey Chugunov.
> >
>



-- 
С уважением,
Сергей Чугунов.

Re: IGNITE-4155 IgniteSemaphoreExample unexpected behavior

Posted by Vladisav Jelisavcic <vl...@gmail.com>.

Hi Sergey,

can you please provide more information?
Have you changed the example (if so, can you provide the changes you made?)
Is the example executed normally (without node failures)?

In the example, semaphore is created in non-failover safe mode,
which means it is not safe to use it once it is broken (something like
CyclicBarrier in java.util.concurrent),
and the semaphore is preserved in spite of the first node failing (if the
backups are configured),
so if the first node failed, then (broken) semaphore with the same name
should still be in the cache,
and this is expected behavior.

If this is not the case (test was executed normally) then please submit a
ticket describing more your setup,
how many nodes, how many backups configured, etc..

Thanks!
Vladisav

On Tue, Nov 8, 2016 at 10:37 AM, Sergey Chugunov <se...@gmail.com>
wrote:

>  Hello folks,
>
> I found a reason why *IgniteSemaphoreExample* hangs when started twice
> without restarting a cluster; and it doesn't seem minor to me anymore.
>
> From here I'm going to refer to example's code so please have it opened.
>
> So, when the first instance of node running example code finishes and
> leaves the cluster, synchronization semaphore named
> "IgniteSemaphoreExample" goes to broken state on all other cluster nodes.
> If I restart example without restarting all nodes of the cluster, final
> *acquire *call on the semaphore on client side hangs because all other
> nodes treat it as broken and don't increase permits with their *release
> *calls
> on it.
>
> There is an interesting comment inside its *tryReleaseShared*
> implementation
> (BTW it is implemented in *GridCacheSemaphoreImpl*):
>
> "// If broken, return immediately, exception will be thrown anyway.
>  if (broken)
>    return true;"
>
> It seems that no exceptions are thrown neither on client side calling
> *acquire
> *or on server side calling *release *methods on a broken semaphore.
>
> Does anybody know why it behaves in that way? Is it expected behavior at
> all and if yes where is it documented?
>
> Thanks,
> Sergey Chugunov.
>