You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ignite.apache.org by Denis Magda <dm...@gridgain.com> on 2015/07/24 14:37:53 UTC

Stopped working on IGNITE-752 (speed up failure detection)

Igniters,

Have just back merged the changes into the main development branch. Thanks Yakov and Dmitriy for spending your time on review!

From now it’s possible to detect failures at cluster nodes' discovery/communication/network levels by altering a single parameter - IgniteConfiguration.failureDetectionTimeout.

By setting the failure detection timeout for a server node it will be possible to detect failed nodes in a cluster topology during the time equal to timeout's value and switch to/keep working with only alive nodes. 
By setting the timeout for a client node will let us to detect failures between the client and its router node (a server node that is a part of a topology).

In addition, bunch of other improvements and simplifications were done at the level of TcpDiscoverySpi and TcpCommunicationSpi. Changes are aggregated here:
https://issues.apache.org/jira/browse/IGNITE-752 <https://issues.apache.org/jira/browse/IGNITE-752>

—
Denis

Re: Stopped working on IGNITE-752 (speed up failure detection)

Posted by Denis Magda <dm...@gridgain.com>.

Thanks Dmitriy, please see below

> On 24 июля 2015 г., at 19:15, Dmitriy Setrakyan <ds...@apache.org> wrote:
> 
> Thanks Denis!
> 
> This feature significantly simplifies failure detection configuration in
> Ignite - just one configuration flag now vs. don't even remember how many.
> 
> Have you run a yardstick test on Amazon EC2 with this new configuration
> flag? If we kill a node in the middle, then drop should be insignificant.
> 
I haven’t. Will play with AWS next week and share the results with the community. It definitely shouldn’t be worth than before.

—
Denis

> Also, I want to note your excellent handling of Jira communication. The
> ticket has been thoroughly updated every step of the way.
> 
> D.
> 
> On Fri, Jul 24, 2015 at 5:37 AM, Denis Magda <dmagda@gridgain.com <ma...@gridgain.com>> wrote:
> 
>> Igniters,
>> 
>> Have just back merged the changes into the main development branch. Thanks
>> Yakov and Dmitriy for spending your time on review!
>> 
>> From now it’s possible to detect failures at cluster nodes'
>> discovery/communication/network levels by altering a single parameter -
>> IgniteConfiguration.failureDetectionTimeout.
>> 
>> By setting the failure detection timeout for a server node it will be
>> possible to detect failed nodes in a cluster topology during the time equal
>> to timeout's value and switch to/keep working with only alive nodes.
>> By setting the timeout for a client node will let us to detect failures
>> between the client and its router node (a server node that is a part of a
>> topology).
>> 
>> In addition, bunch of other improvements and simplifications were done at
>> the level of TcpDiscoverySpi and TcpCommunicationSpi. Changes are
>> aggregated here:
>> https://issues.apache.org/jira/browse/IGNITE-752 <
>> https://issues.apache.org/jira/browse/IGNITE-752 <https://issues.apache.org/jira/browse/IGNITE-752>>
>> 
>> —
>> Denis

Re: Stopped working on IGNITE-752 (speed up failure detection)

Posted by Denis Magda <dm...@gridgain.com>.

Sorry, forgot that attaches are not allowed.

Attach to a public URL mapping:
1) ignite-results-failure-detection.zip -> https://goo.gl/5mitfS
2) ignite-results-no-failure-detection-explicit-timeouts.zip -> 
https://goo.gl/as4qph
3) ignite-results-1.3.0.zip -> https://goo.gl/m8lbiR

--
Denis

On 7/27/2015 4:54 PM, Denis Magda wrote:
> Dmitriy, Igniters,
>
> I've got the first yardstick benchmarking results on Amazon EC2. 
> Thanks Nikolay for guidance and ready to use yardstick docker image.
>
> Used configuration is the following - c4.xlarge, 5 server nodes, 1 
> backup, running put/get benchmark, manually stopping one instance 
> during the execution.
> Time to warmup 60 seconds, execution time 150 seconds, 64 threads.
>
> 1) Failure detection timeout set to *300 ms.
> *Unfortunately, a drop during a kill of one server nodes is 
> significant. Please see a resulting plot in 
> ignite-results-failure-detection.zip.
>
> Making the timeout lower doesn't improve the situation.
>
> Right after that I've decided to run the same benchmark with failure 
> detection timeout ignored by setting several network related timeouts 
> explicitly (these timeouts were used before when we got insignificant 
> drop).
> TcpCommunicationSpi.setSocketWriteTimeout(200)
> TcpDiscoverySpi.setAckTimeout(50)
> TcpDiscoverySpi.setNetworkTimeout(5000)
> TcpDiscoverySpi.setHeartbeatFrequency(100)
>
> 2) Explicitly set the timeouts above, run against the latest changes 
> including mine.
> Here I saw pretty the same result - the drop is again signification. 
> Have a look at the plot in 
> ignite-results-no-failure-detection-explicit-timeouts.zip.
>
> 3) Well, the final sanity check was done over the latest release - 
> ignite-1.3.0-incubation that does NOT contain my changes. The timeouts 
> were the same as in 2).
> Unfortunately, here I see the same drop as well. Look into 
> ignite-results-1.3.0.zip.
>
> Seems that we got that drop even before my 'failure detection timeout' 
> changes were merged, if refer to 3). Will try to debug all this stuff 
> better tomorrow.
>
> --
> Denis
>
> On 7/24/2015 7:15 PM, Dmitriy Setrakyan wrote:
>> Thanks Denis!
>>
>> This feature significantly simplifies failure detection configuration in
>> Ignite - just one configuration flag now vs. don't even remember how many.
>>
>> Have you run a yardstick test on Amazon EC2 with this new configuration
>> flag? If we kill a node in the middle, then drop should be insignificant.
>>
>> Also, I want to note your excellent handling of Jira communication. The
>> ticket has been thoroughly updated every step of the way.
>>
>> D.
>>
>> On Fri, Jul 24, 2015 at 5:37 AM, Denis Magda<dm...@gridgain.com>  wrote:
>>
>>> Igniters,
>>>
>>> Have just back merged the changes into the main development branch. Thanks
>>> Yakov and Dmitriy for spending your time on review!
>>>
>>>  From now it’s possible to detect failures at cluster nodes'
>>> discovery/communication/network levels by altering a single parameter -
>>> IgniteConfiguration.failureDetectionTimeout.
>>>
>>> By setting the failure detection timeout for a server node it will be
>>> possible to detect failed nodes in a cluster topology during the time equal
>>> to timeout's value and switch to/keep working with only alive nodes.
>>> By setting the timeout for a client node will let us to detect failures
>>> between the client and its router node (a server node that is a part of a
>>> topology).
>>>
>>> In addition, bunch of other improvements and simplifications were done at
>>> the level of TcpDiscoverySpi and TcpCommunicationSpi. Changes are
>>> aggregated here:
>>> https://issues.apache.org/jira/browse/IGNITE-752  < https://issues.apache.org/jira/browse/IGNITE-752>
>>>
>>> —
>>> Denis
>

Re: Stopped working on IGNITE-752 (speed up failure detection)

Posted by Denis Magda <dm...@gridgain.com>.

Dmitriy, Igniters,

I've got the first yardstick benchmarking results on Amazon EC2. Thanks 
Nikolay for guidance and ready to use yardstick docker image.

Used configuration is the following - c4.xlarge, 5 server nodes, 1 
backup, running put/get benchmark, manually stopping one instance during 
the execution.
Time to warmup 60 seconds, execution time 150 seconds, 64 threads.

1) Failure detection timeout set to *300 ms.
*Unfortunately, a drop during a kill of one server nodes is significant. 
Please see a resulting plot in ignite-results-failure-detection.zip.

Making the timeout lower doesn't improve the situation.

Right after that I've decided to run the same benchmark with failure 
detection timeout ignored by setting several network related timeouts 
explicitly (these timeouts were used before when we got insignificant drop).
TcpCommunicationSpi.setSocketWriteTimeout(200)
TcpDiscoverySpi.setAckTimeout(50)
TcpDiscoverySpi.setNetworkTimeout(5000)
TcpDiscoverySpi.setHeartbeatFrequency(100)

2) Explicitly set the timeouts above, run against the latest changes 
including mine.
Here I saw pretty the same result - the drop is again signification. 
Have a look at the plot in 
ignite-results-no-failure-detection-explicit-timeouts.zip.

3) Well, the final sanity check was done over the latest release - 
ignite-1.3.0-incubation that does NOT contain my changes. The timeouts 
were the same as in 2).
Unfortunately, here I see the same drop as well. Look into 
ignite-results-1.3.0.zip.

Seems that we got that drop even before my 'failure detection timeout' 
changes were merged, if refer to 3). Will try to debug all this stuff 
better tomorrow.

--
Denis

On 7/24/2015 7:15 PM, Dmitriy Setrakyan wrote:
> Thanks Denis!
>
> This feature significantly simplifies failure detection configuration in
> Ignite - just one configuration flag now vs. don't even remember how many.
>
> Have you run a yardstick test on Amazon EC2 with this new configuration
> flag? If we kill a node in the middle, then drop should be insignificant.
>
> Also, I want to note your excellent handling of Jira communication. The
> ticket has been thoroughly updated every step of the way.
>
> D.
>
> On Fri, Jul 24, 2015 at 5:37 AM, Denis Magda <dm...@gridgain.com> wrote:
>
>> Igniters,
>>
>> Have just back merged the changes into the main development branch. Thanks
>> Yakov and Dmitriy for spending your time on review!
>>
>>  From now it’s possible to detect failures at cluster nodes'
>> discovery/communication/network levels by altering a single parameter -
>> IgniteConfiguration.failureDetectionTimeout.
>>
>> By setting the failure detection timeout for a server node it will be
>> possible to detect failed nodes in a cluster topology during the time equal
>> to timeout's value and switch to/keep working with only alive nodes.
>> By setting the timeout for a client node will let us to detect failures
>> between the client and its router node (a server node that is a part of a
>> topology).
>>
>> In addition, bunch of other improvements and simplifications were done at
>> the level of TcpDiscoverySpi and TcpCommunicationSpi. Changes are
>> aggregated here:
>> https://issues.apache.org/jira/browse/IGNITE-752 <
>> https://issues.apache.org/jira/browse/IGNITE-752>
>>
>> —
>> Denis

Re: Stopped working on IGNITE-752 (speed up failure detection)

Posted by Dmitriy Setrakyan <ds...@apache.org>.

Thanks Denis!

This feature significantly simplifies failure detection configuration in
Ignite - just one configuration flag now vs. don't even remember how many.

Have you run a yardstick test on Amazon EC2 with this new configuration
flag? If we kill a node in the middle, then drop should be insignificant.

Also, I want to note your excellent handling of Jira communication. The
ticket has been thoroughly updated every step of the way.

D.

On Fri, Jul 24, 2015 at 5:37 AM, Denis Magda <dm...@gridgain.com> wrote:

> Igniters,
>
> Have just back merged the changes into the main development branch. Thanks
> Yakov and Dmitriy for spending your time on review!
>
> From now it’s possible to detect failures at cluster nodes'
> discovery/communication/network levels by altering a single parameter -
> IgniteConfiguration.failureDetectionTimeout.
>
> By setting the failure detection timeout for a server node it will be
> possible to detect failed nodes in a cluster topology during the time equal
> to timeout's value and switch to/keep working with only alive nodes.
> By setting the timeout for a client node will let us to detect failures
> between the client and its router node (a server node that is a part of a
> topology).
>
> In addition, bunch of other improvements and simplifications were done at
> the level of TcpDiscoverySpi and TcpCommunicationSpi. Changes are
> aggregated here:
> https://issues.apache.org/jira/browse/IGNITE-752 <
> https://issues.apache.org/jira/browse/IGNITE-752>
>
> —
> Denis