You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@geode.apache.org by Alberto Gomez <al...@est.tech> on 2019/05/10 10:22:36 UTC

Geode self-protection about overload

Hi Geode community!

I'd like to know if Geode implements any kind of self-protection against overload. What I mean by this is some mechanism that allows Geode servers (and possibly locators) to reject incoming operations before processing them when it detects that it is not able to handle the amount of operations received in a reasonable way (with reasonable latency and without experiencing processes crashing).

The goal would be to make sure that Geode (or some parts of it) do not crash under too heavy load and also that the latency level is always under control at least for the amount of traffic the Geode cluster is supposed to support.

If Geode does not offer such mechanism, I would also like to get your opinion about this possible feature, (if you find it interesting) and also on how it could be implemented. One possible approach could be having some measure of the current CPU consumption that allows to decide if a given operation must be processed or not, taking into account the CPU consumption value with respect to an overload threshold.

Thanks in advance for your answers,

-Alberto

Re: Geode self-protection about overload

Posted by Alberto Gomez <al...@est.tech>.

Hi again,

I finally figured out why I was not getting the 
"ServerConnectivityException" when executing a big amount of functions 
in Geode while I did get the exception when running lots of 
gets/puts/queries.

The reason is that the ConnectionImpl::execute(Op op) does not use the 
timeout set by PoolFactory::setReadTimeout(int timeout) when the 
operation is a function. Instead, it uses the timeout set by the 
following System property: gemfire.CLIENT_FUNCTION_TIMEOUT.

Do you see value in adding a method to the PoolFactory as well as to the 
ClientCacheFactory to set this timeout for functions?

How about being able to override this timeout on each function 
invocation by adding a setReadTimeout method to the FunctionService 
interface?

/Alberto


On 22/5/19 18:03, Alberto Gomez wrote:
> Hi Anthony,
>
> Thanks again for the information.
>
> I have played a bit with the the client timeouts and retries and have 
> seen operations being rejected when load is high due to get or put 
> operations. Nevertheless, I have not seen that happen when the load in 
> the server is high due to functions invoked. Is there a reason for not 
> seeing errors with functions or is it just that my test was not good 
> to hit the limits? What if queries are sent with OQL? Do the timeout 
> and retries apply? Is there a similar protection on the native C++ API?
>
> I'd be willing to contribute to the improvements you mention. Do you 
> already have ideas? Anything written down?
>
> /Alberto
>
>
> On 14/5/19 17:01, Anthony Baker wrote:
>> The primary load limiter between the client tier and the Geode servers is via the max connections limit as noted in this writeup:
>>
>> https://cwiki.apache.org/confluence/display/GEODE/Resource+Management+in+Geode <https://cwiki.apache.org/confluence/display/GEODE/Resource+Management+in+Geode>
>>
>> When the load is sufficiently high, operations may timeout and a geode client will failover to less loaded servers.  You can limit the number of retries the client will attempt (each gated by a read timeout) and thus slow down incoming operations.
>>
>> We’re looking into some improvements in the client connection pool to improve both performance and behaviors at the ragged edge when resources are saturated.  Contributions welcome!
>>
>> Anthony
>>
>>
>>> On May 13, 2019, at 9:02 AM, Alberto Gomez <al...@est.tech> wrote:
>>>
>>> Hi Anthony!
>>>
>>> Thanks a lot for your prompt answer.
>>>
>>> I think it is great that Geode can preserve the availability and predictable low latency of the cluster when some members are unresponsive by means of the GMS.
>>>
>>> My question was more targeted to situations in which the load received by the cluster is so high that all members struggle to offer low latency. Under such circumstances, does Geode take any action to back-off some of the incoming load?
>>>
>>> Thanks in advance,
>>>
>>> Alberto
>>>
>>>
>>> On 10/5/19 17:52, Anthony Baker wrote:
>>>
>>> Hi Alberto!
>>>
>>> Great questions.  One of the fundamental characteristics of Geode is its Group Membership System (GMS).  You can read more about it here [1].  The membership system ensures that failures due to unresponsive members and/or network partitions are detected quickly.  Given that we use synchronous replication for consistent updates, the GMS algorithms fence off unresponsive members to preserve the availability (and predictable low latency) of the cluster as a whole.
>>>
>>> Another factor of resilience is memory load.  Regions can be configured to automatically evict data to disk based on heap usage.  In addition, when a Region exceeds a critical memory usage thresholds further updates are blocked until the overload is resolved.
>>>
>>> Geode clients route operations to cluster members based on connection load.  This helps balance cpu load across the entire cluster.  Cluster members can set connection maximums to prevent overrunning the available capacity of an individual server.
>>>
>>> I hope this helps and feel free to keep asking questions :-)
>>>
>>> Anthony
>>>
>>> [1] https://cwiki.apache.org/confluence/display/GEODE/Core+Distributed+System+Concepts <https://cwiki.apache.org/confluence/display/GEODE/Core+Distributed+System+Concepts><https://cwiki.apache.org/confluence/display/GEODE/Core+Distributed+System+Concepts <https://cwiki.apache.org/confluence/display/GEODE/Core+Distributed+System+Concepts>>
>>>
>>>
>>>
>>>
>>> On May 10, 2019, at 3:22 AM, Alberto Gomez <al...@est.tech> wrote:
>>>
>>> Hi Geode community!
>>>
>>> I'd like to know if Geode implements any kind of self-protection against overload. What I mean by this is some mechanism that allows Geode servers (and possibly locators) to reject incoming operations before processing them when it detects that it is not able to handle the amount of operations received in a reasonable way (with reasonable latency and without experiencing processes crashing).
>>>
>>> The goal would be to make sure that Geode (or some parts of it) do not crash under too heavy load and also that the latency level is always under control at least for the amount of traffic the Geode cluster is supposed to support.
>>>
>>> If Geode does not offer such mechanism, I would also like to get your opinion about this possible feature, (if you find it interesting) and also on how it could be implemented. One possible approach could be having some measure of the current CPU consumption that allows to decide if a given operation must be processed or not, taking into account the CPU consumption value with respect to an overload threshold.
>>>
>>> Thanks in advance for your answers,
>>>
>>> -Alberto
>
>

Re: Geode self-protection about overload

Posted by Alberto Gomez <al...@est.tech>.

Hi Anthony,

Thanks again for the information.

I have played a bit with the the client timeouts and retries and have seen operations being rejected when load is high due to get or put operations. Nevertheless, I have not seen that happen when the load in the server is high due to functions invoked. Is there a reason for not seeing errors with functions or is it just that my test was not good to hit the limits? What if queries are sent with OQL? Do the timeout and retries apply? Is there a similar protection on the native C++ API?

I'd be willing to contribute to the improvements you mention. Do you already have ideas? Anything written down?

/Alberto

On 14/5/19 17:01, Anthony Baker wrote:

The primary load limiter between the client tier and the Geode servers is via the max connections limit as noted in this writeup:

https://cwiki.apache.org/confluence/display/GEODE/Resource+Management+in+Geode <https://cwiki.apache.org/confluence/display/GEODE/Resource+Management+in+Geode>

When the load is sufficiently high, operations may timeout and a geode client will failover to less loaded servers. You can limit the number of retries the client will attempt (each gated by a read timeout) and thus slow down incoming operations.

We’re looking into some improvements in the client connection pool to improve both performance and behaviors at the ragged edge when resources are saturated. Contributions welcome!

Anthony

On May 13, 2019, at 9:02 AM, Alberto Gomez <al...@est.tech> wrote:

Hi Anthony!

Thanks a lot for your prompt answer.

I think it is great that Geode can preserve the availability and predictable low latency of the cluster when some members are unresponsive by means of the GMS.

My question was more targeted to situations in which the load received by the cluster is so high that all members struggle to offer low latency. Under such circumstances, does Geode take any action to back-off some of the incoming load?

Thanks in advance,

Alberto

On 10/5/19 17:52, Anthony Baker wrote:

Hi Alberto!

Great questions. One of the fundamental characteristics of Geode is its Group Membership System (GMS). You can read more about it here [1]. The membership system ensures that failures due to unresponsive members and/or network partitions are detected quickly. Given that we use synchronous replication for consistent updates, the GMS algorithms fence off unresponsive members to preserve the availability (and predictable low latency) of the cluster as a whole.

Another factor of resilience is memory load. Regions can be configured to automatically evict data to disk based on heap usage. In addition, when a Region exceeds a critical memory usage thresholds further updates are blocked until the overload is resolved.

Geode clients route operations to cluster members based on connection load. This helps balance cpu load across the entire cluster. Cluster members can set connection maximums to prevent overrunning the available capacity of an individual server.

I hope this helps and feel free to keep asking questions :-)

Anthony

[1] https://cwiki.apache.org/confluence/display/GEODE/Core+Distributed+System+Concepts <https://cwiki.apache.org/confluence/display/GEODE/Core+Distributed+System+Concepts><https://cwiki.apache.org/confluence/display/GEODE/Core+Distributed+System+Concepts <https://cwiki.apache.org/confluence/display/GEODE/Core+Distributed+System+Concepts>>

On May 10, 2019, at 3:22 AM, Alberto Gomez <al...@est.tech> wrote:

Hi Geode community!

I'd like to know if Geode implements any kind of self-protection against overload. What I mean by this is some mechanism that allows Geode servers (and possibly locators) to reject incoming operations before processing them when it detects that it is not able to handle the amount of operations received in a reasonable way (with reasonable latency and without experiencing processes crashing).

The goal would be to make sure that Geode (or some parts of it) do not crash under too heavy load and also that the latency level is always under control at least for the amount of traffic the Geode cluster is supposed to support.

If Geode does not offer such mechanism, I would also like to get your opinion about this possible feature, (if you find it interesting) and also on how it could be implemented. One possible approach could be having some measure of the current CPU consumption that allows to decide if a given operation must be processed or not, taking into account the CPU consumption value with respect to an overload threshold.

Thanks in advance for your answers,

-Alberto

Re: Geode self-protection about overload

Posted by Anthony Baker <ab...@pivotal.io>.

The primary load limiter between the client tier and the Geode servers is via the max connections limit as noted in this writeup:

https://cwiki.apache.org/confluence/display/GEODE/Resource+Management+in+Geode <https://cwiki.apache.org/confluence/display/GEODE/Resource+Management+in+Geode>

When the load is sufficiently high, operations may timeout and a geode client will failover to less loaded servers.  You can limit the number of retries the client will attempt (each gated by a read timeout) and thus slow down incoming operations.  

We’re looking into some improvements in the client connection pool to improve both performance and behaviors at the ragged edge when resources are saturated.  Contributions welcome!

Anthony


> On May 13, 2019, at 9:02 AM, Alberto Gomez <al...@est.tech> wrote:
> 
> Hi Anthony!
> 
> Thanks a lot for your prompt answer.
> 
> I think it is great that Geode can preserve the availability and predictable low latency of the cluster when some members are unresponsive by means of the GMS.
> 
> My question was more targeted to situations in which the load received by the cluster is so high that all members struggle to offer low latency. Under such circumstances, does Geode take any action to back-off some of the incoming load?
> 
> Thanks in advance,
> 
> Alberto
> 
> 
> On 10/5/19 17:52, Anthony Baker wrote:
> 
> Hi Alberto!
> 
> Great questions.  One of the fundamental characteristics of Geode is its Group Membership System (GMS).  You can read more about it here [1].  The membership system ensures that failures due to unresponsive members and/or network partitions are detected quickly.  Given that we use synchronous replication for consistent updates, the GMS algorithms fence off unresponsive members to preserve the availability (and predictable low latency) of the cluster as a whole.
> 
> Another factor of resilience is memory load.  Regions can be configured to automatically evict data to disk based on heap usage.  In addition, when a Region exceeds a critical memory usage thresholds further updates are blocked until the overload is resolved.
> 
> Geode clients route operations to cluster members based on connection load.  This helps balance cpu load across the entire cluster.  Cluster members can set connection maximums to prevent overrunning the available capacity of an individual server.
> 
> I hope this helps and feel free to keep asking questions :-)
> 
> Anthony
> 
> [1] https://cwiki.apache.org/confluence/display/GEODE/Core+Distributed+System+Concepts <https://cwiki.apache.org/confluence/display/GEODE/Core+Distributed+System+Concepts><https://cwiki.apache.org/confluence/display/GEODE/Core+Distributed+System+Concepts <https://cwiki.apache.org/confluence/display/GEODE/Core+Distributed+System+Concepts>>
> 
> 
> 
> 
> On May 10, 2019, at 3:22 AM, Alberto Gomez <al...@est.tech> wrote:
> 
> Hi Geode community!
> 
> I'd like to know if Geode implements any kind of self-protection against overload. What I mean by this is some mechanism that allows Geode servers (and possibly locators) to reject incoming operations before processing them when it detects that it is not able to handle the amount of operations received in a reasonable way (with reasonable latency and without experiencing processes crashing).
> 
> The goal would be to make sure that Geode (or some parts of it) do not crash under too heavy load and also that the latency level is always under control at least for the amount of traffic the Geode cluster is supposed to support.
> 
> If Geode does not offer such mechanism, I would also like to get your opinion about this possible feature, (if you find it interesting) and also on how it could be implemented. One possible approach could be having some measure of the current CPU consumption that allows to decide if a given operation must be processed or not, taking into account the CPU consumption value with respect to an overload threshold.
> 
> Thanks in advance for your answers,
> 
> -Alberto

Re: Geode self-protection about overload

Posted by Alberto Gomez <al...@est.tech>.

Hi Anthony!

Thanks a lot for your prompt answer.

I think it is great that Geode can preserve the availability and predictable low latency of the cluster when some members are unresponsive by means of the GMS.

Thanks in advance,

Alberto

On 10/5/19 17:52, Anthony Baker wrote:

Hi Alberto!

I hope this helps and feel free to keep asking questions :-)

Anthony

[1] https://cwiki.apache.org/confluence/display/GEODE/Core+Distributed+System+Concepts <https://cwiki.apache.org/confluence/display/GEODE/Core+Distributed+System+Concepts>

On May 10, 2019, at 3:22 AM, Alberto Gomez <al...@est.tech> wrote:

Hi Geode community!

Thanks in advance for your answers,

-Alberto

Re: Geode self-protection about overload

Posted by Anthony Baker <ab...@pivotal.io>.

Hi Alberto!

Great questions.  One of the fundamental characteristics of Geode is its Group Membership System (GMS).  You can read more about it here [1].  The membership system ensures that failures due to unresponsive members and/or network partitions are detected quickly.  Given that we use synchronous replication for consistent updates, the GMS algorithms fence off unresponsive members to preserve the availability (and predictable low latency) of the cluster as a whole.

Another factor of resilience is memory load.  Regions can be configured to automatically evict data to disk based on heap usage.  In addition, when a Region exceeds a critical memory usage thresholds further updates are blocked until the overload is resolved.

Geode clients route operations to cluster members based on connection load.  This helps balance cpu load across the entire cluster.  Cluster members can set connection maximums to prevent overrunning the available capacity of an individual server.

I hope this helps and feel free to keep asking questions :-)

Anthony

[1] https://cwiki.apache.org/confluence/display/GEODE/Core+Distributed+System+Concepts <https://cwiki.apache.org/confluence/display/GEODE/Core+Distributed+System+Concepts>

> On May 10, 2019, at 3:22 AM, Alberto Gomez <al...@est.tech> wrote:
> 
> Hi Geode community!
> 
> I'd like to know if Geode implements any kind of self-protection against overload. What I mean by this is some mechanism that allows Geode servers (and possibly locators) to reject incoming operations before processing them when it detects that it is not able to handle the amount of operations received in a reasonable way (with reasonable latency and without experiencing processes crashing).
> 
> The goal would be to make sure that Geode (or some parts of it) do not crash under too heavy load and also that the latency level is always under control at least for the amount of traffic the Geode cluster is supposed to support.
> 
> If Geode does not offer such mechanism, I would also like to get your opinion about this possible feature, (if you find it interesting) and also on how it could be implemented. One possible approach could be having some measure of the current CPU consumption that allows to decide if a given operation must be processed or not, taking into account the CPU consumption value with respect to an overload threshold.
> 
> Thanks in advance for your answers,
> 
> -Alberto