You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@geode.apache.org by Alberto Bustamante Reyes <al...@est.tech> on 2020/09/17 14:42:07 UTC

Clean C++ client metadata in timeouts

Hi geode-dev,

I have a question about the c++ client.

Some months ago we merged GEODE-8231 to solve a problem we observed regarding the native client was trying to connect to stopped server.
GEODE-8231 solution consists on remove the client metadata when an "IO error in handshake" exception is received. This fix solved most of our problems, but it has been observed that sometimes when a server is stopped the errors received in the client are not the same and this "IO error in handshake" takes up to a minute to appear. So during that time, the client is still trying to connect to the offline server.

As the error received during that time is "timeout in handshake", we have tested modyfing the solution of GEODE-8213 to make the client to remove the metadata once a timeout error is received (here is a draft with the code: https://github.com/apache/geode-native/pull/651). With this change in place, the behavior is ok.


But I would like to check your opinion about this check, because this will cause that a single timeout will cause the removal of the client metadata, which maybe its not the best solution. I thought about different alternatives:

- Wait until a given number of timeouts in a row have been received from the same server to remove the metadata
- Make this "remove-metadata-after-timeout" something optional that could be configured if needed

As this will misalign the behavior of Java and C++ clients, making this an optional configuration will be more appropriate, to keep the default c++ client behavior as the Java client.

BR/

Alberto B.

RE: Clean C++ client metadata in timeouts

Posted by Alberto Bustamante Reyes <al...@est.tech>.
Hi,

Just for clarification. When there is an "IO error in handshake" what is deleted is just the information of the failing server, not the whole metadata of the client.

BR/

Alberto B.
________________________________
De: Jacob Barrett <ja...@vmware.com>
Enviado: viernes, 18 de septiembre de 2020 21:32
Para: dev@geode.apache.org <de...@geode.apache.org>
Asunto: Re: Clean C++ client metadata in timeouts

+1 to what Anthony is asking.

Rather the “fixing” the current behavior let’s just implement a behavior that better achieves the goal of single hop optimization.

From what I recall for both the Java and C++ code is that we throw away all metadata on a region whenever there is any triggering event. We should keep the old metadata until we have new metadata. Even if the data is partially correct its better than random server selection.

I don’t recall all the triggering events, but I think anything that cased a server to be removed form the pool triggered this. The silly thing is that pretty much any exception on the connection cased not only that connection to be closed but all other connections to the same server to be closed. This pre-mature termination of connections is probably not ideal. And again, throwing out all the metadata for what probably only effects a subset of the metadata is bad.

Metadata is completely asynchronous and only triggers after “failure” events, like those above or a good response with he metadata update flag set. Is there a way to get this metadata to the client more quickly? I suspect not easily, maybe sending something in ping messages, but this assume mostly idle clients.

I suspect what we would find is that just avoiding the complete dismissal of metadata should suffice. We could start with that and then optimize from there.

As for the original post about the client trying to connect to a stopped server. Is this scenario where the client is going to perform a put, or otherwise mutation operation, so the primary server is necessary but the pool doesn’t have any available connections so it tries to create one. On read only ops it should be going to the locator I think for a “balanced” connection to a server hosting the bucket (though my recollection of the code is old). I think it would be perfectly fine in this scenario to assume the metadata could be incorrect and to fetch it. I would not throw out the current metadata. We could even fetch this metadata synchronously so the current operation doesn’t waste time continually trying to connect to the wrong server, though it is possible this error is transient.

So, yeah, I think it just makes sense for us to look at this behavior from the single hop optimization perspective and make that feature behave like it should and not worry about options to enable different behavior. The end goal is to have optimal single hop operations, if the current implementation doesn’t do that then we fix that. No configuration options necessary.

-Jake


On Sep 18, 2020, at 8:54 AM, Anthony Baker <ba...@vmware.com>> wrote:

I’m not sure I have answers so I’ll just ask more questions :-)

When a server is killed, does that provoke an asynchronous metadata update to clients?  I could be wrong about that but if it IS true, then perhaps we should focus on optimizing that path. The sooner that a client can get accurate bucket location data the faster it can service requests.

I suggest this because I know that wiping out *all* bucket metadata on the client means that we’ve now destroyed the ability of the client to do single-hop operations until the metadata is refreshed. This has the cost of additional latency on each client request and the hidden cost of additional sockets and threads within the cluster to service the extra hop by forwarding requests to the appropriate server.  This is an important because many users test and size their geode cluster based on single-hop resource consumption and it’s a very steep step up when this is not possible.  If there’s insufficient headroom to handle the additional load it can tip a bad situation (single node failing) into a much worse cascading condition (multiple nodes failing).

So I guess my questions are:
- What triggers a metadata refresh and how can we make that faster?
- Can we very selectively identify that some metadata is out of date and invalidate that information only?


Anthony


On Sep 18, 2020, at 3:50 AM, Alberto Bustamante Reyes <al...@est.tech>> wrote:

Hi,

Thanks for you messages, here you are some answers:

Dave:
Are there cases in which one or two timeouts are followed by a successful
retry? Or does one timeout *always* end with more timeouts and, ultimately,
an IO error?
Not in our use case, which is kiling a server. In this case, timeouts will end up on an IO error.

If a straight-up change solves a constant headache, as you suggest, Alberto, and as Blake concurs, that sounds like the way to go. Why introduce a new option or property if the user will always prefer one behavior over the other?

The fix works fine for our use case, I suggested the alternatives to make it something optional in case there were concerns about it. In other projects I have been involved in the past, we had to deal with temporary network problems. So most of the times, if a timeout had a consequence (so to say), that was not applied after just one timeout.

But its true that in this use case, a timeout always ends up on an IO error, as I said. So if you dont see any problem with cleaning the metadata just after one timeout, then we dont need any control mechanism for it.



Blake:
Given that attempts to retrieve metadata after the C++ cache is closed are a constant headache for Geode Native development, I am generally in favor of anything that potentially reduces the number of times/places this happens.  If we've failed the handshake, it's very unlikely things will correct themselves without outside intervention, so this fix is probably goodness. I'd go ahead and submit a PR when you think it's solid.

Good to hear that. The code changes in the draft PR are ready, I just need to figure out the testing part. Im not sure how I will add a test because it would be the same test as the one added for GEODE-8231...


BR/

Alberto B.


________________________________
De: Ernie Burghardt <bu...@vmware.com>>
Enviado: jueves, 17 de septiembre de 2020 22:08
Para: dev@geode.apache.org<ma...@geode.apache.org> <de...@geode.apache.org>>
Asunto: Re: Clean C++ client metadata in timeouts

Let's please consider how this would controlled and look for ways other than YetAnotherProperty

Thanks,
EB

On 9/17/20, 12:59 PM, "Dave Barnes" <db...@apache.org>> wrote:

  If a straight-up change solves a constant headache, as you suggest,
  Alberto, and as Blake concurs, that sounds like the way to go.
  Why introduce a new option or property if the user will always prefer one
  behavior over the other? (And from a docs perspective, who needs another
  optional property, anyway?)

  On Thu, Sep 17, 2020 at 10:32 AM Blake Bender <bb...@vmware.com>> wrote:

Given that attempts to retrieve metadata after the C++ cache is closed are
a constant headache for Geode Native development, I am generally in favor
of anything that potentially reduces the number of times/places this
happens.  If we've failed the handshake, it's very unlikely things will
correct themselves without outside intervention, so this fix is probably
goodness.  I'd go ahead and submit a PR when you think it's solid.

Thanks,

Blake


On 9/17/20, 9:36 AM, "Dave Barnes" <db...@apache.org>> wrote:

  Alberto,
  Are there cases in which one or two timeouts are followed by a
successful
  retry? Or does one timeout *always* end with more timeouts and,
ultimately,
  an IO error?
  If timeouts can sometimes be followed by successful retries, and
re-trying
  is the current default behavior, then I agree that introducing a
setting
  that effectively eliminates re-tries should be the developer's choice.
  In that case, I suggest that the option should not be a low-level
choice of
  "handle the metadata in a way that eliminates retries" but should be
higher
  level, like "when attempting to connect, try only once, instead of
  re-trying (the default behavior)."
  -Dave

  On Thu, Sep 17, 2020 at 7:42 AM Alberto Bustamante Reyes
  <al...@est.tech>> wrote:

Hi geode-dev,

I have a question about the c++ client.

Some months ago we merged GEODE-8231 to solve a problem we observed
regarding the native client was trying to connect to stopped server.
GEODE-8231 solution consists on remove the client metadata when an
"IO
error in handshake" exception is received. This fix solved most of
our
problems, but it has been observed that sometimes when a server is
stopped
the errors received in the client are not the same and this "IO
error in
handshake" takes up to a minute to appear. So during that time, the
client
is still trying to connect to the offline server.

As the error received during that time is "timeout in handshake", we
have
tested modyfing the solution of GEODE-8213 to make the client to
remove the
metadata once a timeout error is received (here is a draft with the
code:

https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F651&amp;data=02%7C01%7Cjabarrett%40vmware.com%7C36b8c078998049579f0708d85beb1d97%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637360412664338261&amp;sdata=xidoP5UN5BckTbajlCb%2BmT8kWVa3h%2FOqbrernxYIjbo%3D&amp;reserved=0).
With this change in
place, the behavior is ok.


But I would like to check your opinion about this check, because
this will
cause that a single timeout will cause the removal of the client
metadata,
which maybe its not the best solution. I thought about different
alternatives:

- Wait until a given number of timeouts in a row have been received
from
the same server to remove the metadata
- Make this "remove-metadata-after-timeout" something optional that
could
be configured if needed

As this will misalign the behavior of Java and C++ clients, making
this an
optional configuration will be more appropriate, to keep the default
c++
client behavior as the Java client.

BR/

Alberto B.


Re: Clean C++ client metadata in timeouts

Posted by Jacob Barrett <ja...@vmware.com>.
+1 to what Anthony is asking.

Rather the “fixing” the current behavior let’s just implement a behavior that better achieves the goal of single hop optimization.

From what I recall for both the Java and C++ code is that we throw away all metadata on a region whenever there is any triggering event. We should keep the old metadata until we have new metadata. Even if the data is partially correct its better than random server selection.

I don’t recall all the triggering events, but I think anything that cased a server to be removed form the pool triggered this. The silly thing is that pretty much any exception on the connection cased not only that connection to be closed but all other connections to the same server to be closed. This pre-mature termination of connections is probably not ideal. And again, throwing out all the metadata for what probably only effects a subset of the metadata is bad.

Metadata is completely asynchronous and only triggers after “failure” events, like those above or a good response with he metadata update flag set. Is there a way to get this metadata to the client more quickly? I suspect not easily, maybe sending something in ping messages, but this assume mostly idle clients.

I suspect what we would find is that just avoiding the complete dismissal of metadata should suffice. We could start with that and then optimize from there.

As for the original post about the client trying to connect to a stopped server. Is this scenario where the client is going to perform a put, or otherwise mutation operation, so the primary server is necessary but the pool doesn’t have any available connections so it tries to create one. On read only ops it should be going to the locator I think for a “balanced” connection to a server hosting the bucket (though my recollection of the code is old). I think it would be perfectly fine in this scenario to assume the metadata could be incorrect and to fetch it. I would not throw out the current metadata. We could even fetch this metadata synchronously so the current operation doesn’t waste time continually trying to connect to the wrong server, though it is possible this error is transient.

So, yeah, I think it just makes sense for us to look at this behavior from the single hop optimization perspective and make that feature behave like it should and not worry about options to enable different behavior. The end goal is to have optimal single hop operations, if the current implementation doesn’t do that then we fix that. No configuration options necessary.

-Jake


On Sep 18, 2020, at 8:54 AM, Anthony Baker <ba...@vmware.com>> wrote:

I’m not sure I have answers so I’ll just ask more questions :-)

When a server is killed, does that provoke an asynchronous metadata update to clients?  I could be wrong about that but if it IS true, then perhaps we should focus on optimizing that path. The sooner that a client can get accurate bucket location data the faster it can service requests.

I suggest this because I know that wiping out *all* bucket metadata on the client means that we’ve now destroyed the ability of the client to do single-hop operations until the metadata is refreshed. This has the cost of additional latency on each client request and the hidden cost of additional sockets and threads within the cluster to service the extra hop by forwarding requests to the appropriate server.  This is an important because many users test and size their geode cluster based on single-hop resource consumption and it’s a very steep step up when this is not possible.  If there’s insufficient headroom to handle the additional load it can tip a bad situation (single node failing) into a much worse cascading condition (multiple nodes failing).

So I guess my questions are:
- What triggers a metadata refresh and how can we make that faster?
- Can we very selectively identify that some metadata is out of date and invalidate that information only?


Anthony


On Sep 18, 2020, at 3:50 AM, Alberto Bustamante Reyes <al...@est.tech>> wrote:

Hi,

Thanks for you messages, here you are some answers:

Dave:
Are there cases in which one or two timeouts are followed by a successful
retry? Or does one timeout *always* end with more timeouts and, ultimately,
an IO error?
Not in our use case, which is kiling a server. In this case, timeouts will end up on an IO error.

If a straight-up change solves a constant headache, as you suggest, Alberto, and as Blake concurs, that sounds like the way to go. Why introduce a new option or property if the user will always prefer one behavior over the other?

The fix works fine for our use case, I suggested the alternatives to make it something optional in case there were concerns about it. In other projects I have been involved in the past, we had to deal with temporary network problems. So most of the times, if a timeout had a consequence (so to say), that was not applied after just one timeout.

But its true that in this use case, a timeout always ends up on an IO error, as I said. So if you dont see any problem with cleaning the metadata just after one timeout, then we dont need any control mechanism for it.



Blake:
Given that attempts to retrieve metadata after the C++ cache is closed are a constant headache for Geode Native development, I am generally in favor of anything that potentially reduces the number of times/places this happens.  If we've failed the handshake, it's very unlikely things will correct themselves without outside intervention, so this fix is probably goodness. I'd go ahead and submit a PR when you think it's solid.

Good to hear that. The code changes in the draft PR are ready, I just need to figure out the testing part. Im not sure how I will add a test because it would be the same test as the one added for GEODE-8231...


BR/

Alberto B.


________________________________
De: Ernie Burghardt <bu...@vmware.com>>
Enviado: jueves, 17 de septiembre de 2020 22:08
Para: dev@geode.apache.org<ma...@geode.apache.org> <de...@geode.apache.org>>
Asunto: Re: Clean C++ client metadata in timeouts

Let's please consider how this would controlled and look for ways other than YetAnotherProperty

Thanks,
EB

On 9/17/20, 12:59 PM, "Dave Barnes" <db...@apache.org>> wrote:

  If a straight-up change solves a constant headache, as you suggest,
  Alberto, and as Blake concurs, that sounds like the way to go.
  Why introduce a new option or property if the user will always prefer one
  behavior over the other? (And from a docs perspective, who needs another
  optional property, anyway?)

  On Thu, Sep 17, 2020 at 10:32 AM Blake Bender <bb...@vmware.com>> wrote:

Given that attempts to retrieve metadata after the C++ cache is closed are
a constant headache for Geode Native development, I am generally in favor
of anything that potentially reduces the number of times/places this
happens.  If we've failed the handshake, it's very unlikely things will
correct themselves without outside intervention, so this fix is probably
goodness.  I'd go ahead and submit a PR when you think it's solid.

Thanks,

Blake


On 9/17/20, 9:36 AM, "Dave Barnes" <db...@apache.org>> wrote:

  Alberto,
  Are there cases in which one or two timeouts are followed by a
successful
  retry? Or does one timeout *always* end with more timeouts and,
ultimately,
  an IO error?
  If timeouts can sometimes be followed by successful retries, and
re-trying
  is the current default behavior, then I agree that introducing a
setting
  that effectively eliminates re-tries should be the developer's choice.
  In that case, I suggest that the option should not be a low-level
choice of
  "handle the metadata in a way that eliminates retries" but should be
higher
  level, like "when attempting to connect, try only once, instead of
  re-trying (the default behavior)."
  -Dave

  On Thu, Sep 17, 2020 at 7:42 AM Alberto Bustamante Reyes
  <al...@est.tech>> wrote:

Hi geode-dev,

I have a question about the c++ client.

Some months ago we merged GEODE-8231 to solve a problem we observed
regarding the native client was trying to connect to stopped server.
GEODE-8231 solution consists on remove the client metadata when an
"IO
error in handshake" exception is received. This fix solved most of
our
problems, but it has been observed that sometimes when a server is
stopped
the errors received in the client are not the same and this "IO
error in
handshake" takes up to a minute to appear. So during that time, the
client
is still trying to connect to the offline server.

As the error received during that time is "timeout in handshake", we
have
tested modyfing the solution of GEODE-8213 to make the client to
remove the
metadata once a timeout error is received (here is a draft with the
code:

https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F651&amp;data=02%7C01%7Cjabarrett%40vmware.com%7C36b8c078998049579f0708d85beb1d97%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637360412664338261&amp;sdata=xidoP5UN5BckTbajlCb%2BmT8kWVa3h%2FOqbrernxYIjbo%3D&amp;reserved=0).
With this change in
place, the behavior is ok.


But I would like to check your opinion about this check, because
this will
cause that a single timeout will cause the removal of the client
metadata,
which maybe its not the best solution. I thought about different
alternatives:

- Wait until a given number of timeouts in a row have been received
from
the same server to remove the metadata
- Make this "remove-metadata-after-timeout" something optional that
could
be configured if needed

As this will misalign the behavior of Java and C++ clients, making
this an
optional configuration will be more appropriate, to keep the default
c++
client behavior as the Java client.

BR/

Alberto B.


Re: Clean C++ client metadata in timeouts

Posted by Anthony Baker <ba...@vmware.com>.
I’m not sure I have answers so I’ll just ask more questions :-)

When a server is killed, does that provoke an asynchronous metadata update to clients?  I could be wrong about that but if it IS true, then perhaps we should focus on optimizing that path. The sooner that a client can get accurate bucket location data the faster it can service requests.

I suggest this because I know that wiping out *all* bucket metadata on the client means that we’ve now destroyed the ability of the client to do single-hop operations until the metadata is refreshed.  This has the cost of additional latency on each client request and the hidden cost of additional sockets and threads within the cluster to service the extra hop by forwarding requests to the appropriate server.  This is an important because many users test and size their geode cluster based on single-hop resource consumption and it’s a very steep step up when this is not possible.  If there’s insufficient headroom to handle the additional load it can tip a bad situation (single node failing) into a much worse cascading condition (multiple nodes failing).

So I guess my questions are:
- What triggers a metadata refresh and how can we make that faster?
- Can we very selectively identify that some metadata is out of date and invalidate that information only?


Anthony


On Sep 18, 2020, at 3:50 AM, Alberto Bustamante Reyes <al...@est.tech>> wrote:

Hi,

Thanks for you messages, here you are some answers:

Dave:
Are there cases in which one or two timeouts are followed by a successful
retry? Or does one timeout *always* end with more timeouts and, ultimately,
an IO error?
Not in our use case, which is kiling a server. In this case, timeouts will end up on an IO error.

If a straight-up change solves a constant headache, as you suggest, Alberto, and as Blake concurs, that sounds like the way to go. Why introduce a new option or property if the user will always prefer one behavior over the other?

The fix works fine for our use case, I suggested the alternatives to make it something optional in case there were concerns about it. In other projects I have been involved in the past, we had to deal with temporary network problems. So most of the times, if a timeout had a consequence (so to say), that was not applied after just one timeout.

But its true that in this use case, a timeout always ends up on an IO error, as I said. So if you dont see any problem with cleaning the metadata just after one timeout, then we dont need any control mechanism for it.



Blake:
Given that attempts to retrieve metadata after the C++ cache is closed are a constant headache for Geode Native development, I am generally in favor of anything that potentially reduces the number of times/places this happens.  If we've failed the handshake, it's very unlikely things will correct themselves without outside intervention, so this fix is probably goodness. I'd go ahead and submit a PR when you think it's solid.

Good to hear that. The code changes in the draft PR are ready, I just need to figure out the testing part. Im not sure how I will add a test because it would be the same test as the one added for GEODE-8231...


BR/

Alberto B.


________________________________
De: Ernie Burghardt <bu...@vmware.com>>
Enviado: jueves, 17 de septiembre de 2020 22:08
Para: dev@geode.apache.org<ma...@geode.apache.org> <de...@geode.apache.org>>
Asunto: Re: Clean C++ client metadata in timeouts

Let's please consider how this would controlled and look for ways other than YetAnotherProperty

Thanks,
EB

On 9/17/20, 12:59 PM, "Dave Barnes" <db...@apache.org>> wrote:

   If a straight-up change solves a constant headache, as you suggest,
   Alberto, and as Blake concurs, that sounds like the way to go.
   Why introduce a new option or property if the user will always prefer one
   behavior over the other? (And from a docs perspective, who needs another
   optional property, anyway?)

   On Thu, Sep 17, 2020 at 10:32 AM Blake Bender <bb...@vmware.com>> wrote:

Given that attempts to retrieve metadata after the C++ cache is closed are
a constant headache for Geode Native development, I am generally in favor
of anything that potentially reduces the number of times/places this
happens.  If we've failed the handshake, it's very unlikely things will
correct themselves without outside intervention, so this fix is probably
goodness.  I'd go ahead and submit a PR when you think it's solid.

Thanks,

Blake


On 9/17/20, 9:36 AM, "Dave Barnes" <db...@apache.org>> wrote:

   Alberto,
   Are there cases in which one or two timeouts are followed by a
successful
   retry? Or does one timeout *always* end with more timeouts and,
ultimately,
   an IO error?
   If timeouts can sometimes be followed by successful retries, and
re-trying
   is the current default behavior, then I agree that introducing a
setting
   that effectively eliminates re-tries should be the developer's choice.
   In that case, I suggest that the option should not be a low-level
choice of
   "handle the metadata in a way that eliminates retries" but should be
higher
   level, like "when attempting to connect, try only once, instead of
   re-trying (the default behavior)."
   -Dave

   On Thu, Sep 17, 2020 at 7:42 AM Alberto Bustamante Reyes
   <al...@est.tech>> wrote:

Hi geode-dev,

I have a question about the c++ client.

Some months ago we merged GEODE-8231 to solve a problem we observed
regarding the native client was trying to connect to stopped server.
GEODE-8231 solution consists on remove the client metadata when an
"IO
error in handshake" exception is received. This fix solved most of
our
problems, but it has been observed that sometimes when a server is
stopped
the errors received in the client are not the same and this "IO
error in
handshake" takes up to a minute to appear. So during that time, the
client
is still trying to connect to the offline server.

As the error received during that time is "timeout in handshake", we
have
tested modyfing the solution of GEODE-8213 to make the client to
remove the
metadata once a timeout error is received (here is a draft with the
code:

https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F651&amp;data=02%7C01%7Cbakera%40vmware.com%7C8c821f31ff6f4be1390008d85bc0bb0d%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637360230616136951&amp;sdata=%2F6BRXhW1xXPB5VXgCdHVx%2Fb976JFg%2BzTrtKHoi9WA9E%3D&amp;reserved=0).
With this change in
place, the behavior is ok.


But I would like to check your opinion about this check, because
this will
cause that a single timeout will cause the removal of the client
metadata,
which maybe its not the best solution. I thought about different
alternatives:

- Wait until a given number of timeouts in a row have been received
from
the same server to remove the metadata
- Make this "remove-metadata-after-timeout" something optional that
could
be configured if needed

As this will misalign the behavior of Java and C++ clients, making
this an
optional configuration will be more appropriate, to keep the default
c++
client behavior as the Java client.

BR/

Alberto B.


Re: Clean C++ client metadata in timeouts

Posted by Ernie Burghardt <bu...@vmware.com>.
Alberto, give us a mention on the draft PR and we can think about how it might be tested too... might get more ideas from the group...

Thanks,
EB

On 9/18/20, 3:51 AM, "Alberto Bustamante Reyes" <al...@est.tech> wrote:

    Hi,

    Thanks for you messages, here you are some answers:

    Dave:
    Are there cases in which one or two timeouts are followed by a successful
    retry? Or does one timeout *always* end with more timeouts and, ultimately,
    an IO error?
    Not in our use case, which is kiling a server. In this case, timeouts will end up on an IO error.

    If a straight-up change solves a constant headache, as you suggest, Alberto, and as Blake concurs, that sounds like the way to go. Why introduce a new option or property if the user will always prefer one behavior over the other?

    The fix works fine for our use case, I suggested the alternatives to make it something optional in case there were concerns about it. In other projects I have been involved in the past, we had to deal with temporary network problems. So most of the times, if a timeout had a consequence (so to say), that was not applied after just one timeout.

    But its true that in this use case, a timeout always ends up on an IO error, as I said. So if you dont see any problem with cleaning the metadata just after one timeout, then we dont need any control mechanism for it.



    Blake:
    Given that attempts to retrieve metadata after the C++ cache is closed are a constant headache for Geode Native development, I am generally in favor of anything that potentially reduces the number of times/places this happens.  If we've failed the handshake, it's very unlikely things will correct themselves without outside intervention, so this fix is probably goodness. I'd go ahead and submit a PR when you think it's solid.

    Good to hear that. The code changes in the draft PR are ready, I just need to figure out the testing part. Im not sure how I will add a test because it would be the same test as the one added for GEODE-8231...


    BR/

    Alberto B.


    ________________________________
    De: Ernie Burghardt <bu...@vmware.com>
    Enviado: jueves, 17 de septiembre de 2020 22:08
    Para: dev@geode.apache.org <de...@geode.apache.org>
    Asunto: Re: Clean C++ client metadata in timeouts

    Let's please consider how this would controlled and look for ways other than YetAnotherProperty

    Thanks,
    EB

    On 9/17/20, 12:59 PM, "Dave Barnes" <db...@apache.org> wrote:

        If a straight-up change solves a constant headache, as you suggest,
        Alberto, and as Blake concurs, that sounds like the way to go.
        Why introduce a new option or property if the user will always prefer one
        behavior over the other? (And from a docs perspective, who needs another
        optional property, anyway?)

        On Thu, Sep 17, 2020 at 10:32 AM Blake Bender <bb...@vmware.com> wrote:

        > Given that attempts to retrieve metadata after the C++ cache is closed are
        > a constant headache for Geode Native development, I am generally in favor
        > of anything that potentially reduces the number of times/places this
        > happens.  If we've failed the handshake, it's very unlikely things will
        > correct themselves without outside intervention, so this fix is probably
        > goodness.  I'd go ahead and submit a PR when you think it's solid.
        >
        > Thanks,
        >
        > Blake
        >
        >
        > On 9/17/20, 9:36 AM, "Dave Barnes" <db...@apache.org> wrote:
        >
        >     Alberto,
        >     Are there cases in which one or two timeouts are followed by a
        > successful
        >     retry? Or does one timeout *always* end with more timeouts and,
        > ultimately,
        >     an IO error?
        >     If timeouts can sometimes be followed by successful retries, and
        > re-trying
        >     is the current default behavior, then I agree that introducing a
        > setting
        >     that effectively eliminates re-tries should be the developer's choice.
        >     In that case, I suggest that the option should not be a low-level
        > choice of
        >     "handle the metadata in a way that eliminates retries" but should be
        > higher
        >     level, like "when attempting to connect, try only once, instead of
        >     re-trying (the default behavior)."
        >     -Dave
        >
        >     On Thu, Sep 17, 2020 at 7:42 AM Alberto Bustamante Reyes
        >     <al...@est.tech> wrote:
        >
        >     > Hi geode-dev,
        >     >
        >     > I have a question about the c++ client.
        >     >
        >     > Some months ago we merged GEODE-8231 to solve a problem we observed
        >     > regarding the native client was trying to connect to stopped server.
        >     > GEODE-8231 solution consists on remove the client metadata when an
        > "IO
        >     > error in handshake" exception is received. This fix solved most of
        > our
        >     > problems, but it has been observed that sometimes when a server is
        > stopped
        >     > the errors received in the client are not the same and this "IO
        > error in
        >     > handshake" takes up to a minute to appear. So during that time, the
        > client
        >     > is still trying to connect to the offline server.
        >     >
        >     > As the error received during that time is "timeout in handshake", we
        > have
        >     > tested modyfing the solution of GEODE-8213 to make the client to
        > remove the
        >     > metadata once a timeout error is received (here is a draft with the
        > code:
        >     >
        > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F651&amp;data=02%7C01%7Cburghardte%40vmware.com%7C81149957da084848bb4308d85bc0bd0f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637360230711375298&amp;sdata=3%2FYLIk8yPJdhYTJ2mrw65JHxQYxNyVy4sdQkAAvMlF8%3D&amp;reserved=0).
        > With this change in
        >     > place, the behavior is ok.
        >     >
        >     >
        >     > But I would like to check your opinion about this check, because
        > this will
        >     > cause that a single timeout will cause the removal of the client
        > metadata,
        >     > which maybe its not the best solution. I thought about different
        >     > alternatives:
        >     >
        >     > - Wait until a given number of timeouts in a row have been received
        > from
        >     > the same server to remove the metadata
        >     > - Make this "remove-metadata-after-timeout" something optional that
        > could
        >     > be configured if needed
        >     >
        >     > As this will misalign the behavior of Java and C++ clients, making
        > this an
        >     > optional configuration will be more appropriate, to keep the default
        > c++
        >     > client behavior as the Java client.
        >     >
        >     > BR/
        >     >
        >     > Alberto B.
        >     >
        >
        >



RE: Clean C++ client metadata in timeouts

Posted by Alberto Bustamante Reyes <al...@est.tech>.
Hi,

Thanks for you messages, here you are some answers:

Dave:
Are there cases in which one or two timeouts are followed by a successful
retry? Or does one timeout *always* end with more timeouts and, ultimately,
an IO error?
Not in our use case, which is kiling a server. In this case, timeouts will end up on an IO error.

If a straight-up change solves a constant headache, as you suggest, Alberto, and as Blake concurs, that sounds like the way to go. Why introduce a new option or property if the user will always prefer one behavior over the other?

The fix works fine for our use case, I suggested the alternatives to make it something optional in case there were concerns about it. In other projects I have been involved in the past, we had to deal with temporary network problems. So most of the times, if a timeout had a consequence (so to say), that was not applied after just one timeout.

But its true that in this use case, a timeout always ends up on an IO error, as I said. So if you dont see any problem with cleaning the metadata just after one timeout, then we dont need any control mechanism for it.



Blake:
Given that attempts to retrieve metadata after the C++ cache is closed are a constant headache for Geode Native development, I am generally in favor of anything that potentially reduces the number of times/places this happens.  If we've failed the handshake, it's very unlikely things will correct themselves without outside intervention, so this fix is probably goodness. I'd go ahead and submit a PR when you think it's solid.

Good to hear that. The code changes in the draft PR are ready, I just need to figure out the testing part. Im not sure how I will add a test because it would be the same test as the one added for GEODE-8231...


BR/

Alberto B.


________________________________
De: Ernie Burghardt <bu...@vmware.com>
Enviado: jueves, 17 de septiembre de 2020 22:08
Para: dev@geode.apache.org <de...@geode.apache.org>
Asunto: Re: Clean C++ client metadata in timeouts

Let's please consider how this would controlled and look for ways other than YetAnotherProperty

Thanks,
EB

On 9/17/20, 12:59 PM, "Dave Barnes" <db...@apache.org> wrote:

    If a straight-up change solves a constant headache, as you suggest,
    Alberto, and as Blake concurs, that sounds like the way to go.
    Why introduce a new option or property if the user will always prefer one
    behavior over the other? (And from a docs perspective, who needs another
    optional property, anyway?)

    On Thu, Sep 17, 2020 at 10:32 AM Blake Bender <bb...@vmware.com> wrote:

    > Given that attempts to retrieve metadata after the C++ cache is closed are
    > a constant headache for Geode Native development, I am generally in favor
    > of anything that potentially reduces the number of times/places this
    > happens.  If we've failed the handshake, it's very unlikely things will
    > correct themselves without outside intervention, so this fix is probably
    > goodness.  I'd go ahead and submit a PR when you think it's solid.
    >
    > Thanks,
    >
    > Blake
    >
    >
    > On 9/17/20, 9:36 AM, "Dave Barnes" <db...@apache.org> wrote:
    >
    >     Alberto,
    >     Are there cases in which one or two timeouts are followed by a
    > successful
    >     retry? Or does one timeout *always* end with more timeouts and,
    > ultimately,
    >     an IO error?
    >     If timeouts can sometimes be followed by successful retries, and
    > re-trying
    >     is the current default behavior, then I agree that introducing a
    > setting
    >     that effectively eliminates re-tries should be the developer's choice.
    >     In that case, I suggest that the option should not be a low-level
    > choice of
    >     "handle the metadata in a way that eliminates retries" but should be
    > higher
    >     level, like "when attempting to connect, try only once, instead of
    >     re-trying (the default behavior)."
    >     -Dave
    >
    >     On Thu, Sep 17, 2020 at 7:42 AM Alberto Bustamante Reyes
    >     <al...@est.tech> wrote:
    >
    >     > Hi geode-dev,
    >     >
    >     > I have a question about the c++ client.
    >     >
    >     > Some months ago we merged GEODE-8231 to solve a problem we observed
    >     > regarding the native client was trying to connect to stopped server.
    >     > GEODE-8231 solution consists on remove the client metadata when an
    > "IO
    >     > error in handshake" exception is received. This fix solved most of
    > our
    >     > problems, but it has been observed that sometimes when a server is
    > stopped
    >     > the errors received in the client are not the same and this "IO
    > error in
    >     > handshake" takes up to a minute to appear. So during that time, the
    > client
    >     > is still trying to connect to the offline server.
    >     >
    >     > As the error received during that time is "timeout in handshake", we
    > have
    >     > tested modyfing the solution of GEODE-8213 to make the client to
    > remove the
    >     > metadata once a timeout error is received (here is a draft with the
    > code:
    >     >
    > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F651&amp;data=02%7C01%7Cburghardte%40vmware.com%7Cd73403fcd2df4b9d1d0a08d85b443413%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637359695795955165&amp;sdata=QeXlk3XdqPn5l0jytgNYja%2Fykvm%2FFz5PySvCv8WXa2E%3D&amp;reserved=0).
    > With this change in
    >     > place, the behavior is ok.
    >     >
    >     >
    >     > But I would like to check your opinion about this check, because
    > this will
    >     > cause that a single timeout will cause the removal of the client
    > metadata,
    >     > which maybe its not the best solution. I thought about different
    >     > alternatives:
    >     >
    >     > - Wait until a given number of timeouts in a row have been received
    > from
    >     > the same server to remove the metadata
    >     > - Make this "remove-metadata-after-timeout" something optional that
    > could
    >     > be configured if needed
    >     >
    >     > As this will misalign the behavior of Java and C++ clients, making
    > this an
    >     > optional configuration will be more appropriate, to keep the default
    > c++
    >     > client behavior as the Java client.
    >     >
    >     > BR/
    >     >
    >     > Alberto B.
    >     >
    >
    >


Re: Clean C++ client metadata in timeouts

Posted by Ernie Burghardt <bu...@vmware.com>.
Let's please consider how this would controlled and look for ways other than YetAnotherProperty

Thanks,
EB

On 9/17/20, 12:59 PM, "Dave Barnes" <db...@apache.org> wrote:

    If a straight-up change solves a constant headache, as you suggest,
    Alberto, and as Blake concurs, that sounds like the way to go.
    Why introduce a new option or property if the user will always prefer one
    behavior over the other? (And from a docs perspective, who needs another
    optional property, anyway?)

    On Thu, Sep 17, 2020 at 10:32 AM Blake Bender <bb...@vmware.com> wrote:

    > Given that attempts to retrieve metadata after the C++ cache is closed are
    > a constant headache for Geode Native development, I am generally in favor
    > of anything that potentially reduces the number of times/places this
    > happens.  If we've failed the handshake, it's very unlikely things will
    > correct themselves without outside intervention, so this fix is probably
    > goodness.  I'd go ahead and submit a PR when you think it's solid.
    >
    > Thanks,
    >
    > Blake
    >
    >
    > On 9/17/20, 9:36 AM, "Dave Barnes" <db...@apache.org> wrote:
    >
    >     Alberto,
    >     Are there cases in which one or two timeouts are followed by a
    > successful
    >     retry? Or does one timeout *always* end with more timeouts and,
    > ultimately,
    >     an IO error?
    >     If timeouts can sometimes be followed by successful retries, and
    > re-trying
    >     is the current default behavior, then I agree that introducing a
    > setting
    >     that effectively eliminates re-tries should be the developer's choice.
    >     In that case, I suggest that the option should not be a low-level
    > choice of
    >     "handle the metadata in a way that eliminates retries" but should be
    > higher
    >     level, like "when attempting to connect, try only once, instead of
    >     re-trying (the default behavior)."
    >     -Dave
    >
    >     On Thu, Sep 17, 2020 at 7:42 AM Alberto Bustamante Reyes
    >     <al...@est.tech> wrote:
    >
    >     > Hi geode-dev,
    >     >
    >     > I have a question about the c++ client.
    >     >
    >     > Some months ago we merged GEODE-8231 to solve a problem we observed
    >     > regarding the native client was trying to connect to stopped server.
    >     > GEODE-8231 solution consists on remove the client metadata when an
    > "IO
    >     > error in handshake" exception is received. This fix solved most of
    > our
    >     > problems, but it has been observed that sometimes when a server is
    > stopped
    >     > the errors received in the client are not the same and this "IO
    > error in
    >     > handshake" takes up to a minute to appear. So during that time, the
    > client
    >     > is still trying to connect to the offline server.
    >     >
    >     > As the error received during that time is "timeout in handshake", we
    > have
    >     > tested modyfing the solution of GEODE-8213 to make the client to
    > remove the
    >     > metadata once a timeout error is received (here is a draft with the
    > code:
    >     >
    > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F651&amp;data=02%7C01%7Cburghardte%40vmware.com%7Cd73403fcd2df4b9d1d0a08d85b443413%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637359695795955165&amp;sdata=QeXlk3XdqPn5l0jytgNYja%2Fykvm%2FFz5PySvCv8WXa2E%3D&amp;reserved=0).
    > With this change in
    >     > place, the behavior is ok.
    >     >
    >     >
    >     > But I would like to check your opinion about this check, because
    > this will
    >     > cause that a single timeout will cause the removal of the client
    > metadata,
    >     > which maybe its not the best solution. I thought about different
    >     > alternatives:
    >     >
    >     > - Wait until a given number of timeouts in a row have been received
    > from
    >     > the same server to remove the metadata
    >     > - Make this "remove-metadata-after-timeout" something optional that
    > could
    >     > be configured if needed
    >     >
    >     > As this will misalign the behavior of Java and C++ clients, making
    > this an
    >     > optional configuration will be more appropriate, to keep the default
    > c++
    >     > client behavior as the Java client.
    >     >
    >     > BR/
    >     >
    >     > Alberto B.
    >     >
    >
    >


Re: Clean C++ client metadata in timeouts

Posted by Dave Barnes <db...@apache.org>.
If a straight-up change solves a constant headache, as you suggest,
Alberto, and as Blake concurs, that sounds like the way to go.
Why introduce a new option or property if the user will always prefer one
behavior over the other? (And from a docs perspective, who needs another
optional property, anyway?)

On Thu, Sep 17, 2020 at 10:32 AM Blake Bender <bb...@vmware.com> wrote:

> Given that attempts to retrieve metadata after the C++ cache is closed are
> a constant headache for Geode Native development, I am generally in favor
> of anything that potentially reduces the number of times/places this
> happens.  If we've failed the handshake, it's very unlikely things will
> correct themselves without outside intervention, so this fix is probably
> goodness.  I'd go ahead and submit a PR when you think it's solid.
>
> Thanks,
>
> Blake
>
>
> On 9/17/20, 9:36 AM, "Dave Barnes" <db...@apache.org> wrote:
>
>     Alberto,
>     Are there cases in which one or two timeouts are followed by a
> successful
>     retry? Or does one timeout *always* end with more timeouts and,
> ultimately,
>     an IO error?
>     If timeouts can sometimes be followed by successful retries, and
> re-trying
>     is the current default behavior, then I agree that introducing a
> setting
>     that effectively eliminates re-tries should be the developer's choice.
>     In that case, I suggest that the option should not be a low-level
> choice of
>     "handle the metadata in a way that eliminates retries" but should be
> higher
>     level, like "when attempting to connect, try only once, instead of
>     re-trying (the default behavior)."
>     -Dave
>
>     On Thu, Sep 17, 2020 at 7:42 AM Alberto Bustamante Reyes
>     <al...@est.tech> wrote:
>
>     > Hi geode-dev,
>     >
>     > I have a question about the c++ client.
>     >
>     > Some months ago we merged GEODE-8231 to solve a problem we observed
>     > regarding the native client was trying to connect to stopped server.
>     > GEODE-8231 solution consists on remove the client metadata when an
> "IO
>     > error in handshake" exception is received. This fix solved most of
> our
>     > problems, but it has been observed that sometimes when a server is
> stopped
>     > the errors received in the client are not the same and this "IO
> error in
>     > handshake" takes up to a minute to appear. So during that time, the
> client
>     > is still trying to connect to the offline server.
>     >
>     > As the error received during that time is "timeout in handshake", we
> have
>     > tested modyfing the solution of GEODE-8213 to make the client to
> remove the
>     > metadata once a timeout error is received (here is a draft with the
> code:
>     >
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F651&amp;data=02%7C01%7Cbblake%40vmware.com%7Cee9cfd61173047c7247808d85b27c3c8%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637359573636742453&amp;sdata=FUhQIAalNs0PK4vFvgnVZPV55cLPykD2cvDRwgRrNj0%3D&amp;reserved=0).
> With this change in
>     > place, the behavior is ok.
>     >
>     >
>     > But I would like to check your opinion about this check, because
> this will
>     > cause that a single timeout will cause the removal of the client
> metadata,
>     > which maybe its not the best solution. I thought about different
>     > alternatives:
>     >
>     > - Wait until a given number of timeouts in a row have been received
> from
>     > the same server to remove the metadata
>     > - Make this "remove-metadata-after-timeout" something optional that
> could
>     > be configured if needed
>     >
>     > As this will misalign the behavior of Java and C++ clients, making
> this an
>     > optional configuration will be more appropriate, to keep the default
> c++
>     > client behavior as the Java client.
>     >
>     > BR/
>     >
>     > Alberto B.
>     >
>
>

Re: Clean C++ client metadata in timeouts

Posted by Blake Bender <bb...@vmware.com>.
Given that attempts to retrieve metadata after the C++ cache is closed are a constant headache for Geode Native development, I am generally in favor of anything that potentially reduces the number of times/places this happens.  If we've failed the handshake, it's very unlikely things will correct themselves without outside intervention, so this fix is probably goodness.  I'd go ahead and submit a PR when you think it's solid.

Thanks,

Blake


On 9/17/20, 9:36 AM, "Dave Barnes" <db...@apache.org> wrote:

    Alberto,
    Are there cases in which one or two timeouts are followed by a successful
    retry? Or does one timeout *always* end with more timeouts and, ultimately,
    an IO error?
    If timeouts can sometimes be followed by successful retries, and re-trying
    is the current default behavior, then I agree that introducing a setting
    that effectively eliminates re-tries should be the developer's choice.
    In that case, I suggest that the option should not be a low-level choice of
    "handle the metadata in a way that eliminates retries" but should be higher
    level, like "when attempting to connect, try only once, instead of
    re-trying (the default behavior)."
    -Dave

    On Thu, Sep 17, 2020 at 7:42 AM Alberto Bustamante Reyes
    <al...@est.tech> wrote:

    > Hi geode-dev,
    >
    > I have a question about the c++ client.
    >
    > Some months ago we merged GEODE-8231 to solve a problem we observed
    > regarding the native client was trying to connect to stopped server.
    > GEODE-8231 solution consists on remove the client metadata when an "IO
    > error in handshake" exception is received. This fix solved most of our
    > problems, but it has been observed that sometimes when a server is stopped
    > the errors received in the client are not the same and this "IO error in
    > handshake" takes up to a minute to appear. So during that time, the client
    > is still trying to connect to the offline server.
    >
    > As the error received during that time is "timeout in handshake", we have
    > tested modyfing the solution of GEODE-8213 to make the client to remove the
    > metadata once a timeout error is received (here is a draft with the code:
    > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F651&amp;data=02%7C01%7Cbblake%40vmware.com%7Cee9cfd61173047c7247808d85b27c3c8%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637359573636742453&amp;sdata=FUhQIAalNs0PK4vFvgnVZPV55cLPykD2cvDRwgRrNj0%3D&amp;reserved=0). With this change in
    > place, the behavior is ok.
    >
    >
    > But I would like to check your opinion about this check, because this will
    > cause that a single timeout will cause the removal of the client metadata,
    > which maybe its not the best solution. I thought about different
    > alternatives:
    >
    > - Wait until a given number of timeouts in a row have been received from
    > the same server to remove the metadata
    > - Make this "remove-metadata-after-timeout" something optional that could
    > be configured if needed
    >
    > As this will misalign the behavior of Java and C++ clients, making this an
    > optional configuration will be more appropriate, to keep the default c++
    > client behavior as the Java client.
    >
    > BR/
    >
    > Alberto B.
    >


Re: Clean C++ client metadata in timeouts

Posted by Dave Barnes <db...@apache.org>.
Alberto,
Are there cases in which one or two timeouts are followed by a successful
retry? Or does one timeout *always* end with more timeouts and, ultimately,
an IO error?
If timeouts can sometimes be followed by successful retries, and re-trying
is the current default behavior, then I agree that introducing a setting
that effectively eliminates re-tries should be the developer's choice.
In that case, I suggest that the option should not be a low-level choice of
"handle the metadata in a way that eliminates retries" but should be higher
level, like "when attempting to connect, try only once, instead of
re-trying (the default behavior)."
-Dave

On Thu, Sep 17, 2020 at 7:42 AM Alberto Bustamante Reyes
<al...@est.tech> wrote:

> Hi geode-dev,
>
> I have a question about the c++ client.
>
> Some months ago we merged GEODE-8231 to solve a problem we observed
> regarding the native client was trying to connect to stopped server.
> GEODE-8231 solution consists on remove the client metadata when an "IO
> error in handshake" exception is received. This fix solved most of our
> problems, but it has been observed that sometimes when a server is stopped
> the errors received in the client are not the same and this "IO error in
> handshake" takes up to a minute to appear. So during that time, the client
> is still trying to connect to the offline server.
>
> As the error received during that time is "timeout in handshake", we have
> tested modyfing the solution of GEODE-8213 to make the client to remove the
> metadata once a timeout error is received (here is a draft with the code:
> https://github.com/apache/geode-native/pull/651). With this change in
> place, the behavior is ok.
>
>
> But I would like to check your opinion about this check, because this will
> cause that a single timeout will cause the removal of the client metadata,
> which maybe its not the best solution. I thought about different
> alternatives:
>
> - Wait until a given number of timeouts in a row have been received from
> the same server to remove the metadata
> - Make this "remove-metadata-after-timeout" something optional that could
> be configured if needed
>
> As this will misalign the behavior of Java and C++ clients, making this an
> optional configuration will be more appropriate, to keep the default c++
> client behavior as the Java client.
>
> BR/
>
> Alberto B.
>