You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "Ernest Burghardt (Jira)" <ji...@apache.org> on 2021/04/13 20:57:00 UTC

[jira] [Updated] (GEODE-9147) Dropped keys in single-hop PUTALL request when one or more servers is unreachable

     [ https://issues.apache.org/jira/browse/GEODE-9147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ernest Burghardt updated GEODE-9147:
------------------------------------
    Description: 
For single-hop PUTALL, the request from the app is broken up in Geode native as follows:

i. Each value is hashed to a bucket, the server corresponding to the bucket is looked up in the metadata, and the value is added to a server-specific list for that server.

ii. When all values are added to a list, Geode native spins up a thread for each list, and sends a PUTALL to each server.

 

When a server can't be reached by Geode native, its entries are removed from the metadata, and the bucket-to-server lookup fails.  This situation is handled as follows:
 i. the size of the "leftover keys" list is divided by the number of servers, then 1 added to compensate for any fractional piece.

ii. That many keys are added to each remaining list going to a server that is still reachable.

iii. We proceed normally, and send one list to each server, on its own thread.

 

_Unfortunately_, this scenario can lead to data loss, because each of the fractional pieces of the list going to the unreachable server has an eventId with the same threadId and incrementing sequenceId.  Thus, if any of our PUTALL threads send out-of-order, the earlier sequenceIds will be marked as already "seen" on the server and _dropped_.

 

 

  was:
For single-hop PUTALL, the request from the app is broken up in Geode native as follows:

i. Each value is hashed to a bucket, the server corresponding to the bucket is looked up in the metadata, and the value is added to a server-specific list for that server.

ii. When all values are added to a list, Geode native spins up a thread for each list, and sends a PUTALL to each server.

 

When a server can't be reached by Geode native, its entries are removed from the metadata, and the bucket-to-server lookup fails.  This situation is handled as follows:
i. the size of the "leftover keys" list is divided by the number of servers, then 1 added to compensate for any fractional piece.

ii. That many keys are added to each remaining list going to a server that is still reachable.

iii. We proceed normally, and send one list to each server, on its own thread.

 

_Unfortunately_, this scenario can lead to data loss, because each of the fractional pieces of the list going to the unreachable server has an eventId with the same threadId and incrementing sequenceId.  Thus, if any of our PUTALL threads send out-of-order, the earlier sequenceIds will be marked as already "seen" on the server and _dropped_.

 

We have identified 3 ways to solve this problem:

i. In the "big" PUTALL, tack all the keys for the unreachable server onto a single one of the existing server-specfic lists

ii. Keep the keys for the unreachable server in its own separate list, and just send that in a PUTALL to a randomly-selected server we _can_ reach.

iii. Just punt completely and drop back to multi-hop, sending _all_ the keys in the "big" PUTALL in a single list.

 


> Dropped keys in single-hop PUTALL request when one or more servers is unreachable
> ---------------------------------------------------------------------------------
>
>                 Key: GEODE-9147
>                 URL: https://issues.apache.org/jira/browse/GEODE-9147
>             Project: Geode
>          Issue Type: Bug
>          Components: native client
>            Reporter: Blake Bender
>            Priority: Major
>
> For single-hop PUTALL, the request from the app is broken up in Geode native as follows:
> i. Each value is hashed to a bucket, the server corresponding to the bucket is looked up in the metadata, and the value is added to a server-specific list for that server.
> ii. When all values are added to a list, Geode native spins up a thread for each list, and sends a PUTALL to each server.
>  
> When a server can't be reached by Geode native, its entries are removed from the metadata, and the bucket-to-server lookup fails.  This situation is handled as follows:
>  i. the size of the "leftover keys" list is divided by the number of servers, then 1 added to compensate for any fractional piece.
> ii. That many keys are added to each remaining list going to a server that is still reachable.
> iii. We proceed normally, and send one list to each server, on its own thread.
>  
> _Unfortunately_, this scenario can lead to data loss, because each of the fractional pieces of the list going to the unreachable server has an eventId with the same threadId and incrementing sequenceId.  Thus, if any of our PUTALL threads send out-of-order, the earlier sequenceIds will be marked as already "seen" on the server and _dropped_.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)