You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Hrishikesh Gadre (JIRA)" <ji...@apache.org> on 2017/02/01 03:21:52 UTC
[jira] [Comment Edited] (HADOOP-14044) Synchronization issue in delegation token cancel functionality

    [ https://issues.apache.org/jira/browse/HADOOP-14044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847916#comment-15847916 ] 

Hrishikesh Gadre edited comment on HADOOP-14044 at 2/1/17 3:21 AM:
-------------------------------------------------------------------

[~xiaochen] Thanks for the feedback. The patch-1 was just a starting point for the discussion (it doesn't compile fully if you see the build output :)).

bq. So it seems there's no good way to satisfy your request of 'when 2 callers are cancelling, exactly 1 should see success'. I guess this may end up inline with zookeeper's documentation - client has to handle it.

I don't think that is quite accurate. I understand that if 2 callers invoke cancel operation *concurrently* then it may not be possible to figure out which client actually deleted the token. But in my case there is just a single caller invoking cancel operation one after another. Hence I think it is a case of *application level* cache inconsistency which is being exposed to the client.

If we don't want to binary compatibility at the API level, would it be ok to change the API semantics ? e.g. an alternative would be for the server to return HTTP 200 in all cases (e.g. instead of sending 404 in case of nonexistent token). This will ensure that server provides consistent response to client regardless of cache inconsistency at the server. The [REST API semantics for DELETE method |http://restcookbook.com/HTTP%20Methods/idempotency/] also seem to favor this approach.

What do you think?





was (Author: hgadre):
[~xiaochen] Thanks for the feedback. The patch-1 was just a starting point for the discussion (it doesn't compile fully if you see the build output :)).

bq. So it seems there's no good way to satisfy your request of 'when 2 callers are cancelling, exactly 1 should see success'. I guess this may end up inline with zookeeper's documentation - client has to handle it.

I don't think that is quite accurate. I understand that if 2 callers invoke cancel operation *concurrently* then it may not be possible to figure out which client actually deleted the token. But in my case there is just a single caller invoking cancel operation one after another. Hence I think it is a case of *application level* cache inconsistency which is being exposed to the client.

If we don't want to binary compatibility at the API level, would it be ok to change the API semantics ? e.g. an alternative would be for the server to return HTTP 200 in all cases (e.g. instead of sending 404 in case of nonexistent token). This will ensure that server provides consistent response to client regardless of cache inconsistency at the server. The [REST API semantics|http://restcookbook.com/HTTP%20Methods/idempotency/] also seem to favor this approach.

What do you think?




> Synchronization issue in delegation token cancel functionality
> --------------------------------------------------------------
>
>                 Key: HADOOP-14044
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14044
>             Project: Hadoop Common
>          Issue Type: Bug
>            Reporter: Hrishikesh Gadre
>            Assignee: Hrishikesh Gadre
>         Attachments: dt_fail.log, dt_success.log, HADOOP-14044-001.patch
>
>
> We are using Hadoop delegation token authentication functionality in Apache Solr. As part of the integration testing, I found following issue with the delegation token cancelation functionality.
> Consider a setup with 2 Solr servers (S1 and S2) which are configured to use delegation token functionality backed by Zookeeper. Now invoke following steps,
> [Step 1] Send a request to S1 to create a delegation token.
>   (Delegation token DT is created successfully)
> [Step 2] Send a request to cancel DT to S2
>   (DT is canceled successfully. client receives HTTP 200 response)
> [Step 3] Send a request to cancel DT to S2 again
>   (DT cancelation fails. client receives HTTP 404 response)
> [Step 4] Send a request to cancel DT to S1
> At this point we get two different responses.
> - DT cancelation fails. client receives HTTP 404 response
> - DT cancelation succeeds. client receives HTTP 200 response
> Also as per the current implementation, each server maintains an in_memory cache of current tokens which is updated using the ZK watch mechanism. e.g. the ZK watch on S1 will ensure that the in_memory cache is synchronized after step 2.
> After investigation, I found the root cause for this behavior is due to the race condition between step 4 and the firing of ZK watch on S1. Whenever the watch fires before the step 4 - we get HTTP 404 response (as expected). When that is not the case - we get HTTP 200 response along with following ERROR message in the log,
> {noformat}
> Attempted to remove a non-existing znode /ZKDTSMTokensRoot/DT_XYZ
> {noformat}
> From client perspective, the server *should* return HTTP 404 error when the cancel request is sent out for an invalid token.
> Ref: Here is the relevant Solr unit test for reference,
> https://github.com/apache/lucene-solr/blob/746786636404cdb8ce505ed0ed02b8d9144ab6c4/solr/core/src/test/org/apache/solr/cloud/TestSolrCloudWithDelegationTokens.java#L285



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org