You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Bikramjeet Vig (Code Review)" <ge...@cloudera.org> on 2021/04/22 10:39:04 UTC

[Impala-ASF-CR] IMPALA-9155: Add recovery mechanism to admission service

Bikramjeet Vig has uploaded this change for review. ( http://gerrit.cloudera.org:8080/17332


Change subject: IMPALA-9155: Add recovery mechanism to admission service
......................................................................

IMPALA-9155: Add recovery mechanism to admission service

Major changes:
- Leverages the admission heartbeat mechanism to signal the
coordinator to send its complete admission state
- No RPCs are serviced by a coordinator unless it sends its complete
admission state. This is to prevent making admission decisions till
admission service has built its view of the cluster
- The complete admission state consists of the states of all queries
that have successfully been admitted, that is, received a valid
schedule from the admission controller and have marked its admission
as complete (for remote admission it means its pending admit status
has transitioned from true to false)
- This helps prevent sending incomplete/inconsistent state to the
admission controller
- Queries that have not started admission get a chance to send their
request to the new service
- Queries that are queued restart the admission process by sending
the request again. This re-try is now also marked in the query profile
- Other RPCs like ReleaseBackend, ReleaseQuery, CancelQuery that
don't get serviced (till initial admission state is sent) can result
in inconsistent state. This state will be rectified in the admission
heartbeats
- AdmitQuery and GetQueryStatus just retry again if they notice a
network failure(assuming admissiond might be down/restarting) or
received the error message that they cannot be serviced yet
admissiond is waiting on initial state from this coordinator)

Limitations:
- Rebuilding the state can not ensure that queued queries will
maintain their spot in the queue.
- Queries can be admitted before all coordinators get a chance to
send their state. This can result in a brief period of over-admission
We cannot rely completely on the statestore membership update and
wait for all coordinators there to send admission state because
that membership is also dynamic which makes it difficult to decide
when to assume that the admission state is complete.

Testing:
- Added end to end tests

Change-Id: I8ad3ef9b9e2496c484833d6326ce914c851e02fd
---
M be/src/runtime/coordinator-backend-resource-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/scheduling/admission-control-client.cc
M be/src/scheduling/admission-control-client.h
M be/src/scheduling/admission-control-service.cc
M be/src/scheduling/admission-control-service.h
M be/src/scheduling/admission-controller-test.cc
M be/src/scheduling/admission-controller.cc
M be/src/scheduling/admission-controller.h
M be/src/scheduling/local-admission-control-client.cc
M be/src/scheduling/local-admission-control-client.h
M be/src/scheduling/remote-admission-control-client.cc
M be/src/scheduling/remote-admission-control-client.h
M be/src/scheduling/schedule-state.cc
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
M common/protobuf/admission_control_service.proto
M common/thrift/generate_error_codes.py
M tests/custom_cluster/test_admission_controller.py
23 files changed, 674 insertions(+), 65 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/32/17332/1
-- 
To view, visit http://gerrit.cloudera.org:8080/17332
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I8ad3ef9b9e2496c484833d6326ce914c851e02fd
Gerrit-Change-Number: 17332
Gerrit-PatchSet: 1
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>

[Impala-ASF-CR] IMPALA-9976 IMPALA-10866: Add recovery mechanism to admission service and fix consistency between coord failure detection and registration

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17332 )

Change subject: IMPALA-9976 IMPALA-10866: Add recovery mechanism to admission service and fix consistency between coord failure detection and registration
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17332/2/tests/custom_cluster/test_admission_controller.py
File tests/custom_cluster/test_admission_controller.py:

http://gerrit.cloudera.org:8080/#/c/17332/2/tests/custom_cluster/test_admission_controller.py@1615
PS2, Line 1615: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/17332/2/tests/custom_cluster/test_admission_controller.py@1628
PS2, Line 1628: "
flake8: E126 continuation line over-indented for hanging indent



-- 
To view, visit http://gerrit.cloudera.org:8080/17332
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8ad3ef9b9e2496c484833d6326ce914c851e02fd
Gerrit-Change-Number: 17332
Gerrit-PatchSet: 2
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Comment-Date: Tue, 31 Aug 2021 21:48:56 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9155: Add recovery mechanism to admission service

Posted by "Andrew Sherman (Code Review)" <ge...@cloudera.org>.
Andrew Sherman has posted comments on this change. ( http://gerrit.cloudera.org:8080/17332 )

Change subject: IMPALA-9155: Add recovery mechanism to admission service
......................................................................


Patch Set 1:

(10 comments)

Some more comments

http://gerrit.cloudera.org:8080/#/c/17332/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17332/1//COMMIT_MSG@9
PS1, Line 9: Major changes:
I wrote my own version of the commit message to see that I understand the change. Is this correct? Feel free to use this

All heartbeats from the coordinator already contain a unique id for the coordinator.
The admissiond uses this unique id to track the coordinators that have sent their full admission state. When the admissiond starts up (in particular when it restarts) it will not not service any requests from coordinators until it has received the full admission state of that coordinator. The trigger for coordinators to resend their full admission state is a new flag set in replies to the AdmissionHeartbeat messages that every coordinator periodically sends to the admissiond. If the new flag is set in the reply that the coordinator receives then the full admission state is sent to the admissiond.

Note that when the admissiond restarts it will service requests from any coordinator that has sent its full admission state. This means that over-admission is possible during the period when not every coordinator has sent its full admission state.


http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-control-client.cc
File be/src/scheduling/admission-control-client.cc:

http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-control-client.cc@32
PS1, Line 32:     "Re-submit for admission due to a possible admission service restart";
Is there a reason it is only a "possible" restart?


http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-control-service.cc
File be/src/scheduling/admission-control-service.cc:

http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-control-service.cc@304
PS1, Line 304:     LOG(WARNING) << "Received heartbeat from unrecognized coord_id=" << req->coord_id();
Maybe WARNING is too high, we do expect to see this after a restart, maybe just make the message say that?


http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-control-service.cc@332
PS1, Line 332:       LOG(INFO) << "Query id already exists. Can be lingering from old coord_id="
Is it possible that the admissionstate we have just received is more up-to-date than that in admission_state_map_ ?


http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-control-service.cc@449
PS1, Line 449:   LOG(INFO) << "Adding to list of known coordinators, coord_id=" << coord_id;
The size of known_coord_ids_ is interesting, maybe add it to the message.


http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/remote-admission-control-client.h
File be/src/scheduling/remote-admission-control-client.h:

http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/remote-admission-control-client.h@54
PS1, Line 54:   /// TODO: add info on what this does? here or in the class comments
Yes it seems like these methods don't have descriptions


http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/remote-admission-control-client.cc
File be/src/scheduling/remote-admission-control-client.cc:

http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/remote-admission-control-client.cc@131
PS1, Line 131:         retry_admission = true;
I think this was already set to true?


http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/remote-admission-control-client.cc@158
PS1, Line 158:     //is network error or admisiond did not recognize this coord.
Nit: "admissiond"


http://gerrit.cloudera.org:8080/#/c/17332/1/tests/custom_cluster/test_admission_controller.py
File tests/custom_cluster/test_admission_controller.py:

http://gerrit.cloudera.org:8080/#/c/17332/1/tests/custom_cluster/test_admission_controller.py@1604
PS1, Line 1604:   def __compare_admission_json(self, json1, json2):
This is a clever way to test this change.


http://gerrit.cloudera.org:8080/#/c/17332/1/tests/custom_cluster/test_admission_controller.py@1606
PS1, Line 1606:     # that are updated immediately and admission changes are compared.
"and admission changes" is slightly unclear, maybe "is changed by admissiond"



-- 
To view, visit http://gerrit.cloudera.org:8080/17332
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8ad3ef9b9e2496c484833d6326ce914c851e02fd
Gerrit-Change-Number: 17332
Gerrit-PatchSet: 1
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Comment-Date: Fri, 28 May 2021 19:46:41 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9155: Add recovery mechanism to admission service

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17332 )

Change subject: IMPALA-9155: Add recovery mechanism to admission service
......................................................................


Patch Set 1:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8620/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17332
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8ad3ef9b9e2496c484833d6326ce914c851e02fd
Gerrit-Change-Number: 17332
Gerrit-PatchSet: 1
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Comment-Date: Thu, 22 Apr 2021 10:59:47 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9976 IMPALA-10866: Add recovery mechanism to admission service and fix consistency between coord failure detection and registration

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17332 )

Change subject: IMPALA-9976 IMPALA-10866: Add recovery mechanism to admission service and fix consistency between coord failure detection and registration
......................................................................


Patch Set 2:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9412/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17332
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8ad3ef9b9e2496c484833d6326ce914c851e02fd
Gerrit-Change-Number: 17332
Gerrit-PatchSet: 2
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Comment-Date: Tue, 31 Aug 2021 22:09:21 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9976 IMPALA-10866: Add recovery mechanism to admission service and fix consistency between coord failure detection and registration

Posted by "Bikramjeet Vig (Code Review)" <ge...@cloudera.org>.
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/17332 )

Change subject: IMPALA-9976 IMPALA-10866: Add recovery mechanism to admission service and fix consistency between coord failure detection and registration
......................................................................


Patch Set 2:

(19 comments)

http://gerrit.cloudera.org:8080/#/c/17332/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17332/1//COMMIT_MSG@9
PS1, Line 9: 
> I wrote my own version of the commit message to see that I understand the c
Pretty much. The only thing I would add is that if a backend is marked as down by the statestore, it will also have to register(send full admission state) again with the admissiond to be able to be serviced


http://gerrit.cloudera.org:8080/#/c/17332/1//COMMIT_MSG@12
PS1, Line 12: - Leverages the admission heartbeat mechanism to signal the
> Nit" should this be "No RPCs are serviced from a coordinator until it has s
Done


http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-control-client.cc
File be/src/scheduling/admission-control-client.cc:

http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-control-client.cc@32
PS1, Line 32:     "Re-submit for admission due to a possible admission service restart or network "
> Is there a reason it is only a "possible" restart?
as this can also be due to a generic network error that prevents the Admit RPC to go through.
Added this to the error msg. Open to suggestion for a better message text


http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-control-service.cc
File be/src/scheduling/admission-control-service.cc:

http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-control-service.cc@304
PS1, Line 304:     LOG(INFO) << "Received heartbeat from unrecognized coord_id=" << req->coord_id();
> Maybe WARNING is too high, we do expect to see this after a restart, maybe 
Just a log line at the INFO level should suffice in that case


http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-control-service.cc@332
PS1, Line 332:   bool registered_on_ac_service =
> Is it possible that the admissionstate we have just received is more up-to-
I have made both the RebuildAdmissionState and  CancelQueriesOnFailedCoordinators atomic operations so there should be no inconsistency anymore. Also if the coord is already registered it would return an OK status. In that case as well there should be no inconsistency as all RPCs after the first successful call to RebuildAdmissionState would be serviced and ideally any subsequent calls for RebuildAdmissionState should just be previous retries.


http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-control-service.cc@449
PS1, Line 449:     lock_guard<mutex> l(admission_state->lock);
> The size of known_coord_ids_ is interesting, maybe add it to the message.
Done


http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-controller.cc
File be/src/scheduling/admission-controller.cc:

http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-controller.cc@2103
PS1, Line 2103:   DCHECK(
> line too long (93 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-controller.cc@2104
PS1, Line 2104:       num_backends_to_release_.find(state->query_id()) == num_backends_to_release_.end());
> line too long (92 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/remote-admission-control-client.h
File be/src/scheduling/remote-admission-control-client.h:

http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/remote-admission-control-client.h@54
PS1, Line 54:   /// TODO: add info on what this does? here or in the class comments
> Yes it seems like these methods don't have descriptions
will add these in the next update


http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/remote-admission-control-client.cc
File be/src/scheduling/remote-admission-control-client.cc:

http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/remote-admission-control-client.cc@131
PS1, Line 131:         // retry_admission can be false if AttemptAdmissionAndWait succeeded but a
> I think this was already set to true?
This might not be set to true if the RPC succeeded but the impala-server initiated a ResetPendingAdmit. This can happen in the following case:
- the RPCs (both admit and getQueryStatus) succeeded on a previous instance of the admissiond
- The coordinator re-registered with an ongoing (old or a new restarted) instance of the admissiond.


http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/remote-admission-control-client.cc@158
PS1, Line 158:   while (admit_rpc_status.IsNetworkError()
> Nit: "admissiond"
Done


http://gerrit.cloudera.org:8080/#/c/17332/1/common/protobuf/admission_control_service.proto
File common/protobuf/admission_control_service.proto:

http://gerrit.cloudera.org:8080/#/c/17332/1/common/protobuf/admission_control_service.proto@280
PS1, Line 280:   /// quickly, eg. if the query is rejected. We should evaluate the benefits of saving a
> Move new comment to before the TODO to clarify that it is not part of the T
Done


http://gerrit.cloudera.org:8080/#/c/17332/1/common/protobuf/admission_control_service.proto@301
PS1, Line 301:   /// admissiond started. TODO: we can save an rpc if we combine the release of the final
> Move new comment to before the TODO to clarify that it is not part of the T
Done


http://gerrit.cloudera.org:8080/#/c/17332/1/tests/custom_cluster/test_admission_controller.py
File tests/custom_cluster/test_admission_controller.py:

http://gerrit.cloudera.org:8080/#/c/17332/1/tests/custom_cluster/test_admission_controller.py@1578
PS1, Line 1578: :
> flake8: E231 missing whitespace after ':'
Done


http://gerrit.cloudera.org:8080/#/c/17332/1/tests/custom_cluster/test_admission_controller.py@1585
PS1, Line 1585:  
> flake8: E226 missing whitespace around arithmetic operator
Done


http://gerrit.cloudera.org:8080/#/c/17332/1/tests/custom_cluster/test_admission_controller.py@1589
PS1, Line 1589:  
> flake8: E226 missing whitespace around arithmetic operator
Done


http://gerrit.cloudera.org:8080/#/c/17332/1/tests/custom_cluster/test_admission_controller.py@1606
PS1, Line 1606:     # that are updated immediately are compared.
> "and admission changes" is slightly unclear, maybe "is changed by admission
Done


http://gerrit.cloudera.org:8080/#/c/17332/1/tests/custom_cluster/test_admission_controller.py@1614
PS1, Line 1614: 
> flake8: E501 line too long (99 > 90 characters)
Done


http://gerrit.cloudera.org:8080/#/c/17332/1/tests/custom_cluster/test_admission_controller.py@1626
PS1, Line 1626: 
> flake8: E501 line too long (97 > 90 characters)
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/17332
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8ad3ef9b9e2496c484833d6326ce914c851e02fd
Gerrit-Change-Number: 17332
Gerrit-PatchSet: 2
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Comment-Date: Tue, 31 Aug 2021 21:48:27 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9155: Add recovery mechanism to admission service

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17332 )

Change subject: IMPALA-9155: Add recovery mechanism to admission service
......................................................................


Patch Set 1:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-controller.cc
File be/src/scheduling/admission-controller.cc:

http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-controller.cc@2103
PS1, Line 2103:   DCHECK(num_backends_to_release_.find(state->query_id()) == num_backends_to_release_.end());
line too long (93 > 90)


http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-controller.cc@2104
PS1, Line 2104:   num_backends_to_release_[state->query_id()] = state->per_backend_schedule_states().size();
line too long (92 > 90)


http://gerrit.cloudera.org:8080/#/c/17332/1/tests/custom_cluster/test_admission_controller.py
File tests/custom_cluster/test_admission_controller.py:

http://gerrit.cloudera.org:8080/#/c/17332/1/tests/custom_cluster/test_admission_controller.py@1578
PS1, Line 1578: :
flake8: E231 missing whitespace after ':'


http://gerrit.cloudera.org:8080/#/c/17332/1/tests/custom_cluster/test_admission_controller.py@1585
PS1, Line 1585: -
flake8: E226 missing whitespace around arithmetic operator


http://gerrit.cloudera.org:8080/#/c/17332/1/tests/custom_cluster/test_admission_controller.py@1589
PS1, Line 1589: /
flake8: E226 missing whitespace around arithmetic operator


http://gerrit.cloudera.org:8080/#/c/17332/1/tests/custom_cluster/test_admission_controller.py@1614
PS1, Line 1614: 2
flake8: E501 line too long (99 > 90 characters)


http://gerrit.cloudera.org:8080/#/c/17332/1/tests/custom_cluster/test_admission_controller.py@1626
PS1, Line 1626: k
flake8: E501 line too long (97 > 90 characters)



-- 
To view, visit http://gerrit.cloudera.org:8080/17332
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8ad3ef9b9e2496c484833d6326ce914c851e02fd
Gerrit-Change-Number: 17332
Gerrit-PatchSet: 1
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Comment-Date: Thu, 22 Apr 2021 10:40:12 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9155: Add recovery mechanism to admission service

Posted by "Andrew Sherman (Code Review)" <ge...@cloudera.org>.
Andrew Sherman has posted comments on this change. ( http://gerrit.cloudera.org:8080/17332 )

Change subject: IMPALA-9155: Add recovery mechanism to admission service
......................................................................


Patch Set 1:

(3 comments)

I read through the code once, saving my nits here

http://gerrit.cloudera.org:8080/#/c/17332/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17332/1//COMMIT_MSG@12
PS1, Line 12: - No RPCs are serviced by a coordinator unless it sends its complete
Nit" should this be "No RPCs are serviced from a coordinator until it has sent"?


http://gerrit.cloudera.org:8080/#/c/17332/1/common/protobuf/admission_control_service.proto
File common/protobuf/admission_control_service.proto:

http://gerrit.cloudera.org:8080/#/c/17332/1/common/protobuf/admission_control_service.proto@280
PS1, Line 280:   /// those situations. An error response is sent if the coordinator has not yet sent a
Move new comment to before the TODO to clarify that it is not part of the TODO


http://gerrit.cloudera.org:8080/#/c/17332/1/common/protobuf/admission_control_service.proto@301
PS1, Line 301:   /// ReleaseQuery. An error response is sent if the coordinator has not yet sent a
Move new comment to before the TODO to clarify that it is not part of the TODO



-- 
To view, visit http://gerrit.cloudera.org:8080/17332
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8ad3ef9b9e2496c484833d6326ce914c851e02fd
Gerrit-Change-Number: 17332
Gerrit-PatchSet: 1
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Comment-Date: Wed, 19 May 2021 21:28:52 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9976 IMPALA-10866: Add recovery mechanism to admission service and fix consistency between coord failure detection and registration

Posted by "Bikramjeet Vig (Code Review)" <ge...@cloudera.org>.
Hello Andrew Sherman, Joe McDonnell, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17332

to look at the new patch set (#2).

Change subject: IMPALA-9976 IMPALA-10866: Add recovery mechanism to admission service and fix consistency between coord failure detection and registration
......................................................................

IMPALA-9976 IMPALA-10866: Add recovery mechanism to admission service
and fix consistency between coord failure detection and registration

Major changes:
IMPALA-9976:
- Leverages the admission heartbeat mechanism to signal the
coordinator to send its complete admission state
- No RPCs are serviced by a coordinator until it has sent its complete
admission state. This is to prevent making admission decisions till
admission service has built its view of the cluster
- The complete admission state consists of the states of all queries
that have successfully been admitted, that is, received a valid
schedule from the admission controller and have marked its admission
as complete (for remote admission it means its pending admit status
has transitioned from true to false)
- This helps prevent sending incomplete/inconsistent state to the
admission controller
- Queries that have not started admission get a chance to send their
request to the new service
- Queries that are queued restart the admission process by sending
the request again. This re-try is now also marked in the query profile
- Other RPCs like ReleaseBackend, ReleaseQuery, CancelQuery that
don't get serviced (till initial admission state is sent) can result
in inconsistent state. This state will be rectified in the admission
heartbeats
- AdmitQuery and GetQueryStatus just retry again if they notice a
network failure(assuming admissiond might be down/restarting) or
received the error message that they cannot be serviced yet
admissiond is waiting on initial state from this coordinator)

IMPALA-10866:
- Made sure that admission state removal on failure detection and
admission state rebuilding on coordinator registration are atomic
operations.
- Leverage statestore's membership view to detect failure and
allow coordinator registration.

Limitations:
- Rebuilding the state can not ensure that queued queries will
maintain their spot in the queue.
- Queries can be admitted before all coordinators get a chance to
send their state. This can result in a brief period of over-admission
We cannot rely completely on the statestore membership update and
wait for all coordinators there to send admission state because
that membership is also dynamic which makes it difficult to decide
when to assume that the admission state is complete.
- The functionalities for coordinator failure detection and
registration rely completely on the statestore.

Testing:
- Added end to end tests

Change-Id: I8ad3ef9b9e2496c484833d6326ce914c851e02fd
---
M be/src/runtime/coordinator-backend-resource-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/scheduling/admission-control-client.cc
M be/src/scheduling/admission-control-client.h
M be/src/scheduling/admission-control-service.cc
M be/src/scheduling/admission-control-service.h
M be/src/scheduling/admission-controller-test.cc
M be/src/scheduling/admission-controller.cc
M be/src/scheduling/admission-controller.h
M be/src/scheduling/admissiond-env.cc
M be/src/scheduling/local-admission-control-client.cc
M be/src/scheduling/local-admission-control-client.h
M be/src/scheduling/remote-admission-control-client.cc
M be/src/scheduling/remote-admission-control-client.h
M be/src/scheduling/schedule-state.cc
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
M common/protobuf/admission_control_service.proto
M common/thrift/generate_error_codes.py
M tests/custom_cluster/test_admission_controller.py
24 files changed, 756 insertions(+), 93 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/32/17332/2
-- 
To view, visit http://gerrit.cloudera.org:8080/17332
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8ad3ef9b9e2496c484833d6326ce914c851e02fd
Gerrit-Change-Number: 17332
Gerrit-PatchSet: 2
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>