You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2022/07/13 10:30:12 UTC

[GitHub] [flink] gaborgsomogyi opened a new pull request, #20265: [FLINK-25910][runtime][security] Propagate obtained delegation tokens to TaskManagers

gaborgsomogyi opened a new pull request, #20265:
URL: https://github.com/apache/flink/pull/20265

   ## What is the purpose of the change
   
   At the moment the delegation token framework only obtains/re-obtains tokens but not used anywhere. In this PR I've added a functionality to propagate the tokens from JobManager to TaskManagers.
   
   ## Brief change log
   
   * Send current tokens to a TaskManager in case of successful registration
   * Send newly obtained tokens to all registered TaskManagers
   * Added `DelegationTokenUpdater` to update tokens for the current user
   * Added new unit tests
   
   ## Verifying this change
   
   * Additional unit tests
   * Manually on YARN and K8S (please see an example [here](https://gist.github.com/gaborgsomogyi/ac4f71ead8494da2f5c35265bcb1e885))
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? yes
     - If yes, how is the feature documented? All documentation is intended to be added in [FLINK-25911](https://issues.apache.org/jira/browse/FLINK-25911) when everything works as a whole
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] mbalassi merged pull request #20265: [FLINK-25910][runtime][security] Propagate obtained delegation tokens to TaskManagers

Posted by GitBox <gi...@apache.org>.
mbalassi merged PR #20265:
URL: https://github.com/apache/flink/pull/20265


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] gaborgsomogyi commented on pull request #20265: [FLINK-25910][runtime][security] Propagate obtained delegation tokens to TaskManagers

Posted by GitBox <gi...@apache.org>.
gaborgsomogyi commented on PR #20265:
URL: https://github.com/apache/flink/pull/20265#issuecomment-1278709659

   @flinkbot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] gaborgsomogyi commented on pull request #20265: [FLINK-25910][runtime][security] Propagate obtained delegation tokens to TaskManagers

Posted by GitBox <gi...@apache.org>.
gaborgsomogyi commented on PR #20265:
URL: https://github.com/apache/flink/pull/20265#issuecomment-1183176067

   Please be aware that this PR is intended to propagate tokens when TMs are registering themself in the JM.
   Unregistered TM token propagation is intended to be handled in separate jira.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] gaborgsomogyi commented on pull request #20265: [FLINK-25910][runtime][security] Propagate obtained delegation tokens to TaskManagers

Posted by GitBox <gi...@apache.org>.
gaborgsomogyi commented on PR #20265:
URL: https://github.com/apache/flink/pull/20265#issuecomment-1228604965

   Just resolved the conflicts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] gaborgsomogyi commented on pull request #20265: [FLINK-25910][runtime][security] Propagate obtained delegation tokens to TaskManagers

Posted by GitBox <gi...@apache.org>.
gaborgsomogyi commented on PR #20265:
URL: https://github.com/apache/flink/pull/20265#issuecomment-1183174122

   I've tested it on real workload and here are the TM/JM logs.
   [jm.log](https://github.com/apache/flink/files/9102158/jm.log)
   [tm.log](https://github.com/apache/flink/files/9102163/tm.log)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] gaborgsomogyi commented on pull request #20265: [FLINK-25910][runtime][security] Propagate obtained delegation tokens to TaskManagers

Posted by GitBox <gi...@apache.org>.
gaborgsomogyi commented on PR #20265:
URL: https://github.com/apache/flink/pull/20265#issuecomment-1183205067

   Just to make it clear since DT framework is sensitive feature and more testing is needed this need to go into 1.17 to be on the safe side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] mbalassi commented on pull request #20265: [FLINK-25910][runtime][security] Propagate obtained delegation tokens to TaskManagers

Posted by GitBox <gi...@apache.org>.
mbalassi commented on PR #20265:
URL: https://github.com/apache/flink/pull/20265#issuecomment-1277722261

   I am starting the review to merge this into the current master, 1.17-SNAPSHOT. Just to make sure that we are good I am running @gaborgsomogyi's [instructions](https://gist.github.com/gaborgsomogyi/ac4f71ead8494da2f5c35265bcb1e885) on top of the new snapshot version.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] flinkbot commented on pull request #20265: [FLINK-25910][runtime][security] Propagate obtained delegation tokens to TaskManagers

Posted by GitBox <gi...@apache.org>.
flinkbot commented on PR #20265:
URL: https://github.com/apache/flink/pull/20265#issuecomment-1183057726

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "39e90dad7ff55fe01d64a13e6cfaf848f345e356",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "39e90dad7ff55fe01d64a13e6cfaf848f345e356",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 39e90dad7ff55fe01d64a13e6cfaf848f345e356 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] gaborgsomogyi commented on a diff in pull request #20265: [FLINK-25910][runtime][security] Propagate obtained delegation tokens to TaskManagers

Posted by GitBox <gi...@apache.org>.
gaborgsomogyi commented on code in PR #20265:
URL: https://github.com/apache/flink/pull/20265#discussion_r957334174


##########
flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java:
##########
@@ -157,6 +160,8 @@
 
     private final DelegationTokenManager delegationTokenManager;
 
+    private AtomicReference<byte[]> latestTokens = new AtomicReference<>();

Review Comment:
   Fixed.



##########
flink-runtime/src/main/java/org/apache/flink/runtime/security/token/DelegationTokenUpdater.java:
##########
@@ -0,0 +1,55 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.runtime.security.token;
+
+import org.apache.hadoop.security.Credentials;
+import org.apache.hadoop.security.UserGroupInformation;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+
+/** Delegation token updater functionality. */
+public class DelegationTokenUpdater {

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] gaborgsomogyi commented on pull request #20265: [FLINK-25910][runtime][security] Propagate obtained delegation tokens to TaskManagers

Posted by GitBox <gi...@apache.org>.
gaborgsomogyi commented on PR #20265:
URL: https://github.com/apache/flink/pull/20265#issuecomment-1237174092

   @ferenc-csaky I think you're right in general and I would give support if this general question would pop-up somewhere in the community. I think the solution need to be considered well since this pattern exists for example in `sendOperatorEventToTask` (amongst other places) which is heavily used.
   
   @mbalassi @gyfora 1.16 has been cut, could you plz have a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] ferenc-csaky commented on a diff in pull request #20265: [FLINK-25910][runtime][security] Propagate obtained delegation tokens to TaskManagers

Posted by GitBox <gi...@apache.org>.
ferenc-csaky commented on code in PR #20265:
URL: https://github.com/apache/flink/pull/20265#discussion_r925688938


##########
flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java:
##########
@@ -157,6 +160,8 @@
 
     private final DelegationTokenManager delegationTokenManager;
 
+    private AtomicReference<byte[]> latestTokens = new AtomicReference<>();

Review Comment:
   This field could be `final` and I would move the init to the ctor, because every other field is initialized there.



##########
flink-runtime/src/main/java/org/apache/flink/runtime/security/token/DelegationTokenUpdater.java:
##########
@@ -0,0 +1,55 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.runtime.security.token;
+
+import org.apache.hadoop.security.Credentials;
+import org.apache.hadoop.security.UserGroupInformation;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+
+/** Delegation token updater functionality. */
+public class DelegationTokenUpdater {

Review Comment:
   This seems like a static helper class, which could be `final` and have a `private` default ctor.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] gaborgsomogyi commented on pull request #20265: [FLINK-25910][runtime][security] Propagate obtained delegation tokens to TaskManagers

Posted by GitBox <gi...@apache.org>.
gaborgsomogyi commented on PR #20265:
URL: https://github.com/apache/flink/pull/20265#issuecomment-1277739294

   Thanks, made a push rebased to the latest master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] gaborgsomogyi commented on pull request #20265: [FLINK-25910][runtime][security] Propagate obtained delegation tokens to TaskManagers

Posted by GitBox <gi...@apache.org>.
gaborgsomogyi commented on PR #20265:
URL: https://github.com/apache/flink/pull/20265#issuecomment-1230321242

   > One general question: is catching Throwable a good thing here? I see that there are rethrowIfFatalError calls and it is a common pattern in the existing code, but that method only cares about a subset of errors (e.g. it ignores OOM or StackOverlflow).
   
   Not sure what you in-depth mean but the original and my current intention here is clear. Catch and log everything because it's always good if it blows up with log entry. On the other hand non-fatal exceptions can be intermittent, retry comes at later timepoint where it could be successful. Fatal exceptions however must stop the workload.
   
   If you think there is a bug in `rethrowIfFatalError` then a separate jira would be better to discuss. The reasons are simple from my perspective:
   * Several execution paths are involved since it's widely used
   * Rewriting rethrowIfFatalError` logic because it's maybe faulty would be overkill
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] gaborgsomogyi commented on pull request #20265: [FLINK-25910][runtime][security] Propagate obtained delegation tokens to TaskManagers

Posted by GitBox <gi...@apache.org>.
gaborgsomogyi commented on PR #20265:
URL: https://github.com/apache/flink/pull/20265#issuecomment-1231555986

   @flinkbot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] gaborgsomogyi commented on pull request #20265: [FLINK-25910][runtime][security] Propagate obtained delegation tokens to TaskManagers

Posted by GitBox <gi...@apache.org>.
gaborgsomogyi commented on PR #20265:
URL: https://github.com/apache/flink/pull/20265#issuecomment-1183174467

   cc @mbalassi @gyfora 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] ferenc-csaky commented on pull request #20265: [FLINK-25910][runtime][security] Propagate obtained delegation tokens to TaskManagers

Posted by GitBox <gi...@apache.org>.
ferenc-csaky commented on PR #20265:
URL: https://github.com/apache/flink/pull/20265#issuecomment-1233134897

   > > One general question: is catching Throwable a good thing here? I see that there are rethrowIfFatalError calls and it is a common pattern in the existing code, but that method only cares about a subset of errors (e.g. it ignores OOM or StackOverlflow).
   > 
   > Not sure what you in-depth mean but the original and my current intention here is clear. Catch and log everything because it's always good if it blows up with log entry. On the other hand non-fatal exceptions can be intermittent, retry comes at later timepoint where it could be successful. Fatal exceptions however must stop the workload.
   > 
   > If you think there is a bug in `rethrowIfFatalError` then a separate jira would be better to discuss. The reasons are simple from my perspective:
   > 
   >     * Several execution paths are involved since it's widely used
   > 
   >     * Rewriting rethrowIfFatalError` logic because it's maybe faulty would be overkill
   
   Me neither, I do not have in-depth knowledge, I just wanted to mention this, because IMO catching `Throwable` in general is not a good practice, cause `Error`s are non-recoverable events and the stack trace will be printed at anyways. I'm sure this is not really part of this PR's context and following the already existing practice is the way to go, I just started to wonder is it right in the first place or not. :)
   
   The logic itself LGTM for me!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org