You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by yanghua <gi...@git.apache.org> on 2018/05/04 08:58:52 UTC

[GitHub] flink pull request #5954: [FLINK-9276] Improve error message when TaskManage...

GitHub user yanghua opened a pull request:

    https://github.com/apache/flink/pull/5954

    [FLINK-9276] Improve error message when TaskManager fails

    ## What is the purpose of the change
    
    *This pull request improves error message when TaskManager fails*
    
    
    ## Brief change log
    
      - *add a Exception param from method `SlotPoolGateway#releaseTaskManager`*
      - *refactor the usage of method `SlotPoolGateway#releaseTaskManager`*
    
    ## Verifying this change
    
    
    This change is already covered by existing tests, such as *SchedulerTestBase/SchedulerIsolatedTasks*.
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): (yes / **no**)
      - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (yes / **no**)
      - The serializers: (yes / **no** / don't know)
      - The runtime per-record code paths (performance sensitive): (yes / **no** / don't know)
      - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / **no** / don't know)
      - The S3 file system connector: (yes / **no** / don't know)
    
    ## Documentation
    
      - Does this pull request introduce a new feature? (yes / **no**)
      - If yes, how is the feature documented? (not applicable / docs / JavaDocs / **not documented**)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yanghua/flink FLINK-9276

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/5954.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5954
    
----
commit 904f762e6b495ea6b4ffd445b9168f001f74be26
Author: yanghua <ya...@...>
Date:   2018-05-04T08:52:38Z

    [FLINK-9276] Improve error message when TaskManager fails

----


---

[GitHub] flink issue #5954: [FLINK-9276] Improve error message when TaskManager fails

Posted by yanghua <gi...@git.apache.org>.
Github user yanghua commented on the issue:

    https://github.com/apache/flink/pull/5954
  
    @aljoscha @kl0u 


---

[GitHub] flink issue #5954: [FLINK-9276] Improve error message when TaskManager fails

Posted by yanghua <gi...@git.apache.org>.
Github user yanghua commented on the issue:

    https://github.com/apache/flink/pull/5954
  
    cc @StephanEwen refactored code base on your suggestion, please review again, thanks.


---

[GitHub] flink issue #5954: [FLINK-9276] Improve error message when TaskManager fails

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/5954
  
    This looks fine from my side.
    
    Would like to get a +1 from @tillrohrmann before merging this...


---

[GitHub] flink issue #5954: [FLINK-9276] Improve error message when TaskManager fails

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on the issue:

    https://github.com/apache/flink/pull/5954
  
    Thanks for opening this PR @yanghua and sorry for my late reply. 
    
    I think we accidentally addressed this problem already with another PR (#6202). This was an oversight because you clearly opened the PR before. Since it basically covers the same functionality, I think we can close this PR. 
    
    Sorry for the bad PR management. This won't happen again.


---

[GitHub] flink issue #5954: [FLINK-9276] Improve error message when TaskManager fails

Posted by yanghua <gi...@git.apache.org>.
Github user yanghua commented on the issue:

    https://github.com/apache/flink/pull/5954
  
    hi @StephanEwen this PR also reviewed by you, yesterday it has a conflicts with master branch, I have fixed, would you please review it again, thanks.


---

[GitHub] flink issue #5954: [FLINK-9276] Improve error message when TaskManager fails

Posted by yanghua <gi...@git.apache.org>.
Github user yanghua commented on the issue:

    https://github.com/apache/flink/pull/5954
  
    cc @StephanEwen @zentol 


---

[GitHub] flink issue #5954: [FLINK-9276] Improve error message when TaskManager fails

Posted by yanghua <gi...@git.apache.org>.
Github user yanghua commented on the issue:

    https://github.com/apache/flink/pull/5954
  
    @tillrohrmann can you merge this?


---

[GitHub] flink pull request #5954: [FLINK-9276] Improve error message when TaskManage...

Posted by yanghua <gi...@git.apache.org>.
Github user yanghua closed the pull request at:

    https://github.com/apache/flink/pull/5954


---

[GitHub] flink pull request #5954: [FLINK-9276] Improve error message when TaskManage...

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on a diff in the pull request:

    https://github.com/apache/flink/pull/5954#discussion_r186375841
  
    --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/slotpool/SlotPool.java ---
    @@ -1050,11 +1050,12 @@ else if (availableSlots.tryRemove(allocationID)) {
     	 * when we find some TaskManager becomes "dead" or "abnormal", and we decide to not using slots from it anymore.
     	 *
     	 * @param resourceId The id of the TaskManager
    +	 * @param cause for the release the TaskManager
     	 */
     	@Override
    -	public CompletableFuture<Acknowledge> releaseTaskManager(final ResourceID resourceId) {
    +	public CompletableFuture<Acknowledge> releaseTaskManager(final ResourceID resourceId, final Exception cause) {
    --- End diff --
    
    I would use `Throwable` in the signatures. It may always be that some Error is the cause (class not found, etc.)


---

[GitHub] flink issue #5954: [FLINK-9276] Improve error message when TaskManager fails

Posted by yanghua <gi...@git.apache.org>.
Github user yanghua commented on the issue:

    https://github.com/apache/flink/pull/5954
  
    @tillrohrmann That's all right. I know you are very busy. Just a little question, I have reviewed PR(#6202). I saw it used `Exception`, and there is a suggestion from Stephan in May 7 : 
    
    ```
    I would use Throwable in the signatures. It may always be that some Error is the cause (class not found, etc.)
    ```
    
    So I replaced the `Exception ` to `Throwable` in this PR, do you think it can be consider? If not, I would close this PR. 


---

[GitHub] flink pull request #5954: [FLINK-9276] Improve error message when TaskManage...

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on a diff in the pull request:

    https://github.com/apache/flink/pull/5954#discussion_r186376182
  
    --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/slotpool/SlotPoolGateway.java ---
    @@ -86,9 +86,10 @@
     	 * Releases a TaskExecutor with the given {@link ResourceID} from the {@link SlotPool}.
     	 *
     	 * @param resourceId identifying the TaskExecutor which shall be released from the SlotPool
    +	 * @param cause for the release the TaskManager
     	 * @return Future acknowledge which is completed after the TaskExecutor has been released
     	 */
    -	CompletableFuture<Acknowledge> releaseTaskManager(final ResourceID resourceId);
    +	CompletableFuture<Acknowledge> releaseTaskManager(final ResourceID resourceId, final Exception cause);
    --- End diff --
    
    Throwable, see above.


---

[GitHub] flink issue #5954: [FLINK-9276] Improve error message when TaskManager fails

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on the issue:

    https://github.com/apache/flink/pull/5954
  
    I think this is not super critical and since we only pass in `Exceptions` at the moment I would say let's close this PR.


---

[GitHub] flink issue #5954: [FLINK-9276] Improve error message when TaskManager fails

Posted by yanghua <gi...@git.apache.org>.
Github user yanghua commented on the issue:

    https://github.com/apache/flink/pull/5954
  
    @zentol @tillrohrmann @aljoscha  this PR takes a long time and has conflicts (I have rebased recently), hope for it could be merged as soon as possible.


---

[GitHub] flink pull request #5954: [FLINK-9276] Improve error message when TaskManage...

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on a diff in the pull request:

    https://github.com/apache/flink/pull/5954#discussion_r186376326
  
    --- Diff: flink-runtime/src/test/java/org/apache/flink/runtime/jobmanager/scheduler/SchedulerTestBase.java ---
    @@ -296,7 +296,7 @@ public TaskManagerLocation addTaskManager(int numberSlots) {
     		@Override
     		public void releaseTaskManager(ResourceID resourceId) {
     			try {
    -				slotPool.releaseTaskManager(resourceId).get();
    +				slotPool.releaseTaskManager(resourceId, null).get();
    --- End diff --
    
    I would avoid null exceptions.


---

[GitHub] flink issue #5954: [FLINK-9276] Improve error message when TaskManager fails

Posted by yanghua <gi...@git.apache.org>.
Github user yanghua commented on the issue:

    https://github.com/apache/flink/pull/5954
  
    @zentol @tillrohrmann @aljoscha  this PR takes a long time and has conflicts (I have rebased recently), hope for it could be merged as soon as possible.


---

[GitHub] flink pull request #5954: [FLINK-9276] Improve error message when TaskManage...

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on a diff in the pull request:

    https://github.com/apache/flink/pull/5954#discussion_r186376122
  
    --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/slotpool/SlotPool.java ---
    @@ -1070,13 +1071,22 @@ protected void timeoutPendingSlotRequest(SlotRequestId slotRequestId) {
     		removePendingRequest(slotRequestId);
     	}
     
    -	private void releaseTaskManagerInternal(final ResourceID resourceId) {
    -		final FlinkException cause = new FlinkException("Releasing TaskManager " + resourceId + '.');
    +	private void releaseTaskManagerInternal(final ResourceID resourceId, final Exception cause) {
    --- End diff --
    
    I think we can just use the `cause` exception and do not need to re-wrap it into another exception.
    
    If re-wrapping is needed for some reason, we need to add the original exception as the cause.


---

[GitHub] flink issue #5954: [FLINK-9276] Improve error message when TaskManager fails

Posted by yanghua <gi...@git.apache.org>.
Github user yanghua commented on the issue:

    https://github.com/apache/flink/pull/5954
  
    @zentol can this be merged?


---

[GitHub] flink issue #5954: [FLINK-9276] Improve error message when TaskManager fails

Posted by yanghua <gi...@git.apache.org>.
Github user yanghua commented on the issue:

    https://github.com/apache/flink/pull/5954
  
    OK, closing this PR...


---

[GitHub] flink issue #5954: [FLINK-9276] Improve error message when TaskManager fails

Posted by yanghua <gi...@git.apache.org>.
Github user yanghua commented on the issue:

    https://github.com/apache/flink/pull/5954
  
    cc @StephanEwen 


---

[GitHub] flink pull request #5954: [FLINK-9276] Improve error message when TaskManage...

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on a diff in the pull request:

    https://github.com/apache/flink/pull/5954#discussion_r186375693
  
    --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/slotpool/SlotPool.java ---
    @@ -212,7 +212,7 @@ public void start(JobMasterId jobMasterId, String newJobManagerAddress) throws E
     
     		// release all registered slots by releasing the corresponding TaskExecutors
     		for (ResourceID taskManagerResourceId : registeredTaskManagers) {
    -			releaseTaskManagerInternal(taskManagerResourceId);
    +			releaseTaskManagerInternal(taskManagerResourceId, null);
    --- End diff --
    
    I would try to avoid null here. It is still helpful to have a message that says "SlotPool is shutting down".


---

[GitHub] flink issue #5954: [FLINK-9276] Improve error message when TaskManager fails

Posted by yanghua <gi...@git.apache.org>.
Github user yanghua commented on the issue:

    https://github.com/apache/flink/pull/5954
  
    cc @StephanEwen 


---