You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Robert Metzger (Jira)" <ji...@apache.org> on 2021/01/15 13:42:00 UTC
[jira] [Commented] (FLINK-19067) resource_manager and dispatcher register on different nodes in HA mode will cause FileNotFoundException

    [ https://issues.apache.org/jira/browse/FLINK-19067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17266048#comment-17266048 ] 

Robert Metzger commented on FLINK-19067:
----------------------------------------

Thank you for your detailed analysis. I had a look at the code too, and I agree that you can run into this situation.

Let's assume we have the following setup:
Node1
lead Dispatcher
BlobServer1

Node2
lead ResourceManager
BlobServer2

Node3
TaskManager

Job submissions will go to the Dispatcher on Node1. The JobSubmitHandler will call DispatcherGateway.getBlobServerPort, which returns the address of BlobServer1.

During job execution, the TaskManager on Node3 will execute a task from the job. As part of the TaskExecutorRegistrationSuccess, we include the ClusterInformation, which contains the BlobServer address. The TaskExecutorRegistrationSuccess is coming from the ResourceManager, on Node2, which returns it's local ClusterInformation, with the address of BlobServer2.

[~trohrmann] Could you take a quick look at this to verify or correct my thinking?

> resource_manager and dispatcher register on different nodes in HA mode will cause FileNotFoundException
> -------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-19067
>                 URL: https://issues.apache.org/jira/browse/FLINK-19067
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.11.1
>            Reporter: JieFang.He
>            Priority: Major
>         Attachments: flink-jobmanager-deployer-hejiefang01.log, flink-jobmanager-deployer-hejiefang02.log, flink-taskmanager-deployer-hejiefang01.log, flink-taskmanager-deployer-hejiefang02.log
>
>
> When run examples/batch/WordCount.jar，it will fail with the exception:
> {code:java}
> Caused by: java.io.FileNotFoundException: /data2/flink/storageDir/default/blob/job_d29414828f614d5466e239be4d3889ac/blob_p-a2ebe1c5aa160595f214b4bd0f39d80e42ee2e93-f458f1c12dc023e78d25f191de1d7c4b (No such file or directory)
>  at java.io.FileInputStream.open0(Native Method)
>  at java.io.FileInputStream.open(FileInputStream.java:195)
>  at java.io.FileInputStream.<init>(FileInputStream.java:138)
>  at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)
>  at org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:143)
>  at org.apache.flink.runtime.blob.FileSystemBlobStore.get(FileSystemBlobStore.java:105)
>  at org.apache.flink.runtime.blob.FileSystemBlobStore.get(FileSystemBlobStore.java:87)
>  at org.apache.flink.runtime.blob.BlobServer.getFileInternal(BlobServer.java:501)
>  at org.apache.flink.runtime.blob.BlobServerConnection.get(BlobServerConnection.java:231)
>  at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:117)
> {code}
>  
> I think the reason is that the jobFiles are upload to the dispatcher node，but the task get jobFiles from resource_manager node. So in HA mode, it need to ensure they are on one node
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)