You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Sayat Satybaldiyev (JIRA)" <ji...@apache.org> on 2018/09/05 14:30:00 UTC

[jira] [Updated] (FLINK-10286) Flink Persist Invalid Job Graph in Zookeeper

     [ https://issues.apache.org/jira/browse/FLINK-10286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sayat Satybaldiyev updated FLINK-10286:
---------------------------------------
    Description: 
In HA mode Flink 1.6, Flink persist job graph in Zookpeer even if the job was not accepted by Job Manager. This particularly bad as later if JM dies and restarts JM tries to recover the job and obviously fails and dies completely.

 

How to reproduce:

1. Have HA Flink cluster 1.6

2. Submit invalid job, in my case I'm put invalid file schema for rocksdb state backed

 

 
{code:java}
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
env.enableCheckpointing(5000);
RocksDBStateBackend backend = new RocksDBStateBackend("hddd:///tmp/flink/rocksdb");
backend.setPredefinedOptions(PredefinedOptions.FLASH_SSD_OPTIMIZED);
env.setStateBackend(backend);
{code}
 

Client returns:

 

 
{code:java}
The program finished with the following exception:
org.apache.flink.client.program.ProgramInvocationException: Could not submit job (JobID: 9680f02ae2f3806c3b4da25bfacd0749)
{code}
 

 

JM does not accept job, this truncated error log from JM:

 

 
{code:java}
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit job.
... 24 more
Caused by: java.util.concurrent.CompletionException: java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
 
Caused by: java.lang.RuntimeException: Failed to start checkpoint ID counter: Could not find a file system implementation for scheme 'hddd'. The scheme is not directly supported by Flink and no Hadoop file system to support this scheme could be loaded.
{code}
 

 

 

4. Go to ZK and observe that JM has saved job to ZK

ls /flink/flink_ns/jobgraphs/9680f02ae2f3806c3b4da25bfacd0749
 [7f392fd9-cedc-4978-9186-1f54b98eeeb7]

  was:
In HA mode Flink 1.6, Flink persist job graph in Zookpeer even if the job was not accepted by Job Manager. This particularly bad as later if JM dies and restarts JM tries to recover the job and obviously fails and dies completely.

 

How to reproduce:

1. Have HA Flink cluster 1.6

2. Submit invalid job, in my case I'm put invalid file schema for rocksdb state backed

```

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
env.enableCheckpointing(5000);
RocksDBStateBackend backend = new RocksDBStateBackend("hddd:///tmp/flink/rocksdb");

backend.setPredefinedOptions(PredefinedOptions.FLASH_SSD_OPTIMIZED);
env.setStateBackend(backend);

```

Client returns:

```

The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: Could not submit job (JobID: 9680f02ae2f3806c3b4da25bfacd0749)

```

JM does not accept job, this truncated error log from JM:

```

Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit job.
... 24 more
Caused by: java.util.concurrent.CompletionException: java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager

 

Caused by: java.lang.RuntimeException: Failed to start checkpoint ID counter: Could not find a file system implementation for scheme 'hddd'. The scheme is not directly supported by Flink and no Hadoop file system to support this scheme could be loaded.

 

```

4. Go to ZK and observe that JM has saved job to ZK

ls /flink/flink_ns/jobgraphs/9680f02ae2f3806c3b4da25bfacd0749
[7f392fd9-cedc-4978-9186-1f54b98eeeb7]


> Flink Persist Invalid Job Graph in Zookeeper
> --------------------------------------------
>
>                 Key: FLINK-10286
>                 URL: https://issues.apache.org/jira/browse/FLINK-10286
>             Project: Flink
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.6.0
>            Reporter: Sayat Satybaldiyev
>            Priority: Major
>
> In HA mode Flink 1.6, Flink persist job graph in Zookpeer even if the job was not accepted by Job Manager. This particularly bad as later if JM dies and restarts JM tries to recover the job and obviously fails and dies completely.
>  
> How to reproduce:
> 1. Have HA Flink cluster 1.6
> 2. Submit invalid job, in my case I'm put invalid file schema for rocksdb state backed
>  
>  
> {code:java}
> StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
> env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
> env.enableCheckpointing(5000);
> RocksDBStateBackend backend = new RocksDBStateBackend("hddd:///tmp/flink/rocksdb");
> backend.setPredefinedOptions(PredefinedOptions.FLASH_SSD_OPTIMIZED);
> env.setStateBackend(backend);
> {code}
>  
> Client returns:
>  
>  
> {code:java}
> The program finished with the following exception:
> org.apache.flink.client.program.ProgramInvocationException: Could not submit job (JobID: 9680f02ae2f3806c3b4da25bfacd0749)
> {code}
>  
>  
> JM does not accept job, this truncated error log from JM:
>  
>  
> {code:java}
> Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit job.
> ... 24 more
> Caused by: java.util.concurrent.CompletionException: java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
>  
> Caused by: java.lang.RuntimeException: Failed to start checkpoint ID counter: Could not find a file system implementation for scheme 'hddd'. The scheme is not directly supported by Flink and no Hadoop file system to support this scheme could be loaded.
> {code}
>  
>  
>  
> 4. Go to ZK and observe that JM has saved job to ZK
> ls /flink/flink_ns/jobgraphs/9680f02ae2f3806c3b4da25bfacd0749
>  [7f392fd9-cedc-4978-9186-1f54b98eeeb7]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)