You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by HarshithBolar <hk...@arity.com> on 2018/05/28 12:01:23 UTC

Job execution fails when parallelism is increased beyond 1

I'm submitting a Flink job to a cluster that has three Task Managers via the
Flink dashboard. When I set `Parallelism` to 1 (which is default),
everything runs as expected. But when I increase `Parallelism` to anything
more than 1, the job fails with the exception,

    /java.io.FileNotFoundException:
/tmp/flink-io-f91d7812-a411-4b58-a891-c9be1cde91da/08caeac37d6b8351daf6a3eb123a473106c56381b101f3e5d9704df9f78406a2.0.buffer
(No such file or directory)
	at java.io.RandomAccessFile.open0(Native Method)
	at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
	at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
	at
org.apache.flink.streaming.runtime.io.BufferSpiller.createSpillingChannel(BufferSpiller.java:259)
	at
org.apache.flink.streaming.runtime.io.BufferSpiller.<init>(BufferSpiller.java:120)
	at
org.apache.flink.streaming.runtime.io.BarrierBuffer.<init>(BarrierBuffer.java:149)
	at
org.apache.flink.streaming.runtime.io.StreamInputProcessor.<init>(StreamInputProcessor.java:129)
	at
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.init(OneInputStreamTask.java:56)
	at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:235)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
	at java.lang.Thread.run(Thread.java:745)/

I've enabled checkpoints on my Flink job with "Exactly Once" retry strategy
every 10 seconds. Here's my checkpoint configuration,

   / env.setStateBackend(new
RocksDBStateBackend(props.getFlinkCheckpointDataUri(), true));
    env.enableCheckpointing(10000, EXACTLY_ONCE); //10 seconds
    CheckpointConfig config = env.getCheckpointConfig();
   
config.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);/

Should I configure something else other than simply entering `2` while
submitting the job in the dashboard?

EDIT: If I disable checkpoints and upload the job, it runs without any
error.



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Job execution fails when parallelism is increased beyond 1

Posted by HarshithBolar <hk...@arity.com>.

Just a heads up. I haven't found the root cause for this issue yet but
restarting all the nodes seems to have solved this issue. 

  



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Job execution fails when parallelism is increased beyond 1

Posted by HarshithBolar <hk...@arity.com>.

I'm using Flink 1.4.2



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Job execution fails when parallelism is increased beyond 1

Posted by Chesnay Schepler <ch...@apache.org>.

Could you tell us which Flink version you are using?

On 28.05.2018 14:01, HarshithBolar wrote:
> I'm submitting a Flink job to a cluster that has three Task Managers via the
> Flink dashboard. When I set `Parallelism` to 1 (which is default),
> everything runs as expected. But when I increase `Parallelism` to anything
> more than 1, the job fails with the exception,
>
>      /java.io.FileNotFoundException:
> /tmp/flink-io-f91d7812-a411-4b58-a891-c9be1cde91da/08caeac37d6b8351daf6a3eb123a473106c56381b101f3e5d9704df9f78406a2.0.buffer
> (No such file or directory)
> 	at java.io.RandomAccessFile.open0(Native Method)
> 	at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
> 	at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
> 	at
> org.apache.flink.streaming.runtime.io.BufferSpiller.createSpillingChannel(BufferSpiller.java:259)
> 	at
> org.apache.flink.streaming.runtime.io.BufferSpiller.<init>(BufferSpiller.java:120)
> 	at
> org.apache.flink.streaming.runtime.io.BarrierBuffer.<init>(BarrierBuffer.java:149)
> 	at
> org.apache.flink.streaming.runtime.io.StreamInputProcessor.<init>(StreamInputProcessor.java:129)
> 	at
> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.init(OneInputStreamTask.java:56)
> 	at
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:235)
> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
> 	at java.lang.Thread.run(Thread.java:745)/
>
> I've enabled checkpoints on my Flink job with "Exactly Once" retry strategy
> every 10 seconds. Here's my checkpoint configuration,
>
>     / env.setStateBackend(new
> RocksDBStateBackend(props.getFlinkCheckpointDataUri(), true));
>      env.enableCheckpointing(10000, EXACTLY_ONCE); //10 seconds
>      CheckpointConfig config = env.getCheckpointConfig();
>     
> config.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);/
>
> Should I configure something else other than simply entering `2` while
> submitting the job in the dashboard?
>
> EDIT: If I disable checkpoints and upload the job, it runs without any
> error.
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>