You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by zavalit <za...@gmail.com> on 2018/11/12 16:12:09 UTC

Flink with parallelism 3 is running locally but not on cluster

Hi,
may be i just missing smth, but i just have no more ideas where to look.

here is an screen of the failed state
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1383/Bildschirmfoto_2018-11-12_um_16.png> 

i read messages from 2 sources, make a join based on a common key and sink
it all in a kafka.

  val env = StreamExecutionEnvironment.getExecutionEnvironment
  env.setParallelism(3)
  ...
  source1
     .keyBy(_.searchId)
     .connect(source2.keyBy(_.searchId))
     .process(new SearchResultsJoinFunction)
     .addSink(KafkaSink.sink)

so it perfectly works when launch it locally. when i deploy it to 1 job
manager and 3 taskmanagers and get every Task in "RUNNING" state, after 2
minutes (when nothing is comming to sink) one of the taskmanagers gets
following in log:

 Flat Map (1/3) (9598c11996f4b52a2e2f9f532f91ff66) switched from RUNNING to
FAILED.
java.io.IOException: Connecting the channel failed: Connecting to remote
task manager + 'flink-taskmanager-11-dn9cj/10.81.27.84:37708' has failed.
This might indicate that the remote task manager has been lost.
	at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:196)
	at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:133)
	at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:85)
	at
org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:60)
	at
org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:166)
	at
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:494)
	at
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:525)
	at
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:508)
	at
org.apache.flink.streaming.runtime.io.BarrierTracker.getNextNonBlocked(BarrierTracker.java:94)
	at
org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:209)
	at
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
	at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
	at java.lang.Thread.run(Thread.java:748)
Caused by:
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
Connecting to remote task manager +
'flink-taskmanager-11-dn9cj/10.81.27.84:37708' has failed. This might
indicate that the remote task manager has been lost.
	at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:219)
	at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:133)
	at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)
	at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504)
	at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483)
	at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)
	at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121)
	at
org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:269)
	at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
	at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:125)
	at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
	at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
	at
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
	at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
	... 1 more
Caused by:
org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException:
connection timed out: flink-taskmanager-11-dn9cj/10.81.27.84:37708
	at
org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:267)
	... 7 more
2018-11-12 15:47:57,198 INFO  org.apache.flink.runtime.taskmanager.Task                    
- Flat Map (1/3) (171e84d98f94a83e1f3e7cd598c7dbbc) switched from RUNNING to
FAILED.


i would appreciate any hint. 

thx a lot.


 





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Flink with parallelism 3 is running locally but not on cluster

Posted by zavalit <za...@gmail.com>.
Hey, Dominik,
tnx for getting back.
i've posted also by stackoverflow and David Anderson gave a good tipp where
to look.
https://stackoverflow.com/questions/53282967/run-flink-with-parallelism-more-than-1/53289840
issues is resolved, everything is running.

thx. again



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Flink with parallelism 3 is running locally but not on cluster

Posted by Dominik Wosiński <wo...@gmail.com>.
PS.
Could You also post the whole log for the application run ??

Best Regards,
Dom.

czw., 15 lis 2018 o 11:04 Dominik Wosiński <wo...@gmail.com> napisał(a):

> Hey,
>
> DId You try to run any other job on your setup? Also, could You please
> tell what are the sources you are trying to use, do all messages come from
> Kafka??
> From the first look, it seems that the JobManager can't connect to one of
> the TaskManagers.
>
>
> Best Regards,
> Dom.
>
> pon., 12 lis 2018 o 17:12 zavalit <za...@gmail.com> napisał(a):
>
>> Hi,
>> may be i just missing smth, but i just have no more ideas where to look.
>>
>> here is an screen of the failed state
>> <
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1383/Bildschirmfoto_2018-11-12_um_16.png>
>>
>>
>> i read messages from 2 sources, make a join based on a common key and sink
>> it all in a kafka.
>>
>>   val env = StreamExecutionEnvironment.getExecutionEnvironment
>>   env.setParallelism(3)
>>   ...
>>   source1
>>      .keyBy(_.searchId)
>>      .connect(source2.keyBy(_.searchId))
>>      .process(new SearchResultsJoinFunction)
>>      .addSink(KafkaSink.sink)
>>
>> so it perfectly works when launch it locally. when i deploy it to 1 job
>> manager and 3 taskmanagers and get every Task in "RUNNING" state, after 2
>> minutes (when nothing is comming to sink) one of the taskmanagers gets
>> following in log:
>>
>>  Flat Map (1/3) (9598c11996f4b52a2e2f9f532f91ff66) switched from RUNNING
>> to
>> FAILED.
>> java.io.IOException: Connecting the channel failed: Connecting to remote
>> task manager + 'flink-taskmanager-11-dn9cj/10.81.27.84:37708' has failed.
>> This might indicate that the remote task manager has been lost.
>>         at
>> org.apache.flink.runtime.io
>> .network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:196)
>>         at
>> org.apache.flink.runtime.io
>> .network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:133)
>>         at
>> org.apache.flink.runtime.io
>> .network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:85)
>>         at
>> org.apache.flink.runtime.io
>> .network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:60)
>>         at
>> org.apache.flink.runtime.io
>> .network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:166)
>>         at
>> org.apache.flink.runtime.io
>> .network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:494)
>>         at
>> org.apache.flink.runtime.io
>> .network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:525)
>>         at
>> org.apache.flink.runtime.io
>> .network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:508)
>>         at
>>
>> org.apache.flink.streaming.runtime.io.BarrierTracker.getNextNonBlocked(BarrierTracker.java:94)
>>         at
>> org.apache.flink.streaming.runtime.io
>> .StreamInputProcessor.processInput(StreamInputProcessor.java:209)
>>         at
>>
>> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
>>         at
>>
>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
>>         at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
>>         at java.lang.Thread.run(Thread.java:748)
>> Caused by:
>> org.apache.flink.runtime.io
>> .network.netty.exception.RemoteTransportException:
>> Connecting to remote task manager +
>> 'flink-taskmanager-11-dn9cj/10.81.27.84:37708' has failed. This might
>> indicate that the remote task manager has been lost.
>>         at
>> org.apache.flink.runtime.io
>> .network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:219)
>>         at
>> org.apache.flink.runtime.io
>> .network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:133)
>>         at
>>
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)
>>         at
>>
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504)
>>         at
>>
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483)
>>         at
>>
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)
>>         at
>>
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121)
>>         at
>>
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:269)
>>         at
>>
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
>>         at
>>
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:125)
>>         at
>>
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
>>         at
>>
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
>>         at
>>
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
>>         at
>>
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
>>         ... 1 more
>> Caused by:
>> org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException:
>> connection timed out: flink-taskmanager-11-dn9cj/10.81.27.84:37708
>>         at
>>
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:267)
>>         ... 7 more
>> 2018-11-12 15:47:57,198 INFO  org.apache.flink.runtime.taskmanager.Task
>>
>> - Flat Map (1/3) (171e84d98f94a83e1f3e7cd598c7dbbc) switched from RUNNING
>> to
>> FAILED.
>>
>>
>> i would appreciate any hint.
>>
>> thx a lot.
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> Sent from:
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>>
>

Re: Flink with parallelism 3 is running locally but not on cluster

Posted by Dominik Wosiński <wo...@gmail.com>.
Hey,

DId You try to run any other job on your setup? Also, could You please tell
what are the sources you are trying to use, do all messages come from
Kafka??
From the first look, it seems that the JobManager can't connect to one of
the TaskManagers.


Best Regards,
Dom.

pon., 12 lis 2018 o 17:12 zavalit <za...@gmail.com> napisał(a):

> Hi,
> may be i just missing smth, but i just have no more ideas where to look.
>
> here is an screen of the failed state
> <
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1383/Bildschirmfoto_2018-11-12_um_16.png>
>
>
> i read messages from 2 sources, make a join based on a common key and sink
> it all in a kafka.
>
>   val env = StreamExecutionEnvironment.getExecutionEnvironment
>   env.setParallelism(3)
>   ...
>   source1
>      .keyBy(_.searchId)
>      .connect(source2.keyBy(_.searchId))
>      .process(new SearchResultsJoinFunction)
>      .addSink(KafkaSink.sink)
>
> so it perfectly works when launch it locally. when i deploy it to 1 job
> manager and 3 taskmanagers and get every Task in "RUNNING" state, after 2
> minutes (when nothing is comming to sink) one of the taskmanagers gets
> following in log:
>
>  Flat Map (1/3) (9598c11996f4b52a2e2f9f532f91ff66) switched from RUNNING to
> FAILED.
> java.io.IOException: Connecting the channel failed: Connecting to remote
> task manager + 'flink-taskmanager-11-dn9cj/10.81.27.84:37708' has failed.
> This might indicate that the remote task manager has been lost.
>         at
> org.apache.flink.runtime.io
> .network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:196)
>         at
> org.apache.flink.runtime.io
> .network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:133)
>         at
> org.apache.flink.runtime.io
> .network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:85)
>         at
> org.apache.flink.runtime.io
> .network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:60)
>         at
> org.apache.flink.runtime.io
> .network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:166)
>         at
> org.apache.flink.runtime.io
> .network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:494)
>         at
> org.apache.flink.runtime.io
> .network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:525)
>         at
> org.apache.flink.runtime.io
> .network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:508)
>         at
>
> org.apache.flink.streaming.runtime.io.BarrierTracker.getNextNonBlocked(BarrierTracker.java:94)
>         at
> org.apache.flink.streaming.runtime.io
> .StreamInputProcessor.processInput(StreamInputProcessor.java:209)
>         at
>
> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
>         at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
>         at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by:
> org.apache.flink.runtime.io
> .network.netty.exception.RemoteTransportException:
> Connecting to remote task manager +
> 'flink-taskmanager-11-dn9cj/10.81.27.84:37708' has failed. This might
> indicate that the remote task manager has been lost.
>         at
> org.apache.flink.runtime.io
> .network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:219)
>         at
> org.apache.flink.runtime.io
> .network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:133)
>         at
>
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)
>         at
>
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504)
>         at
>
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483)
>         at
>
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)
>         at
>
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121)
>         at
>
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:269)
>         at
>
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
>         at
>
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:125)
>         at
>
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
>         at
>
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
>         at
>
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
>         at
>
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
>         ... 1 more
> Caused by:
> org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException:
> connection timed out: flink-taskmanager-11-dn9cj/10.81.27.84:37708
>         at
>
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:267)
>         ... 7 more
> 2018-11-12 15:47:57,198 INFO  org.apache.flink.runtime.taskmanager.Task
>
> - Flat Map (1/3) (171e84d98f94a83e1f3e7cd598c7dbbc) switched from RUNNING
> to
> FAILED.
>
>
> i would appreciate any hint.
>
> thx a lot.
>
>
>
>
>
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>