You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Attila Bernáth <be...@gmail.com> on 2014/10/21 11:25:52 UTC

flink on my cluster gets stuck

Dear Developers,

I run some experiment on my cluster. I send the same job a couple of
times, and it is finished on the first 5-6 occasions, but the next one
fails and it gets stuck (the web dashboard stops moving on).

I use flink 0.7, compiled from source.

In the log file of one of my task managers I find the following
(similar message is written in every second, I only copy the last 2):

10:58:21,540 WARN  io.netty.channel.DefaultChannelPipeline
          - An exceptionCaught() event was fired, and it reached at
the tail of the pipeline. It usually means the last handler in the
pipeline did not handle the exception.
java.io.IOException: Too many open files
        at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
        at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)
        at io.netty.channel.socket.nio.NioServerSocketChannel.doReadMessages(NioServerSocketChannel.java:135)
        at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:68)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
        at java.lang.Thread.run(Thread.java:745)
10:58:22,541 WARN  io.netty.channel.DefaultChannelPipeline
          - An exceptionCaught() event was fired, and it reached at
the tail of the pipeline. It usually means the last handler in the
pipeline did not handle the exception.
java.io.IOException: Too many open files
        at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
        at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)
        at io.netty.channel.socket.nio.NioServerSocketChannel.doReadMessages(NioServerSocketChannel.java:135)
        at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:68)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
        at java.lang.Thread.run(Thread.java:745)

Any ideas what this can be?

Attila

Re: flink on my cluster gets stuck

Posted by Attila Bernáth <be...@gmail.com>.
Dear Robert!

I did not have this problem recently. If I run into it again I will
get back to this problem.

2014-10-31 6:41 GMT+01:00 Robert Metzger <rm...@apache.org>:
> Were you able to increase the number of file handles in your cluster?
Do you think that this is the solution? 8192 is already quite many.

Attila

Re: flink on my cluster gets stuck

Posted by Robert Metzger <rm...@apache.org>.
Were you able to increase the number of file handles in your cluster?

I think the TaskManager is not reporting any heartbeats because it is
basically crashed once the "Too many open files" exception occured.

On Tue, Oct 21, 2014 at 3:56 AM, Attila Bernáth <be...@gmail.com>
wrote:

> Dear Ufuk,
>
> ulimit -n
> says
> 8192
>
> It seems that some of the task managers do not report a heartbeat
> (this is what I find in the job managers log), and the job manager
> fails to cancel the job.
>
> Attila
>
>
> 2014-10-21 12:05 GMT+02:00 Ufuk Celebi <uc...@apache.org>:
> > Hey Attila,
> >
> > this means that your system is running out of file handles. Can you
> execute "ulimit -n" on your machines and report the value back? You will
> have to increase that value.
> >
> > We actually multiplex multiple logical channels over the same TCP
> connection in order to reduce the number of concurrently open files
> handles. The problem, which leads to "too many open files" is that channels
> are not closed. Let me look into that and get back to you.
> >
> > – Ufuk
> >
> > On 21 Oct 2014, at 11:25, Attila Bernáth <be...@gmail.com>
> wrote:
> >
> >> Dear Developers,
> >>
> >> I run some experiment on my cluster. I send the same job a couple of
> >> times, and it is finished on the first 5-6 occasions, but the next one
> >> fails and it gets stuck (the web dashboard stops moving on).
> >>
> >> I use flink 0.7, compiled from source.
> >>
> >> In the log file of one of my task managers I find the following
> >> (similar message is written in every second, I only copy the last 2):
> >>
> >> 10:58:21,540 WARN  io.netty.channel.DefaultChannelPipeline
> >>          - An exceptionCaught() event was fired, and it reached at
> >> the tail of the pipeline. It usually means the last handler in the
> >> pipeline did not handle the exception.
> >> java.io.IOException: Too many open files
> >>        at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
> >>        at
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)
> >>        at
> io.netty.channel.socket.nio.NioServerSocketChannel.doReadMessages(NioServerSocketChannel.java:135)
> >>        at
> io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:68)
> >>        at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
> >>        at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> >>        at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> >>        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> >>        at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
> >>        at
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
> >>        at java.lang.Thread.run(Thread.java:745)
> >> 10:58:22,541 WARN  io.netty.channel.DefaultChannelPipeline
> >>          - An exceptionCaught() event was fired, and it reached at
> >> the tail of the pipeline. It usually means the last handler in the
> >> pipeline did not handle the exception.
> >> java.io.IOException: Too many open files
> >>        at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
> >>        at
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)
> >>        at
> io.netty.channel.socket.nio.NioServerSocketChannel.doReadMessages(NioServerSocketChannel.java:135)
> >>        at
> io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:68)
> >>        at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
> >>        at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> >>        at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> >>        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> >>        at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
> >>        at
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
> >>        at java.lang.Thread.run(Thread.java:745)
> >>
> >> Any ideas what this can be?
> >>
> >> Attila
> >
>

Re: flink on my cluster gets stuck

Posted by Attila Bernáth <be...@gmail.com>.
Dear Ufuk,

ulimit -n
says
8192

It seems that some of the task managers do not report a heartbeat
(this is what I find in the job managers log), and the job manager
fails to cancel the job.

Attila


2014-10-21 12:05 GMT+02:00 Ufuk Celebi <uc...@apache.org>:
> Hey Attila,
>
> this means that your system is running out of file handles. Can you execute "ulimit -n" on your machines and report the value back? You will have to increase that value.
>
> We actually multiplex multiple logical channels over the same TCP connection in order to reduce the number of concurrently open files handles. The problem, which leads to "too many open files" is that channels are not closed. Let me look into that and get back to you.
>
> – Ufuk
>
> On 21 Oct 2014, at 11:25, Attila Bernáth <be...@gmail.com> wrote:
>
>> Dear Developers,
>>
>> I run some experiment on my cluster. I send the same job a couple of
>> times, and it is finished on the first 5-6 occasions, but the next one
>> fails and it gets stuck (the web dashboard stops moving on).
>>
>> I use flink 0.7, compiled from source.
>>
>> In the log file of one of my task managers I find the following
>> (similar message is written in every second, I only copy the last 2):
>>
>> 10:58:21,540 WARN  io.netty.channel.DefaultChannelPipeline
>>          - An exceptionCaught() event was fired, and it reached at
>> the tail of the pipeline. It usually means the last handler in the
>> pipeline did not handle the exception.
>> java.io.IOException: Too many open files
>>        at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
>>        at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)
>>        at io.netty.channel.socket.nio.NioServerSocketChannel.doReadMessages(NioServerSocketChannel.java:135)
>>        at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:68)
>>        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>>        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>>        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>>        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>>        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>>        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
>>        at java.lang.Thread.run(Thread.java:745)
>> 10:58:22,541 WARN  io.netty.channel.DefaultChannelPipeline
>>          - An exceptionCaught() event was fired, and it reached at
>> the tail of the pipeline. It usually means the last handler in the
>> pipeline did not handle the exception.
>> java.io.IOException: Too many open files
>>        at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
>>        at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)
>>        at io.netty.channel.socket.nio.NioServerSocketChannel.doReadMessages(NioServerSocketChannel.java:135)
>>        at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:68)
>>        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>>        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>>        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>>        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>>        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>>        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
>>        at java.lang.Thread.run(Thread.java:745)
>>
>> Any ideas what this can be?
>>
>> Attila
>

Re: flink on my cluster gets stuck

Posted by Ufuk Celebi <uc...@apache.org>.
Hey Attila,

this means that your system is running out of file handles. Can you execute "ulimit -n" on your machines and report the value back? You will have to increase that value.

We actually multiplex multiple logical channels over the same TCP connection in order to reduce the number of concurrently open files handles. The problem, which leads to "too many open files" is that channels are not closed. Let me look into that and get back to you.

– Ufuk

On 21 Oct 2014, at 11:25, Attila Bernáth <be...@gmail.com> wrote:

> Dear Developers,
> 
> I run some experiment on my cluster. I send the same job a couple of
> times, and it is finished on the first 5-6 occasions, but the next one
> fails and it gets stuck (the web dashboard stops moving on).
> 
> I use flink 0.7, compiled from source.
> 
> In the log file of one of my task managers I find the following
> (similar message is written in every second, I only copy the last 2):
> 
> 10:58:21,540 WARN  io.netty.channel.DefaultChannelPipeline
>          - An exceptionCaught() event was fired, and it reached at
> the tail of the pipeline. It usually means the last handler in the
> pipeline did not handle the exception.
> java.io.IOException: Too many open files
>        at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
>        at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)
>        at io.netty.channel.socket.nio.NioServerSocketChannel.doReadMessages(NioServerSocketChannel.java:135)
>        at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:68)
>        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
>        at java.lang.Thread.run(Thread.java:745)
> 10:58:22,541 WARN  io.netty.channel.DefaultChannelPipeline
>          - An exceptionCaught() event was fired, and it reached at
> the tail of the pipeline. It usually means the last handler in the
> pipeline did not handle the exception.
> java.io.IOException: Too many open files
>        at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
>        at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)
>        at io.netty.channel.socket.nio.NioServerSocketChannel.doReadMessages(NioServerSocketChannel.java:135)
>        at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:68)
>        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
>        at java.lang.Thread.run(Thread.java:745)
> 
> Any ideas what this can be?
> 
> Attila