You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Frank Grimes <fr...@yahoo.com> on 2012/01/26 16:51:26 UTC

Collector node failing with java.net.SocketException: Too many open files

Hi All,

We are using flume-0.9.5 (specifically, http://svn.apache.org/repos/asf/incubator/flume/trunk@1179275) and occasionally our Collector node accumulates too many open TCP connections and starts madly logging the following errors:

WARN org.apache.thrift.server.TSaneThreadPoolServer: Transport error occurred during acceptance of message.
org.apache.thrift.transport.TTransportException: java.net.SocketException: Too many open files
       at org.apache.thrift.transport.TSaneServerSocket.acceptImpl(TSaneServerSocket.java:139)
       at org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:31)
       at org.apache.thrift.server.TSaneThreadPoolServer$1.run(TSaneThreadPoolServer.java:175)
Caused by: java.net.SocketException: Too many open files
       at java.net.PlainSocketImpl.socketAccept(Native Method)
       at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:408)
       at java.net.ServerSocket.implAccept(ServerSocket.java:462)
       at java.net.ServerSocket.accept(ServerSocket.java:430)
       at org.apache.thrift.transport.TSaneServerSocket.acceptImpl(TSaneServerSocket.java:134)
       ... 2 more

This quickly fills up the disk as the log file grows to multiple gigabytes in size.

After some investigation, it appears that even though the Agent nodes show single open connections to the Collector, the Collector node appears to have a bunch of zombie TCP connections open back to the Agent nodes.
i.e.
"lsof -n | grep PORT" on the Agent node shows 1 established connection
However, the Collector node shows hundreds of established connections for that same port which don't seem to tie up to any connections I can find on the Agent node.

So we're concluding that the Collector node is somehow leaking connections.

Has anyone seen this kind of thing before?

Could this be related to https://issues.apache.org/jira/browse/FLUME-857?
Or could this be a Thrift bug that could be avoided by switching to Avro sources/sinks?

Any hints/tips are most welcome.

Thanks,

Frank Grimes

Re: Collector node failing with java.net.SocketException: Too many open files

Posted by alo alt <wg...@googlemail.com>.
Hi,

# cat /etc/security/limits.conf
 flume            soft     nofile         5000
 flume            hard     nofile         5000
 

# cat /etc/sysctl.conf
 fs.file-max=200000

can you try that settings?

Max open files 1024 is a default value and designed for small servers / PC. 

- Alex 

--
Alexander Lorenz
http://mapredit.blogspot.com

On Jan 26, 2012, at 6:04 PM, Frank Grimes wrote:

> It's 1024, but we really shouldn't  need to up that value... doing so would just delay the failure.
> 
> 
> On 2012-01-26, at 11:57 AM, Zijad Purkovic wrote:
> 
>> Hi Frank,
>> 
>> Can you show output of ulimit -n from your collector node?
>> 
>> On Thu, Jan 26, 2012 at 4:51 PM, Frank Grimes <fr...@yahoo.com> wrote:
>>> Hi All,
>>> 
>>> We are using flume-0.9.5
>>> (specifically, http://svn.apache.org/repos/asf/incubator/flume/trunk@1179275)
>>> and occasionally our Collector node accumulates too many open TCP
>>> connections and starts madly logging the following errors:
>>> 
>>> WARN org.apache.thrift.server.TSaneThreadPoolServer: Transport error
>>> occurred during acceptance of message.
>>> org.apache.thrift.transport.TTransportException: java.net.SocketException:
>>> Too many open files
>>>       at
>>> org.apache.thrift.transport.TSaneServerSocket.acceptImpl(TSaneServerSocket.java:139)
>>>       at
>>> org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:31)
>>>       at
>>> org.apache.thrift.server.TSaneThreadPoolServer$1.run(TSaneThreadPoolServer.java:175)
>>> Caused by: java.net.SocketException: Too many open files
>>>       at java.net.PlainSocketImpl.socketAccept(Native Method)
>>>       at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:408)
>>>       at java.net.ServerSocket.implAccept(ServerSocket.java:462)
>>>       at java.net.ServerSocket.accept(ServerSocket.java:430)
>>>       at
>>> org.apache.thrift.transport.TSaneServerSocket.acceptImpl(TSaneServerSocket.java:134)
>>>       ... 2 more
>>> 
>>> 
>>> This quickly fills up the disk as the log file grows to multiple gigabytes
>>> in size.
>>> 
>>> After some investigation, it appears that even though the Agent nodes show
>>> single open connections to the Collector, the Collector node appears to have
>>> a bunch of zombie TCP connections open back to the Agent nodes.
>>> i.e.
>>> "lsof -n | grep PORT" on the Agent node shows 1 established connection
>>> However, the Collector node shows hundreds of established connections for
>>> that same port which don't seem to tie up to any connections I can find on
>>> the Agent node.
>>> 
>>> So we're concluding that the Collector node is somehow leaking connections.
>>> 
>>> Has anyone seen this kind of thing before?
>>> 
>>> Could this be related to https://issues.apache.org/jira/browse/FLUME-857?
>>> Or could this be a Thrift bug that could be avoided by switching to Avro
>>> sources/sinks?
>>> 
>>> Any hints/tips are most welcome.
>>> 
>>> Thanks,
>>> 
>>> Frank Grimes
>> 
>> 
>> 
>> -- 
>> Zijad Purković
>> Dobrovoljnih davalaca krvi 3/19, Zavidovići
>> 061/ 690 - 241
> 


Re: Collector node failing with java.net.SocketException: Too many open files

Posted by Frank Grimes <fr...@yahoo.com>.
It's 1024, but we really shouldn't  need to up that value... doing so would just delay the failure.


On 2012-01-26, at 11:57 AM, Zijad Purkovic wrote:

> Hi Frank,
> 
> Can you show output of ulimit -n from your collector node?
> 
> On Thu, Jan 26, 2012 at 4:51 PM, Frank Grimes <fr...@yahoo.com> wrote:
>> Hi All,
>> 
>> We are using flume-0.9.5
>> (specifically, http://svn.apache.org/repos/asf/incubator/flume/trunk@1179275)
>> and occasionally our Collector node accumulates too many open TCP
>> connections and starts madly logging the following errors:
>> 
>> WARN org.apache.thrift.server.TSaneThreadPoolServer: Transport error
>> occurred during acceptance of message.
>> org.apache.thrift.transport.TTransportException: java.net.SocketException:
>> Too many open files
>>        at
>> org.apache.thrift.transport.TSaneServerSocket.acceptImpl(TSaneServerSocket.java:139)
>>        at
>> org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:31)
>>        at
>> org.apache.thrift.server.TSaneThreadPoolServer$1.run(TSaneThreadPoolServer.java:175)
>> Caused by: java.net.SocketException: Too many open files
>>        at java.net.PlainSocketImpl.socketAccept(Native Method)
>>        at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:408)
>>        at java.net.ServerSocket.implAccept(ServerSocket.java:462)
>>        at java.net.ServerSocket.accept(ServerSocket.java:430)
>>        at
>> org.apache.thrift.transport.TSaneServerSocket.acceptImpl(TSaneServerSocket.java:134)
>>        ... 2 more
>> 
>> 
>> This quickly fills up the disk as the log file grows to multiple gigabytes
>> in size.
>> 
>> After some investigation, it appears that even though the Agent nodes show
>> single open connections to the Collector, the Collector node appears to have
>> a bunch of zombie TCP connections open back to the Agent nodes.
>> i.e.
>> "lsof -n | grep PORT" on the Agent node shows 1 established connection
>> However, the Collector node shows hundreds of established connections for
>> that same port which don't seem to tie up to any connections I can find on
>> the Agent node.
>> 
>> So we're concluding that the Collector node is somehow leaking connections.
>> 
>> Has anyone seen this kind of thing before?
>> 
>> Could this be related to https://issues.apache.org/jira/browse/FLUME-857?
>> Or could this be a Thrift bug that could be avoided by switching to Avro
>> sources/sinks?
>> 
>> Any hints/tips are most welcome.
>> 
>> Thanks,
>> 
>> Frank Grimes
> 
> 
> 
> -- 
> Zijad Purković
> Dobrovoljnih davalaca krvi 3/19, Zavidovići
> 061/ 690 - 241


Re: Collector node failing with java.net.SocketException: Too many open files

Posted by Zijad Purkovic <zi...@gmail.com>.
Hi Frank,

Can you show output of ulimit -n from your collector node?

On Thu, Jan 26, 2012 at 4:51 PM, Frank Grimes <fr...@yahoo.com> wrote:
> Hi All,
>
> We are using flume-0.9.5
> (specifically, http://svn.apache.org/repos/asf/incubator/flume/trunk@1179275)
> and occasionally our Collector node accumulates too many open TCP
> connections and starts madly logging the following errors:
>
> WARN org.apache.thrift.server.TSaneThreadPoolServer: Transport error
> occurred during acceptance of message.
> org.apache.thrift.transport.TTransportException: java.net.SocketException:
> Too many open files
>        at
> org.apache.thrift.transport.TSaneServerSocket.acceptImpl(TSaneServerSocket.java:139)
>        at
> org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:31)
>        at
> org.apache.thrift.server.TSaneThreadPoolServer$1.run(TSaneThreadPoolServer.java:175)
> Caused by: java.net.SocketException: Too many open files
>        at java.net.PlainSocketImpl.socketAccept(Native Method)
>        at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:408)
>        at java.net.ServerSocket.implAccept(ServerSocket.java:462)
>        at java.net.ServerSocket.accept(ServerSocket.java:430)
>        at
> org.apache.thrift.transport.TSaneServerSocket.acceptImpl(TSaneServerSocket.java:134)
>        ... 2 more
>
>
> This quickly fills up the disk as the log file grows to multiple gigabytes
> in size.
>
> After some investigation, it appears that even though the Agent nodes show
> single open connections to the Collector, the Collector node appears to have
> a bunch of zombie TCP connections open back to the Agent nodes.
> i.e.
> "lsof -n | grep PORT" on the Agent node shows 1 established connection
> However, the Collector node shows hundreds of established connections for
> that same port which don't seem to tie up to any connections I can find on
> the Agent node.
>
> So we're concluding that the Collector node is somehow leaking connections.
>
> Has anyone seen this kind of thing before?
>
> Could this be related to https://issues.apache.org/jira/browse/FLUME-857?
> Or could this be a Thrift bug that could be avoided by switching to Avro
> sources/sinks?
>
> Any hints/tips are most welcome.
>
> Thanks,
>
> Frank Grimes



-- 
Zijad Purković
Dobrovoljnih davalaca krvi 3/19, Zavidovići
061/ 690 - 241

Re: Collector node failing with java.net.SocketException: Too many open files

Posted by Frank Grimes <fr...@yahoo.com>.
https://issues.apache.org/jira/browse/FLUME-943


On 2012-01-30, at 10:49 AM, Frank Grimes wrote:

> I think this bug might be addressed by making use of TCP keepalive on the Thrift server socket. e.g.
> 
> Index: flume-core/src/main/java/org/apache/thrift/transport/TSaneServerSocket.java
> ===================================================================
> --- flume-core/src/main/java/org/apache/thrift/transport/TSaneServerSocket.java	(revision 1237721)
> +++ flume-core/src/main/java/org/apache/thrift/transport/TSaneServerSocket.java	(working copy)
> @@ -132,6 +132,7 @@
>      }
>      try {
>        Socket result = serverSocket_.accept();
> +      result.setKeepAlive(true); 
>        TSocket result2 = new TBufferedSocket(result);
>        result2.setTimeout(clientTimeout_);
>        return result2;
> 
> I believe that on Linux that would force the connections to be closed/cleaned up after 2 hours by default.
> This is likely good enough to prevent the "java.net.SocketException: Too many open files" from occurring in our case.
> Note that it's also configurable as per http://tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/#usingkeepalive.
> 
> Shall I open up a JIRA case for this and submit a patch?
> Should the keepalive be configurable or is it desirable to always have the Flume collector protected from these kinds of killed connections?
> I can't think of any downsides to always having it on...
> 
> Cheers,
> 
> Frank Grimes
> 
> 
> On 2012-01-28, at 12:20 PM, Frank Grimes wrote:
> 
>> We believe that we've made some progress in identifying the problem.
>> 
>> It appears that we have a slow socket connection leak on the Collector node due to sparse data coming in on some Thrift RPC sources.
>> Turns out we're going through a firewall, and we believe that it is killing those inactive connections.
>> 
>> The Agent node's Thrift RPC sink sockets are getting cleaned up after a socket timeout on a subsequent append, but the Collector still has its socket connections open and they don't appear to ever be timing out and closing.
>> 
>> I found the following which seems to describe the problem:
>> 
>>   http://mail-archives.apache.org/mod_mbox/incubator-flume-user/201107.mbox/%3C1311642202.14311.2155844361@webmail.messagingengine.com%3E
>> 
>> However, because presumably some other disconnect conditions could trigger the problem as well, we are still looking for a solution that doesn't require fiddling with firewall settings.
>> 
>> Is there a way to configure the Collector node to drop/close these inactive connections? 
>> i.e. either at the Linux network layer or through Java socket APIs within Flume?
>> 
>> Thanks,
>> 
>> Frank Grimes
>> 
>> 
>> On 2012-01-26, at 10:51 AM, Frank Grimes wrote:
>> 
>>> Hi All,
>>> 
>>> We are using flume-0.9.5 (specifically, http://svn.apache.org/repos/asf/incubator/flume/trunk@1179275) and occasionally our Collector node accumulates too many open TCP connections and starts madly logging the following errors:
>>> 
>>> WARN org.apache.thrift.server.TSaneThreadPoolServer: Transport error occurred during acceptance of message.
>>> org.apache.thrift.transport.TTransportException: java.net.SocketException: Too many open files
>>>        at org.apache.thrift.transport.TSaneServerSocket.acceptImpl(TSaneServerSocket.java:139)
>>>        at org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:31)
>>>        at org.apache.thrift.server.TSaneThreadPoolServer$1.run(TSaneThreadPoolServer.java:175)
>>> Caused by: java.net.SocketException: Too many open files
>>>        at java.net.PlainSocketImpl.socketAccept(Native Method)
>>>        at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:408)
>>>        at java.net.ServerSocket.implAccept(ServerSocket.java:462)
>>>        at java.net.ServerSocket.accept(ServerSocket.java:430)
>>>        at org.apache.thrift.transport.TSaneServerSocket.acceptImpl(TSaneServerSocket.java:134)
>>>        ... 2 more
>>> 
>>> This quickly fills up the disk as the log file grows to multiple gigabytes in size.
>>> 
>>> After some investigation, it appears that even though the Agent nodes show single open connections to the Collector, the Collector node appears to have a bunch of zombie TCP connections open back to the Agent nodes.
>>> i.e.
>>> "lsof -n | grep PORT" on the Agent node shows 1 established connection
>>> However, the Collector node shows hundreds of established connections for that same port which don't seem to tie up to any connections I can find on the Agent node.
>>> 
>>> So we're concluding that the Collector node is somehow leaking connections.
>>> 
>>> Has anyone seen this kind of thing before?
>>> 
>>> Could this be related to https://issues.apache.org/jira/browse/FLUME-857?
>>> Or could this be a Thrift bug that could be avoided by switching to Avro sources/sinks?
>>> 
>>> Any hints/tips are most welcome.
>>> 
>>> Thanks,
>>> 
>>> Frank Grimes
>> 
> 


Re: Collector node failing with java.net.SocketException: Too many open files

Posted by Frank Grimes <fr...@yahoo.com>.
I think this bug might be addressed by making use of TCP keepalive on the Thrift server socket. e.g.

Index: flume-core/src/main/java/org/apache/thrift/transport/TSaneServerSocket.java
===================================================================
--- flume-core/src/main/java/org/apache/thrift/transport/TSaneServerSocket.java	(revision 1237721)
+++ flume-core/src/main/java/org/apache/thrift/transport/TSaneServerSocket.java	(working copy)
@@ -132,6 +132,7 @@
     }
     try {
       Socket result = serverSocket_.accept();
+      result.setKeepAlive(true); 
       TSocket result2 = new TBufferedSocket(result);
       result2.setTimeout(clientTimeout_);
       return result2;

I believe that on Linux that would force the connections to be closed/cleaned up after 2 hours by default.
This is likely good enough to prevent the "java.net.SocketException: Too many open files" from occurring in our case.
Note that it's also configurable as per http://tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/#usingkeepalive.

Shall I open up a JIRA case for this and submit a patch?
Should the keepalive be configurable or is it desirable to always have the Flume collector protected from these kinds of killed connections?
I can't think of any downsides to always having it on...

Cheers,

Frank Grimes


On 2012-01-28, at 12:20 PM, Frank Grimes wrote:

> We believe that we've made some progress in identifying the problem.
> 
> It appears that we have a slow socket connection leak on the Collector node due to sparse data coming in on some Thrift RPC sources.
> Turns out we're going through a firewall, and we believe that it is killing those inactive connections.
> 
> The Agent node's Thrift RPC sink sockets are getting cleaned up after a socket timeout on a subsequent append, but the Collector still has its socket connections open and they don't appear to ever be timing out and closing.
> 
> I found the following which seems to describe the problem:
> 
>   http://mail-archives.apache.org/mod_mbox/incubator-flume-user/201107.mbox/%3C1311642202.14311.2155844361@webmail.messagingengine.com%3E
> 
> However, because presumably some other disconnect conditions could trigger the problem as well, we are still looking for a solution that doesn't require fiddling with firewall settings.
> 
> Is there a way to configure the Collector node to drop/close these inactive connections? 
> i.e. either at the Linux network layer or through Java socket APIs within Flume?
> 
> Thanks,
> 
> Frank Grimes
> 
> 
> On 2012-01-26, at 10:51 AM, Frank Grimes wrote:
> 
>> Hi All,
>> 
>> We are using flume-0.9.5 (specifically, http://svn.apache.org/repos/asf/incubator/flume/trunk@1179275) and occasionally our Collector node accumulates too many open TCP connections and starts madly logging the following errors:
>> 
>> WARN org.apache.thrift.server.TSaneThreadPoolServer: Transport error occurred during acceptance of message.
>> org.apache.thrift.transport.TTransportException: java.net.SocketException: Too many open files
>>        at org.apache.thrift.transport.TSaneServerSocket.acceptImpl(TSaneServerSocket.java:139)
>>        at org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:31)
>>        at org.apache.thrift.server.TSaneThreadPoolServer$1.run(TSaneThreadPoolServer.java:175)
>> Caused by: java.net.SocketException: Too many open files
>>        at java.net.PlainSocketImpl.socketAccept(Native Method)
>>        at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:408)
>>        at java.net.ServerSocket.implAccept(ServerSocket.java:462)
>>        at java.net.ServerSocket.accept(ServerSocket.java:430)
>>        at org.apache.thrift.transport.TSaneServerSocket.acceptImpl(TSaneServerSocket.java:134)
>>        ... 2 more
>> 
>> This quickly fills up the disk as the log file grows to multiple gigabytes in size.
>> 
>> After some investigation, it appears that even though the Agent nodes show single open connections to the Collector, the Collector node appears to have a bunch of zombie TCP connections open back to the Agent nodes.
>> i.e.
>> "lsof -n | grep PORT" on the Agent node shows 1 established connection
>> However, the Collector node shows hundreds of established connections for that same port which don't seem to tie up to any connections I can find on the Agent node.
>> 
>> So we're concluding that the Collector node is somehow leaking connections.
>> 
>> Has anyone seen this kind of thing before?
>> 
>> Could this be related to https://issues.apache.org/jira/browse/FLUME-857?
>> Or could this be a Thrift bug that could be avoided by switching to Avro sources/sinks?
>> 
>> Any hints/tips are most welcome.
>> 
>> Thanks,
>> 
>> Frank Grimes
> 


Re: Collector node failing with java.net.SocketException: Too many open files

Posted by Frank Grimes <fr...@yahoo.com>.
We believe that we've made some progress in identifying the problem.

It appears that we have a slow socket connection leak on the Collector node due to sparse data coming in on some Thrift RPC sources.
Turns out we're going through a firewall, and we believe that it is killing those inactive connections.

The Agent node's Thrift RPC sink sockets are getting cleaned up after a socket timeout on a subsequent append, but the Collector still has its socket connections open and they don't appear to ever be timing out and closing.

I found the following which seems to describe the problem:

  http://mail-archives.apache.org/mod_mbox/incubator-flume-user/201107.mbox/%3C1311642202.14311.2155844361@webmail.messagingengine.com%3E

However, because presumably some other disconnect conditions could trigger the problem as well, we are still looking for a solution that doesn't require fiddling with firewall settings.

Is there a way to configure the Collector node to drop/close these inactive connections? 
i.e. either at the Linux network layer or through Java socket APIs within Flume?

Thanks,

Frank Grimes


On 2012-01-26, at 10:51 AM, Frank Grimes wrote:

> Hi All,
> 
> We are using flume-0.9.5 (specifically, http://svn.apache.org/repos/asf/incubator/flume/trunk@1179275) and occasionally our Collector node accumulates too many open TCP connections and starts madly logging the following errors:
> 
> WARN org.apache.thrift.server.TSaneThreadPoolServer: Transport error occurred during acceptance of message.
> org.apache.thrift.transport.TTransportException: java.net.SocketException: Too many open files
>        at org.apache.thrift.transport.TSaneServerSocket.acceptImpl(TSaneServerSocket.java:139)
>        at org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:31)
>        at org.apache.thrift.server.TSaneThreadPoolServer$1.run(TSaneThreadPoolServer.java:175)
> Caused by: java.net.SocketException: Too many open files
>        at java.net.PlainSocketImpl.socketAccept(Native Method)
>        at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:408)
>        at java.net.ServerSocket.implAccept(ServerSocket.java:462)
>        at java.net.ServerSocket.accept(ServerSocket.java:430)
>        at org.apache.thrift.transport.TSaneServerSocket.acceptImpl(TSaneServerSocket.java:134)
>        ... 2 more
> 
> This quickly fills up the disk as the log file grows to multiple gigabytes in size.
> 
> After some investigation, it appears that even though the Agent nodes show single open connections to the Collector, the Collector node appears to have a bunch of zombie TCP connections open back to the Agent nodes.
> i.e.
> "lsof -n | grep PORT" on the Agent node shows 1 established connection
> However, the Collector node shows hundreds of established connections for that same port which don't seem to tie up to any connections I can find on the Agent node.
> 
> So we're concluding that the Collector node is somehow leaking connections.
> 
> Has anyone seen this kind of thing before?
> 
> Could this be related to https://issues.apache.org/jira/browse/FLUME-857?
> Or could this be a Thrift bug that could be avoided by switching to Avro sources/sinks?
> 
> Any hints/tips are most welcome.
> 
> Thanks,
> 
> Frank Grimes