You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by "Adam J. Shook" <ad...@gmail.com> on 2018/01/24 20:45:25 UTC

Large number of used ports from tserver

Hello all,

Has anyone come across an issue with a TabletServer occupying a large
number of ports in a CLOSED_WAIT state?  'Normal' number of used ports on a
12-node cluster are around 12,000 to 20,000 ports.  In one instance, there
were over 68k and it was affecting other applications from getting a free
port and they would fail to start (which is how we found this in the first
place).

Thank you,
--Adam

Re: Large number of used ports from tserver

Posted by "Adam J. Shook" <ad...@gmail.com>.
I checked all tablet servers across all six of our environments and it
seems to be present in all of them, with some having upwards of 73k
connections.

I disabled replication in our dev cluster and restarted the tablet
servers.  Left it running overnight and checked the connections -- a
reasonable number in the single or double digits.  Enabling replication
lead to a quick climb in the CLOSE_WAIT connections to a couple thousand,
leading me to think it is some lingering connection reading a WAL file from
HDFS.

I've opened ACCUMULO-4787
<https://issues.apache.org/jira/browse/ACCUMULO-4787> to track this and we
can move discussion over there.

--Adam

On Thu, Jan 25, 2018 at 12:23 PM, Christopher <ct...@apache.org> wrote:

> Interesting. It's possible we're mishandling an IOException from DFSClient
> or something... but it's also possible there's a bug in DFSClient
> somewhere. I found a few similar issues from the past... some might still
> be not fully resolved:
>
> https://issues.apache.org/jira/browse/HDFS-1836
> https://issues.apache.org/jira/browse/HDFS-2028
> https://issues.apache.org/jira/browse/HDFS-6973
> https://issues.apache.org/jira/browse/HBASE-9393
>
> The HBASE issue is interesting, because it indicates a new HDFS feature in
> 2.6.4 to clear readahead buffers/sockets (https://issues.apache.org/
> jira/browse/HDFS-7694). That might be a feature we're not yet utilizing,
> but it would only work on a newer version of HDFS.
>
> I would probably also try to grab some jstacks of the tserver, to try to
> figure out what HDFS client code paths are being taken to see where the
> leak might be occurring. Also, if you have any debug logs for the tserver,
> that might help. There might be some DEBUG or WARN items that indicate
> retries or other failures failures that are occurring, but perhaps handled
> improperly.
>
> It's probably less likely, but it could also be a Java or Linux issue. I
> wouldn't even know where to begin debugging at that level, though, other
> than to check for OS updates.  What JVM are you running?
>
> It's possible it's not a leak... and these are just getting cleaned up too
> slowly. That might be something that can be tuned with sysctl.
>
> On Thu, Jan 25, 2018 at 11:27 AM Adam J. Shook <ad...@gmail.com>
> wrote:
>
>> We're running Ubuntu 14.04, HDFS 2.6.0, ZooKeeper 3.4.6, and Accumulo
>> 1.8.1.  I'm using `lsof -i` and grepping for the tserver PID to list all
>> the connections.  Just now there are ~25k connections for this one tserver,
>> of which 99.9% of them are all writing to various DataNodes on port 50010.
>> It's split about 50/50 for connections that are CLOSED_WAIT and ones that
>> are ESTABLISHED.  No special RPC configuration.
>>
>> On Wed, Jan 24, 2018 at 7:53 PM, Josh Elser <jo...@gmail.com> wrote:
>>
>>> +1 to looking at the remote end of the socket and see where they're
>>> going/coming to/from. I've seen a few HDFS JIRA issues filed about sockets
>>> left in CLOSED_WAIT.
>>>
>>> Lucky you, this is a fun Linux rabbit hole to go down :)
>>>
>>> (https://blog.cloudflare.com/this-is-strictly-a-violation-
>>> of-the-tcp-specification/ covers some of the technical details)
>>>
>>> On 1/24/18 6:37 PM, Christopher wrote:
>>>
>>>> I haven't seen that, but I'm curious what OS, Hadoop, ZooKeeper, and
>>>> Accumulo version you're running. I'm assuming you verified that it was the
>>>> TabletServer process holding these TCP sockets open using `netstat -p` and
>>>> cross-referencing the PID with `jps -ml` (or similar)? Are you able to
>>>> confirm based on the port number that these were Thrift connections or
>>>> could they be ZooKeeper or Hadoop connections? Do you have any special
>>>> non-default Accumulo RPC configuration (SSL or SASL)?
>>>>
>>>> On Wed, Jan 24, 2018 at 3:46 PM Adam J. Shook <adamjshook@gmail.com
>>>> <ma...@gmail.com>> wrote:
>>>>
>>>>     Hello all,
>>>>
>>>>     Has anyone come across an issue with a TabletServer occupying a
>>>>     large number of ports in a CLOSED_WAIT state?  'Normal' number of
>>>>     used ports on a 12-node cluster are around 12,000 to 20,000 ports.
>>>>    In one instance, there were over 68k and it was affecting other
>>>>     applications from getting a free port and they would fail to start
>>>>     (which is how we found this in the first place).
>>>>
>>>>     Thank you,
>>>>     --Adam
>>>>
>>>>
>>

Re: Large number of used ports from tserver

Posted by Christopher <ct...@apache.org>.
Interesting. It's possible we're mishandling an IOException from DFSClient
or something... but it's also possible there's a bug in DFSClient
somewhere. I found a few similar issues from the past... some might still
be not fully resolved:

https://issues.apache.org/jira/browse/HDFS-1836
https://issues.apache.org/jira/browse/HDFS-2028
https://issues.apache.org/jira/browse/HDFS-6973
https://issues.apache.org/jira/browse/HBASE-9393

The HBASE issue is interesting, because it indicates a new HDFS feature in
2.6.4 to clear readahead buffers/sockets (
https://issues.apache.org/jira/browse/HDFS-7694). That might be a feature
we're not yet utilizing, but it would only work on a newer version of HDFS.

I would probably also try to grab some jstacks of the tserver, to try to
figure out what HDFS client code paths are being taken to see where the
leak might be occurring. Also, if you have any debug logs for the tserver,
that might help. There might be some DEBUG or WARN items that indicate
retries or other failures failures that are occurring, but perhaps handled
improperly.

It's probably less likely, but it could also be a Java or Linux issue. I
wouldn't even know where to begin debugging at that level, though, other
than to check for OS updates.  What JVM are you running?

It's possible it's not a leak... and these are just getting cleaned up too
slowly. That might be something that can be tuned with sysctl.

On Thu, Jan 25, 2018 at 11:27 AM Adam J. Shook <ad...@gmail.com> wrote:

> We're running Ubuntu 14.04, HDFS 2.6.0, ZooKeeper 3.4.6, and Accumulo
> 1.8.1.  I'm using `lsof -i` and grepping for the tserver PID to list all
> the connections.  Just now there are ~25k connections for this one tserver,
> of which 99.9% of them are all writing to various DataNodes on port 50010.
> It's split about 50/50 for connections that are CLOSED_WAIT and ones that
> are ESTABLISHED.  No special RPC configuration.
>
> On Wed, Jan 24, 2018 at 7:53 PM, Josh Elser <jo...@gmail.com> wrote:
>
>> +1 to looking at the remote end of the socket and see where they're
>> going/coming to/from. I've seen a few HDFS JIRA issues filed about sockets
>> left in CLOSED_WAIT.
>>
>> Lucky you, this is a fun Linux rabbit hole to go down :)
>>
>> (
>> https://blog.cloudflare.com/this-is-strictly-a-violation-of-the-tcp-specification/
>> covers some of the technical details)
>>
>> On 1/24/18 6:37 PM, Christopher wrote:
>>
>>> I haven't seen that, but I'm curious what OS, Hadoop, ZooKeeper, and
>>> Accumulo version you're running. I'm assuming you verified that it was the
>>> TabletServer process holding these TCP sockets open using `netstat -p` and
>>> cross-referencing the PID with `jps -ml` (or similar)? Are you able to
>>> confirm based on the port number that these were Thrift connections or
>>> could they be ZooKeeper or Hadoop connections? Do you have any special
>>> non-default Accumulo RPC configuration (SSL or SASL)?
>>>
>>> On Wed, Jan 24, 2018 at 3:46 PM Adam J. Shook <adamjshook@gmail.com
>>> <ma...@gmail.com>> wrote:
>>>
>>>     Hello all,
>>>
>>>     Has anyone come across an issue with a TabletServer occupying a
>>>     large number of ports in a CLOSED_WAIT state?  'Normal' number of
>>>     used ports on a 12-node cluster are around 12,000 to 20,000 ports.
>>>    In one instance, there were over 68k and it was affecting other
>>>     applications from getting a free port and they would fail to start
>>>     (which is how we found this in the first place).
>>>
>>>     Thank you,
>>>     --Adam
>>>
>>>
>

Re: Large number of used ports from tserver

Posted by Michael Wall <mj...@gmail.com>.
What tables/tablets are on that tserver?

On Thu, Jan 25, 2018 at 11:27 AM Adam J. Shook <ad...@gmail.com> wrote:

> We're running Ubuntu 14.04, HDFS 2.6.0, ZooKeeper 3.4.6, and Accumulo
> 1.8.1.  I'm using `lsof -i` and grepping for the tserver PID to list all
> the connections.  Just now there are ~25k connections for this one tserver,
> of which 99.9% of them are all writing to various DataNodes on port 50010.
> It's split about 50/50 for connections that are CLOSED_WAIT and ones that
> are ESTABLISHED.  No special RPC configuration.
>
> On Wed, Jan 24, 2018 at 7:53 PM, Josh Elser <jo...@gmail.com> wrote:
>
>> +1 to looking at the remote end of the socket and see where they're
>> going/coming to/from. I've seen a few HDFS JIRA issues filed about sockets
>> left in CLOSED_WAIT.
>>
>> Lucky you, this is a fun Linux rabbit hole to go down :)
>>
>> (
>> https://blog.cloudflare.com/this-is-strictly-a-violation-of-the-tcp-specification/
>> covers some of the technical details)
>>
>> On 1/24/18 6:37 PM, Christopher wrote:
>>
>>> I haven't seen that, but I'm curious what OS, Hadoop, ZooKeeper, and
>>> Accumulo version you're running. I'm assuming you verified that it was the
>>> TabletServer process holding these TCP sockets open using `netstat -p` and
>>> cross-referencing the PID with `jps -ml` (or similar)? Are you able to
>>> confirm based on the port number that these were Thrift connections or
>>> could they be ZooKeeper or Hadoop connections? Do you have any special
>>> non-default Accumulo RPC configuration (SSL or SASL)?
>>>
>>> On Wed, Jan 24, 2018 at 3:46 PM Adam J. Shook <adamjshook@gmail.com
>>> <ma...@gmail.com>> wrote:
>>>
>>>     Hello all,
>>>
>>>     Has anyone come across an issue with a TabletServer occupying a
>>>     large number of ports in a CLOSED_WAIT state?  'Normal' number of
>>>     used ports on a 12-node cluster are around 12,000 to 20,000 ports.
>>>    In one instance, there were over 68k and it was affecting other
>>>     applications from getting a free port and they would fail to start
>>>     (which is how we found this in the first place).
>>>
>>>     Thank you,
>>>     --Adam
>>>
>>>
>

Re: Large number of used ports from tserver

Posted by "Adam J. Shook" <ad...@gmail.com>.
We're running Ubuntu 14.04, HDFS 2.6.0, ZooKeeper 3.4.6, and Accumulo
1.8.1.  I'm using `lsof -i` and grepping for the tserver PID to list all
the connections.  Just now there are ~25k connections for this one tserver,
of which 99.9% of them are all writing to various DataNodes on port 50010.
It's split about 50/50 for connections that are CLOSED_WAIT and ones that
are ESTABLISHED.  No special RPC configuration.

On Wed, Jan 24, 2018 at 7:53 PM, Josh Elser <jo...@gmail.com> wrote:

> +1 to looking at the remote end of the socket and see where they're
> going/coming to/from. I've seen a few HDFS JIRA issues filed about sockets
> left in CLOSED_WAIT.
>
> Lucky you, this is a fun Linux rabbit hole to go down :)
>
> (https://blog.cloudflare.com/this-is-strictly-a-violation-of
> -the-tcp-specification/ covers some of the technical details)
>
> On 1/24/18 6:37 PM, Christopher wrote:
>
>> I haven't seen that, but I'm curious what OS, Hadoop, ZooKeeper, and
>> Accumulo version you're running. I'm assuming you verified that it was the
>> TabletServer process holding these TCP sockets open using `netstat -p` and
>> cross-referencing the PID with `jps -ml` (or similar)? Are you able to
>> confirm based on the port number that these were Thrift connections or
>> could they be ZooKeeper or Hadoop connections? Do you have any special
>> non-default Accumulo RPC configuration (SSL or SASL)?
>>
>> On Wed, Jan 24, 2018 at 3:46 PM Adam J. Shook <adamjshook@gmail.com
>> <ma...@gmail.com>> wrote:
>>
>>     Hello all,
>>
>>     Has anyone come across an issue with a TabletServer occupying a
>>     large number of ports in a CLOSED_WAIT state?  'Normal' number of
>>     used ports on a 12-node cluster are around 12,000 to 20,000 ports.
>>  In one instance, there were over 68k and it was affecting other
>>     applications from getting a free port and they would fail to start
>>     (which is how we found this in the first place).
>>
>>     Thank you,
>>     --Adam
>>
>>

Re: Large number of used ports from tserver

Posted by Josh Elser <jo...@gmail.com>.
+1 to looking at the remote end of the socket and see where they're 
going/coming to/from. I've seen a few HDFS JIRA issues filed about 
sockets left in CLOSED_WAIT.

Lucky you, this is a fun Linux rabbit hole to go down :)

(https://blog.cloudflare.com/this-is-strictly-a-violation-of-the-tcp-specification/ 
covers some of the technical details)

On 1/24/18 6:37 PM, Christopher wrote:
> I haven't seen that, but I'm curious what OS, Hadoop, ZooKeeper, and 
> Accumulo version you're running. I'm assuming you verified that it was 
> the TabletServer process holding these TCP sockets open using `netstat 
> -p` and cross-referencing the PID with `jps -ml` (or similar)? Are you 
> able to confirm based on the port number that these were Thrift 
> connections or could they be ZooKeeper or Hadoop connections? Do you 
> have any special non-default Accumulo RPC configuration (SSL or SASL)?
> 
> On Wed, Jan 24, 2018 at 3:46 PM Adam J. Shook <adamjshook@gmail.com 
> <ma...@gmail.com>> wrote:
> 
>     Hello all,
> 
>     Has anyone come across an issue with a TabletServer occupying a
>     large number of ports in a CLOSED_WAIT state?  'Normal' number of
>     used ports on a 12-node cluster are around 12,000 to 20,000 ports. 
>     In one instance, there were over 68k and it was affecting other
>     applications from getting a free port and they would fail to start
>     (which is how we found this in the first place).
> 
>     Thank you,
>     --Adam
> 

Re: Large number of used ports from tserver

Posted by Christopher <ct...@apache.org>.
I haven't seen that, but I'm curious what OS, Hadoop, ZooKeeper, and
Accumulo version you're running. I'm assuming you verified that it was the
TabletServer process holding these TCP sockets open using `netstat -p` and
cross-referencing the PID with `jps -ml` (or similar)? Are you able to
confirm based on the port number that these were Thrift connections or
could they be ZooKeeper or Hadoop connections? Do you have any special
non-default Accumulo RPC configuration (SSL or SASL)?

On Wed, Jan 24, 2018 at 3:46 PM Adam J. Shook <ad...@gmail.com> wrote:

> Hello all,
>
> Has anyone come across an issue with a TabletServer occupying a large
> number of ports in a CLOSED_WAIT state?  'Normal' number of used ports on a
> 12-node cluster are around 12,000 to 20,000 ports.  In one instance, there
> were over 68k and it was affecting other applications from getting a free
> port and they would fail to start (which is how we found this in the first
> place).
>
> Thank you,
> --Adam
>