You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by "Adam J Shook (JIRA)" <ji...@apache.org> on 2018/01/26 17:08:00 UTC

[jira] [Commented] (ACCUMULO-4787) Numerous leaked CLOSE_WAIT threads from TabletServer

    [ https://issues.apache.org/jira/browse/ACCUMULO-4787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16341291#comment-16341291 ] 

Adam J Shook commented on ACCUMULO-4787:
----------------------------------------

From the users list:

I checked all tablet servers across all six of our environments and it
seems to be present in all of them, with some having upwards of 73k
connections.

I disabled replication in our dev cluster and restarted the tablet
servers.  Left it running overnight and checked the connections -- a
reasonable number in the single or double digits.  Enabling replication
lead to a quick climb in the CLOSE_WAIT connections to a couple thousand,
leading me to think it is some lingering connection reading a WAL file from
HDFS.

I've opened ACCUMULO-4787
<https://issues.apache.org/jira/browse/ACCUMULO-4787> to track this and we
can move discussion over there.

--Adam

On Thu, Jan 25, 2018 at 12:23 PM, Christopher <ct...@apache.org> wrote:

> Interesting. It's possible we're mishandling an IOException from DFSClient
> or something... but it's also possible there's a bug in DFSClient
> somewhere. I found a few similar issues from the past... some might still
> be not fully resolved:
>
> https://issues.apache.org/jira/browse/HDFS-1836
> https://issues.apache.org/jira/browse/HDFS-2028
> https://issues.apache.org/jira/browse/HDFS-6973
> https://issues.apache.org/jira/browse/HBASE-9393
>
> The HBASE issue is interesting, because it indicates a new HDFS feature in
> 2.6.4 to clear readahead buffers/sockets (https://issues.apache.org/
> jira/browse/HDFS-7694). That might be a feature we're not yet utilizing,
> but it would only work on a newer version of HDFS.
>
> I would probably also try to grab some jstacks of the tserver, to try to
> figure out what HDFS client code paths are being taken to see where the
> leak might be occurring. Also, if you have any debug logs for the tserver,
> that might help. There might be some DEBUG or WARN items that indicate
> retries or other failures failures that are occurring, but perhaps handled
> improperly.
>
> It's probably less likely, but it could also be a Java or Linux issue. I
> wouldn't even know where to begin debugging at that level, though, other
> than to check for OS updates.  What JVM are you running?
>
> It's possible it's not a leak... and these are just getting cleaned up too
> slowly. That might be something that can be tuned with sysctl.
>
> On Thu, Jan 25, 2018 at 11:27 AM Adam J. Shook <ad...@gmail.com>
> wrote:
>
>> We're running Ubuntu 14.04, HDFS 2.6.0, ZooKeeper 3.4.6, and Accumulo
>> 1.8.1.  I'm using `lsof -i` and grepping for the tserver PID to list all
>> the connections.  Just now there are ~25k connections for this one tserver,
>> of which 99.9% of them are all writing to various DataNodes on port 50010.
>> It's split about 50/50 for connections that are CLOSED_WAIT and ones that
>> are ESTABLISHED.  No special RPC configuration.
>>
>> On Wed, Jan 24, 2018 at 7:53 PM, Josh Elser <jo...@gmail.com> wrote:
>>
>>> +1 to looking at the remote end of the socket and see where they're
>>> going/coming to/from. I've seen a few HDFS JIRA issues filed about sockets
>>> left in CLOSED_WAIT.
>>>
>>> Lucky you, this is a fun Linux rabbit hole to go down :)
>>>
>>> (https://blog.cloudflare.com/this-is-strictly-a-violation-
>>> of-the-tcp-specification/ covers some of the technical details)
>>>
>>> On 1/24/18 6:37 PM, Christopher wrote:
>>>
>>>> I haven't seen that, but I'm curious what OS, Hadoop, ZooKeeper, and
>>>> Accumulo version you're running. I'm assuming you verified that it was the
>>>> TabletServer process holding these TCP sockets open using `netstat -p` and
>>>> cross-referencing the PID with `jps -ml` (or similar)? Are you able to
>>>> confirm based on the port number that these were Thrift connections or
>>>> could they be ZooKeeper or Hadoop connections? Do you have any special
>>>> non-default Accumulo RPC configuration (SSL or SASL)?
>>>>
>>>> On Wed, Jan 24, 2018 at 3:46 PM Adam J. Shook <adamjshook@gmail.com
>>>> <ma...@gmail.com>> wrote:
>>>>
>>>>     Hello all,
>>>>
>>>>     Has anyone come across an issue with a TabletServer occupying a
>>>>     large number of ports in a CLOSED_WAIT state?  'Normal' number of
>>>>     used ports on a 12-node cluster are around 12,000 to 20,000 ports.
>>>>    In one instance, there were over 68k and it was affecting other
>>>>     applications from getting a free port and they would fail to start
>>>>     (which is how we found this in the first place).
>>>>
>>>>     Thank you,
>>>>     --Adam
>>>>
>>>>
>>

> Numerous leaked CLOSE_WAIT threads from TabletServer
> ----------------------------------------------------
>
>                 Key: ACCUMULO-4787
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4787
>             Project: Accumulo
>          Issue Type: Bug
>    Affects Versions: 1.8.1
>         Environment: * Ubuntu 14.04
> * HDFS 2.6.0 and 2.5.0 (in the middle of an upgrade cycle)
> * ZooKeeper 3.4.6
> * Accumulo 1.8.1
> * HotSpot 1.8.0_121
>            Reporter: Adam J Shook
>            Assignee: Adam J Shook
>            Priority: Major
>
> I'm running into an issue across all environments where TabletServers are occupying a large number of ports in a CLOSED_WAIT state writing to a DataNode at port 50010.  I'm seeing numbers from around 12,000 to 20,000 ports.  In some instances, there were over 68k and it was affecting other applications from getting a free port and they would fail to start (which is how we found this in the first place).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)