You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Adar Dembo (JIRA)" <ji...@apache.org> on 2019/03/07 20:49:00 UTC

[jira] [Commented] (KUDU-2738) linked_list-test occasionally fails with webserver port bind failure: address already in use

    [ https://issues.apache.org/jira/browse/KUDU-2738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16787182#comment-16787182 ] 

Adar Dembo commented on KUDU-2738:
----------------------------------

I don't think it was a separate process because, like you said, we're using a unique loopback IP address and the odds of another application electing to use a non-standard loopback IP address, choosing the same address as ours, and also using the same port are impossibly low.

More likely than not I think we've run afoul of some SO_REUSEADDR behavior w.r.t. TIME_WAIT sockets. According to the test output you attached, the tserver that failed was only started twice. I've listed the parameters involved in choosing the webserver bind socket below.

Start:
{noformat}
--webserver_interface=127.14.25.194
--webserver_port=0
{noformat}

Restart:
{noformat}
--webserver_interface=127.14.25.194
--webserver_port=49008
 {noformat}

I double checked and squeasel does set SO_REUSEADDR on its socket before binding to it (see {{set_ports_option}} in squeasel.c. I don't exactly know how SO_REUSEADDR works, but there's a very informative [SO post|https://stackoverflow.com/questions/14388706/socket-options-so-reuseaddr-and-so-reuseport-how-do-they-differ-do-they-mean-t] about it. I suspect that we're being bitten by something here. Maybe it's exacerbated because linked_list-test starts up a PeriodicWebUIChecker which accesses the web UI repeatedly, unlike most other tests. Maybe it's also exacerbated because squeasel sometimes tinkers with the "linger time" of its sockets:
{noformat}
static void close_socket_gracefully(struct sq_connection *conn) {
  struct linger linger;

  // Set linger option to avoid socket hanging out after close. This prevent
  // ephemeral port exhaust problem under high QPS.
  linger.l_onoff = 1;
  linger.l_linger = 1;
  setsockopt(conn->client.sock, SOL_SOCKET, SO_LINGER,
             (char *) &linger, sizeof(linger));

  // Send FIN to the client
  shutdown(conn->client.sock, SHUT_WR);
  set_non_blocking_mode(conn->client.sock);

  // Now we know that our FIN is ACK-ed, safe to close
  closesocket(conn->client.sock);
}
{noformat}

> linked_list-test occasionally fails with webserver port bind failure: address already in use
> --------------------------------------------------------------------------------------------
>
>                 Key: KUDU-2738
>                 URL: https://issues.apache.org/jira/browse/KUDU-2738
>             Project: Kudu
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 1.9.0
>            Reporter: Mike Percy
>            Priority: Trivial
>         Attachments: jenkins_output.txt.gz
>
>
> Occasionally I see linked_list-test fail with the following error on Linux in an automated test environment:
> {code:java}
> E0306 23:35:25.207222 19523 webserver.cc:369] Webserver: set_ports_option: cannot bind to 127.14.25.194:49008: 98 (Address already in use)
> W0306 23:35:25.207244 19523 net_util.cc:457] Trying to use lsof to find any processes listening on 0.0.0.0:49008
> I0306 23:35:25.207249 19523 net_util.cc:460] $ export PATH=$PATH:/usr/sbin ; lsof -n -i 'TCP:49008' -sTCP:LISTEN ; for pid in $(lsof -F p -n -i 'TCP:49008' -sTCP:LISTEN | grep p | cut -f 2 -dp) ; do while [ $pid -gt 1 ] ; do ps h -fp $pid ; stat=($(</proc/$pid/stat)) ; pid=${stat[3]} ; done ; done
> ...
> W0306 23:35:25.583075 19523 net_util.cc:467]
> F0306 23:35:25.583206 19523 tablet_server_main.cc:89] Check failed: _s.ok() Bad status: Runtime error: Webserver: could not start on address 127.14.25.194:49008: set_ports_option: cannot bind to 127.14.25.194:49008: 98 (Address already in use){code}
> I am not sure what would have bound to 0.0.0.0:49008 for a short period of time, or used 126.14.25.194:49008 as an ephemeral address / port pair since it's such a unique loopback IP address.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)