You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@thrift.apache.org by "Jens Geyer (Jira)" <ji...@apache.org> on 2020/06/09 21:12:00 UTC

[jira] [Commented] (THRIFT-5186) AI_ADDRCONFIG: Thrift libraries crash with localhost-only network.

    [ https://issues.apache.org/jira/browse/THRIFT-5186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17129807#comment-17129807 ] 

Jens Geyer commented on THRIFT-5186:
------------------------------------


Original comment from https://github.com/apache/thrift/pull/2151

My previous patch (9b9567b23378c821b460cfe54b70b9d189bf194d) has exposed
an issue shared by TServerSocket and TNonblockingServerSocket:
the results of getaddrinfo() call aren't used "in full" -- just a single
address is picked, and then bind() is retried in a loop on that single address.

This leads to poor results if we start varying the network conditions.
Like I show in the Jira issue, this is normal:

    [root@04dd07b70038 /]# ping -6 localhost
    ping: connect: Cannot assign requested address

-- same with `ping -6 $(hostname)` -- a hostname may resolve to IPv6 address
which we might not be able to connect to; and vice-versa, connecting to
IPv4 addresses may fail on IPv6-only systems.

The solution is to iterate over what getaddrinfo() returns while retrying
failed bind(). This is what this patch implements. Server-side analogue
of this behavior of curl:

    > curl -v localhost:8001/index.do
    * Trying ::1:8001...
    * connect to ::1 port 8001 failed: Connection refused
    * Trying 127.0.0.1:8001...
    * Connected to localhost (127.0.0.1) port 8001 (#0)
    > GET /index.do HTTP/1.1
    [...]

To achieve that, I throw away the TGetAddrInfoWrapper, and roll a proper one.

The bind-retrying loop had to be restructured: since sa_family will vary
from retry to retry, the socket() call has to be within the loop.

All the setsockopt() business factored away into side-methods: ~hundred
repetitive #ifdef-rich lines in the middle of important loop do get in the way.

This patch has quite a lot of code movement; git config diff.colormoved zebra
is strongly advised to the reader.

Tested manually on Linux, by running CMake's `make test` in 3 Docker environments:
 1) loopback-only (--net=none)
 2) classic IPv4-only bridge
 3) dual IPv6-IPv4 bridge
The only failing test was 88 - PythonTestSSLSocket, which currently also
fails on master for me (due to NULL cipher being unexpectedly accepted).


> AI_ADDRCONFIG: Thrift libraries crash with localhost-only network.
> ------------------------------------------------------------------
>
>                 Key: THRIFT-5186
>                 URL: https://issues.apache.org/jira/browse/THRIFT-5186
>             Project: Thrift
>          Issue Type: Bug
>          Components: C++ - Library, Delphi - Library, Python - Library
>    Affects Versions: 0.13.0
>         Environment: Red Hat Enterprise Linux 8.0
>            Reporter: Max
>            Assignee: Max
>            Priority: Major
>              Labels: getaddrinfo, localhost, sockets
>             Fix For: 0.14.0
>
>         Attachments: 0001-THRIFT-5186-Dont-pass-AI_ADDRCONFIG-to-getaddrinfo.patch
>
>          Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> THRIFT-2539 has been reported, and fixed — but for win32 only, for no apparent reason. The exact same problem reproduces on POSIX.
> Namely, when no network interfaces besides {{lo}} (the 127.0.0.1 loopback interface) are up, C++ and Python apps linked with Thrift-generated code, both clients and servers — *crash by throwing an exception*. Even when the intention is exactly to run them on localhost only.
> This happens because Thrift library code for TSocket, TServerSocket, TNonblockingServerSocket calls [{{getaddrinfo()}}|http://man7.org/linux/man-pages/man3/getaddrinfo.3.html] to resolve target hostname to connect to/listen on, into concrete IP address (v4 or v6, whichever the system is configured for). To that call, it *passes the {{AI_ADDRCONFIG}} hint* which effectively turns a localhost-only situation into:
> {quote}{{Could not resolve host for client socket.}}
> {quote}
> and into this (server-side):
> {code:java}
> гру 23 13:52:13 localhost.localdomain systemd[1]: db_cache.service: Main process exited, code=dumped, status=6/ABRT
> гру 23 13:52:13 localhost.localdomain systemd[1]: db_cache.service: Failed with result 'core-dump'.
> гру 23 13:52:17 localhost.localdomain db_cache[12912]: Thrift: Mon Dec 23 13:52:15 2019 TSocket::open() getaddrinfo() <Host: 127.0.0.1 Port: 1302>Address family for hostname not supported
> гру 23 13:52:17 localhost.localdomain db_cache[12912]: Thrift: Mon Dec 23 13:52:15 2019 TSocket::open() getaddrinfo() <Host: 127.0.0.1 Port: 8345>Address family for hostname not supported
> гру 23 13:52:17 localhost.localdomain db_cache[12912]: Thrift: Mon Dec 23 13:52:15 2019 TNonblocking: using dedicated listener thread, io threads: 16
> гру 23 13:52:17 localhost.localdomain db_cache[12912]: Thrift: Mon Dec 23 13:52:15 2019 getaddrinfo -9: Address family for hostname not supported
> гру 23 13:52:17 localhost.localdomain db_cache[12912]: terminate called after throwing an instance of 'apache::thrift::transport::TTransportException'
> гру 23 13:52:17 localhost.localdomain db_cache[12912]:   what():  Could not resolve host for server socket.
> {code}
> I fail to understand the original reason to pass that {{AI_ADDRCONFIG}} hint. It shouldn't be there as I see it.
> Further, since Thrift 0.9.2, windows builds of thrift apps don't pass that hint anymore (see THRIFT-2539), and it seems to be okay.
> For comprehension, I'm attaching a sample patch to remove {{AI_ADDRCONFIG}} from {{lib/cpp}} and {{lib/py}}. The main change will be landing via GitHub, per Thrift's contribution process, so please follow there too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)