You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@brooklyn.apache.org by neykov <gi...@git.apache.org> on 2015/05/27 16:53:14 UTC

[GitHub] incubator-brooklyn pull request: Fix for the "unable to find a fre...

GitHub user neykov opened a pull request:

    https://github.com/apache/incubator-brooklyn/pull/662

    Fix for the "unable to find a free port" Jenkins error

    The fix
    ----------
    If the free port check fails for a specific interface, do an additional check if able to bind to an OS picked port on the same IP. If it fails, then there's something wrong with the interface, not just a port already bound error.
    
    Chasing the problem
    -----------------
    Used https://github.com/apache/incubator-brooklyn/pull/657 as a test bed to figure out what could be wrong when the error occurs. After lost of rebuilds (sorry if you get the jenkins emails) got lucky and hit a case wherre `isPortAvailable` [was failing [1]](https://builds.apache.org/job/incubator-brooklyn-pull-requests/1238/console):
    ```
    java.net.SocketException: No such device
    	at java.net.PlainSocketImpl.socketBind(Native Method) ~[na:1.6.0_45]
    ```
    when trying to bind to IP `fe80:0:0:0:e833:4cff:fed4:1108%181`.  The test was being executed continuously for half an hour during which the error was consistent (because `invocationCount=1000` and `nextAvailablePort` enumeration all ports up to 65535). This shows that it's not just a momentary glitch but a persistent state of the network stack.
    
    Digging in
    -------------------
    As the message is not really helpful and it was not clear what is causing it, I went to the Java sources to find out. As seen from the above stack trace, Java eventually calls into the POSIX API when dealing with sockets. [This [2]](http://hg.openjdk.java.net/jdk6/jdk6/jdk/file/ca2c9f498b70/src/solaris/native/java/net/PlainSocketImpl.c#l643) is where the exception is thrown from. It will essentially use `strerror` after `bind` failure to build the exception message (see [[3]](http://hg.openjdk.java.net/jdk6/jdk6/jdk/file/ca2c9f498b70/src/share/native/common/jni_util.c#l160), [[4]](http://hg.openjdk.java.net/jdk6/jdk6/hotspot/file/1f963de41717/src/share/vm/prims/jvm.cpp#l658),  [[5]](http://hg.openjdk.java.net/jdk6/jdk6/hotspot/file/1f963de41717/src/os/linux/vm/os_linux.cpp#l1608)).
    
    Now we are getting somewhere, "No such device" [is equivalent [6]](https://github.com/lattera/glibc/blob/a7a93d5086103f52367d3e9776976eb0b0bc6c7b/sysdeps/gnu/errlist.c#L220) to `ENODEV` error, which is returned from `bind`. But what does it actually mean? From the [man page on IPv6 [7]](http://man7.org/linux/man-pages/man7/ipv6.7.html):
    ```
           ENODEV The user tried to bind(2) to a link-local IPv6 address, but
                  the sin6_scope_id in the supplied sockaddr_in6 structure is
                  not a valid interface index.
    ```
    And indeed [this is the case [8]](https://github.com/torvalds/linux/blob/master/net/ipv6/af_inet6.c#L335). Looking for what `sin6_scope_id` means I found out that local-link IPv6 addresses (beginning with `fe80::`) have an additional identifier to distinguish between IPs on different interfaces. All link-local IPs of the interfaces live in the same subnet which makes it impossible to distinguish to which interface it belongs just by the IP. To solve this one can add the interface index (or on some newer distributions the interface name) separated by `%` to the IP. Looking at the failing IP, it has this identifier - `fe80:0:0:0:e833:4cff:fed4:1108%181` - `181`, which means that it belongs to interface with index 181, unusually high.
    
    Now it's time to find out where does Java get the IP+index from. On one side attempts to bind to the IP fail, but on the other Java includes it in the interface list. We get the interface list and assigned IPs from `NetworkInterface.getNetworkInterfaces()`. Going back to the jdk sources eventually (and not to bog you down with even more details) we get to [this piece of code [9]](http://hg.openjdk.java.net/jdk6/jdk6/jdk/file/ca2c9f498b70/src/solaris/native/java/net/NetworkInterface.c#l1101). It's parsing the contents of `/proc/net/if_inet6` to get the list. On my system it looks like this:
    ```
    00000000000000000000000000000001 01 80 10 80       lo
    fe800000000000000a0027fffeb8efed 02 40 20 80     eth0
    fe800000000000000a0027fffea3192c 03 40 20 80     eth1
    ```
    
    So `/proc/net/if_inet6` contains a record for `fe80:0:0:0:e833:4cff:fed4:1108%181`, but `bind` says that interface `181` doesn't exist (and this goes for at least half an hour, before the test is aborted). Weird. Could be some misconfiguration? Probably caused by some other tests running in parallel doing something to the system.
    
    Comparing different build runs I noticed something interesting. In [some of the logs, i.e. [10]](https://builds.apache.org/job/incubator-brooklyn-pull-requests/1247/console) `ifconfig` returned (amongst the standard stuff):
    
    ```
    docker0   Link encap:Ethernet  HWaddr 56:84:7a:fe:97:99  
              inet addr:172.17.42.1  Bcast:0.0.0.0  Mask:255.255.0.0
              inet6 addr: fe80::5484:7aff:fefe:9799/64 Scope:Link
              UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
              RX packets:183134 errors:0 dropped:0 overruns:0 frame:0
              TX packets:324280 errors:0 dropped:0 overruns:0 carrier:0
              collisions:0 txqueuelen:0 
              RX bytes:39149132 (39.1 MB)  TX bytes:277258190 (277.2 MB)
    
    ...
    
    vethc78eeb0 Link encap:Ethernet  HWaddr d6:88:a9:b6:be:5d  
              inet6 addr: fe80::d488:a9ff:feb6:be5d/64 Scope:Link
              UP BROADCAST RUNNING  MTU:1500  Metric:1
              RX packets:40283 errors:0 dropped:0 overruns:0 frame:0
              TX packets:193760 errors:0 dropped:0 overruns:0 carrier:0
              collisions:0 txqueuelen:0 
              RX bytes:3764512 (3.7 MB)  TX bytes:256855395 (256.8 MB)
    ```
    
    
    `vethc78eeb0` looks interesting - like it's being auto generated. This could explain the large interface index we have. This run was successful though, so something more going on there. Time to find out what vethc78eeb0 means and who is creating it. The docker interface gives some hints - LXC has a type of network called [`veth` (Virtual Ethernet) [11]](https://wiki.debian.org/LXC/SimpleBridge). One can create a pair of interface with LXC and the convention is to call them vethXXX. Apparently the interface above has the id `c78eeb0` auto generated. So looks like some test is creating a ton of virtual interfaces and then destroying them (or more likely putting at least one in a strange state). There was a mesos test bing run around the same time which supports this.
    
    I wanted to replicate the exact same error locally to be sure that the changes will solve the problem. Followed [this tutorial [12]](http://blog.scottlowe.org/2013/09/04/introducing-linux-network-namespaces/)  to create a veth pair:
    ```
    sudo ip netns add blue
    sudo ip link add veth0 type veth peer name veth1
    sudo ip link set veth1 netns blue
    sudo ip netns exec blue ifconfig veth1 10.1.1.1/24 up
    ```
    
    Tried to give it an IPv6 address as well, but eventually it generated its own link-local one. Binding to the new IP works great, everything in order in `/proc/net/if_inet6`. Then I let the interface go on an up-down loop with the following and started the `isPortAvailable` test. 
    ```
    while true; do sudo ifconfig veth0 down; sleep 5; sudo ifconfig veth0 up; sleep 5; done
    ```
    
    I started getting `java.net.BindException: Cannot assign requested address` exceptions for a second after the interface is upped. Once it settles the test is working, but there is a short period after up where the interface exists with an IP, but denies binds. It's not quite what I was after, but close enough for tests.
    
    First tried to work around it by using `NetworkInterface.isUp` but didn't help. That's why I settled on trying a second bind but this time letting the OS choose a free port. If the normal bind fails and the second one succeeds, then obviously the port is taken. But if both fail then there's something fishy going on.
    
    And finally one caveat taken from the commit comments:
    ```
         * If there's port exhaustion on a single interface we won't catch it
         * and declare the port is free. Doesn't matter really because the
         * subsequent bind of the caller will fail anyway and nextAvailablePort
         * wouldn't be able to find a free one either.
    ```
    
    
    PS: Looks like this commit https://github.com/pushtechnology/incubator-brooklyn/commit/34439f4a2ac0c13781a8485c4d6d308fb744de47 removed part of the check wrongly assuming it's unnecessary. Not sure if worth adding back...
    
    
    [1] https://builds.apache.org/job/incubator-brooklyn-pull-requests/1238/console (will be gone soon)
    [2] http://hg.openjdk.java.net/jdk6/jdk6/jdk/file/ca2c9f498b70/src/solaris/native/java/net/PlainSocketImpl.c#l643
    [3] http://hg.openjdk.java.net/jdk6/jdk6/jdk/file/ca2c9f498b70/src/share/native/common/jni_util.c#l160
    [4] http://hg.openjdk.java.net/jdk6/jdk6/hotspot/file/1f963de41717/src/share/vm/prims/jvm.cpp#l658
    [5] http://hg.openjdk.java.net/jdk6/jdk6/hotspot/file/1f963de41717/src/os/linux/vm/os_linux.cpp#l1608
    [6] https://github.com/lattera/glibc/blob/a7a93d5086103f52367d3e9776976eb0b0bc6c7b/sysdeps/gnu/errlist.c#L220
    [7] http://man7.org/linux/man-pages/man7/ipv6.7.html
    [8] https://github.com/torvalds/linux/blob/master/net/ipv6/af_inet6.c#L335
    [9] http://hg.openjdk.java.net/jdk6/jdk6/jdk/file/ca2c9f498b70/src/solaris/native/java/net/NetworkInterface.c#l1101
    [10] https://builds.apache.org/job/incubator-brooklyn-pull-requests/1247/console
    [11] https://wiki.debian.org/LXC/SimpleBridge
    [12] http://blog.scottlowe.org/2013/09/04/introducing-linux-network-namespaces/

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/neykov/incubator-brooklyn fix-port-not-available

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-brooklyn/pull/662.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #662
    
----
commit 72571f2c65bf2cca16520cc525e6b61812e51158
Author: Svetoslav Neykov <sv...@cloudsoftcorp.com>
Date:   2015-05-27T11:51:27Z

    Fix for the "unable to find a free port" error that has been plaguing the jenkins builds
    
    If the free port check fails for a specific interface, do an additional check if able to bind to an OS picked port on the same IP. If it fails, then there's something wrong with the interface, not just a port already bound error.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-brooklyn pull request: Fix for the "unable to find a fre...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/incubator-brooklyn/pull/662


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-brooklyn pull request: Fix for the "unable to find a fre...

Posted by ahgittin <gi...@git.apache.org>.
Github user ahgittin commented on a diff in the pull request:

    https://github.com/apache/incubator-brooklyn/pull/662#discussion_r31166051
  
    --- Diff: utils/common/src/main/java/brooklyn/util/net/Networking.java ---
    @@ -129,24 +127,89 @@ public static boolean isPortAvailable(InetAddress localAddress, int port) {
                     } catch (SocketException e) {
                         throw Exceptions.propagate(e);
                     }
    +                // When using a specific interface saw failures not caused by port already bound:
    +                //   * java.net.SocketException: No such device
    +                //   * java.net.BindException: Cannot assign requested address
    +                //   * probably many more
    +                // Check if the address is still valid before marking the port as not available.
    +                boolean foundAvailableInterface = false;
                     while (nis.hasMoreElements()) {
                         NetworkInterface ni = nis.nextElement();
                         Enumeration<InetAddress> as = ni.getInetAddresses();
                         while (as.hasMoreElements()) {
                             InetAddress a = as.nextElement();
    -                        if (!isPortAvailable(a, port))
    -                            return false;
    +                        if (!isPortAvailable(a, port)) {
    +                            if (isAddressValid(a)) {
    +                                if (log.isTraceEnabled()) log.trace("Port {} : {} @ {} is taken and the address is valid", new Object[] {a, port, nis});
    +                                return false;
    +                            }
    +                        } else {
    +                            foundAvailableInterface = true;
    +                        }
                         }
                     }
    +                if (!foundAvailableInterface) {
    +                    //Aborting with an error, even nextAvailablePort won't be able to find a free port.
    +                    throw new RuntimeException("Unable to bind on any network interface, even when letting the OS pick a port. Possible causes include file handle exhaustion, port exhaustion. Failed on request for " + localAddress + ":" + port + ".");
    +                }
                 }
    -            
    +
                 return true;
             } finally {
                 // Until timeout was added, was taking 1min5secs for /fe80:0:0:0:1cc5:1ff:fe81:a61d%8 : 8081
    -            if (log.isTraceEnabled()) log.trace("Took {} to determine if port {} : {} was available", 
    -                    new Object[] {Time.makeTimeString(watch.elapsed(TimeUnit.MILLISECONDS), true), localAddress, port});
    +            // Svet - Probably caused by the now gone new Socket().connect() call, SO_TIMEOUT doesn't
    +            // influence bind(). Doesn't hurt having it though.
    +            long elapsed = watch.elapsed(TimeUnit.SECONDS);
    +            boolean isDelayed = (elapsed >= 1);
    +            boolean isDelayedByMuch = (elapsed >= 30);
    +            if (isDelayed || log.isTraceEnabled()) {
    +                String msg = "Took {} to determine if port was available for {} : {}";
    +                Object[] args = new Object[] {Time.makeTimeString(watch.elapsed(TimeUnit.MILLISECONDS), true), localAddress, port};
    +                if (isDelayedByMuch) {
    +                    log.warn(msg, args);
    +                } else if (isDelayed) {
    +                    log.debug(msg, args);
    +                } else {
    +                    log.trace(msg, args);
    +                }
    +            }
             }
         }
    +
    +    /**
    +     * Bind to the specified IP, but let the OS pick a port.
    +     * If the operation fails we know it's not because of
    +     * non-available port, the interface could be down.
    +     * 
    +     * If there's port exhaustion on a single interface we won't catch it
    +     * and declare the port is free. Doesn't matter really because the
    +     * subsequent bind of the caller will fail anyway and nextAvailablePort
    +     * wouldn't be able to find a free one either.
    +     */
    +    private static boolean isAddressValid(InetAddress addr) {
    +        ServerSocket ss;
    +        try {
    +            ss = new ServerSocket();
    +            ss.setSoTimeout(250);
    +        } catch (IOException e) {
    +            throw Exceptions.propagate(e);
    +        }
    +        try {
    +            ss.bind(new InetSocketAddress(addr, 0));
    +            return true;
    +        } catch (IOException e) {
    +            Exceptions.propagateIfFatal(e);
    +            if (log.isTraceEnabled()) log.trace("Binding on {} failed, interface could be down, being reconfigured, file handle exhaustion, port exhaustion, etc.", addr);
    +            return false;
    +        } finally {
    +            try {
    +                ss.close();
    +            } catch (IOException e) {
    +                Exceptions.propagateIfFatal(e);
    --- End diff --
    
    as above - maybe a `closeQuietly` method?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-brooklyn pull request: Fix for the "unable to find a fre...

Posted by ahgittin <gi...@git.apache.org>.
Github user ahgittin commented on a diff in the pull request:

    https://github.com/apache/incubator-brooklyn/pull/662#discussion_r31165901
  
    --- Diff: utils/common/src/main/java/brooklyn/util/net/Networking.java ---
    @@ -115,7 +113,7 @@ public static boolean isPortAvailable(InetAddress localAddress, int port) {
                         try {
                             ss.close();
                         } catch (IOException e) {
    -                        /* should not be thrown */
    +                        Exceptions.propagateIfFatal(e);
    --- End diff --
    
    will `IOException` ever be fatal?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-brooklyn pull request: Fix for the "unable to find a fre...

Posted by neykov <gi...@git.apache.org>.
Github user neykov commented on the pull request:

    https://github.com/apache/incubator-brooklyn/pull/662#issuecomment-106207292
  
    Addressed comments, thanks @ahgittin, @sjcorbett.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-brooklyn pull request: Fix for the "unable to find a fre...

Posted by ahgittin <gi...@git.apache.org>.
Github user ahgittin commented on a diff in the pull request:

    https://github.com/apache/incubator-brooklyn/pull/662#discussion_r31165925
  
    --- Diff: utils/common/src/main/java/brooklyn/util/net/Networking.java ---
    @@ -115,7 +113,7 @@ public static boolean isPortAvailable(InetAddress localAddress, int port) {
                         try {
                             ss.close();
                         } catch (IOException e) {
    -                        /* should not be thrown */
    +                        Exceptions.propagateIfFatal(e);
    --- End diff --
    
    worth keeping comment in any case that non-fatal problems closing can be ignored


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-brooklyn pull request: Fix for the "unable to find a fre...

Posted by sjcorbett <gi...@git.apache.org>.
Github user sjcorbett commented on the pull request:

    https://github.com/apache/incubator-brooklyn/pull/662#issuecomment-105952187
  
    @neykov I really enjoy reading your diagnoses and solutions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-brooklyn pull request: Fix for the "unable to find a fre...

Posted by ahgittin <gi...@git.apache.org>.
Github user ahgittin commented on the pull request:

    https://github.com/apache/incubator-brooklyn/pull/662#issuecomment-106028277
  
    minor comments, but probably worth addressing.  then looks good to merge.
    
    +1 to sam - excellent write-up and excellent job getting to the bottom of this!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-brooklyn pull request: Fix for the "unable to find a fre...

Posted by ahgittin <gi...@git.apache.org>.
Github user ahgittin commented on the pull request:

    https://github.com/apache/incubator-brooklyn/pull/662#issuecomment-106290407
  
    merging.  @neykov close #657 now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---