You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@river.apache.org by Peter Firmstone <ji...@zeus.net.au> on 2013/04/09 21:29:25 UTC

Tests failing due to port in use

This was being discussed in two separate threads:

Greg,

Are you able to find which process is using a port?  Perhaps jvisualvm 
might help determine which test it is?  If you can narrow it down to a 
bunch of test definitions sharing a test class, by identifying the class 
on the stack, you'll find it.

Sim wrote:
> On 09-04-13 11:44, Peter Firmstone wrote:
>> I've never experienced the issue locally (I see it on Jenkins quite a
>> lot), but I suspect a stale registrar process left from another test may
>> be stopping the socket from closing.  Not that registrars are also
>> simulated for discovery tests, so it may not necessarily be Reggie.
>>
>> The code is duplicated in two places in superclasses of the tests that
>> are failing,  the method portInUse(int port) is supposed to check if the
>> ports available, but only selects from a list of LookupLocator's known
>> to have started, so doesn't actually check the port's available.
>>
>> Perhaps you can find the stale test process responsible?
>
> And other jobs get executed on the same machine at the same time. 
> Maybe these consume ports?
>
> The best way to find the problem looks to me like doing a 'lsof -i' in 
> the test as soons as a port turns out used unexpectedly.
>
> Gr. Simon
>
> -- 
> QCG, Software voor het MKB, 071-5890970, http://www.qcg.nl
> Quality Consultancy Group b.v., Leiderdorp, Kvk Den Haag: 28088397

Peter wrote:
> It's a problem I've been aware of for some time, many tests don't 
> specify a port, they just want three or four or however many 
> registrar's started and they have to specify it using a LookupLocator, 
> unfortunately you can't specify an ephemeral LookupLocator, it's not 
> allowed, so if you don't specify a particular port, you get 4160 and 
> the portInUse method is supposed to work around that.
>
> When a test finds a port in use, it updates the registrar wanted to an 
> available port.  If it thinks 4160 is available, when it isn't, boom, 
> test failed.
>
> Regards,
>
> Peter.
>
> Greg Trasuk wrote:
>> I hope you're not adding this to try to help my configuration issue,
>> because I'm pretty sure that's not the problem, and I hate to screw
>> around with the regression tests unneccessarily.
>>
>> Evidence-wise, the "lookup started != initial lookups wanted" issue
>> looks like this:
>> When I enable logging at the FINE level, I see three registrars started,
>> with unicast discovery on random ports - this is Reggie's fallback
>> position when it attempts to open its unicast port and it's already in
>> use.  However, the test that's checking for "lookup started == initial
>> lookups wanted" is testing for three separate registrars on the default
>> port.  Obviously that's an impossible proposition.  My working theory is
>> that the test is supposed to be configured to start Reggie on three
>> separate unicast ports (e.e. 7700, 7701, 7702), and the assertion is
>> checking to see if that happened correctly.  This theory is supported by
>> the existence of
>> "qa/src/com/sun/jini/test/share/reggie3_2Ports.properties", which
>> includes properties called 
>> "net.jini.core.lookup.ServiceRegistrar.port.0=7700",
>> "net.jini.core.lookup.ServiceRegistrar.port.1=7701",
>> "net.jini.core.lookup.ServiceRegistrar.port.2=7702", etc. 
>> Hence my suspicion that I don't understand the configuration mechanism
>> well enough yet.  My theory is that the test is supposed to start up
>> Reggie's on three different ports, but there is something about my test
>> environment setup that causes the configuration to be missed.  Anpother
>> possibility is that the actual test configuration got messed up at some
>> point (I did see a comment in one of the properties files about "these
>> entries are commented out because they are now included in the test
>> descriptions".
>>
>> Testing that theory is going to take me a little while to understand the
>> multi-layer test configuration mechanism.  Unfortunately I've got a full
>> morning of other work ahead of me.
>>
>> Anyhow, I don't really think changing the tests is the correct approach
>> at this point (at least in my troubleshooting process).
>>
>> Cheers,
>>
>> Greg.
>>
>>
>>
>> On Tue, 2013-04-09 at 07:05, peter_firmstone@apache.org wrote:
>>  
>>> Author: peter_firmstone
>>> Date: Tue Apr  9 11:05:32 2013
>>> New Revision: 1465969
>>>
>>> URL: http://svn.apache.org/r1465969
>>> Log:
>>> Be more thorough when checking for port availability on localhost 
>>> during testing, don't assume port free if current test isn't using it.
>>>
>>> Modified:
>>>     
>>> river/jtsk/skunk/qa_refactor/trunk/qa/src/com/sun/jini/test/share/LookupServices.java 
>>>
>>>
>>> Modified: 
>>> river/jtsk/skunk/qa_refactor/trunk/qa/src/com/sun/jini/test/share/LookupServices.java 
>>>
>>> URL: 
>>> http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/qa/src/com/sun/jini/test/share/LookupServices.java?rev=1465969&r1=1465968&r2=1465969&view=diff 
>>>
>>> ============================================================================== 
>>>
>>> --- 
>>> river/jtsk/skunk/qa_refactor/trunk/qa/src/com/sun/jini/test/share/LookupServices.java 
>>> (original)
>>> +++ 
>>> river/jtsk/skunk/qa_refactor/trunk/qa/src/com/sun/jini/test/share/LookupServices.java 
>>> Tue Apr  9 11:05:32 2013
>>> @@ -27,7 +27,15 @@ import com.sun.jini.test.spec.discoverys
>>>  import com.sun.jini.test.spec.discoveryservice.AbstractBaseTest;
>>>  import 
>>> com.sun.jini.test.spec.discoveryservice.AbstractBaseTest.DiscoveryStruct; 
>>>
>>>  import 
>>> com.sun.jini.test.spec.discoveryservice.AbstractBaseTest.RegGroupsPair;
>>> +import java.io.IOException;
>>>  import java.net.InetAddress;
>>> +import java.net.InetSocketAddress;
>>> +import java.net.ServerSocket;
>>> +import java.net.Socket;
>>> +import java.net.SocketAddress;
>>> +import java.net.SocketException;
>>> +import java.net.SocketOptions;
>>> +import java.net.SocketTimeoutException;
>>>  import java.net.UnknownHostException;
>>>  import java.rmi.RemoteException;
>>>  import java.util.ArrayList;
>>> @@ -660,12 +668,47 @@ public class LookupServices {
>>>       * @return true if port in use.       */
>>>      private boolean portInUse(int port) {
>>> +        if (port == 0) return false; // Ephemeral
>>>          for(int i=0;i<lookupsStarted.size();i++) {
>>>              LocatorGroupsPair pair = lookupsStarted.get(i);
>>>              int curPort = (pair.getLocator()).getPort();
>>>              if(port == curPort) return true;
>>>          }//end loop
>>> -        return false;
>>> +        // Open a client ephemeral socket and attempt to connect to
>>> +        // port on localhost to see if someone's listening.
>>> +        Socket sock = null;
>>> +        try {
>>> +            sock = new Socket();
>>> +            if (sock instanceof SocketOptions){
>>> +                // Socket terminates with a RST rather than a FIN, 
>>> so there's no TIME_WAIT
>>> +                try {
>>> +                    ((SocketOptions) 
>>> sock).setOption(SocketOptions.SO_LINGER, Integer.valueOf(0));
>>> +                } catch (SocketException se) {
>>> +                    // Ignore, not supported.
>>> +                    logger.log( Level.FINEST, "SocketOptions set 
>>> SO_LINGER threw an Exception", se);
>>> +                }
>>> +            }
>>> +            SocketAddress add = new InetSocketAddress(port);
>>> +            sock.connect(add, 3000); // Try to connect for up to 
>>> three seconds +            // We were able to connect to a socket 
>>> listening on localhost
>>> +            return true;
>>> +        } catch (SocketTimeoutException e){
>>> +            // There might be a stale process assume in use.
>>> +            logger.log( Level.FINEST, "Socket timed out while 
>>> trying to connect", e);
>>> +            return true;
>>> +        } catch (IOException e){
>>> +            // There was nothing listening on the socket so it's 
>>> probably free.
>>> +            // or it timed out.
>>> +            return false;
>>> +        } finally {
>>> +            if (sock != null){
>>> +                try {
>>> +                    sock.close();
>>> +                } catch (IOException ex){
>>> +                    logger.log( Level.FINEST, "Socket threw 
>>> exception while attempting to close", ex);
>>> +                }// Ignore
>>> +            }
>>> +        }
>>>      }//end portInUse
>>>           private void refreshLookupLocatorListsAt(int index){
>>>
>>>     
>>
>>
>>   
>


Re: Tests failing due to port in use

Posted by Peter Firmstone <ji...@zeus.net.au>.
Not wanting to sound picky, but just so we don't create confusion, we 
are currently discussing work on the integration tests, Greg's running 
these from the 2.2 branch, in preparation for release.  The regression 
tests are not run from the qa suite, they are run separately by jtreg, 
each regression test is associated with a Sun bug number, each checks 
whether that specific bug is present.

To run the regression tests, the ant target is:

$ ant jtreg

, from the trunk/qa build directory.

Yes, I'm guilty of fixing thread visibility issues in the library 
associated with jtreg tests too.

Regards,

Peter.

Peter Firmstone wrote:
> This was being discussed in two separate threads:
>
> Greg,
>
> Are you able to find which process is using a port?  Perhaps jvisualvm 
> might help determine which test it is?  If you can narrow it down to a 
> bunch of test definitions sharing a test class, by identifying the 
> class on the stack, you'll find it.
>
> Sim wrote:
>> On 09-04-13 11:44, Peter Firmstone wrote:
>>> I've never experienced the issue locally (I see it on Jenkins quite a
>>> lot), but I suspect a stale registrar process left from another test 
>>> may
>>> be stopping the socket from closing.  Not that registrars are also
>>> simulated for discovery tests, so it may not necessarily be Reggie.
>>>
>>> The code is duplicated in two places in superclasses of the tests that
>>> are failing,  the method portInUse(int port) is supposed to check if 
>>> the
>>> ports available, but only selects from a list of LookupLocator's known
>>> to have started, so doesn't actually check the port's available.
>>>
>>> Perhaps you can find the stale test process responsible?
>>
>> And other jobs get executed on the same machine at the same time. 
>> Maybe these consume ports?
>>
>> The best way to find the problem looks to me like doing a 'lsof -i' 
>> in the test as soons as a port turns out used unexpectedly.
>>
>> Gr. Simon
>>
>> -- 
>> QCG, Software voor het MKB, 071-5890970, http://www.qcg.nl
>> Quality Consultancy Group b.v., Leiderdorp, Kvk Den Haag: 28088397
>
> Peter wrote:
>> It's a problem I've been aware of for some time, many tests don't 
>> specify a port, they just want three or four or however many 
>> registrar's started and they have to specify it using a 
>> LookupLocator, unfortunately you can't specify an ephemeral 
>> LookupLocator, it's not allowed, so if you don't specify a particular 
>> port, you get 4160 and the portInUse method is supposed to work 
>> around that.
>>
>> When a test finds a port in use, it updates the registrar wanted to 
>> an available port.  If it thinks 4160 is available, when it isn't, 
>> boom, test failed.
>>
>> Regards,
>>
>> Peter.
>>
>> Greg Trasuk wrote:
>>> I hope you're not adding this to try to help my configuration issue,
>>> because I'm pretty sure that's not the problem, and I hate to screw
>>> around with the regression tests unneccessarily.
>>>
>>> Evidence-wise, the "lookup started != initial lookups wanted" issue
>>> looks like this:
>>> When I enable logging at the FINE level, I see three registrars 
>>> started,
>>> with unicast discovery on random ports - this is Reggie's fallback
>>> position when it attempts to open its unicast port and it's already in
>>> use.  However, the test that's checking for "lookup started == initial
>>> lookups wanted" is testing for three separate registrars on the default
>>> port.  Obviously that's an impossible proposition.  My working 
>>> theory is
>>> that the test is supposed to be configured to start Reggie on three
>>> separate unicast ports (e.e. 7700, 7701, 7702), and the assertion is
>>> checking to see if that happened correctly.  This theory is 
>>> supported by
>>> the existence of
>>> "qa/src/com/sun/jini/test/share/reggie3_2Ports.properties", which
>>> includes properties called 
>>> "net.jini.core.lookup.ServiceRegistrar.port.0=7700",
>>> "net.jini.core.lookup.ServiceRegistrar.port.1=7701",
>>> "net.jini.core.lookup.ServiceRegistrar.port.2=7702", etc. Hence my 
>>> suspicion that I don't understand the configuration mechanism
>>> well enough yet.  My theory is that the test is supposed to start up
>>> Reggie's on three different ports, but there is something about my test
>>> environment setup that causes the configuration to be missed.  Anpother
>>> possibility is that the actual test configuration got messed up at some
>>> point (I did see a comment in one of the properties files about "these
>>> entries are commented out because they are now included in the test
>>> descriptions".
>>>
>>> Testing that theory is going to take me a little while to understand 
>>> the
>>> multi-layer test configuration mechanism.  Unfortunately I've got a 
>>> full
>>> morning of other work ahead of me.
>>>
>>> Anyhow, I don't really think changing the tests is the correct approach
>>> at this point (at least in my troubleshooting process).
>>>
>>> Cheers,
>>>
>>> Greg.
>>>
>>>
>>>
>>> On Tue, 2013-04-09 at 07:05, peter_firmstone@apache.org wrote:
>>>  
>>>> Author: peter_firmstone
>>>> Date: Tue Apr  9 11:05:32 2013
>>>> New Revision: 1465969
>>>>
>>>> URL: http://svn.apache.org/r1465969
>>>> Log:
>>>> Be more thorough when checking for port availability on localhost 
>>>> during testing, don't assume port free if current test isn't using it.
>>>>
>>>> Modified:
>>>>     
>>>> river/jtsk/skunk/qa_refactor/trunk/qa/src/com/sun/jini/test/share/LookupServices.java 
>>>>
>>>>
>>>> Modified: 
>>>> river/jtsk/skunk/qa_refactor/trunk/qa/src/com/sun/jini/test/share/LookupServices.java 
>>>>
>>>> URL: 
>>>> http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/qa/src/com/sun/jini/test/share/LookupServices.java?rev=1465969&r1=1465968&r2=1465969&view=diff 
>>>>
>>>> ============================================================================== 
>>>>
>>>> --- 
>>>> river/jtsk/skunk/qa_refactor/trunk/qa/src/com/sun/jini/test/share/LookupServices.java 
>>>> (original)
>>>> +++ 
>>>> river/jtsk/skunk/qa_refactor/trunk/qa/src/com/sun/jini/test/share/LookupServices.java 
>>>> Tue Apr  9 11:05:32 2013
>>>> @@ -27,7 +27,15 @@ import com.sun.jini.test.spec.discoverys
>>>>  import com.sun.jini.test.spec.discoveryservice.AbstractBaseTest;
>>>>  import 
>>>> com.sun.jini.test.spec.discoveryservice.AbstractBaseTest.DiscoveryStruct; 
>>>>
>>>>  import 
>>>> com.sun.jini.test.spec.discoveryservice.AbstractBaseTest.RegGroupsPair; 
>>>>
>>>> +import java.io.IOException;
>>>>  import java.net.InetAddress;
>>>> +import java.net.InetSocketAddress;
>>>> +import java.net.ServerSocket;
>>>> +import java.net.Socket;
>>>> +import java.net.SocketAddress;
>>>> +import java.net.SocketException;
>>>> +import java.net.SocketOptions;
>>>> +import java.net.SocketTimeoutException;
>>>>  import java.net.UnknownHostException;
>>>>  import java.rmi.RemoteException;
>>>>  import java.util.ArrayList;
>>>> @@ -660,12 +668,47 @@ public class LookupServices {
>>>>       * @return true if port in use.       */
>>>>      private boolean portInUse(int port) {
>>>> +        if (port == 0) return false; // Ephemeral
>>>>          for(int i=0;i<lookupsStarted.size();i++) {
>>>>              LocatorGroupsPair pair = lookupsStarted.get(i);
>>>>              int curPort = (pair.getLocator()).getPort();
>>>>              if(port == curPort) return true;
>>>>          }//end loop
>>>> -        return false;
>>>> +        // Open a client ephemeral socket and attempt to connect to
>>>> +        // port on localhost to see if someone's listening.
>>>> +        Socket sock = null;
>>>> +        try {
>>>> +            sock = new Socket();
>>>> +            if (sock instanceof SocketOptions){
>>>> +                // Socket terminates with a RST rather than a FIN, 
>>>> so there's no TIME_WAIT
>>>> +                try {
>>>> +                    ((SocketOptions) 
>>>> sock).setOption(SocketOptions.SO_LINGER, Integer.valueOf(0));
>>>> +                } catch (SocketException se) {
>>>> +                    // Ignore, not supported.
>>>> +                    logger.log( Level.FINEST, "SocketOptions set 
>>>> SO_LINGER threw an Exception", se);
>>>> +                }
>>>> +            }
>>>> +            SocketAddress add = new InetSocketAddress(port);
>>>> +            sock.connect(add, 3000); // Try to connect for up to 
>>>> three seconds +            // We were able to connect to a socket 
>>>> listening on localhost
>>>> +            return true;
>>>> +        } catch (SocketTimeoutException e){
>>>> +            // There might be a stale process assume in use.
>>>> +            logger.log( Level.FINEST, "Socket timed out while 
>>>> trying to connect", e);
>>>> +            return true;
>>>> +        } catch (IOException e){
>>>> +            // There was nothing listening on the socket so it's 
>>>> probably free.
>>>> +            // or it timed out.
>>>> +            return false;
>>>> +        } finally {
>>>> +            if (sock != null){
>>>> +                try {
>>>> +                    sock.close();
>>>> +                } catch (IOException ex){
>>>> +                    logger.log( Level.FINEST, "Socket threw 
>>>> exception while attempting to close", ex);
>>>> +                }// Ignore
>>>> +            }
>>>> +        }
>>>>      }//end portInUse
>>>>           private void refreshLookupLocatorListsAt(int index){
>>>>
>>>>     
>>>
>>>
>>>   
>>
>
>


Re: Tests failing due to port in use

Posted by md...@martinwills.com.
> On 10/04/2013 5:19 PM, Simon IJskes - QCG wrote:
>> From the build list:
>>
>>> Hi,
>>>
>>> The Apache Aries build removed ubuntu2 from its list of possible
>>> machines
>>> several months ago because its ulimit was set to 1024 while the other
>>> ubuntu machines were set to 40000. Would you be able to make it match
>>> the
>>> other machines as part of the rebuild please?
>>>
>>> John
>>
>> I havent checked, but i believe they are talking about: RLIMIT_NOFILE
>> might that also be relevant to failing tests? It limits the number of
>> sockets open. When opening a socket when the filedescriptors are
>> exhausted, all platforms provide exceptions indicating so, would they?
>>
>> Gr. Simon
>>
>
> Hmm, good suggestion, I honestly hadn't considered that.
>
> Presently it appears as though port 4160 is unavailable on one of the
> Arm nodes (arm-16 from memory), changing portInUse to take into account
> whether ports were in use by programs other than the test suite enabled
> another 98 tests to pass, however there are still 182 failing due to
> 4160 being used by another process.  I'm considering making a script
> that checks available ports against those required and change the
> default configured ports for the tests if they're already in use.
>
> I wonder if there's a way to detect RLIMIT_NOFILE from java?
>

Open the unix file "/proc/<pid>/limits" and parse the "Max open files" line:
Limit                     Soft Limit           Hard Limit         Units
Max open files            1024                 4096                 files

works on Linux ??????? 3.2.0-40-generic #64-Ubuntu SMP Mon Mar 25 21:22:10
UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

Regards
Martin


> The problems on Jenkins appear unreleated to Greg's test failures, the
> number of failed tests seems to be coincidental.
>
> Regards,
>
> Peter.
>



Re: Tests failing due to port in use

Posted by Peter Firmstone <ji...@zeus.net.au>.
On 10/04/2013 9:16 PM, Simon IJskes - QCG wrote:
> On 10-04-13 13:09, Peter Firmstone wrote:
>>> Where shall i put it? It's a single class.
>>>
>>>
>>
>> The qa suite package com.sun.jini.test.share looks like a good 
>> candidate.
>
> http://svn.apache.org/viewvc/river/jtsk/skunk/sijskes/SimpleJeri/Heapdumper.java?view=log&pathrev=1466426 
>
>
> I've no time to integrate it. Sorry.
>

That's cool, thanks for helping out.

Cheers,

Peter.

Re: Tests failing due to port in use

Posted by Simon IJskes - QCG <si...@qcg.nl>.
On 10-04-13 13:09, Peter Firmstone wrote:
>> Where shall i put it? It's a single class.
>>
>>
>
> The qa suite package com.sun.jini.test.share looks like a good candidate.

http://svn.apache.org/viewvc/river/jtsk/skunk/sijskes/SimpleJeri/Heapdumper.java?view=log&pathrev=1466426

I've no time to integrate it. Sorry.

-- 
QCG, Software voor het MKB, 071-5890970, http://www.qcg.nl
Quality Consultancy Group b.v., Leiderdorp, Kvk Den Haag: 28088397

Re: Tests failing due to port in use

Posted by Peter Firmstone <ji...@zeus.net.au>.
On 10/04/2013 9:04 PM, Simon IJskes - QCG wrote:
> On 10-04-13 12:56, Peter Firmstone wrote:
>> On 10/04/2013 7:46 PM, Simon IJskes - QCG wrote:
>>> On 10-04-13 09:44, Peter Firmstone wrote:
>>>
>>>> I wonder if there's a way to detect RLIMIT_NOFILE from java?
>>>
>>> Other then spawning a shell with 'ulimit -n' i can't tell.
>>>
>>> In order to find the open port and do a 'lsof -i :4160' we need root
>>> access to see other userids, if whe don't have it, the result is not
>>> all-inclusive, but i suspect any jenkins build is the most likely
>>> suspect, and these i think would all run under the same userid.
>>>
>>> I also have code to create a heapdump programmatically, so if the
>>> exception occurs whe can inspect the dump later.
>>>
>>
>> That might also be useful for tests failing only occassionally,
>> synchronization issues are suspected, so far it's proven quite difficult
>> to determine causes and I've had to resort to visual code inspection.
>
> Where shall i put it? It's a single class.
>
>

The qa suite package com.sun.jini.test.share looks like a good candidate.

Re: Tests failing due to port in use

Posted by Simon IJskes - QCG <si...@qcg.nl>.
On 10-04-13 12:56, Peter Firmstone wrote:
> On 10/04/2013 7:46 PM, Simon IJskes - QCG wrote:
>> On 10-04-13 09:44, Peter Firmstone wrote:
>>
>>> I wonder if there's a way to detect RLIMIT_NOFILE from java?
>>
>> Other then spawning a shell with 'ulimit -n' i can't tell.
>>
>> In order to find the open port and do a 'lsof -i :4160' we need root
>> access to see other userids, if whe don't have it, the result is not
>> all-inclusive, but i suspect any jenkins build is the most likely
>> suspect, and these i think would all run under the same userid.
>>
>> I also have code to create a heapdump programmatically, so if the
>> exception occurs whe can inspect the dump later.
>>
>
> That might also be useful for tests failing only occassionally,
> synchronization issues are suspected, so far it's proven quite difficult
> to determine causes and I've had to resort to visual code inspection.

Where shall i put it? It's a single class.


-- 
QCG, Software voor het MKB, 071-5890970, http://www.qcg.nl
Quality Consultancy Group b.v., Leiderdorp, Kvk Den Haag: 28088397

Re: Tests failing due to port in use

Posted by Peter Firmstone <ji...@zeus.net.au>.
On 10/04/2013 7:46 PM, Simon IJskes - QCG wrote:
> On 10-04-13 09:44, Peter Firmstone wrote:
>
>> I wonder if there's a way to detect RLIMIT_NOFILE from java?
>
> Other then spawning a shell with 'ulimit -n' i can't tell.
>
> In order to find the open port and do a 'lsof -i :4160' we need root 
> access to see other userids, if whe don't have it, the result is not 
> all-inclusive, but i suspect any jenkins build is the most likely 
> suspect, and these i think would all run under the same userid.
>
> I also have code to create a heapdump programmatically, so if the 
> exception occurs whe can inspect the dump later.
>

That might also be useful for tests failing only occassionally, 
synchronization issues are suspected, so far it's proven quite difficult 
to determine causes and I've had to resort to visual code inspection.



Re: Tests failing due to port in use

Posted by Simon IJskes - QCG <si...@qcg.nl>.
On 10-04-13 09:44, Peter Firmstone wrote:

> I wonder if there's a way to detect RLIMIT_NOFILE from java?

Other then spawning a shell with 'ulimit -n' i can't tell.

In order to find the open port and do a 'lsof -i :4160' we need root 
access to see other userids, if whe don't have it, the result is not 
all-inclusive, but i suspect any jenkins build is the most likely 
suspect, and these i think would all run under the same userid.

I also have code to create a heapdump programmatically, so if the 
exception occurs whe can inspect the dump later.

-- 
QCG, Software voor het MKB, 071-5890970, http://www.qcg.nl
Quality Consultancy Group b.v., Leiderdorp, Kvk Den Haag: 28088397

Re: Tests failing due to port in use

Posted by Peter Firmstone <ji...@zeus.net.au>.
On 10/04/2013 5:19 PM, Simon IJskes - QCG wrote:
> From the build list:
>
>> Hi,
>>
>> The Apache Aries build removed ubuntu2 from its list of possible 
>> machines
>> several months ago because its ulimit was set to 1024 while the other
>> ubuntu machines were set to 40000. Would you be able to make it match 
>> the
>> other machines as part of the rebuild please?
>>
>> John
>
> I havent checked, but i believe they are talking about: RLIMIT_NOFILE 
> might that also be relevant to failing tests? It limits the number of 
> sockets open. When opening a socket when the filedescriptors are 
> exhausted, all platforms provide exceptions indicating so, would they?
>
> Gr. Simon
>

Hmm, good suggestion, I honestly hadn't considered that.

Presently it appears as though port 4160 is unavailable on one of the 
Arm nodes (arm-16 from memory), changing portInUse to take into account 
whether ports were in use by programs other than the test suite enabled 
another 98 tests to pass, however there are still 182 failing due to 
4160 being used by another process.  I'm considering making a script 
that checks available ports against those required and change the 
default configured ports for the tests if they're already in use.

I wonder if there's a way to detect RLIMIT_NOFILE from java?

The problems on Jenkins appear unreleated to Greg's test failures, the 
number of failed tests seems to be coincidental.

Regards,

Peter.

Re: Tests failing due to port in use

Posted by Simon IJskes - QCG <si...@qcg.nl>.
 From the build list:

> Hi,
>
> The Apache Aries build removed ubuntu2 from its list of possible machines
> several months ago because its ulimit was set to 1024 while the other
> ubuntu machines were set to 40000. Would you be able to make it match the
> other machines as part of the rebuild please?
>
> John

I havent checked, but i believe they are talking about: RLIMIT_NOFILE 
might that also be relevant to failing tests? It limits the number of 
sockets open. When opening a socket when the filedescriptors are 
exhausted, all platforms provide exceptions indicating so, would they?

Gr. Simon

-- 
QCG, Software voor het MKB, 071-5890970, http://www.qcg.nl
Quality Consultancy Group b.v., Leiderdorp, Kvk Den Haag: 28088397

Re: Tests failing due to port in use

Posted by Greg Trasuk <tr...@stratuscom.com>.
I'm testing on JDK7 on a mac right now.  When I see a good test run
(started one last night.  Looks promising at the moment) I was going to
fire up tests on JDK6/Solaris 10, JDK7/win7x64 and JDK7/winxp32.

Given that the Levels fix is for a JDK7 issue, it's pretty critical that
we test on JDK7.  Even having said that, I don't know that the Levels
issue is immediately apparent - we should probably write a test that
demonstrates the issue and fix.

Is Dennis there?  Do you have some code that shows the Levels problem? 
If so I'll see about writing a jtreg test for it.

Cheers,

Greg.

On Wed, 2013-04-10 at 07:12, Peter Firmstone wrote:
> Greg,
> 
> Are you testing on Java 7?
> 
> The last release was only tested with Java 5 and 6.
> 
> Can you test with Java 6?
> 
> Regards,
> 
> Peter.
> 
> 
> On 10/04/2013 6:32 AM, Greg Trasuk wrote:
> > On Tue, 2013-04-09 at 15:29, Peter Firmstone wrote:
> >> This was being discussed in two separate threads:
> >>
> >> Greg,
> >>
> >> Are you able to find which process is using a port?  Perhaps jvisualvm
> >> might help determine which test it is?  If you can narrow it down to a
> >> bunch of test definitions sharing a test class, by identifying the class
> >> on the stack, you'll find it.
> >>
> > No, there's no process using the port.  The evidence points to a
> > configuration error.  The problem is not that the registrars don't
> > start.  It's that they start on random ports when the port is supposed
> > to be specified.


Re: Tests failing due to port in use

Posted by Peter Firmstone <ji...@zeus.net.au>.
Greg,

Are you testing on Java 7?

The last release was only tested with Java 5 and 6.

Can you test with Java 6?

Regards,

Peter.


On 10/04/2013 6:32 AM, Greg Trasuk wrote:
> On Tue, 2013-04-09 at 15:29, Peter Firmstone wrote:
>> This was being discussed in two separate threads:
>>
>> Greg,
>>
>> Are you able to find which process is using a port?  Perhaps jvisualvm
>> might help determine which test it is?  If you can narrow it down to a
>> bunch of test definitions sharing a test class, by identifying the class
>> on the stack, you'll find it.
>>
> No, there's no process using the port.  The evidence points to a
> configuration error.  The problem is not that the registrars don't
> start.  It's that they start on random ports when the port is supposed
> to be specified.


Re: Tests failing due to port in use

Posted by Greg Trasuk <tr...@stratuscom.com>.
On Tue, 2013-04-09 at 15:29, Peter Firmstone wrote:
> This was being discussed in two separate threads:
> 
> Greg,
> 
> Are you able to find which process is using a port?  Perhaps jvisualvm 
> might help determine which test it is?  If you can narrow it down to a 
> bunch of test definitions sharing a test class, by identifying the class 
> on the stack, you'll find it.
> 
No, there's no process using the port.  The evidence points to a
configuration error.  The problem is not that the registrars don't
start.  It's that they start on random ports when the port is supposed
to be specified.

If you don't believe that, perhaps you could explain what the properties
file "qa/src/com/sun/jini/test/share/reggie3_2Ports.properties", which
> >> includes properties called 
> >> "net.jini.core.lookup.ServiceRegistrar.port.0=7700",
> >> "net.jini.core.lookup.ServiceRegistrar.port.1=7701",
> >> "net.jini.core.lookup.ServiceRegistrar.port.2=7702", etc. 
and others that are similar, are supposed to be doing?

Randomly changing code is not a troubleshooting technique.  

That probably sounds a little snarky.  Sorry, it's been a long day. 
What I mean is that based on the evidence I'm seeing, I'm pretty sure
that there is a configuration error, that might be local, or might
actually be an error in the config files in the integration tests.  I'm
not going to patch the integration tests in the '2.2' branch without
knowing for sure what's going on.

Relevant part of the trace output for a failed test is below.  Note that
the test is expecting ConstrainableLookupLocator objects, which do in
fact have a constructor that takes a port parameter.

Is it possible that this issue is also present in the Jenkins builds?  

Having asked that, it is certainly a lot more likely (given that the
Jenkins build is on a shared machine) that some other build on the
machine is trying to use one of the ports that the integration tests
want to use for the class server.  I suspect that would cause failures.

Cheers,

Greg.

BaseQATest.startInitLookups FINE:  initial lookups started --
BaseQATest.displayLookupStartInfo FINE:    # of lookups = 3
BaseQATest.displayLookupStartInfo FINE:      locator lookup[0] = ConstrainableLookupLocator[[jini://192.168.1.112:61219/], [null]]
GroupsUtil.displayGroupSet FINE:        group[0] = LLDGroup0_A_Gregs-MacBook-Pro-3.local_1365366343456
GroupsUtil.displayGroupSet FINE:        group[1] = LLDGroup0_B_Gregs-MacBook-Pro-3.local_1365366343456
GroupsUtil.displayGroupSet FINE:        group[2] = LLDGroup0_C_Gregs-MacBook-Pro-3.local_1365366343456
BaseQATest.displayLookupStartInfo FINE:      locator lookup[1] = ConstrainableLookupLocator[[jini://192.168.1.112:61228/], [null]]
GroupsUtil.displayGroupSet FINE:        group[0] = LLDGroup1_A_Gregs-MacBook-Pro-3.local_1365366343456
GroupsUtil.displayGroupSet FINE:        group[1] = LLDGroup1_B_Gregs-MacBook-Pro-3.local_1365366343456
GroupsUtil.displayGroupSet FINE:        group[2] = LLDGroup1_C_Gregs-MacBook-Pro-3.local_1365366343456
GroupsUtil.displayGroupSet FINE:        group[3] = LLDGroup1_D_Gregs-MacBook-Pro-3.local_1365366343456
BaseQATest.displayLookupStartInfo FINE:      locator lookup[2] = ConstrainableLookupLocator[[jini://192.168.1.112:61231/], [null]]
GroupsUtil.displayGroupSet FINE:        group[0] = LLDGroup2_A_Gregs-MacBook-Pro-3.local_1365366343456
GroupsUtil.displayGroupSet FINE:        group[1] = LLDGroup2_B_Gregs-MacBook-Pro-3.local_1365366343456
GroupsUtil.displayGroupSet FINE:        group[2] = LLDGroup2_C_Gregs-MacBook-Pro-3.local_1365366343456
BaseQATest.startInitLookups FINE:  initial lookups wanted --
BaseQATest.displayLookupStartInfo FINE:    # of lookups = 3
BaseQATest.displayLookupStartInfo FINE:      locator lookup[0] = ConstrainableLookupLocator[[jini://192.168.1.112/], [null]]
GroupsUtil.displayGroupSet FINE:        group[0] = LLDGroup0_A_Gregs-MacBook-Pro-3.local_1365366343456
GroupsUtil.displayGroupSet FINE:        group[1] = LLDGroup0_B_Gregs-MacBook-Pro-3.local_1365366343456
GroupsUtil.displayGroupSet FINE:        group[2] = LLDGroup0_C_Gregs-MacBook-Pro-3.local_1365366343456
BaseQATest.displayLookupStartInfo FINE:      locator lookup[1] = ConstrainableLookupLocator[[jini://192.168.1.112/], [null]]
GroupsUtil.displayGroupSet FINE:        group[0] = LLDGroup1_A_Gregs-MacBook-Pro-3.local_1365366343456
GroupsUtil.displayGroupSet FINE:        group[1] = LLDGroup1_B_Gregs-MacBook-Pro-3.local_1365366343456
GroupsUtil.displayGroupSet FINE:        group[2] = LLDGroup1_C_Gregs-MacBook-Pro-3.local_1365366343456
GroupsUtil.displayGroupSet FINE:        group[3] = LLDGroup1_D_Gregs-MacBook-Pro-3.local_1365366343456
BaseQATest.displayLookupStartInfo FINE:      locator lookup[2] = ConstrainableLookupLocator[[jini://192.168.1.112/], [null]]
GroupsUtil.displayGroupSet FINE:        group[0] = LLDGroup2_A_Gregs-MacBook-Pro-3.local_1365366343456
GroupsUtil.displayGroupSet FINE:        group[1] = LLDGroup2_B_Gregs-MacBook-Pro-3.local_1365366343456
GroupsUtil.displayGroupSet FINE:        group[2] = LLDGroup2_C_Gregs-MacBook-Pro-3.local_1365366343456
BaseQATest.tearDown FINE:  tearDown - terminating lookup service(s)

> Sim wrote:
> > On 09-04-13 11:44, Peter Firmstone wrote:
> >> I've never experienced the issue locally (I see it on Jenkins quite a
> >> lot), but I suspect a stale registrar process left from another test may
> >> be stopping the socket from closing.  Not that registrars are also
> >> simulated for discovery tests, so it may not necessarily be Reggie.
> >>
> >> The code is duplicated in two places in superclasses of the tests that
> >> are failing,  the method portInUse(int port) is supposed to check if the
> >> ports available, but only selects from a list of LookupLocator's known
> >> to have started, so doesn't actually check the port's available.
> >>
> >> Perhaps you can find the stale test process responsible?
> >
> > And other jobs get executed on the same machine at the same time. 
> > Maybe these consume ports?
> >
> > The best way to find the problem looks to me like doing a 'lsof -i' in 
> > the test as soons as a port turns out used unexpectedly.
> >
> > Gr. Simon
> >
> > -- 
> > QCG, Software voor het MKB, 071-5890970, http://www.qcg.nl
> > Quality Consultancy Group b.v., Leiderdorp, Kvk Den Haag: 28088397
> 
> Peter wrote:
> > It's a problem I've been aware of for some time, many tests don't 
> > specify a port, they just want three or four or however many 
> > registrar's started and they have to specify it using a LookupLocator, 
> > unfortunately you can't specify an ephemeral LookupLocator, it's not 
> > allowed, so if you don't specify a particular port, you get 4160 and 
> > the portInUse method is supposed to work around that.
> >
> > When a test finds a port in use, it updates the registrar wanted to an 
> > available port.  If it thinks 4160 is available, when it isn't, boom, 
> > test failed.
> >
> > Regards,
> >
> > Peter.
> >
> > Greg Trasuk wrote:
> >> I hope you're not adding this to try to help my configuration issue,
> >> because I'm pretty sure that's not the problem, and I hate to screw
> >> around with the regression tests unneccessarily.
> >>
> >> Evidence-wise, the "lookup started != initial lookups wanted" issue
> >> looks like this:
> >> When I enable logging at the FINE level, I see three registrars started,
> >> with unicast discovery on random ports - this is Reggie's fallback
> >> position when it attempts to open its unicast port and it's already in
> >> use.  However, the test that's checking for "lookup started == initial
> >> lookups wanted" is testing for three separate registrars on the default
> >> port.  Obviously that's an impossible proposition.  My working theory is
> >> that the test is supposed to be configured to start Reggie on three
> >> separate unicast ports (e.e. 7700, 7701, 7702), and the assertion is
> >> checking to see if that happened correctly.  This theory is supported by
> >> the existence of
> >> "qa/src/com/sun/jini/test/share/reggie3_2Ports.properties", which
> >> includes properties called 
> >> "net.jini.core.lookup.ServiceRegistrar.port.0=7700",
> >> "net.jini.core.lookup.ServiceRegistrar.port.1=7701",
> >> "net.jini.core.lookup.ServiceRegistrar.port.2=7702", etc. 
> >> Hence my suspicion that I don't understand the configuration mechanism
> >> well enough yet.  My theory is that the test is supposed to start up
> >> Reggie's on three different ports, but there is something about my test
> >> environment setup that causes the configuration to be missed.  Anpother
> >> possibility is that the actual test configuration got messed up at some
> >> point (I did see a comment in one of the properties files about "these
> >> entries are commented out because they are now included in the test
> >> descriptions".
> >>
> >> Testing that theory is going to take me a little while to understand the
> >> multi-layer test configuration mechanism.  Unfortunately I've got a full
> >> morning of other work ahead of me.
> >>
> >> Anyhow, I don't really think changing the tests is the correct approach
> >> at this point (at least in my troubleshooting process).
> >>
> >> Cheers,
> >>
> >> Greg.
> >>
> >>
> >>
> >> On Tue, 2013-04-09 at 07:05, peter_firmstone@apache.org wrote:
> >>  
> >>> Author: peter_firmstone
> >>> Date: Tue Apr  9 11:05:32 2013
> >>> New Revision: 1465969
> >>>
> >>> URL: http://svn.apache.org/r1465969
> >>> Log:
> >>> Be more thorough when checking for port availability on localhost 
> >>> during testing, don't assume port free if current test isn't using it.
> >>>
> >>> Modified:
> >>>     
> >>> river/jtsk/skunk/qa_refactor/trunk/qa/src/com/sun/jini/test/share/LookupServices.java 
> >>>
> >>>
> >>> Modified: 
> >>> river/jtsk/skunk/qa_refactor/trunk/qa/src/com/sun/jini/test/share/LookupServices.java 
> >>>
> >>> URL: 
> >>> http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/qa/src/com/sun/jini/test/share/LookupServices.java?rev=1465969&r1=1465968&r2=1465969&view=diff
> >>>
> >>> ============================================================================== 
> >>>
> >>> --- 
> >>> river/jtsk/skunk/qa_refactor/trunk/qa/src/com/sun/jini/test/share/LookupServices.java 
> >>> (original)
> >>> +++ 
> >>> river/jtsk/skunk/qa_refactor/trunk/qa/src/com/sun/jini/test/share/LookupServices.java 
> >>> Tue Apr  9 11:05:32 2013
> >>> @@ -27,7 +27,15 @@ import com.sun.jini.test.spec.discoverys
> >>>  import com.sun.jini.test.spec.discoveryservice.AbstractBaseTest;
> >>>  import 
> >>> com.sun.jini.test.spec.discoveryservice.AbstractBaseTest.DiscoveryStruct; 
> >>>
> >>>  import 
> >>> com.sun.jini.test.spec.discoveryservice.AbstractBaseTest.RegGroupsPair;
> >>> +import java.io.IOException;
> >>>  import java.net.InetAddress;
> >>> +import java.net.InetSocketAddress;
> >>> +import java.net.ServerSocket;
> >>> +import java.net.Socket;
> >>> +import java.net.SocketAddress;
> >>> +import java.net.SocketException;
> >>> +import java.net.SocketOptions;
> >>> +import java.net.SocketTimeoutException;
> >>>  import java.net.UnknownHostException;
> >>>  import java.rmi.RemoteException;
> >>>  import java.util.ArrayList;
> >>> @@ -660,12 +668,47 @@ public class LookupServices {
> >>>       * @return true if port in use.       */
> >>>      private boolean portInUse(int port) {
> >>> +        if (port == 0) return false; // Ephemeral
> >>>          for(int i=0;i<lookupsStarted.size();i++) {
> >>>              LocatorGroupsPair pair = lookupsStarted.get(i);
> >>>              int curPort = (pair.getLocator()).getPort();
> >>>              if(port == curPort) return true;
> >>>          }//end loop
> >>> -        return false;
> >>> +        // Open a client ephemeral socket and attempt to connect to
> >>> +        // port on localhost to see if someone's listening.
> >>> +        Socket sock = null;
> >>> +        try {
> >>> +            sock = new Socket();
> >>> +            if (sock instanceof SocketOptions){
> >>> +                // Socket terminates with a RST rather than a FIN, 
> >>> so there's no TIME_WAIT
> >>> +                try {
> >>> +                    ((SocketOptions) 
> >>> sock).setOption(SocketOptions.SO_LINGER, Integer.valueOf(0));
> >>> +                } catch (SocketException se) {
> >>> +                    // Ignore, not supported.
> >>> +                    logger.log( Level.FINEST, "SocketOptions set 
> >>> SO_LINGER threw an Exception", se);
> >>> +                }
> >>> +            }
> >>> +            SocketAddress add = new InetSocketAddress(port);
> >>> +            sock.connect(add, 3000); // Try to connect for up to 
> >>> three seconds +            // We were able to connect to a socket 
> >>> listening on localhost
> >>> +            return true;
> >>> +        } catch (SocketTimeoutException e){
> >>> +            // There might be a stale process assume in use.
> >>> +            logger.log( Level.FINEST, "Socket timed out while 
> >>> trying to connect", e);
> >>> +            return true;
> >>> +        } catch (IOException e){
> >>> +            // There was nothing listening on the socket so it's 
> >>> probably free.
> >>> +            // or it timed out.
> >>> +            return false;
> >>> +        } finally {
> >>> +            if (sock != null){
> >>> +                try {
> >>> +                    sock.close();
> >>> +                } catch (IOException ex){
> >>> +                    logger.log( Level.FINEST, "Socket threw 
> >>> exception while attempting to close", ex);
> >>> +                }// Ignore
> >>> +            }
> >>> +        }
> >>>      }//end portInUse
> >>>           private void refreshLookupLocatorListsAt(int index){
> >>>
> >>>     
> >>
> >>
> >>   
> >