You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@river.apache.org by Patricia Shanahan <pa...@acm.org> on 2010/10/15 19:51:06 UTC

Bug fixing

I seem to have reached a level of understanding of River that lets me 
track down some bugs, and bug fixing is an obviously useful activity, so 
I plan to spend some time on it.

As I mentioned in the "Request for testing help" thread, I've 
investigated the GetStateTest hang. The test is spinning waiting for the 
TransactionManager's getState method to throw an exception because it 
has discarded the aborted transaction. As far as I can tell, there is no 
requirement that a TransactionManager discard a transaction, even when 
it is permitted to do so.

I plan to file a Jira for the test, and modify it to spin for a limited 
time. Treat either UnknownTransactionException or continuous return of 
ABORTED status for e.g. one minute as successful test completion.

What is the next bug to work on? Any opinions?

Patricia

Re: Bug fixing

Posted by Peter Firmstone <ji...@zeus.net.au>.

Patricia Shanahan wrote:
> On 10/15/2010 10:51 AM, Patricia Shanahan wrote:
>> I seem to have reached a level of understanding of River that lets me
>> track down some bugs, and bug fixing is an obviously useful activity, so
>> I plan to spend some time on it.
>>
>> As I mentioned in the "Request for testing help" thread, I've
>> investigated the GetStateTest hang. The test is spinning waiting for the
>> TransactionManager's getState method to throw an exception because it
>> has discarded the aborted transaction. As far as I can tell, there is no
>> requirement that a TransactionManager discard a transaction, even when
>> it is permitted to do so.
>>
>> I plan to file a Jira for the test, and modify it to spin for a limited
>> time. Treat either UnknownTransactionException or continuous return of
>> ABORTED status for e.g. one minute as successful test completion.
>
> I've investigated this some more, and the test is revealing a real 
> problem in the transaction abort implementation.
>
> If there is a timeout it passes the problem to a SettlerTask, subclass 
> of RetryTask, which retries the abort. However, 
> com.sun.jini.mahalo.TxnManagerImpl's abort code checks for an attempt 
> to abort a transaction for which abort has already been called, and 
> throws new CannotAbortException("Transaction previously aborted")
>
> The net.jini.core.transaction.server.TransactionManager interface, 
> which it implements, specifies that abort throws CannotAbortException 
> for a transaction that has reached the COMMITTED state, but says 
> nothing about throwing it for a transaction that is in the ABORTED state.
>
> I propose modifying TxnManagerImpl to make it match the interface 
> declaration, and allow an abort to be retried. This may break other 
> tests, if they are assuming the behavior that TxnManagerImpl implemented.
>
> Patricia
>
+1 Peter.

Re: Bug fixing

Posted by Patricia Shanahan <pa...@acm.org>.

Could you set up a special Hudson build that would only run the 
Solaris-failing test on Solaris, and would use a skunk branch rather 
than the trunk?

Given that, I could debug on Solaris-Hudson, by editing in logging 
and/or printouts. I use System.err printouts when I want to ask a 
specialized question that is probably only needed for that step of that 
debug effort, and I can't use Eclipse to get the data.

It would be slower than debugging on a machine under my direct control, 
but not that much.

I did get some failures from the QA test of my TxnManagerImpl change, so 
I need to investigate, and may need to modify the fix.

Patricia


Jonathan Costers wrote:
> Hi Patricia
> Great work again.
> I'm going to look into creating a separate Hudson build to run only on
> Solaris nodes today. Likewise, we'll have another one running on Ubuntu
> nodes only.
> The former should fail consistently, the latter should pass consistently.
> On first sight, the issue on Solaris seems to have to do with multicasting
> somehow, but haven't been able to spend much time to investigate.
> Best
> Jonathan
> 
> 2010/10/31 Patricia Shanahan <pa...@acm.org>
> 
>> I've found a bug fix to com.sun.jini.mahalo.TxnManagerImpl that makes the
>> entire javaspace category pass. I've started a full QA run with the fix, but
>> that will take about a day. If anything fails as a result of the
>> TxnManagerImpl change I'll have to debug that, but until I get a failure
>> there is nothing more to do on javaspace.
>>
>> There is arguably a problem in the txnmanager test category because it did
>> not detect the bug. Maybe I should add the txnmanager category to a couple
>> of the javaspace tests that use transactions.
>>
>> Time pick my next bug hunt - I can start something else while the QA test
>> is running. Any opinions?
>>
>> I could take a look at some of the skipped tests to see why they are
>> skipped. Maybe some of them would tell us about real bugs if we ran them.
>> There is also the Solaris-only bug. I can build a Solaris VirtualBox and see
>> if it reproduces. If not, I may need to learn how to run things on a Hudson
>> Solaris in order to investigate.
>>
>>
>> Patricia
>>
>>
>> On 10/20/2010 1:44 PM, Jonathan Costers wrote:
>>
>>> Great job Patricia!
>>>
>>> My vote would go to the "javaspace" test category.
>>> Last time I ran that one (250 or so tests IIRC) I got 16 failures.
>>>
>>> If we can get these cleared up, I believe we have a solid test base in
>>> place
>>> to start validating some new developments and experiments.
>>>
>>> Looks like we are really getting some momentum here, I like it a lot.
>>>
>>> Thanks to all for your hard work.
>>> Jonathan
>>>
>>> 2010/10/20 Patricia Shanahan<pa...@acm.org>
>>>
>>>  On 10/19/2010 4:08 PM, Patricia Shanahan wrote:
>>>>  I propose modifying TxnManagerImpl to make it match the interface
>>>>> declaration, and allow an abort to be retried. This may break other
>>>>> tests, if they are assuming the behavior that TxnManagerImpl
>>>>> implemented.
>>>>>
>>>>>
>>>> I'm doing a full QA test, including txnmanager. Although the test is
>>>> still
>>>> running, all of the txnmanager tests, including GetStateTest, have
>>>> passed.
>>>> Those tests are the most likely to notice the change.
>>>>
>>>> If the rest of the QA test is clean when it finishes, I'll check in the
>>>> fix
>>>> and we can add txnmanager to the default category list.
>>>>
>>>> Any votes on my next bug hunt? For example, are there bug reports in Jira
>>>> that really need to be fixed before the next release?
>>>>
>>>> Patricia
>>>>
>>>>
>>>>
>

Re: Bug fixing

Posted by Jonathan Costers <jo...@googlemail.com>.

Hi Patricia
Great work again.
I'm going to look into creating a separate Hudson build to run only on
Solaris nodes today. Likewise, we'll have another one running on Ubuntu
nodes only.
The former should fail consistently, the latter should pass consistently.
On first sight, the issue on Solaris seems to have to do with multicasting
somehow, but haven't been able to spend much time to investigate.
Best
Jonathan

2010/10/31 Patricia Shanahan <pa...@acm.org>

> I've found a bug fix to com.sun.jini.mahalo.TxnManagerImpl that makes the
> entire javaspace category pass. I've started a full QA run with the fix, but
> that will take about a day. If anything fails as a result of the
> TxnManagerImpl change I'll have to debug that, but until I get a failure
> there is nothing more to do on javaspace.
>
> There is arguably a problem in the txnmanager test category because it did
> not detect the bug. Maybe I should add the txnmanager category to a couple
> of the javaspace tests that use transactions.
>
> Time pick my next bug hunt - I can start something else while the QA test
> is running. Any opinions?
>
> I could take a look at some of the skipped tests to see why they are
> skipped. Maybe some of them would tell us about real bugs if we ran them.
> There is also the Solaris-only bug. I can build a Solaris VirtualBox and see
> if it reproduces. If not, I may need to learn how to run things on a Hudson
> Solaris in order to investigate.
>
>
> Patricia
>
>
> On 10/20/2010 1:44 PM, Jonathan Costers wrote:
>
>> Great job Patricia!
>>
>> My vote would go to the "javaspace" test category.
>> Last time I ran that one (250 or so tests IIRC) I got 16 failures.
>>
>> If we can get these cleared up, I believe we have a solid test base in
>> place
>> to start validating some new developments and experiments.
>>
>> Looks like we are really getting some momentum here, I like it a lot.
>>
>> Thanks to all for your hard work.
>> Jonathan
>>
>> 2010/10/20 Patricia Shanahan<pa...@acm.org>
>>
>>  On 10/19/2010 4:08 PM, Patricia Shanahan wrote:
>>>
>>>  I propose modifying TxnManagerImpl to make it match the interface
>>>> declaration, and allow an abort to be retried. This may break other
>>>> tests, if they are assuming the behavior that TxnManagerImpl
>>>> implemented.
>>>>
>>>>
>>> I'm doing a full QA test, including txnmanager. Although the test is
>>> still
>>> running, all of the txnmanager tests, including GetStateTest, have
>>> passed.
>>> Those tests are the most likely to notice the change.
>>>
>>> If the rest of the QA test is clean when it finishes, I'll check in the
>>> fix
>>> and we can add txnmanager to the default category list.
>>>
>>> Any votes on my next bug hunt? For example, are there bug reports in Jira
>>> that really need to be fixed before the next release?
>>>
>>> Patricia
>>>
>>>
>>>
>>
>

Re: Bug fixing

Posted by Patricia Shanahan <pa...@acm.org>.

I've found a bug fix to com.sun.jini.mahalo.TxnManagerImpl that makes 
the entire javaspace category pass. I've started a full QA run with the 
fix, but that will take about a day. If anything fails as a result of 
the TxnManagerImpl change I'll have to debug that, but until I get a 
failure there is nothing more to do on javaspace.

There is arguably a problem in the txnmanager test category because it 
did not detect the bug. Maybe I should add the txnmanager category to a 
couple of the javaspace tests that use transactions.

Time pick my next bug hunt - I can start something else while the QA 
test is running. Any opinions?

I could take a look at some of the skipped tests to see why they are 
skipped. Maybe some of them would tell us about real bugs if we ran 
them. There is also the Solaris-only bug. I can build a Solaris 
VirtualBox and see if it reproduces. If not, I may need to learn how to 
run things on a Hudson Solaris in order to investigate.

Patricia

On 10/20/2010 1:44 PM, Jonathan Costers wrote:
> Great job Patricia!
>
> My vote would go to the "javaspace" test category.
> Last time I ran that one (250 or so tests IIRC) I got 16 failures.
>
> If we can get these cleared up, I believe we have a solid test base in place
> to start validating some new developments and experiments.
>
> Looks like we are really getting some momentum here, I like it a lot.
>
> Thanks to all for your hard work.
> Jonathan
>
> 2010/10/20 Patricia Shanahan<pa...@acm.org>
>
>> On 10/19/2010 4:08 PM, Patricia Shanahan wrote:
>>
>>> I propose modifying TxnManagerImpl to make it match the interface
>>> declaration, and allow an abort to be retried. This may break other
>>> tests, if they are assuming the behavior that TxnManagerImpl implemented.
>>>
>>
>> I'm doing a full QA test, including txnmanager. Although the test is still
>> running, all of the txnmanager tests, including GetStateTest, have passed.
>> Those tests are the most likely to notice the change.
>>
>> If the rest of the QA test is clean when it finishes, I'll check in the fix
>> and we can add txnmanager to the default category list.
>>
>> Any votes on my next bug hunt? For example, are there bug reports in Jira
>> that really need to be fixed before the next release?
>>
>> Patricia
>>
>>
>

Re: Bug fixing

Posted by Patricia Shanahan <pa...@acm.org>.

Do you still have a note of the javaspace tests that failed? If not, 
I'll run the category and see what happens.

It is interesting to note that both the test failures I've tackled 
recently turned out to be showing up problems in the code under test, 
not just the tests. It shows the importance of letting the tests speak 
to us by running them and analyzing failures.

Patricia


On 10/20/2010 1:44 PM, Jonathan Costers wrote:
> Great job Patricia!
>
> My vote would go to the "javaspace" test category.
> Last time I ran that one (250 or so tests IIRC) I got 16 failures.
>
> If we can get these cleared up, I believe we have a solid test base in place
> to start validating some new developments and experiments.
>
> Looks like we are really getting some momentum here, I like it a lot.
>
> Thanks to all for your hard work.
> Jonathan
>
> 2010/10/20 Patricia Shanahan<pa...@acm.org>
>
>> On 10/19/2010 4:08 PM, Patricia Shanahan wrote:
>>
>>> I propose modifying TxnManagerImpl to make it match the interface
>>> declaration, and allow an abort to be retried. This may break other
>>> tests, if they are assuming the behavior that TxnManagerImpl implemented.
>>>
>>
>> I'm doing a full QA test, including txnmanager. Although the test is still
>> running, all of the txnmanager tests, including GetStateTest, have passed.
>> Those tests are the most likely to notice the change.
>>
>> If the rest of the QA test is clean when it finishes, I'll check in the fix
>> and we can add txnmanager to the default category list.
>>
>> Any votes on my next bug hunt? For example, are there bug reports in Jira
>> that really need to be fixed before the next release?
>>
>> Patricia
>>
>>
>

Re: Bug fixing

Posted by Peter Firmstone <ji...@zeus.net.au>.

Hmm, most admirable.

Cheers,

Peter.

Patricia Shanahan wrote:
> I've confirmed the temporary fix discussed below with the trunk code, 
> having reverted the rest of the changes I made to simplify the problem 
> and extract more data. I'm now rerunning the javaspaces category, 
> because I think several failures have the same general flavor and may 
> be fixed by this one change.
>
> Patricia
>
>
> On 10/26/2010 1:22 PM, Patricia Shanahan wrote:
>> On 10/26/2010 8:06 AM, Patricia Shanahan wrote:
>>> Greg Trasuk wrote:
>>>> The transaction spec says (under "Joining a Transaction") that "The 
>>>> join
>>>> method throws CannotJoinException if the transaction is known to the
>>>> manager but is no longer active."
>>>>
>>>> So it would appear that this behaviour is a bug, if the transaction's
>>>> lease has actually expired.
>>>>
>>>> I had a quick look through TxnManagerImpl and I don't see any
>>>> "auto-renew-on-join" behaviour, so that seems kind of odd. In fact, in
>>>> the join(...) method, it finds a TxnManagerTransaction instance, then
>>>> calls join(..) on it, which appears to check whether the 
>>>> transaction is
>>>> expired, and should throw CannotJoinException if it is expired.
>>>
>>> There is an alternative I need to check. I want see if the JavaSpace is
>>> caching the transaction information. Maybe it isn't even checking with
>>> the TransactionManager if it thinks it has already joined the
>>> transaction, regardless of the length of the transaction's lease.
>>
>> My guess was correct! (This is the first time I've found a River bug
>> through an intuitive leap, rather than step-by-step brute force debug.)
>>
>> The method enterTxn in OutriggerServerImpl does the dirty work. It uses
>> a table that maps TransactionManager-id pairs to its internal
>> transaction representation, com.sun.jini.outrigger.Txn.
>>
>> It begins by doing a look-up on the table which will either get a Txn or
>> null. That is followed by this comment:
>>
>> "If we find the txn with the probe, we're all done. Just return.
>> Otherwise, we need to join the transaction. Of course, there is a race
>> condition here. While we are attempting to join, others might be
>> attempting to join also. We are careful to ensure that only one of the
>> racing threads is able to update our internal data structures with our
>> internal representation of this transaction. NB: There are better ways
>> of doing this which could ensure we never make an unnecessary remote
>> call that we may want to explore later."
>>
>> if (txn == null) {
>> final TransactionManager mgr = (TransactionManager)
>> transactionManagerPreparer.prepareProxy(tr.mgr);
>> tr = new ServerTransaction(mgr, tr.id);
>> tr.join(participantProxy, crashCount);
>> txn = txnTable.put(tr);
>> }
>>
>> Commenting out the "if (txn == null)" test, so that the join is done
>> regardless, makes the test pass.
>>
>> My next problem is how to really fix this. Presumably, the caching was
>> not done for fun. There is at least some chance that it is a fix for a
>> real performance problem, and merely commenting out the test would bring
>> back that problem. On the other hand, I'm not sure of all the
>> consequences, including effects on recovery, of going on blindly using a
>> transaction id after the transaction has really been discarded by its
>> manager.
>>
>> One approach would be to keep the caching, but try to make it correct.
>> Is there any way to find out the scheduled lease termination of an
>> existing transaction? If there were, OutriggerSeverImpl could store it
>> in the Txn object, and re-do the join if time now is greater.
>>
>> On the other hand, I rather feel that if the state of a transaction can
>> be safely cached, with a lease-based time limit, it should be done by
>> the TransactionManager's proxy. It is more tightly coupled to the
>> TransactionManager implementation, and any performance benefit would
>> apply to all TransactionManager uses, not just the outrigger JavaSpace
>> implementation.
>>
>> Thoughts? Suggestions?
>>
>> Patricia
>>
>
>

Re: Bug fixing

Posted by Patricia Shanahan <pa...@acm.org>.

I've confirmed the temporary fix discussed below with the trunk code, 
having reverted the rest of the changes I made to simplify the problem 
and extract more data. I'm now rerunning the javaspaces category, 
because I think several failures have the same general flavor and may be 
fixed by this one change.

Patricia


On 10/26/2010 1:22 PM, Patricia Shanahan wrote:
> On 10/26/2010 8:06 AM, Patricia Shanahan wrote:
>> Greg Trasuk wrote:
>>> The transaction spec says (under "Joining a Transaction") that "The join
>>> method throws CannotJoinException if the transaction is known to the
>>> manager but is no longer active."
>>>
>>> So it would appear that this behaviour is a bug, if the transaction's
>>> lease has actually expired.
>>>
>>> I had a quick look through TxnManagerImpl and I don't see any
>>> "auto-renew-on-join" behaviour, so that seems kind of odd. In fact, in
>>> the join(...) method, it finds a TxnManagerTransaction instance, then
>>> calls join(..) on it, which appears to check whether the transaction is
>>> expired, and should throw CannotJoinException if it is expired.
>>
>> There is an alternative I need to check. I want see if the JavaSpace is
>> caching the transaction information. Maybe it isn't even checking with
>> the TransactionManager if it thinks it has already joined the
>> transaction, regardless of the length of the transaction's lease.
>
> My guess was correct! (This is the first time I've found a River bug
> through an intuitive leap, rather than step-by-step brute force debug.)
>
> The method enterTxn in OutriggerServerImpl does the dirty work. It uses
> a table that maps TransactionManager-id pairs to its internal
> transaction representation, com.sun.jini.outrigger.Txn.
>
> It begins by doing a look-up on the table which will either get a Txn or
> null. That is followed by this comment:
>
> "If we find the txn with the probe, we're all done. Just return.
> Otherwise, we need to join the transaction. Of course, there is a race
> condition here. While we are attempting to join, others might be
> attempting to join also. We are careful to ensure that only one of the
> racing threads is able to update our internal data structures with our
> internal representation of this transaction. NB: There are better ways
> of doing this which could ensure we never make an unnecessary remote
> call that we may want to explore later."
>
> if (txn == null) {
> final TransactionManager mgr = (TransactionManager)
> transactionManagerPreparer.prepareProxy(tr.mgr);
> tr = new ServerTransaction(mgr, tr.id);
> tr.join(participantProxy, crashCount);
> txn = txnTable.put(tr);
> }
>
> Commenting out the "if (txn == null)" test, so that the join is done
> regardless, makes the test pass.
>
> My next problem is how to really fix this. Presumably, the caching was
> not done for fun. There is at least some chance that it is a fix for a
> real performance problem, and merely commenting out the test would bring
> back that problem. On the other hand, I'm not sure of all the
> consequences, including effects on recovery, of going on blindly using a
> transaction id after the transaction has really been discarded by its
> manager.
>
> One approach would be to keep the caching, but try to make it correct.
> Is there any way to find out the scheduled lease termination of an
> existing transaction? If there were, OutriggerSeverImpl could store it
> in the Txn object, and re-do the join if time now is greater.
>
> On the other hand, I rather feel that if the state of a transaction can
> be safely cached, with a lease-based time limit, it should be done by
> the TransactionManager's proxy. It is more tightly coupled to the
> TransactionManager implementation, and any performance benefit would
> apply to all TransactionManager uses, not just the outrigger JavaSpace
> implementation.
>
> Thoughts? Suggestions?
>
> Patricia
>

Re: Bug fixing

Posted by Patricia Shanahan <pa...@acm.org>.

On 10/26/2010 8:06 AM, Patricia Shanahan wrote:
> Greg Trasuk wrote:
>> The transaction spec says (under "Joining a Transaction") that "The join
>> method throws CannotJoinException if the transaction is known to the
>> manager but is no longer active."
>>
>> So it would appear that this behaviour is a bug, if the transaction's
>> lease has actually expired.
>>
>> I had a quick look through TxnManagerImpl and I don't see any
>> "auto-renew-on-join" behaviour, so that seems kind of odd. In fact, in
>> the join(...) method, it finds a TxnManagerTransaction instance, then
>> calls join(..) on it, which appears to check whether the transaction is
>> expired, and should throw CannotJoinException if it is expired.
>
> There is an alternative I need to check. I want see if the JavaSpace is
> caching the transaction information. Maybe it isn't even checking with
> the TransactionManager if it thinks it has already joined the
> transaction, regardless of the length of the transaction's lease.

My guess was correct! (This is the first time I've found a River bug 
through an intuitive leap, rather than step-by-step brute force debug.)

The method enterTxn in OutriggerServerImpl does the dirty work. It uses 
a table that maps TransactionManager-id pairs to its internal 
transaction representation, com.sun.jini.outrigger.Txn.

It begins by doing a look-up on the table which will either get a Txn or 
null. That is followed by this comment:

"If we find the txn with the probe, we're all done. Just return.
Otherwise, we need to join the transaction. Of course, there is a race
condition here. While we are attempting to join, others might be
attempting to join also. We are careful to ensure that only one of the
racing threads is able to update our internal data structures with our
internal representation of this transaction. NB: There are better ways
of doing this which could ensure we never make an unnecessary remote
call that we may want to explore later."

if (txn == null) {
   final TransactionManager mgr = (TransactionManager)
     transactionManagerPreparer.prepareProxy(tr.mgr);
   tr = new ServerTransaction(mgr, tr.id);
   tr.join(participantProxy, crashCount);
   txn = txnTable.put(tr);
}

Commenting out the "if (txn == null)" test, so that the join is done 
regardless, makes the test pass.

My next problem is how to really fix this. Presumably, the caching was 
not done for fun. There is at least some chance that it is a fix for a 
real performance problem, and merely commenting out the test would bring 
back that problem. On the other hand, I'm not sure of all the 
consequences, including effects on recovery, of going on blindly using a 
transaction id after the transaction has really been discarded by its 
manager.

One approach would be to keep the caching, but try to make it correct. 
Is there any way to find out the scheduled lease termination of an 
existing transaction? If there were, OutriggerSeverImpl could store it 
in the Txn object, and re-do the join if time now is greater.

On the other hand, I rather feel that if the state of a transaction can 
be safely cached, with a lease-based time limit, it should be done by 
the TransactionManager's proxy. It is more tightly coupled to the 
TransactionManager implementation, and any performance benefit would 
apply to all TransactionManager uses, not just the outrigger JavaSpace 
implementation.

Thoughts? Suggestions?

Patricia

Re: Bug fixing

Posted by Patricia Shanahan <pa...@acm.org>.

Greg Trasuk wrote:
> The transaction spec says (under "Joining a Transaction") that "The join
> method throws CannotJoinException if the transaction is known to the
> manager but is no longer active."
> 
> So it would appear that this behaviour is a bug, if the transaction's
> lease has actually expired.
> 
> I had a quick look through TxnManagerImpl and I don't see any
> "auto-renew-on-join" behaviour, so that seems kind of odd.  In fact, in
> the join(...) method, it finds a TxnManagerTransaction instance, then
> calls join(..) on it, which appears to check whether the transaction is
> expired, and should throw CannotJoinException if it is expired.

There is an alternative I need to check. I want see if the JavaSpace is 
caching the transaction information. Maybe it isn't even checking with 
the TransactionManager if it thinks it has already joined the 
transaction, regardless of the length of the transaction's lease.

Of course, logging and otherwise investigating the TransactionManager 
will detect that as well, in the form of a missing join call for the 
second write.

> I'm not currently setup to run the tests, but if I were, I'd set the log
> level on TxnManagerImpl to FINER and see if somebody is renewing the
> lease on that transaction, then try to track down who it is.

No need for you to run tests. I'm set up to do that. Your comments and 
suggestions are a valuable contribution to my bug hunt.

Thanks,

Patricia

Re: Bug fixing

Posted by Greg Trasuk <tr...@stratuscom.com>.

Hi Patricia:

See comments interspersed...

On Tue, 2010-10-26 at 00:54, Patricia Shanahan wrote:
> On 10/25/2010 10:50 AM, Greg Trasuk wrote:
> >
> > On Mon, 2010-10-25 at 12:15, Patricia Shanahan wrote:
> >> I've made some progress on one form of failure, a resource remaining
> >> available after its lease was scheduled to expire.
> >>
> >> If the test pauses, e.g due to an inserted Thread.sleep, between
> >> creating the lease and probing it for the first time, the test passes.
> >> The effect of either of these actions is to delay the probe until after
> >> the lease has expired. Incidentally, hooking up to the Eclipse debugger
> >> was very helpful in discovering this behavior. The first indication was
> >> when I spent a few minutes investigating at a breakpoint, let it
> >> continue, and the test passed.
> >>
> >> If the test gets to the probe point before the lease is due to expire,
> >> it goes into a polling loop. No matter how long it stays in the polling
> >> loop, the lease does not expire. Somehow, the probe loop has the effect
> >> of preventing expiration.
> >>
> > What is the "probe code" ?  How exactly does it determine that the lease
> > is still alive?
> 
> It calls this method:
> 
>      protected boolean isAvailable() throws TestException {
>          try {
>              final Lease x = space.write(aEntry, resource, 1000);
> 			//addOutriggerLease(x, true);
>          } catch (TransactionException e) {
> 
>              // This is ok...probably means it is a bad transactions
>              return false;
>          } catch (Exception e) {
>              throw new TestException("Testing for availability", e);
>          }
>          return true;
>      }
> 
> (I've slightly simplified the code, as part of a general attempt to 
> comment out anything that is not required to reproduce the problem.)
> 
> space is a JavaSpace. The objective seems to be to detect whether the 
> Transaction, resource, remains available. It was originally requested 
> with a one minute lease duration, and it seems to have got it, "aprox 
> duration:59991"
> 

OK, so the JavaSpace will attempt to join the transaction and that join
should fail if the transaction's lease has expired, or if the
transaction has been aborted, or is in the commit process (i.e. not in
ACTIVE state).

> If the isAvailable method is called even once before the scheduled 
> expiration of the resource lease, it will go on returning true. Repeated 
> calls are not needed - a single early call is sufficient. I've tested 
> that through a half hour sleep between two calls.
> 
> If the first call to this method is after the scheduled expiration of 
> the resource lease, it returns false.
> 
> The JavaSpace is implemented by OutriggerServerImpl. This print call in 
> its public long[] write(EntryRep rep, Transaction tr, long lease) method:
> 
> System.err.printf("XXXpats: Write of rep %s, tr %s, lease %d at %d%n",
> 			rep, tr, lease, System.currentTimeMillis());
> 
> reports:
> 
> XXXpats: Write of rep 
> EntryRep[com.sun.jini.test.share.UninterestingEntry], tr 
> net.jini.core.transaction.server.ServerTransaction 
> [manager=com.sun.jini.mahalo.TxnMgrProxy$ConstrainableTxnMgrProxy@2b6ad6ff, 
> id=-6728796543664280839], lease 1000 at 1288065446700
> 
> 
> 
> >
> >> I would like to understand the rules for lease expiration. What actions
> >> should extend a lease?
> >
> > lease.renew(long duration);
> >
> >
> >> Is the lease manager required to expire it on
> >> schedule? (Even if it is not required to expire the lease I still need
> >> to find out whether the non-expiration is deliberate design or due to a
> >> bug.)
> >
> > Do you mean the landlord (i.e. the grantor of the lease), or the
> > LeaseRenewalManager object (which automatically renews leases on behalf
> > of a lessor)?
> >
> > The owner of the resource does not need to expire it on a time schedule;
> > it could expire the lease at some future point that is a convenient run
> > time (like the next time the service happens to be accessed).
> > Conceptually, the lease is entirely for the convenience of the lessor;
> > expiry means that the client is no longer interested in the resource, so
> > the client's resource allocation can be freed.
> >
> > In any case, specs would be at:
> >
> > http://www.jini.org/wiki/Jini_Distributed_Leasing_Specification
> 
> The bottom line seems to me to be that the observed behavior may be 
> permitted, but is weird enough that I need to investigate it to find out 
> if it is an indication of a bug.
> 

The transaction spec says (under "Joining a Transaction") that "The join
method throws CannotJoinException if the transaction is known to the
manager but is no longer active."

So it would appear that this behaviour is a bug, if the transaction's
lease has actually expired.

I had a quick look through TxnManagerImpl and I don't see any
"auto-renew-on-join" behaviour, so that seems kind of odd.  In fact, in
the join(...) method, it finds a TxnManagerTransaction instance, then
calls join(..) on it, which appears to check whether the transaction is
expired, and should throw CannotJoinException if it is expired.

I'm not currently setup to run the tests, but if I were, I'd set the log
level on TxnManagerImpl to FINER and see if somebody is renewing the
lease on that transaction, then try to track down who it is.


Cheers,

Greg.


> A single write call using the transaction protects it for at least half 
> an hour. That is just the longest time I've tested - it may protect it 
> indefinitely, which would be a bug. There are no leases involved longer 
> than one minute.
> 
> Patricia
-- 
Greg Trasuk, President
StratusCom Manufacturing Systems Inc. - We use information technology to
solve business problems on your plant floor.
http://stratuscom.com

Re: Bug fixing

Posted by Patricia Shanahan <pa...@acm.org>.

On 10/25/2010 10:50 AM, Greg Trasuk wrote:
>
> On Mon, 2010-10-25 at 12:15, Patricia Shanahan wrote:
>> I've made some progress on one form of failure, a resource remaining
>> available after its lease was scheduled to expire.
>>
>> If the test pauses, e.g due to an inserted Thread.sleep, between
>> creating the lease and probing it for the first time, the test passes.
>> The effect of either of these actions is to delay the probe until after
>> the lease has expired. Incidentally, hooking up to the Eclipse debugger
>> was very helpful in discovering this behavior. The first indication was
>> when I spent a few minutes investigating at a breakpoint, let it
>> continue, and the test passed.
>>
>> If the test gets to the probe point before the lease is due to expire,
>> it goes into a polling loop. No matter how long it stays in the polling
>> loop, the lease does not expire. Somehow, the probe loop has the effect
>> of preventing expiration.
>>
> What is the "probe code" ?  How exactly does it determine that the lease
> is still alive?

It calls this method:

     protected boolean isAvailable() throws TestException {
         try {
             final Lease x = space.write(aEntry, resource, 1000);
			//addOutriggerLease(x, true);
         } catch (TransactionException e) {

             // This is ok...probably means it is a bad transactions
             return false;
         } catch (Exception e) {
             throw new TestException("Testing for availability", e);
         }
         return true;
     }

(I've slightly simplified the code, as part of a general attempt to 
comment out anything that is not required to reproduce the problem.)

space is a JavaSpace. The objective seems to be to detect whether the 
Transaction, resource, remains available. It was originally requested 
with a one minute lease duration, and it seems to have got it, "aprox 
duration:59991"

If the isAvailable method is called even once before the scheduled 
expiration of the resource lease, it will go on returning true. Repeated 
calls are not needed - a single early call is sufficient. I've tested 
that through a half hour sleep between two calls.

If the first call to this method is after the scheduled expiration of 
the resource lease, it returns false.

The JavaSpace is implemented by OutriggerServerImpl. This print call in 
its public long[] write(EntryRep rep, Transaction tr, long lease) method:

System.err.printf("XXXpats: Write of rep %s, tr %s, lease %d at %d%n",
			rep, tr, lease, System.currentTimeMillis());

reports:

XXXpats: Write of rep 
EntryRep[com.sun.jini.test.share.UninterestingEntry], tr 
net.jini.core.transaction.server.ServerTransaction 
[manager=com.sun.jini.mahalo.TxnMgrProxy$ConstrainableTxnMgrProxy@2b6ad6ff, 
id=-6728796543664280839], lease 1000 at 1288065446700



>
>> I would like to understand the rules for lease expiration. What actions
>> should extend a lease?
>
> lease.renew(long duration);
>
>
>> Is the lease manager required to expire it on
>> schedule? (Even if it is not required to expire the lease I still need
>> to find out whether the non-expiration is deliberate design or due to a
>> bug.)
>
> Do you mean the landlord (i.e. the grantor of the lease), or the
> LeaseRenewalManager object (which automatically renews leases on behalf
> of a lessor)?
>
> The owner of the resource does not need to expire it on a time schedule;
> it could expire the lease at some future point that is a convenient run
> time (like the next time the service happens to be accessed).
> Conceptually, the lease is entirely for the convenience of the lessor;
> expiry means that the client is no longer interested in the resource, so
> the client's resource allocation can be freed.
>
> In any case, specs would be at:
>
> http://www.jini.org/wiki/Jini_Distributed_Leasing_Specification

The bottom line seems to me to be that the observed behavior may be 
permitted, but is weird enough that I need to investigate it to find out 
if it is an indication of a bug.

A single write call using the transaction protects it for at least half 
an hour. That is just the longest time I've tested - it may protect it 
indefinitely, which would be a bug. There are no leases involved longer 
than one minute.

Patricia

Re: Bug fixing

Posted by Greg Trasuk <tr...@stratuscom.com>.

On Mon, 2010-10-25 at 12:15, Patricia Shanahan wrote:
> I've made some progress on one form of failure, a resource remaining 
> available after its lease was scheduled to expire.
> 
> If the test pauses, e.g due to an inserted Thread.sleep, between 
> creating the lease and probing it for the first time, the test passes. 
> The effect of either of these actions is to delay the probe until after 
> the lease has expired. Incidentally, hooking up to the Eclipse debugger 
> was very helpful in discovering this behavior. The first indication was 
> when I spent a few minutes investigating at a breakpoint, let it 
> continue, and the test passed.
> 
> If the test gets to the probe point before the lease is due to expire, 
> it goes into a polling loop. No matter how long it stays in the polling 
> loop, the lease does not expire. Somehow, the probe loop has the effect 
> of preventing expiration.
> 
What is the "probe code" ?  How exactly does it determine that the lease
is still alive?

> I would like to understand the rules for lease expiration. What actions 
> should extend a lease? 

lease.renew(long duration);


> Is the lease manager required to expire it on 
> schedule? (Even if it is not required to expire the lease I still need 
> to find out whether the non-expiration is deliberate design or due to a 
> bug.)

Do you mean the landlord (i.e. the grantor of the lease), or the
LeaseRenewalManager object (which automatically renews leases on behalf
of a lessor)?

The owner of the resource does not need to expire it on a time schedule;
it could expire the lease at some future point that is a convenient run
time (like the next time the service happens to be accessed). 
Conceptually, the lease is entirely for the convenience of the lessor;
expiry means that the client is no longer interested in the resource, so
the client's resource allocation can be freed.

In any case, specs would be at:

http://www.jini.org/wiki/Jini_Distributed_Leasing_Specification

Cheers,

Greg.

> 
> Can anyone point me to documentation or a tutorial that would help?
> 
> Thanks,
> 
> Patricia
> 
> 
> Patricia Shanahan wrote:
> > I've done a preliminary javaspace test:
> > 
> >      [java] # of tests started   = 259
> >      [java] # of tests completed = 259
> >      [java] # of tests skipped   = 3
> >      [java] # of tests passed    = 243
> >      [java] # of tests failed    = 16
> > 
> > The failures seem to fall into a few clusters. I'll start on one of them 
> > tomorrow.
> > 
> > Patricia
> > 
> > On 10/20/2010 1:44 PM, Jonathan Costers wrote:
> >> My vote would go to the "javaspace" test category.
> >> Last time I ran that one (250 or so tests IIRC) I got 16 failures.
> >>> On 10/19/2010 4:08 PM, Patricia Shanahan wrote:
> >>> Any votes on my next bug hunt? For example, are there bug reports in 
> >>> Jira
> >>> that really need to be fixed before the next release?
> > 
-- 
Greg Trasuk, President
StratusCom Manufacturing Systems Inc. - We use information technology to
solve business problems on your plant floor.
http://stratuscom.com

Re: Bug fixing

Posted by Patricia Shanahan <pa...@acm.org>.

I've made some progress on one form of failure, a resource remaining 
available after its lease was scheduled to expire.

If the test pauses, e.g due to an inserted Thread.sleep, between 
creating the lease and probing it for the first time, the test passes. 
The effect of either of these actions is to delay the probe until after 
the lease has expired. Incidentally, hooking up to the Eclipse debugger 
was very helpful in discovering this behavior. The first indication was 
when I spent a few minutes investigating at a breakpoint, let it 
continue, and the test passed.

If the test gets to the probe point before the lease is due to expire, 
it goes into a polling loop. No matter how long it stays in the polling 
loop, the lease does not expire. Somehow, the probe loop has the effect 
of preventing expiration.

I would like to understand the rules for lease expiration. What actions 
should extend a lease? Is the lease manager required to expire it on 
schedule? (Even if it is not required to expire the lease I still need 
to find out whether the non-expiration is deliberate design or due to a 
bug.)

Can anyone point me to documentation or a tutorial that would help?

Thanks,

Patricia

Patricia Shanahan wrote:
> I've done a preliminary javaspace test:
> 
>      [java] # of tests started   = 259
>      [java] # of tests completed = 259
>      [java] # of tests skipped   = 3
>      [java] # of tests passed    = 243
>      [java] # of tests failed    = 16
> 
> The failures seem to fall into a few clusters. I'll start on one of them 
> tomorrow.
> 
> Patricia
> 
> On 10/20/2010 1:44 PM, Jonathan Costers wrote:
>> My vote would go to the "javaspace" test category.
>> Last time I ran that one (250 or so tests IIRC) I got 16 failures.
>>> On 10/19/2010 4:08 PM, Patricia Shanahan wrote:
>>> Any votes on my next bug hunt? For example, are there bug reports in 
>>> Jira
>>> that really need to be fixed before the next release?
>

Re: Bug fixing

Posted by Patricia Shanahan <pa...@acm.org>.

I've done a preliminary javaspace test:

      [java] # of tests started   = 259
      [java] # of tests completed = 259
      [java] # of tests skipped   = 3
      [java] # of tests passed    = 243
      [java] # of tests failed    = 16

The failures seem to fall into a few clusters. I'll start on one of them 
tomorrow.

Patricia

On 10/20/2010 1:44 PM, Jonathan Costers wrote:
> My vote would go to the "javaspace" test category.
> Last time I ran that one (250 or so tests IIRC) I got 16 failures.
>> On 10/19/2010 4:08 PM, Patricia Shanahan wrote:
>> Any votes on my next bug hunt? For example, are there bug reports in Jira
>> that really need to be fixed before the next release?

Re: Bug fixing

Posted by Gregg Wonderly <gr...@wonderly.org>.

Do we need to look at VirtualBox, or some other VM that we can boot onto the 
servers and dispatch multiple machines onto so that we can create the type of 
test scenarios that make sense in these situations?

Gregg Wonderly

On 10/21/2010 4:42 PM, Patricia Shanahan wrote:
> To me, the idea of something as much concerned with remote activity as a Jini
> implementation not having a single QA test that uses more than one computer
> seems implausible and a bit alarming. A developer loading up enough services on
> his workstation for it to play the other computer in some remote tests seems
> quite plausible.
>
> On the other hand, I have not looked at all at the tests in question, so they
> may indeed be tests that *must* run single system and just mention "resendes"
> for historical reasons. In that case, I'll put creating some multi-computer
> tests, and finding a way of running them, on my to-do list, unless there is a
> really convincing argument that River cannot possibly have a remote-only bug.
>
> Patricia
>
>
> On 10/21/2010 11:43 AM, Gregg Wonderly wrote:
>> Resendes was one of the original Sun, Jini developers, so the use of
>> his computer would indicate something that he worked on creating I'd
>> guess.
>>
>> Gregg Wonderly
>>
>> On Oct 21, 2010, at 9:18 AM, Patricia Shanahan wrote:
>>
>>> Jonathan Costers wrote:
>>>> If we can get these cleared up, I believe we have a solid test
>>>> base in place to start validating some new developments and
>>>> experiments.
>>>
>>> I have a few remaining test-related concerns:
>>>
>>> 1. Tests with multiple computers. This may be what "resendes" is
>>> about. Surely many Apache projects need this for testing, so the
>>> infrastructure should have some way of applying multiple virtual
>>> computers for a single test.
>>>
>>> 2. Skipped tests. The category suppression suggests to me that
>>> River, or maybe Jini before it, went through a phase in which the
>>> developers took a "shoot the messenger" approach to test failures.
>>> If so, some of the skipped tests could be valid tests that, if run,
>>> would tell us something we need to hear. Once I've done with
>>> failures that are preventing whole categories from running, I want
>>> to review the skipped tests to make sure each skip is commented
>>> with a better reason than the test failing.
>>>
>>> 3. Lack of effective testing of concurrency and retry behavior in
>>> JoinManager and ServiceDiscoveryManager. I did start looking at
>>> building a mock environment around SDM for the purpose of forcing
>>> high traffic and difficult combinations of actions. Once we have
>>> all the existing tests running, I'm going to take another look at
>>> how effectively these difficult and important classes are being
>>> tested.
>>>
>>> Patricia
>>
>>
>
>

Re: Bug fixing

Posted by Patricia Shanahan <pa...@acm.org>.

To me, the idea of something as much concerned with remote activity as a 
Jini implementation not having a single QA test that uses more than one 
computer seems implausible and a bit alarming. A developer loading up 
enough services on his workstation for it to play the other computer in 
some remote tests seems quite plausible.

On the other hand, I have not looked at all at the tests in question, so 
they may indeed be tests that *must* run single system and just mention 
"resendes" for historical reasons. In that case, I'll put creating some 
multi-computer tests, and finding a way of running them, on my to-do 
list, unless there is a really convincing argument that River cannot 
possibly have a remote-only bug.

Patricia


On 10/21/2010 11:43 AM, Gregg Wonderly wrote:
> Resendes was one of the original Sun, Jini developers, so the use of
> his computer would indicate something that he worked on creating I'd
> guess.
>
> Gregg Wonderly
>
> On Oct 21, 2010, at 9:18 AM, Patricia Shanahan wrote:
>
>> Jonathan Costers wrote:
>>> If we can get these cleared up, I believe we have a solid test
>>> base in place to start validating some new developments and
>>> experiments.
>>
>> I have a few remaining test-related concerns:
>>
>> 1. Tests with multiple computers. This may be what "resendes" is
>> about. Surely many Apache projects need this for testing, so the
>> infrastructure should have some way of applying multiple virtual
>> computers for a single test.
>>
>> 2. Skipped tests. The category suppression suggests to me that
>> River, or maybe Jini before it, went through a phase in which the
>> developers took a "shoot the messenger" approach to test failures.
>> If so, some of the skipped tests could be valid tests that, if run,
>> would tell us something we need to hear. Once I've done with
>> failures that are preventing whole categories from running, I want
>> to review the skipped tests to make sure each skip is commented
>> with a better reason than the test failing.
>>
>> 3. Lack of effective testing of concurrency and retry behavior in
>> JoinManager and ServiceDiscoveryManager. I did start looking at
>> building a mock environment around SDM for the purpose of forcing
>> high traffic and difficult combinations of actions. Once we have
>> all the existing tests running, I'm going to take another look at
>> how effectively these difficult and important classes are being
>> tested.
>>
>> Patricia
>
>

Re: Bug fixing

Posted by Gregg Wonderly <gr...@wonderly.org>.

Resendes was one of the original Sun, Jini developers, so the use of his computer would indicate something that he worked on creating I'd guess.

Gregg Wonderly

On Oct 21, 2010, at 9:18 AM, Patricia Shanahan wrote:

> Jonathan Costers wrote:
>> If we can get these cleared up, I believe we have a solid test base in place
>> to start validating some new developments and experiments.
> 
> I have a few remaining test-related concerns:
> 
> 1. Tests with multiple computers. This may be what "resendes" is about. Surely many Apache projects need this for testing, so the infrastructure should have some way of applying multiple virtual computers for a single test.
> 
> 2. Skipped tests. The category suppression suggests to me that River, or maybe Jini before it, went through a phase in which the developers took a "shoot the messenger" approach to test failures. If so, some of the skipped tests could be valid tests that, if run, would tell us something we need to hear. Once I've done with failures that are preventing whole categories from running, I want to review the skipped tests to make sure each skip is commented with a better reason than the test failing.
> 
> 3. Lack of effective testing of concurrency and retry behavior in JoinManager and ServiceDiscoveryManager. I did start looking at building a mock environment around SDM for the purpose of forcing high traffic and difficult combinations of actions. Once we have all the existing tests running, I'm going to take another look at how effectively these difficult and important classes are being tested.
> 
> Patricia

Re: Bug fixing

Posted by Patricia Shanahan <pa...@acm.org>.

On 10/21/2010 6:39 PM, Bob Resendes wrote:
...
> Now, I'm not claiming I did an exhaustive search (config files, doc, etc.), but
> from what I've seen, "resendes" was just an example hostname, with perhaps the
> additional condition that it not be a "reachable" hostname (at least from the
> QA/Test network). I'd be surprised if that's not true for the existing QA test
> environment, but that said, it probably makes sense to change any "resendes"
> references to some more explicitly "bogus" to avoid future confusion.
...

Thanks for the information. From that, I would expect the "resendes" 
tests to pass in our QA environments, even without a name change to 
protect the innocent.

Patricia

Re: Bug fixing

Posted by Sim IJskes - QCG <si...@qcg.nl>.

On 10/22/2010 03:39 AM, Bob Resendes wrote:
> QA/Test network). I'd be surprised if that's not true for the existing QA test
> environment, but that said, it probably makes sense to change any "resendes"
> references to some more explicitly "bogus" to avoid future confusion.
>
>
> Hope that helps,
> Bob

Yes. Thanks! I changed the issue accordingly.

Gr. Sim

Re: Bug fixing

Posted by Bob Resendes <re...@yahoo.com>.


> 
> 1. Tests with multiple computers. This may be what "resendes"  is about. Surely 
>many Apache projects need this for testing, so the  infrastructure should have 
>some way of applying multiple virtual computers for a  single test.
> 

[Sorry for the late reply. I haven't been following the River list, but a little 
birdie mentioned that my name was being taken in vain on the River list.]

I took a quick look at "resendes" references in the tests and could only find 
the following file references (and explanations):

AdminIFShutdownTest & AdminIFTest (for mahalo and mercury services)
====================================================
"resendes" is only used to create a couple of lookup locator URLs which are set 
and get from the service's admin I/F. Any other name could have been used here, 
but you'd probably want to pick a non-existent host for the URL so as not to 
send any communications to that host (possibly interfering with another test). I 
can't say for sure, but I think the QA testing network was isolated from the 
development network and resendes was pretty much guaranteed not to exist there.

SerializedServiceDescriptors
=====================
Again, simply used "resendes" for an example URL and any other hostname would do 
here. The test only checks that serialized form of the descriptor matches before 
and after serialization. 


ServiceDescriptorUtil
================
There are "resendes" references, here, but I didn't see any references to the 
constants that use them anywhere else. Again, nothing special about the name.

SharedActivationPolicyPermissionHTTPEqualsTest and 
SharedActivationPolicyPermissionHTTPImpliesTest
==============================================================================
 "resendes" is just being used for a sample URL hostname. The tests are only 
comparing URL strings. Again, nothing special about the name.


Now, I'm not claiming I did an exhaustive search (config files, doc, etc.), but 
from what I've seen, "resendes" was just an example hostname, with perhaps the 
additional condition that it not be a "reachable" hostname (at least from the 
QA/Test network). I'd be surprised if that's not true for the existing QA test 
environment, but that said, it probably makes sense to change any "resendes" 
references to some more explicitly "bogus" to avoid future confusion. 


Hope that helps,
Bob

Re: Bug fixing

Posted by Patricia Shanahan <pa...@acm.org>.

Jonathan Costers wrote:
> If we can get these cleared up, I believe we have a solid test base in place
> to start validating some new developments and experiments.

I have a few remaining test-related concerns:

1. Tests with multiple computers. This may be what "resendes" is about. 
Surely many Apache projects need this for testing, so the infrastructure 
should have some way of applying multiple virtual computers for a single 
test.

2. Skipped tests. The category suppression suggests to me that River, or 
maybe Jini before it, went through a phase in which the developers took 
a "shoot the messenger" approach to test failures. If so, some of the 
skipped tests could be valid tests that, if run, would tell us something 
we need to hear. Once I've done with failures that are preventing whole 
categories from running, I want to review the skipped tests to make sure 
each skip is commented with a better reason than the test failing.

3. Lack of effective testing of concurrency and retry behavior in 
JoinManager and ServiceDiscoveryManager. I did start looking at building 
a mock environment around SDM for the purpose of forcing high traffic 
and difficult combinations of actions. Once we have all the existing 
tests running, I'm going to take another look at how effectively these 
difficult and important classes are being tested.

Patricia

Re: Bug fixing

Posted by Jonathan Costers <jo...@googlemail.com>.

Great job Patricia!

My vote would go to the "javaspace" test category.
Last time I ran that one (250 or so tests IIRC) I got 16 failures.

If we can get these cleared up, I believe we have a solid test base in place
to start validating some new developments and experiments.

Looks like we are really getting some momentum here, I like it a lot.

Thanks to all for your hard work.
Jonathan

2010/10/20 Patricia Shanahan <pa...@acm.org>

> On 10/19/2010 4:08 PM, Patricia Shanahan wrote:
>
>> I propose modifying TxnManagerImpl to make it match the interface
>> declaration, and allow an abort to be retried. This may break other
>> tests, if they are assuming the behavior that TxnManagerImpl implemented.
>>
>
> I'm doing a full QA test, including txnmanager. Although the test is still
> running, all of the txnmanager tests, including GetStateTest, have passed.
> Those tests are the most likely to notice the change.
>
> If the rest of the QA test is clean when it finishes, I'll check in the fix
> and we can add txnmanager to the default category list.
>
> Any votes on my next bug hunt? For example, are there bug reports in Jira
> that really need to be fixed before the next release?
>
> Patricia
>
>

Re: Bug fixing

Posted by Patricia Shanahan <pa...@acm.org>.

On 10/19/2010 4:08 PM, Patricia Shanahan wrote:
> I propose modifying TxnManagerImpl to make it match the interface
> declaration, and allow an abort to be retried. This may break other
> tests, if they are assuming the behavior that TxnManagerImpl implemented.

I'm doing a full QA test, including txnmanager. Although the test is 
still running, all of the txnmanager tests, including GetStateTest, have 
passed. Those tests are the most likely to notice the change.

If the rest of the QA test is clean when it finishes, I'll check in the 
fix and we can add txnmanager to the default category list.

Any votes on my next bug hunt? For example, are there bug reports in 
Jira that really need to be fixed before the next release?

Patricia

Re: Bug fixing

Posted by Patricia Shanahan <pa...@acm.org>.

On 10/15/2010 10:51 AM, Patricia Shanahan wrote:
> I seem to have reached a level of understanding of River that lets me
> track down some bugs, and bug fixing is an obviously useful activity, so
> I plan to spend some time on it.
>
> As I mentioned in the "Request for testing help" thread, I've
> investigated the GetStateTest hang. The test is spinning waiting for the
> TransactionManager's getState method to throw an exception because it
> has discarded the aborted transaction. As far as I can tell, there is no
> requirement that a TransactionManager discard a transaction, even when
> it is permitted to do so.
>
> I plan to file a Jira for the test, and modify it to spin for a limited
> time. Treat either UnknownTransactionException or continuous return of
> ABORTED status for e.g. one minute as successful test completion.

I've investigated this some more, and the test is revealing a real 
problem in the transaction abort implementation.

If there is a timeout it passes the problem to a SettlerTask, subclass 
of RetryTask, which retries the abort. However, 
com.sun.jini.mahalo.TxnManagerImpl's abort code checks for an attempt to 
abort a transaction for which abort has already been called, and throws 
new CannotAbortException("Transaction previously aborted")

The net.jini.core.transaction.server.TransactionManager interface, which 
it implements, specifies that abort throws CannotAbortException for a 
transaction that has reached the COMMITTED state, but says nothing about 
throwing it for a transaction that is in the ABORTED state.

I propose modifying TxnManagerImpl to make it match the interface 
declaration, and allow an abort to be retried. This may break other 
tests, if they are assuming the behavior that TxnManagerImpl implemented.

Patricia