You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Daniel Barclay <db...@maprtech.com> on 2015/04/29 09:13:11 UTC

TestDrillbitResilience broken? assertion errors; now slow/hung, with 278 threads!

Does anyone know what's going on with TestDrillbitResilience (rebased
from master today)?  (Is it working right?)


One run, via "mvn install", yielded assertion errors:

...
Error shutting down Drillbit "beta".
Tests run: 11, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 33.811 sec <<< FAILURE! - in org.apache.drill.exec.server.TestDrillbitResilience
cancelAfterEverythingIsCompleted(org.apache.drill.exec.server.TestDrillbitResilience)  Time elapsed: 1.468 sec  <<< FAILURE!
java.lang.AssertionError: null
	at org.apache.drill.exec.server.TestDrillbitResilience.assertCancelled(TestDrillbitResilience.java:459)
	at org.apache.drill.exec.server.TestDrillbitResilience.cancelAfterEverythingIsCompleted(TestDrillbitResilience.java:565)

cancelInMiddleOfFetchingResults(org.apache.drill.exec.server.TestDrillbitResilience)  Time elapsed: 1.496 sec  <<< FAILURE!
java.lang.AssertionError: null
	at org.apache.drill.exec.server.TestDrillbitResilience.assertCancelled(TestDrillbitResilience.java:459)
	at org.apache.drill.exec.server.TestDrillbitResilience.cancelInMiddleOfFetchingResults(TestDrillbitResilience.java:510)

Running <next test>
...


A second run, run individually (but still via Maven) died with different errors.



A third run, via "mvn install" again, seems hung after reporting this
(maybe expected) exception:

Exception (no rows returned): org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: run-try-end


[fb9cfe61-af6e-4c9c-b6ab-8a1b8725c6e9 on dev-linux2:31010]


The process is using only about 5% CPU--but has 278 threads!
(That includes about 35 threads all with the same name of "BitClient-1".)


Daniel






-- 
Daniel Barclay
MapR Technologies

Re: TestDrillbitResilience broken? assertion errors; now slow/hung, with 278 threads!

Posted by Sudheesh Katkam <sk...@maprtech.com>.
*ran the tests before checking them in. 

> On Apr 29, 2015, at 7:53 AM, Sudheesh Katkam <sk...@maprtech.com> wrote:
> 
> I am responsible for those tests. I ran the tests at least 10 times on my Linux VM with 1 second pauses, all of which passed. 
> 
> On your second run, what different errors did you see?
> 
> On your third run, are you able to reproduce the test case the hangs?
> 
> Sorry that the message is not informative. I already have a patch which is a slight improvement to Jacques change that improves the message in those tests.  
> 
> What tool did you use to get the thread count?
> 
> - Sudheesh
> 
> Sent from my iPhone. Pardon any typos.
> 
>> On Apr 29, 2015, at 6:28 AM, Abdel Hakim Deneche <ad...@maprtech.com> wrote:
>> 
>> The message displayed in the first run contains actually two different
>> issues:
>> 
>> 1. The error message "Error shutting down Drillbit 'beta'" is most likely
>> caused by this issue DRILL-2878
>> <https://issues.apache.org/jira/browse/DRILL-2878>
>> 
>> 2. The test that failed with an "java.lang.AssertionError: null" is most
>> likely a bug because that unit test should not fail. I've seen this error
>> before, but it only happens intermittently.
>> 
>> The system error reported in the 3rd run is actually an "expected" injected
>> exception, but 278 threads looks suspicious!!!
>> 
>> On Wed, Apr 29, 2015 at 12:13 AM, Daniel Barclay <db...@maprtech.com>
>> wrote:
>> 
>>> Does anyone know what's going on with TestDrillbitResilience (rebased
>>> from master today)?  (Is it working right?)
>>> 
>>> 
>>> One run, via "mvn install", yielded assertion errors:
>>> 
>>> ...
>>> Error shutting down Drillbit "beta".
>>> Tests run: 11, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 33.811
>>> sec <<< FAILURE! - in org.apache.drill.exec.server.TestDrillbitResilience
>>> cancelAfterEverythingIsCompleted(org.apache.drill.exec.server.TestDrillbitResilience)
>>> Time elapsed: 1.468 sec  <<< FAILURE!
>>> java.lang.AssertionError: null
>>>       at
>>> org.apache.drill.exec.server.TestDrillbitResilience.assertCancelled(TestDrillbitResilience.java:459)
>>>       at
>>> org.apache.drill.exec.server.TestDrillbitResilience.cancelAfterEverythingIsCompleted(TestDrillbitResilience.java:565)
>>> 
>>> cancelInMiddleOfFetchingResults(org.apache.drill.exec.server.TestDrillbitResilience)
>>> Time elapsed: 1.496 sec  <<< FAILURE!
>>> java.lang.AssertionError: null
>>>       at
>>> org.apache.drill.exec.server.TestDrillbitResilience.assertCancelled(TestDrillbitResilience.java:459)
>>>       at
>>> org.apache.drill.exec.server.TestDrillbitResilience.cancelInMiddleOfFetchingResults(TestDrillbitResilience.java:510)
>>> 
>>> Running <next test>
>>> ...
>>> 
>>> 
>>> A second run, run individually (but still via Maven) died with different
>>> errors.
>>> 
>>> 
>>> 
>>> A third run, via "mvn install" again, seems hung after reporting this
>>> (maybe expected) exception:
>>> 
>>> Exception (no rows returned):
>>> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
>>> run-try-end
>>> 
>>> 
>>> [fb9cfe61-af6e-4c9c-b6ab-8a1b8725c6e9 on dev-linux2:31010]
>>> 
>>> 
>>> The process is using only about 5% CPU--but has 278 threads!
>>> (That includes about 35 threads all with the same name of "BitClient-1".)
>>> 
>>> 
>>> Daniel
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Daniel Barclay
>>> MapR Technologies
>> 
>> 
>> 
>> -- 
>> 
>> Abdelhakim Deneche
>> 
>> Software Engineer
>> 
>> <http://www.mapr.com/>
>> 
>> 
>> Now Available - Free Hadoop On-Demand Training
>> <http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>

Re: TestDrillbitResilience broken? assertion errors; now slow/hung, with 278 threads!

Posted by Sudheesh Katkam <sk...@maprtech.com>.
1) I ran “mvn clean install” back to back (almost). So serial, and all tests.

2) Tests rely on pause time so that threads wait 'long enough’ so that Drill receives and propagates a cancel signal. The proposal in https://issues.apache.org/jira/browse/DRILL-2697 <https://issues.apache.org/jira/browse/DRILL-2697> would make test cases work without any timing issues.

> On Apr 29, 2015, at 9:15 AM, Jacques Nadeau <ja...@apache.org> wrote:
> 
> Quick question re 10 runs: are these runs that are in parallel with all the
> unit tests or just this test?
> 
> The other question is: how do we construct these tests so they it is
> extremely unlikely to get a failure even if processing is slow or threads
> are suspended?
> 
> On Wed, Apr 29, 2015 at 7:53 AM, Sudheesh Katkam <sk...@maprtech.com>
> wrote:
> 
>> I am responsible for those tests. I ran the tests at least 10 times on my
>> Linux VM with 1 second pauses, all of which passed.
>> 
>> On your second run, what different errors did you see?
>> 
>> On your third run, are you able to reproduce the test case the hangs?
>> 
>> Sorry that the message is not informative. I already have a patch which is
>> a slight improvement to Jacques change that improves the message in those
>> tests.
>> 
>> What tool did you use to get the thread count?
>> 
>> - Sudheesh
>> 
>> Sent from my iPhone. Pardon any typos.
>> 
>>> On Apr 29, 2015, at 6:28 AM, Abdel Hakim Deneche <ad...@maprtech.com>
>> wrote:
>>> 
>>> The message displayed in the first run contains actually two different
>>> issues:
>>> 
>>> 1. The error message "Error shutting down Drillbit 'beta'" is most likely
>>> caused by this issue DRILL-2878
>>> <https://issues.apache.org/jira/browse/DRILL-2878>
>>> 
>>> 2. The test that failed with an "java.lang.AssertionError: null" is most
>>> likely a bug because that unit test should not fail. I've seen this error
>>> before, but it only happens intermittently.
>>> 
>>> The system error reported in the 3rd run is actually an "expected"
>> injected
>>> exception, but 278 threads looks suspicious!!!
>>> 
>>> On Wed, Apr 29, 2015 at 12:13 AM, Daniel Barclay <db...@maprtech.com>
>>> wrote:
>>> 
>>>> Does anyone know what's going on with TestDrillbitResilience (rebased
>>>> from master today)?  (Is it working right?)
>>>> 
>>>> 
>>>> One run, via "mvn install", yielded assertion errors:
>>>> 
>>>> ...
>>>> Error shutting down Drillbit "beta".
>>>> Tests run: 11, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 33.811
>>>> sec <<< FAILURE! - in
>> org.apache.drill.exec.server.TestDrillbitResilience
>>>> 
>> cancelAfterEverythingIsCompleted(org.apache.drill.exec.server.TestDrillbitResilience)
>>>> Time elapsed: 1.468 sec  <<< FAILURE!
>>>> java.lang.AssertionError: null
>>>>       at
>>>> 
>> org.apache.drill.exec.server.TestDrillbitResilience.assertCancelled(TestDrillbitResilience.java:459)
>>>>       at
>>>> 
>> org.apache.drill.exec.server.TestDrillbitResilience.cancelAfterEverythingIsCompleted(TestDrillbitResilience.java:565)
>>>> 
>>>> 
>> cancelInMiddleOfFetchingResults(org.apache.drill.exec.server.TestDrillbitResilience)
>>>> Time elapsed: 1.496 sec  <<< FAILURE!
>>>> java.lang.AssertionError: null
>>>>       at
>>>> 
>> org.apache.drill.exec.server.TestDrillbitResilience.assertCancelled(TestDrillbitResilience.java:459)
>>>>       at
>>>> 
>> org.apache.drill.exec.server.TestDrillbitResilience.cancelInMiddleOfFetchingResults(TestDrillbitResilience.java:510)
>>>> 
>>>> Running <next test>
>>>> ...
>>>> 
>>>> 
>>>> A second run, run individually (but still via Maven) died with different
>>>> errors.
>>>> 
>>>> 
>>>> 
>>>> A third run, via "mvn install" again, seems hung after reporting this
>>>> (maybe expected) exception:
>>>> 
>>>> Exception (no rows returned):
>>>> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
>>>> run-try-end
>>>> 
>>>> 
>>>> [fb9cfe61-af6e-4c9c-b6ab-8a1b8725c6e9 on dev-linux2:31010]
>>>> 
>>>> 
>>>> The process is using only about 5% CPU--but has 278 threads!
>>>> (That includes about 35 threads all with the same name of
>> "BitClient-1".)
>>>> 
>>>> 
>>>> Daniel
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Daniel Barclay
>>>> MapR Technologies
>>> 
>>> 
>>> 
>>> --
>>> 
>>> Abdelhakim Deneche
>>> 
>>> Software Engineer
>>> 
>>> <http://www.mapr.com/>
>>> 
>>> 
>>> Now Available - Free Hadoop On-Demand Training
>>> <
>> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
>>> 
>> 


Re: TestDrillbitResilience broken? assertion errors; now slow/hung, with 278 threads!

Posted by Abdel Hakim Deneche <ad...@maprtech.com>.
On Wed, Apr 29, 2015 at 9:15 AM, Jacques Nadeau <ja...@apache.org> wrote:

> Quick question re 10 runs: are these runs that are in parallel with all the
> unit tests or just this test?
>
> The other question is: how do we construct these tests so they it is
> extremely unlikely to get a failure even if processing is slow or threads
> are suspended?
>

First problems we hit when processing is slow are junit timeouts. Once a
unit tests times out, it's corresponding query isn't cancelled and may
continue running in parallel with other unit tests from same test class.
Once the @AfterClass method shuts down the drillbits, they may complain
about allocators not closed because some queries are actually still running.


> On Wed, Apr 29, 2015 at 7:53 AM, Sudheesh Katkam <sk...@maprtech.com>
> wrote:
>
> > I am responsible for those tests. I ran the tests at least 10 times on my
> > Linux VM with 1 second pauses, all of which passed.
> >
> > On your second run, what different errors did you see?
> >
> > On your third run, are you able to reproduce the test case the hangs?
> >
> > Sorry that the message is not informative. I already have a patch which
> is
> > a slight improvement to Jacques change that improves the message in those
> > tests.
> >
> > What tool did you use to get the thread count?
> >
> > - Sudheesh
> >
> > Sent from my iPhone. Pardon any typos.
> >
> > > On Apr 29, 2015, at 6:28 AM, Abdel Hakim Deneche <
> adeneche@maprtech.com>
> > wrote:
> > >
> > > The message displayed in the first run contains actually two different
> > > issues:
> > >
> > > 1. The error message "Error shutting down Drillbit 'beta'" is most
> likely
> > > caused by this issue DRILL-2878
> > > <https://issues.apache.org/jira/browse/DRILL-2878>
> > >
> > > 2. The test that failed with an "java.lang.AssertionError: null" is
> most
> > > likely a bug because that unit test should not fail. I've seen this
> error
> > > before, but it only happens intermittently.
> > >
> > > The system error reported in the 3rd run is actually an "expected"
> > injected
> > > exception, but 278 threads looks suspicious!!!
> > >
> > > On Wed, Apr 29, 2015 at 12:13 AM, Daniel Barclay <
> dbarclay@maprtech.com>
> > > wrote:
> > >
> > >> Does anyone know what's going on with TestDrillbitResilience (rebased
> > >> from master today)?  (Is it working right?)
> > >>
> > >>
> > >> One run, via "mvn install", yielded assertion errors:
> > >>
> > >> ...
> > >> Error shutting down Drillbit "beta".
> > >> Tests run: 11, Failures: 2, Errors: 0, Skipped: 0, Time elapsed:
> 33.811
> > >> sec <<< FAILURE! - in
> > org.apache.drill.exec.server.TestDrillbitResilience
> > >>
> >
> cancelAfterEverythingIsCompleted(org.apache.drill.exec.server.TestDrillbitResilience)
> > >> Time elapsed: 1.468 sec  <<< FAILURE!
> > >> java.lang.AssertionError: null
> > >>        at
> > >>
> >
> org.apache.drill.exec.server.TestDrillbitResilience.assertCancelled(TestDrillbitResilience.java:459)
> > >>        at
> > >>
> >
> org.apache.drill.exec.server.TestDrillbitResilience.cancelAfterEverythingIsCompleted(TestDrillbitResilience.java:565)
> > >>
> > >>
> >
> cancelInMiddleOfFetchingResults(org.apache.drill.exec.server.TestDrillbitResilience)
> > >> Time elapsed: 1.496 sec  <<< FAILURE!
> > >> java.lang.AssertionError: null
> > >>        at
> > >>
> >
> org.apache.drill.exec.server.TestDrillbitResilience.assertCancelled(TestDrillbitResilience.java:459)
> > >>        at
> > >>
> >
> org.apache.drill.exec.server.TestDrillbitResilience.cancelInMiddleOfFetchingResults(TestDrillbitResilience.java:510)
> > >>
> > >> Running <next test>
> > >> ...
> > >>
> > >>
> > >> A second run, run individually (but still via Maven) died with
> different
> > >> errors.
> > >>
> > >>
> > >>
> > >> A third run, via "mvn install" again, seems hung after reporting this
> > >> (maybe expected) exception:
> > >>
> > >> Exception (no rows returned):
> > >> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
> > >> run-try-end
> > >>
> > >>
> > >> [fb9cfe61-af6e-4c9c-b6ab-8a1b8725c6e9 on dev-linux2:31010]
> > >>
> > >>
> > >> The process is using only about 5% CPU--but has 278 threads!
> > >> (That includes about 35 threads all with the same name of
> > "BitClient-1".)
> > >>
> > >>
> > >> Daniel
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> --
> > >> Daniel Barclay
> > >> MapR Technologies
> > >
> > >
> > >
> > > --
> > >
> > > Abdelhakim Deneche
> > >
> > > Software Engineer
> > >
> > >  <http://www.mapr.com/>
> > >
> > >
> > > Now Available - Free Hadoop On-Demand Training
> > > <
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>

Re: TestDrillbitResilience broken? assertion errors; now slow/hung, with 278 threads!

Posted by Jacques Nadeau <ja...@apache.org>.
Quick question re 10 runs: are these runs that are in parallel with all the
unit tests or just this test?

The other question is: how do we construct these tests so they it is
extremely unlikely to get a failure even if processing is slow or threads
are suspended?

On Wed, Apr 29, 2015 at 7:53 AM, Sudheesh Katkam <sk...@maprtech.com>
wrote:

> I am responsible for those tests. I ran the tests at least 10 times on my
> Linux VM with 1 second pauses, all of which passed.
>
> On your second run, what different errors did you see?
>
> On your third run, are you able to reproduce the test case the hangs?
>
> Sorry that the message is not informative. I already have a patch which is
> a slight improvement to Jacques change that improves the message in those
> tests.
>
> What tool did you use to get the thread count?
>
> - Sudheesh
>
> Sent from my iPhone. Pardon any typos.
>
> > On Apr 29, 2015, at 6:28 AM, Abdel Hakim Deneche <ad...@maprtech.com>
> wrote:
> >
> > The message displayed in the first run contains actually two different
> > issues:
> >
> > 1. The error message "Error shutting down Drillbit 'beta'" is most likely
> > caused by this issue DRILL-2878
> > <https://issues.apache.org/jira/browse/DRILL-2878>
> >
> > 2. The test that failed with an "java.lang.AssertionError: null" is most
> > likely a bug because that unit test should not fail. I've seen this error
> > before, but it only happens intermittently.
> >
> > The system error reported in the 3rd run is actually an "expected"
> injected
> > exception, but 278 threads looks suspicious!!!
> >
> > On Wed, Apr 29, 2015 at 12:13 AM, Daniel Barclay <db...@maprtech.com>
> > wrote:
> >
> >> Does anyone know what's going on with TestDrillbitResilience (rebased
> >> from master today)?  (Is it working right?)
> >>
> >>
> >> One run, via "mvn install", yielded assertion errors:
> >>
> >> ...
> >> Error shutting down Drillbit "beta".
> >> Tests run: 11, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 33.811
> >> sec <<< FAILURE! - in
> org.apache.drill.exec.server.TestDrillbitResilience
> >>
> cancelAfterEverythingIsCompleted(org.apache.drill.exec.server.TestDrillbitResilience)
> >> Time elapsed: 1.468 sec  <<< FAILURE!
> >> java.lang.AssertionError: null
> >>        at
> >>
> org.apache.drill.exec.server.TestDrillbitResilience.assertCancelled(TestDrillbitResilience.java:459)
> >>        at
> >>
> org.apache.drill.exec.server.TestDrillbitResilience.cancelAfterEverythingIsCompleted(TestDrillbitResilience.java:565)
> >>
> >>
> cancelInMiddleOfFetchingResults(org.apache.drill.exec.server.TestDrillbitResilience)
> >> Time elapsed: 1.496 sec  <<< FAILURE!
> >> java.lang.AssertionError: null
> >>        at
> >>
> org.apache.drill.exec.server.TestDrillbitResilience.assertCancelled(TestDrillbitResilience.java:459)
> >>        at
> >>
> org.apache.drill.exec.server.TestDrillbitResilience.cancelInMiddleOfFetchingResults(TestDrillbitResilience.java:510)
> >>
> >> Running <next test>
> >> ...
> >>
> >>
> >> A second run, run individually (but still via Maven) died with different
> >> errors.
> >>
> >>
> >>
> >> A third run, via "mvn install" again, seems hung after reporting this
> >> (maybe expected) exception:
> >>
> >> Exception (no rows returned):
> >> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
> >> run-try-end
> >>
> >>
> >> [fb9cfe61-af6e-4c9c-b6ab-8a1b8725c6e9 on dev-linux2:31010]
> >>
> >>
> >> The process is using only about 5% CPU--but has 278 threads!
> >> (That includes about 35 threads all with the same name of
> "BitClient-1".)
> >>
> >>
> >> Daniel
> >>
> >>
> >>
> >>
> >>
> >>
> >> --
> >> Daniel Barclay
> >> MapR Technologies
> >
> >
> >
> > --
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> >  <http://www.mapr.com/>
> >
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>

Re: TestDrillbitResilience broken? assertion errors; now slow/hung, with 278 threads!

Posted by Sudheesh Katkam <sk...@maprtech.com>.
I am responsible for those tests. I ran the tests at least 10 times on my Linux VM with 1 second pauses, all of which passed. 

On your second run, what different errors did you see?

On your third run, are you able to reproduce the test case the hangs?

Sorry that the message is not informative. I already have a patch which is a slight improvement to Jacques change that improves the message in those tests.  

What tool did you use to get the thread count?

- Sudheesh

Sent from my iPhone. Pardon any typos.

> On Apr 29, 2015, at 6:28 AM, Abdel Hakim Deneche <ad...@maprtech.com> wrote:
> 
> The message displayed in the first run contains actually two different
> issues:
> 
> 1. The error message "Error shutting down Drillbit 'beta'" is most likely
> caused by this issue DRILL-2878
> <https://issues.apache.org/jira/browse/DRILL-2878>
> 
> 2. The test that failed with an "java.lang.AssertionError: null" is most
> likely a bug because that unit test should not fail. I've seen this error
> before, but it only happens intermittently.
> 
> The system error reported in the 3rd run is actually an "expected" injected
> exception, but 278 threads looks suspicious!!!
> 
> On Wed, Apr 29, 2015 at 12:13 AM, Daniel Barclay <db...@maprtech.com>
> wrote:
> 
>> Does anyone know what's going on with TestDrillbitResilience (rebased
>> from master today)?  (Is it working right?)
>> 
>> 
>> One run, via "mvn install", yielded assertion errors:
>> 
>> ...
>> Error shutting down Drillbit "beta".
>> Tests run: 11, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 33.811
>> sec <<< FAILURE! - in org.apache.drill.exec.server.TestDrillbitResilience
>> cancelAfterEverythingIsCompleted(org.apache.drill.exec.server.TestDrillbitResilience)
>> Time elapsed: 1.468 sec  <<< FAILURE!
>> java.lang.AssertionError: null
>>        at
>> org.apache.drill.exec.server.TestDrillbitResilience.assertCancelled(TestDrillbitResilience.java:459)
>>        at
>> org.apache.drill.exec.server.TestDrillbitResilience.cancelAfterEverythingIsCompleted(TestDrillbitResilience.java:565)
>> 
>> cancelInMiddleOfFetchingResults(org.apache.drill.exec.server.TestDrillbitResilience)
>> Time elapsed: 1.496 sec  <<< FAILURE!
>> java.lang.AssertionError: null
>>        at
>> org.apache.drill.exec.server.TestDrillbitResilience.assertCancelled(TestDrillbitResilience.java:459)
>>        at
>> org.apache.drill.exec.server.TestDrillbitResilience.cancelInMiddleOfFetchingResults(TestDrillbitResilience.java:510)
>> 
>> Running <next test>
>> ...
>> 
>> 
>> A second run, run individually (but still via Maven) died with different
>> errors.
>> 
>> 
>> 
>> A third run, via "mvn install" again, seems hung after reporting this
>> (maybe expected) exception:
>> 
>> Exception (no rows returned):
>> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
>> run-try-end
>> 
>> 
>> [fb9cfe61-af6e-4c9c-b6ab-8a1b8725c6e9 on dev-linux2:31010]
>> 
>> 
>> The process is using only about 5% CPU--but has 278 threads!
>> (That includes about 35 threads all with the same name of "BitClient-1".)
>> 
>> 
>> Daniel
>> 
>> 
>> 
>> 
>> 
>> 
>> --
>> Daniel Barclay
>> MapR Technologies
> 
> 
> 
> -- 
> 
> Abdelhakim Deneche
> 
> Software Engineer
> 
>  <http://www.mapr.com/>
> 
> 
> Now Available - Free Hadoop On-Demand Training
> <http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>

Re: TestDrillbitResilience broken? assertion errors; now slow/hung, with 278 threads!

Posted by Abdel Hakim Deneche <ad...@maprtech.com>.
The message displayed in the first run contains actually two different
issues:

1. The error message "Error shutting down Drillbit 'beta'" is most likely
caused by this issue DRILL-2878
<https://issues.apache.org/jira/browse/DRILL-2878>

2. The test that failed with an "java.lang.AssertionError: null" is most
likely a bug because that unit test should not fail. I've seen this error
before, but it only happens intermittently.

The system error reported in the 3rd run is actually an "expected" injected
exception, but 278 threads looks suspicious!!!

On Wed, Apr 29, 2015 at 12:13 AM, Daniel Barclay <db...@maprtech.com>
wrote:

> Does anyone know what's going on with TestDrillbitResilience (rebased
> from master today)?  (Is it working right?)
>
>
> One run, via "mvn install", yielded assertion errors:
>
> ...
> Error shutting down Drillbit "beta".
> Tests run: 11, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 33.811
> sec <<< FAILURE! - in org.apache.drill.exec.server.TestDrillbitResilience
> cancelAfterEverythingIsCompleted(org.apache.drill.exec.server.TestDrillbitResilience)
> Time elapsed: 1.468 sec  <<< FAILURE!
> java.lang.AssertionError: null
>         at
> org.apache.drill.exec.server.TestDrillbitResilience.assertCancelled(TestDrillbitResilience.java:459)
>         at
> org.apache.drill.exec.server.TestDrillbitResilience.cancelAfterEverythingIsCompleted(TestDrillbitResilience.java:565)
>
> cancelInMiddleOfFetchingResults(org.apache.drill.exec.server.TestDrillbitResilience)
> Time elapsed: 1.496 sec  <<< FAILURE!
> java.lang.AssertionError: null
>         at
> org.apache.drill.exec.server.TestDrillbitResilience.assertCancelled(TestDrillbitResilience.java:459)
>         at
> org.apache.drill.exec.server.TestDrillbitResilience.cancelInMiddleOfFetchingResults(TestDrillbitResilience.java:510)
>
> Running <next test>
> ...
>
>
> A second run, run individually (but still via Maven) died with different
> errors.
>
>
>
> A third run, via "mvn install" again, seems hung after reporting this
> (maybe expected) exception:
>
> Exception (no rows returned):
> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
> run-try-end
>
>
> [fb9cfe61-af6e-4c9c-b6ab-8a1b8725c6e9 on dev-linux2:31010]
>
>
> The process is using only about 5% CPU--but has 278 threads!
> (That includes about 35 threads all with the same name of "BitClient-1".)
>
>
> Daniel
>
>
>
>
>
>
> --
> Daniel Barclay
> MapR Technologies
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>

Re: TestDrillbitResilience broken? assertion errors; now slow/hung, with 278 threads!

Posted by Jacques Nadeau <ja...@apache.org>.
My sense is it depends too heavily on timing.  I've added better error
messages and disabled it until we get it stable.  I've opened DRILL-2903 to
track.

Note,  it does report a bunch of expected exceptions as well. However I
could never run it by itself without seeing a shutdown thread leak and when
run in combination with the entire unit test suite, it fails sporadically
on a couple tests,  generally having a query complete successfully when it
was expecting a cancelled completion.
On Apr 29, 2015 3:21 AM, "Daniel Barclay" <db...@maprtech.com> wrote:

> Does anyone know what's going on with TestDrillbitResilience (rebased
> from master today)?  (Is it working right?)
>
>
> One run, via "mvn install", yielded assertion errors:
>
> ...
> Error shutting down Drillbit "beta".
> Tests run: 11, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 33.811
> sec <<< FAILURE! - in org.apache.drill.exec.server.TestDrillbitResilience
> cancelAfterEverythingIsCompleted(org.apache.drill.exec.server.TestDrillbitResilience)
> Time elapsed: 1.468 sec  <<< FAILURE!
> java.lang.AssertionError: null
>         at
> org.apache.drill.exec.server.TestDrillbitResilience.assertCancelled(TestDrillbitResilience.java:459)
>         at
> org.apache.drill.exec.server.TestDrillbitResilience.cancelAfterEverythingIsCompleted(TestDrillbitResilience.java:565)
>
> cancelInMiddleOfFetchingResults(org.apache.drill.exec.server.TestDrillbitResilience)
> Time elapsed: 1.496 sec  <<< FAILURE!
> java.lang.AssertionError: null
>         at
> org.apache.drill.exec.server.TestDrillbitResilience.assertCancelled(TestDrillbitResilience.java:459)
>         at
> org.apache.drill.exec.server.TestDrillbitResilience.cancelInMiddleOfFetchingResults(TestDrillbitResilience.java:510)
>
> Running <next test>
> ...
>
>
> A second run, run individually (but still via Maven) died with different
> errors.
>
>
>
> A third run, via "mvn install" again, seems hung after reporting this
> (maybe expected) exception:
>
> Exception (no rows returned):
> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
> run-try-end
>
>
> [fb9cfe61-af6e-4c9c-b6ab-8a1b8725c6e9 on dev-linux2:31010]
>
>
> The process is using only about 5% CPU--but has 278 threads!
> (That includes about 35 threads all with the same name of "BitClient-1".)
>
>
> Daniel
>
>
>
>
>
>
> --
> Daniel Barclay
> MapR Technologies
>

Re: TestDrillbitResilience broken? assertion errors; now slow/hung, with 278 threads!

Posted by Jacques Nadeau <ja...@apache.org>.
The thread count doesn't seem that surprising given the nature of the
test.  It's starting up three distinct Drillbits plus a DrillClient.  That
is a large number of RPC pools (3 server and 2 client pools for each
Drillbit plus a client pool for the DrillClient).

I'd focus on the two actual failures.

On Wed, Apr 29, 2015 at 12:13 AM, Daniel Barclay <db...@maprtech.com>
wrote:

> Does anyone know what's going on with TestDrillbitResilience (rebased
> from master today)?  (Is it working right?)
>
>
> One run, via "mvn install", yielded assertion errors:
>
> ...
> Error shutting down Drillbit "beta".
> Tests run: 11, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 33.811
> sec <<< FAILURE! - in org.apache.drill.exec.server.TestDrillbitResilience
> cancelAfterEverythingIsCompleted(org.apache.drill.exec.server.TestDrillbitResilience)
> Time elapsed: 1.468 sec  <<< FAILURE!
> java.lang.AssertionError: null
>         at
> org.apache.drill.exec.server.TestDrillbitResilience.assertCancelled(TestDrillbitResilience.java:459)
>         at
> org.apache.drill.exec.server.TestDrillbitResilience.cancelAfterEverythingIsCompleted(TestDrillbitResilience.java:565)
>
> cancelInMiddleOfFetchingResults(org.apache.drill.exec.server.TestDrillbitResilience)
> Time elapsed: 1.496 sec  <<< FAILURE!
> java.lang.AssertionError: null
>         at
> org.apache.drill.exec.server.TestDrillbitResilience.assertCancelled(TestDrillbitResilience.java:459)
>         at
> org.apache.drill.exec.server.TestDrillbitResilience.cancelInMiddleOfFetchingResults(TestDrillbitResilience.java:510)
>
> Running <next test>
> ...
>
>
> A second run, run individually (but still via Maven) died with different
> errors.
>
>
>
> A third run, via "mvn install" again, seems hung after reporting this
> (maybe expected) exception:
>
> Exception (no rows returned):
> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
> run-try-end
>
>
> [fb9cfe61-af6e-4c9c-b6ab-8a1b8725c6e9 on dev-linux2:31010]
>
>
> The process is using only about 5% CPU--but has 278 threads!
> (That includes about 35 threads all with the same name of "BitClient-1".)
>
>
> Daniel
>
>
>
>
>
>
> --
> Daniel Barclay
> MapR Technologies
>