You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@allura.apache.org by Dave Brondsema <da...@brondsema.net> on 2015/04/07 00:02:16 UTC

Re: Tests locking up with 100% CPU usage

I started running into this issue again, but believe I have got a fix for it now
too.

The fix ticket is at https://forge-allura.apache.org/p/allura/tickets/7869/
Here's how I figured it out:

On a fresh centos VM ./run_tests -p Allura worked fine but on a "well-used" VM
it would lock up at 100% CPU.  So there are definitely some environmental
aspects that I did not dive into to figure out.  But what I did was run the
tests in a single process (just nosetests, no ./run_tests helper).  I also ran
`pip uninstall nose_xunitmp` since that plugin (even when not activated) as
spawning a secondary process.

Then I used http://pyrasite.readthedocs.org/en/latest/ to attach in to the
running python process because we needed to know what it was doing using up 100%
CPU.  I used pyrasite's thread dump example to see what each thread(s) was doing
in the process.
http://pyrasite.readthedocs.org/en/latest/Payloads.html#dumping-thread-stacks
Sometimes I didn't see anything helpful, but once I saw a very recursive-looking
call going through `lookup_template_engine` in `patches.py` many many times.  I
commented that monkeypatch out and ran the tests again and they completed but
with a max recursion error on the `encode` monkeypatched method.  Commented that
one out and the tests completed with just a few test failures (presumably due to
monkeypatches not being applied).  Then I saw the comment at the end of
`patches.py` describing patches being applied too many times during tests.  I
confirmed that was happening by restoring the monkeypatch code and logging the
stack depth `len(traceback.extract_stack())` from inside a monkeypatched method.
 It would start around 60 and slowly creep up into the mid-100s.  Perhaps it
would've gone higher if I let it keep running.  With the fix to apply
monkeypatches only once, it stays right around 60.

I don't know why this would cause some environments to completely "lock up".
Maybe they didn't and it just seemed like it.  I would've expected a max
recursion error to be thrown.  But it does seem to fix it and it explains the
problems were related to a certain number of tests run in a process and that
multiprocessing avoided it by dividing the test suites into multiple pieces.

Surprisingly, I didn't notice any speedups with this change, comparing a
./run_tests multiprocessing on master and ./run_tests multiprocessing with
monkeypatch-once logic.

On 9/30/14 7:48 AM, Igor Bondarenko wrote:
> We're currently using `./run_tests -n 2 -m 4` on a single core machines and
> that works fine for us
> 
> On Mon, Sep 29, 2014 at 5:44 PM, Dave Brondsema <da...@brondsema.net> wrote:
> 
>> Seems like there's no easy fix for this :(
>>
>> Since the workaround we're adopting is to run the Allura package's tests
>> with
>> nosetests --processes=2 (or more) we should probably force ./run_tests to
>> do
>> that so it doesn't cause problems for somebody trying out Allura on a
>> single
>> core machine or VM.  Any downside to that?
>>
>> On 9/25/14 12:41 AM, Alex Luberg wrote:
>>> I have discovered that the suite passed with 756 tests, and if I added
>>> another test(just copied some existing one with a different name) it
>> locked
>>> up at some test(which was not the one i've copied). I suspect that it is
>>> not related to the actual test code, but something with
>> nose/python/sandbox.
>>>
>>> On Mon, Sep 22, 2014 at 3:40 AM, Igor Bondarenko <je...@gmail.com>
>> wrote:
>>>
>>>> On Sat, Sep 20, 2014 at 12:25 AM, Dave Brondsema <da...@brondsema.net>
>>>> wrote:
>>>>
>>>>> On 9/19/14 12:18 PM, Dave Brondsema wrote:
>>>>>> Starting with Igor's comments on
>>>>> https://sourceforge.net/p/allura/tickets/7657/#c7d9
>>>>>>
>>>>>>> There's a couple of new tests commented out in a last commit. I can't
>>>>> figure out why, but they cause allura/tests/test_dispatch.py to hang
>> when
>>>>> run together with other tests. Also I have added and then removed tests
>>>> for
>>>>> enable/disable user for the same reason.
>>>>>>>
>>>>>>> I think it needs another pair of eyes on it, since I've already spent
>>>>> too much time dealing with this tests and have no idea what's
>>>> happening...
>>>>> Maybe I'm missing something obvious.
>>>>>>
>>>>>> Alex and I have seen this recently too, and its hard to figure out
>> what
>>>>> exactly
>>>>>> is the problem.  I first noticed it when running `./run_tests
>>>>> --with-coverage`
>>>>>> which would run nosetests in the Allura dir and would not use
>>>>> --processes=N
>>>>>> because of the with-coverage param.  So basically just a regular run
>> of
>>>>> the
>>>>>> tests in the Allura dir would cause the CPU to go into 100% usage and
>>>>> the tests
>>>>>> wouldn't finish.  Couldn't ctrl-C or profile them, had to kill -9 it.
>>>>>>
>>>>>> That was on Centos 5.10 and a workaround was to run with --processes=N
>>>>> and then
>>>>>> the tests would finish fine.  On the Ubuntu vagrant image, I didn't
>>>>> encounter
>>>>>> any problem in the first place.  So perhaps related to the
>> environment.
>>>>>>
>>>>>> I tried to narrow down to a specific test that might be the culprit.
>> I
>>>>> found
>>>>>> tests consistently got up to TestSecurity.test_auth (which is a bit
>>>>> weird and
>>>>>> old test anyway).  And also that commenting out that test let them all
>>>>> pass.
>>>>>>
>>>>>> But I'm pretty sure Alex said he dug into this as well and found
>>>>> variation in
>>>>>> what tests could cause the problem.  I think he told me that going
>> back
>>>>> in
>>>>>> git-history before the problem, and then adding a single test (a copy
>>>> of
>>>>> an
>>>>>> existing one) caused the problem.  So perhaps some limit, or resource
>>>>> tipping
>>>>>> point is hit.
>>>>>>
>>>>>> Alex or Igor, any more data points you know from what you've seen?
>>>>>>
>>>>>> Anyone else seen anything like this?  Or have ideas for how to
>> approach
>>>>> nailing
>>>>>> it down better?
>>>>>>
>>>>>>
>>>>>
>>>>> I tried checking out branch je/42cc_7657 and going back to commit
>>>>> 4cc63586e5728d7d0c5c2f09150eb07eb7e4edc1 (before tests were commented
>>>> out)
>>>>> to
>>>>> see what happened for me:
>>>>>
>>>>> On vagrant / ubuntu, it froze at test_dispatch.py same as you.  So some
>>>>> consistency there.  Tests passed when I ran `nosetests
>>>>> --process-timeout=180
>>>>> --processes=4 -v` in the Allura dir.  Seemed slow at the end though,
>>>> almost
>>>>> thought it froze.
>>>>>
>>>>> On centos, it froze at a different spot with a regular nosetests run.
>> It
>>>>> passed
>>>>> with `nosetests allura/tests/ --processes=4 --process-timeout=180 -v`.
>>>>> For some
>>>>> reason (hopefully unrelated), I needed to specify path "allura/tests/"
>> to
>>>>> avoid
>>>>> an IOError from multiprocessing.
>>>>>
>>>>> So at least multiprocess tests still seems like a workaround for me.
>>>> Note:
>>>>> ./run_tests picks a --processes=N value dynamically based on the
>>>> machine's
>>>>> CPU
>>>>> cores, so with a single core you don't get multiple processes that way.
>>>>> Also
>>>>> note: if you have nose-progressive installed and active, that is
>>>>> incompatible
>>>>> with multiple processes.
>>>>>
>>>>>
>>>> It works exactly as you described for me too.
>>>>
>>>> I've reverted some commits with those tests, since problem not with them
>>>> and they are useful https://sourceforge.net/p/allura/tickets/7657/#8c06
>>>> and
>>>> also made a fix in 42cc's Makefile (commited directly in master), so
>> that
>>>> it would always run tests in parallel (turns out here at 42cc we have
>>>> single core CPUs on boxes that run tests, that's why I had locks on our
>> CI
>>>> also :( )
>>>>
>>>
>>
>>
>>
>> --
>> Dave Brondsema : dave@brondsema.net
>> http://www.brondsema.net : personal
>> http://www.splike.com : programming
>>               <><
>>
> 

-- 
Dave Brondsema : dave@brondsema.net
http://www.brondsema.net : personal
http://www.splike.com : programming
              <><

Re: Tests locking up with 100% CPU usage

Posted by Igor Bondarenko <je...@gmail.com>.

Good catch!

On Tue, Apr 7, 2015 at 1:02 AM, Dave Brondsema <da...@brondsema.net> wrote:

> I started running into this issue again, but believe I have got a fix for
> it now
> too.
>
> The fix ticket is at
> https://forge-allura.apache.org/p/allura/tickets/7869/
> Here's how I figured it out:
>
> On a fresh centos VM ./run_tests -p Allura worked fine but on a
> "well-used" VM
> it would lock up at 100% CPU.  So there are definitely some environmental
> aspects that I did not dive into to figure out.  But what I did was run the
> tests in a single process (just nosetests, no ./run_tests helper).  I also
> ran
> `pip uninstall nose_xunitmp` since that plugin (even when not activated) as
> spawning a secondary process.
>
> Then I used http://pyrasite.readthedocs.org/en/latest/ to attach in to the
> running python process because we needed to know what it was doing using
> up 100%
> CPU.  I used pyrasite's thread dump example to see what each thread(s) was
> doing
> in the process.
>
> http://pyrasite.readthedocs.org/en/latest/Payloads.html#dumping-thread-stacks
> Sometimes I didn't see anything helpful, but once I saw a very
> recursive-looking
> call going through `lookup_template_engine` in `patches.py` many many
> times.  I
> commented that monkeypatch out and ran the tests again and they completed
> but
> with a max recursion error on the `encode` monkeypatched method.
> Commented that
> one out and the tests completed with just a few test failures (presumably
> due to
> monkeypatches not being applied).  Then I saw the comment at the end of
> `patches.py` describing patches being applied too many times during
> tests.  I
> confirmed that was happening by restoring the monkeypatch code and logging
> the
> stack depth `len(traceback.extract_stack())` from inside a monkeypatched
> method.
>  It would start around 60 and slowly creep up into the mid-100s.  Perhaps
> it
> would've gone higher if I let it keep running.  With the fix to apply
> monkeypatches only once, it stays right around 60.
>
> I don't know why this would cause some environments to completely "lock
> up".
> Maybe they didn't and it just seemed like it.  I would've expected a max
> recursion error to be thrown.  But it does seem to fix it and it explains
> the
> problems were related to a certain number of tests run in a process and
> that
> multiprocessing avoided it by dividing the test suites into multiple
> pieces.
>
> Surprisingly, I didn't notice any speedups with this change, comparing a
> ./run_tests multiprocessing on master and ./run_tests multiprocessing with
> monkeypatch-once logic.
>
> On 9/30/14 7:48 AM, Igor Bondarenko wrote:
> > We're currently using `./run_tests -n 2 -m 4` on a single core machines
> and
> > that works fine for us
> >
> > On Mon, Sep 29, 2014 at 5:44 PM, Dave Brondsema <da...@brondsema.net>
> wrote:
> >
> >> Seems like there's no easy fix for this :(
> >>
> >> Since the workaround we're adopting is to run the Allura package's tests
> >> with
> >> nosetests --processes=2 (or more) we should probably force ./run_tests
> to
> >> do
> >> that so it doesn't cause problems for somebody trying out Allura on a
> >> single
> >> core machine or VM.  Any downside to that?
> >>
> >> On 9/25/14 12:41 AM, Alex Luberg wrote:
> >>> I have discovered that the suite passed with 756 tests, and if I added
> >>> another test(just copied some existing one with a different name) it
> >> locked
> >>> up at some test(which was not the one i've copied). I suspect that it
> is
> >>> not related to the actual test code, but something with
> >> nose/python/sandbox.
> >>>
> >>> On Mon, Sep 22, 2014 at 3:40 AM, Igor Bondarenko <je...@gmail.com>
> >> wrote:
> >>>
> >>>> On Sat, Sep 20, 2014 at 12:25 AM, Dave Brondsema <da...@brondsema.net>
> >>>> wrote:
> >>>>
> >>>>> On 9/19/14 12:18 PM, Dave Brondsema wrote:
> >>>>>> Starting with Igor's comments on
> >>>>> https://sourceforge.net/p/allura/tickets/7657/#c7d9
> >>>>>>
> >>>>>>> There's a couple of new tests commented out in a last commit. I
> can't
> >>>>> figure out why, but they cause allura/tests/test_dispatch.py to hang
> >> when
> >>>>> run together with other tests. Also I have added and then removed
> tests
> >>>> for
> >>>>> enable/disable user for the same reason.
> >>>>>>>
> >>>>>>> I think it needs another pair of eyes on it, since I've already
> spent
> >>>>> too much time dealing with this tests and have no idea what's
> >>>> happening...
> >>>>> Maybe I'm missing something obvious.
> >>>>>>
> >>>>>> Alex and I have seen this recently too, and its hard to figure out
> >> what
> >>>>> exactly
> >>>>>> is the problem.  I first noticed it when running `./run_tests
> >>>>> --with-coverage`
> >>>>>> which would run nosetests in the Allura dir and would not use
> >>>>> --processes=N
> >>>>>> because of the with-coverage param.  So basically just a regular run
> >> of
> >>>>> the
> >>>>>> tests in the Allura dir would cause the CPU to go into 100% usage
> and
> >>>>> the tests
> >>>>>> wouldn't finish.  Couldn't ctrl-C or profile them, had to kill -9
> it.
> >>>>>>
> >>>>>> That was on Centos 5.10 and a workaround was to run with
> --processes=N
> >>>>> and then
> >>>>>> the tests would finish fine.  On the Ubuntu vagrant image, I didn't
> >>>>> encounter
> >>>>>> any problem in the first place.  So perhaps related to the
> >> environment.
> >>>>>>
> >>>>>> I tried to narrow down to a specific test that might be the culprit.
> >> I
> >>>>> found
> >>>>>> tests consistently got up to TestSecurity.test_auth (which is a bit
> >>>>> weird and
> >>>>>> old test anyway).  And also that commenting out that test let them
> all
> >>>>> pass.
> >>>>>>
> >>>>>> But I'm pretty sure Alex said he dug into this as well and found
> >>>>> variation in
> >>>>>> what tests could cause the problem.  I think he told me that going
> >> back
> >>>>> in
> >>>>>> git-history before the problem, and then adding a single test (a
> copy
> >>>> of
> >>>>> an
> >>>>>> existing one) caused the problem.  So perhaps some limit, or
> resource
> >>>>> tipping
> >>>>>> point is hit.
> >>>>>>
> >>>>>> Alex or Igor, any more data points you know from what you've seen?
> >>>>>>
> >>>>>> Anyone else seen anything like this?  Or have ideas for how to
> >> approach
> >>>>> nailing
> >>>>>> it down better?
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> I tried checking out branch je/42cc_7657 and going back to commit
> >>>>> 4cc63586e5728d7d0c5c2f09150eb07eb7e4edc1 (before tests were commented
> >>>> out)
> >>>>> to
> >>>>> see what happened for me:
> >>>>>
> >>>>> On vagrant / ubuntu, it froze at test_dispatch.py same as you.  So
> some
> >>>>> consistency there.  Tests passed when I ran `nosetests
> >>>>> --process-timeout=180
> >>>>> --processes=4 -v` in the Allura dir.  Seemed slow at the end though,
> >>>> almost
> >>>>> thought it froze.
> >>>>>
> >>>>> On centos, it froze at a different spot with a regular nosetests run.
> >> It
> >>>>> passed
> >>>>> with `nosetests allura/tests/ --processes=4 --process-timeout=180
> -v`.
> >>>>> For some
> >>>>> reason (hopefully unrelated), I needed to specify path
> "allura/tests/"
> >> to
> >>>>> avoid
> >>>>> an IOError from multiprocessing.
> >>>>>
> >>>>> So at least multiprocess tests still seems like a workaround for me.
> >>>> Note:
> >>>>> ./run_tests picks a --processes=N value dynamically based on the
> >>>> machine's
> >>>>> CPU
> >>>>> cores, so with a single core you don't get multiple processes that
> way.
> >>>>> Also
> >>>>> note: if you have nose-progressive installed and active, that is
> >>>>> incompatible
> >>>>> with multiple processes.
> >>>>>
> >>>>>
> >>>> It works exactly as you described for me too.
> >>>>
> >>>> I've reverted some commits with those tests, since problem not with
> them
> >>>> and they are useful
> https://sourceforge.net/p/allura/tickets/7657/#8c06
> >>>> and
> >>>> also made a fix in 42cc's Makefile (commited directly in master), so
> >> that
> >>>> it would always run tests in parallel (turns out here at 42cc we have
> >>>> single core CPUs on boxes that run tests, that's why I had locks on
> our
> >> CI
> >>>> also :( )
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> Dave Brondsema : dave@brondsema.net
> >> http://www.brondsema.net : personal
> >> http://www.splike.com : programming
> >>               <><
> >>
> >
>
>
>
> --
> Dave Brondsema : dave@brondsema.net
> http://www.brondsema.net : personal
> http://www.splike.com : programming
>               <><
>