You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by Stefan Kögl <ko...@gmail.com> on 2012/03/01 12:17:17 UTC

Crash of CouchDB 1.2.x

Hello,

My experiments to replicate some live data / traffic to a CouchDB
1.2.x (running the current 1.2.x branch + the patch from [1]) that
sparked the indexing speed discussions, did also yield another
(potential) problem. First sorry for not further reporting back any
performance measurements, but I didn't yet find the time to run the
tests on my machines.

Anyway, I found the following stack traces in my log (after noticing
that some requests failed and compaction of a view stopped)

http://skoegl.net/~stefan/tmp/couchdb-1.2.x-crash.txt

The files starts at the first failed requests. Every request before
that returned a positiv (ie 2xx) status code. The crash might have
some "natural" reason (such as timeouts, lack of RAM, etc), but I'm
not sure how to interpret Erlang stack traces. Can somebody point me
in the right direction for diagnosing the problem?


Thanks,

-- Stefan


[1] http://friendpaste.com/178nPFgfyyeGf2vtNRpL0w

Re: Crash of CouchDB 1.2.x

Posted by Jason Smith <jh...@iriscouch.com>.
I seem to remember that, say, ext2 had more or less constant-time unlinking.

On Mon, Mar 12, 2012 at 10:32 AM, Robert Newson <rn...@apache.org> wrote:
> I can confirm that XFS is aggressive when deleting large files (other
> i/o requests are slow or blocked while it does it). It has been
> necessary to iteratively truncate a file instead of a simple 'rm' in
> production to avoid that problem. Increasing the size of extent
> preallocation ought to help considerably but I've not yet deployed
> that change. I *can* confirm that you can't 'ionice' the rm call,
> though.
>
> B.
>
> On 12 March 2012 05:00, Randall Leeds <ra...@gmail.com> wrote:
>> On Mar 11, 2012 7:40 PM, "Jason Smith" <jh...@iriscouch.com> wrote:
>>>
>>> On Mon, Mar 12, 2012 at 8:44 AM, Randall Leeds <ra...@gmail.com>
>> wrote:
>>> > I'm not sure what else you could provide after the fact. If your couch
>>> > came back online automatically, and did so quickly, I would expect to
>>> > see very long response times while the disk was busy freeing the old,
>>> > un-compacted file. We have had some fixes in the last couple releases
>>> > to address similar issues, but maybe there's something lurking still.
>>> > I've got no other ideas/leads at this time.
>>>
>>> Another long shot, but you could try a filesystem that doesn't
>>> synchronously reclaim the space, like (IIRC) XFS, btrfs, or I think
>>> ext2.
>>
>> I think you're referring to extents, which, IIRC, allow large, contiguous
>> sections if a file to be allocated and freed with less bookkeeping and,
>> therefore, fewer writes. This behavior is not any more or less synchronous.
>>
>> In my production experience, xfs does not show much benefit from this
>> because any machine which contains more than one databases which are
>> growing still results in file fragmentation that limits the gains from
>> extents.
>>
>> I suspect, but have not tried to verify, that very large RAID stripe sizes
>> that force pre allocation of larger blocks, might deliver some gains.
>>
>> I have an open ticket for a manual delete option which was designed to
>> allow deletion of trashed files to occur during low volume hours or using
>> tools like ionice.  Unfortunately, I never got a chance to experiment with
>> that set up in production, though I have seen ionice help significantly to
>> keep request latency down when doing large deletes (just not in this
>> particular use case).



-- 
Iris Couch

Re: Crash of CouchDB 1.2.x

Posted by Robert Newson <rn...@apache.org>.
I can confirm that XFS is aggressive when deleting large files (other
i/o requests are slow or blocked while it does it). It has been
necessary to iteratively truncate a file instead of a simple 'rm' in
production to avoid that problem. Increasing the size of extent
preallocation ought to help considerably but I've not yet deployed
that change. I *can* confirm that you can't 'ionice' the rm call,
though.

B.

On 12 March 2012 05:00, Randall Leeds <ra...@gmail.com> wrote:
> On Mar 11, 2012 7:40 PM, "Jason Smith" <jh...@iriscouch.com> wrote:
>>
>> On Mon, Mar 12, 2012 at 8:44 AM, Randall Leeds <ra...@gmail.com>
> wrote:
>> > I'm not sure what else you could provide after the fact. If your couch
>> > came back online automatically, and did so quickly, I would expect to
>> > see very long response times while the disk was busy freeing the old,
>> > un-compacted file. We have had some fixes in the last couple releases
>> > to address similar issues, but maybe there's something lurking still.
>> > I've got no other ideas/leads at this time.
>>
>> Another long shot, but you could try a filesystem that doesn't
>> synchronously reclaim the space, like (IIRC) XFS, btrfs, or I think
>> ext2.
>
> I think you're referring to extents, which, IIRC, allow large, contiguous
> sections if a file to be allocated and freed with less bookkeeping and,
> therefore, fewer writes. This behavior is not any more or less synchronous.
>
> In my production experience, xfs does not show much benefit from this
> because any machine which contains more than one databases which are
> growing still results in file fragmentation that limits the gains from
> extents.
>
> I suspect, but have not tried to verify, that very large RAID stripe sizes
> that force pre allocation of larger blocks, might deliver some gains.
>
> I have an open ticket for a manual delete option which was designed to
> allow deletion of trashed files to occur during low volume hours or using
> tools like ionice.  Unfortunately, I never got a chance to experiment with
> that set up in production, though I have seen ionice help significantly to
> keep request latency down when doing large deletes (just not in this
> particular use case).

Re: Crash of CouchDB 1.2.x

Posted by Randall Leeds <ra...@gmail.com>.
On Mar 11, 2012 7:40 PM, "Jason Smith" <jh...@iriscouch.com> wrote:
>
> On Mon, Mar 12, 2012 at 8:44 AM, Randall Leeds <ra...@gmail.com>
wrote:
> > I'm not sure what else you could provide after the fact. If your couch
> > came back online automatically, and did so quickly, I would expect to
> > see very long response times while the disk was busy freeing the old,
> > un-compacted file. We have had some fixes in the last couple releases
> > to address similar issues, but maybe there's something lurking still.
> > I've got no other ideas/leads at this time.
>
> Another long shot, but you could try a filesystem that doesn't
> synchronously reclaim the space, like (IIRC) XFS, btrfs, or I think
> ext2.

I think you're referring to extents, which, IIRC, allow large, contiguous
sections if a file to be allocated and freed with less bookkeeping and,
therefore, fewer writes. This behavior is not any more or less synchronous.

In my production experience, xfs does not show much benefit from this
because any machine which contains more than one databases which are
growing still results in file fragmentation that limits the gains from
extents.

I suspect, but have not tried to verify, that very large RAID stripe sizes
that force pre allocation of larger blocks, might deliver some gains.

I have an open ticket for a manual delete option which was designed to
allow deletion of trashed files to occur during low volume hours or using
tools like ionice.  Unfortunately, I never got a chance to experiment with
that set up in production, though I have seen ionice help significantly to
keep request latency down when doing large deletes (just not in this
particular use case).

Re: Crash of CouchDB 1.2.x

Posted by Jason Smith <jh...@iriscouch.com>.
On Mon, Mar 12, 2012 at 8:44 AM, Randall Leeds <ra...@gmail.com> wrote:
> I'm not sure what else you could provide after the fact. If your couch
> came back online automatically, and did so quickly, I would expect to
> see very long response times while the disk was busy freeing the old,
> un-compacted file. We have had some fixes in the last couple releases
> to address similar issues, but maybe there's something lurking still.
> I've got no other ideas/leads at this time.

Another long shot, but you could try a filesystem that doesn't
synchronously reclaim the space, like (IIRC) XFS, btrfs, or I think
ext2.

-- 
Iris Couch

Re: Crash of CouchDB 1.2.x

Posted by Randall Leeds <ra...@gmail.com>.
On Sun, Mar 11, 2012 at 07:56, Stefan Kögl <ko...@gmail.com> wrote:
> On 03/11/2012 02:32 PM, Jason Smith wrote:
>>
>> Longshot, but is it possible that couch had a file handle to an
>> unlinked file, so once the (OS) process crashed, the space was
>> freed?
>
>
> Hmm.. that might be possible. I ran a database compaction before that. When
> I noticed the crash I saw that the db compaction finished, but it might be
> possible that it still had a handle to the old db file.

{badmatch,{error,enospc}} is exactly the out of space error coming
straight up out of the kernel.

>
> How should we proceed from here? Is it possible for me to provide further
> information about that in retrospect?

I'm not sure what else you could provide after the fact. If your couch
came back online automatically, and did so quickly, I would expect to
see very long response times while the disk was busy freeing the old,
un-compacted file. We have had some fixes in the last couple releases
to address similar issues, but maybe there's something lurking still.
I've got no other ideas/leads at this time.

>
>
> -- Stefan

Re: Crash of CouchDB 1.2.x

Posted by Stefan Kögl <ko...@gmail.com>.
On 03/11/2012 02:32 PM, Jason Smith wrote:
> Longshot, but is it possible that couch had a file handle to an
> unlinked file, so once the (OS) process crashed, the space was
> freed?

Hmm.. that might be possible. I ran a database compaction before that. 
When I noticed the crash I saw that the db compaction finished, but it 
might be possible that it still had a handle to the old db file.

How should we proceed from here? Is it possible for me to provide 
further information about that in retrospect?


-- Stefan

Re: Crash of CouchDB 1.2.x

Posted by Jason Smith <jh...@iriscouch.com>.
Longshot, but is it possible that couch had a file handle to an
unlinked file, so once the (OS) process crashed, the space was freed?

On Sun, Mar 11, 2012 at 7:50 PM, Stefan Kögl <ko...@gmail.com> wrote:
> On 03/11/2012 01:33 PM, Bob Dionne wrote:
>>
>> At a glance I would suspect you've run out of disk space and the
>> error thrown is not caught resulting in the badmatch,
>
>
> At the time of the crash there were about 70G of free space left, which is
> enough for the compaction to finish.
>
>
> -- Stefan



-- 
Iris Couch

Re: Crash of CouchDB 1.2.x

Posted by Stefan Kögl <ko...@gmail.com>.
On 03/11/2012 01:33 PM, Bob Dionne wrote:
> At a glance I would suspect you've run out of disk space and the
> error thrown is not caught resulting in the badmatch,

At the time of the crash there were about 70G of free space left, which 
is enough for the compaction to finish.


-- Stefan

Re: Crash of CouchDB 1.2.x

Posted by Bob Dionne <di...@dionne-associates.com>.
Stefan,

At a glance I would suspect you've run out of disk space and the error thrown is not caught resulting in the badmatch,

Bob

On Mar 11, 2012, at 6:41 AM, Stefan Kögl wrote:

> Hi,
> 
> I had my CouchDB 1.2.x (fb72251bc7114b07f0667867226ec9e200732dac)
> crash again twice today.
> 
> The first one was while the instance was pull replicating (which
> failed due to the source being unreachable), and compacting a rather
> large view (from ~216G disk size to about 57G data size, if that's
> relevant).
> 
> Here's the log that shows the crash
> 
> http://friendpaste.com/41Idie3gGdQRxJPEyVHpTR
> 
> After the crash the view compaction stopped, and I tried to restart it
> 
> $ curl -H "Content-Type: application/json" -X POST
> http://stefan:********@localhost:5984/mygpo/_compact/users-tmp
> {"error":"timeout","reason":"{gen_server,call,\n
> [<0.19783.69>,\n
> {start_compact,#Fun<couch_view_compactor.0.15011741>}]}"}
> 
> http://friendpaste.com/2A086gHN8dNEJHPpMkDrPO
> 
> I assume this is because deleting the .compact.view file took too
> long. The compaction started anyway, though. Besides the replication,
> there were no other activities on the server.
> 
> Please let me know if I can assist with debugging somehow.
> 
> 
> -- Stefan
> 
> 
> 
> On Fri, Mar 2, 2012 at 11:51 AM, Jan Lehnardt <ja...@apache.org> wrote:
>> 
>> On Mar 2, 2012, at 11:29 , Stefan Kögl wrote:
>> 
>>> On Thu, Mar 1, 2012 at 9:39 PM, Jan Lehnardt <ja...@apache.org> wrote:
>>>> Where in there did you do the git pull? And was a make clean or git clean
>>>> involved?
>>> 
>>> IIRC I did not pull in between, only apply the patch I mentioned
>>> earlier ( http://friendpaste.com/178nPFgfyyeGf2vtNRpL0w ). And I
>>> probably did a make clean && make && make install.
>>> 
>>> 
>>>> I think you should be in the clear with that procedure, but just to be
>>>> sure, I think it'd be worth rm'ing all .beam files you find manually
>>>> after the uninstall.
>>> 
>>> Done, I'll report back if the problem appears again.
>> 
>> Thanks Stefan, your help here is really appreciated :)
>> 
>> Cheers
>> Jan
>> --
>> 


Re: Crash of CouchDB 1.2.x

Posted by Stefan Kögl <ko...@gmail.com>.
Hi,

I had my CouchDB 1.2.x (fb72251bc7114b07f0667867226ec9e200732dac)
crash again twice today.

The first one was while the instance was pull replicating (which
failed due to the source being unreachable), and compacting a rather
large view (from ~216G disk size to about 57G data size, if that's
relevant).

Here's the log that shows the crash

http://friendpaste.com/41Idie3gGdQRxJPEyVHpTR

After the crash the view compaction stopped, and I tried to restart it

$ curl -H "Content-Type: application/json" -X POST
http://stefan:********@localhost:5984/mygpo/_compact/users-tmp
{"error":"timeout","reason":"{gen_server,call,\n
[<0.19783.69>,\n
{start_compact,#Fun<couch_view_compactor.0.15011741>}]}"}

http://friendpaste.com/2A086gHN8dNEJHPpMkDrPO

I assume this is because deleting the .compact.view file took too
long. The compaction started anyway, though. Besides the replication,
there were no other activities on the server.

Please let me know if I can assist with debugging somehow.


-- Stefan



On Fri, Mar 2, 2012 at 11:51 AM, Jan Lehnardt <ja...@apache.org> wrote:
>
> On Mar 2, 2012, at 11:29 , Stefan Kögl wrote:
>
>> On Thu, Mar 1, 2012 at 9:39 PM, Jan Lehnardt <ja...@apache.org> wrote:
>>> Where in there did you do the git pull? And was a make clean or git clean
>>> involved?
>>
>> IIRC I did not pull in between, only apply the patch I mentioned
>> earlier ( http://friendpaste.com/178nPFgfyyeGf2vtNRpL0w ). And I
>> probably did a make clean && make && make install.
>>
>>
>>> I think you should be in the clear with that procedure, but just to be
>>> sure, I think it'd be worth rm'ing all .beam files you find manually
>>> after the uninstall.
>>
>> Done, I'll report back if the problem appears again.
>
> Thanks Stefan, your help here is really appreciated :)
>
> Cheers
> Jan
> --
>

Re: Crash of CouchDB 1.2.x

Posted by Jan Lehnardt <ja...@apache.org>.
On Mar 2, 2012, at 11:29 , Stefan Kögl wrote:

> On Thu, Mar 1, 2012 at 9:39 PM, Jan Lehnardt <ja...@apache.org> wrote:
>> Where in there did you do the git pull? And was a make clean or git clean
>> involved?
> 
> IIRC I did not pull in between, only apply the patch I mentioned
> earlier ( http://friendpaste.com/178nPFgfyyeGf2vtNRpL0w ). And I
> probably did a make clean && make && make install.
> 
> 
>> I think you should be in the clear with that procedure, but just to be
>> sure, I think it'd be worth rm'ing all .beam files you find manually
>> after the uninstall.
> 
> Done, I'll report back if the problem appears again.

Thanks Stefan, your help here is really appreciated :)

Cheers
Jan
-- 


Re: Crash of CouchDB 1.2.x

Posted by Stefan Kögl <ko...@gmail.com>.
On Thu, Mar 1, 2012 at 9:39 PM, Jan Lehnardt <ja...@apache.org> wrote:
> Where in there did you do the git pull? And was a make clean or git clean
> involved?

IIRC I did not pull in between, only apply the patch I mentioned
earlier ( http://friendpaste.com/178nPFgfyyeGf2vtNRpL0w ). And I
probably did a make clean && make && make install.


> I think you should be in the clear with that procedure, but just to be
> sure, I think it'd be worth rm'ing all .beam files you find manually
> after the uninstall.

Done, I'll report back if the problem appears again.


-- Stefan

Re: Crash of CouchDB 1.2.x

Posted by Jan Lehnardt <ja...@apache.org>.
On Mar 1, 2012, at 19:57 , Stefan Kögl wrote:

> On 03/01/2012 07:38 PM, Jan Lehnardt wrote:
>> On Mar 1, 2012, at 19:18 , Stefan Kögl wrote:
>>> If this is a problem, I could remove CouchDB first and do a fresh
>>> install instead. What would be the preferred way to do a clean
>>> uninstall?
>> 
>> I don't want to claim that this is definitely the cause for your
>> problem, but it'd be great if you could do a clean, fresh, empty
>> install to make sure we can rule that out as :)
> 
> I just did
> 
> /etc/init.d/couchdb stop
> make uninstall
> make install
> # edit local.ini -- why does that get removed anyway?
> /etc/init.d/couchdb start
> 
> Is that enough to count as a fresh install, or should I do anything
> else? I'll continue monitoring the instance. Previously the error
> happened after a few days, so I can't say yet if the re-install changed
> anything.

Where in there did you do the git pull? And was a make clean or git clean
involved?

I think you should be in the clear with that procedure, but just to be
sure, I think it'd be worth rm'ing all .beam files you find manually
after the uninstall.

local.ini is usually preserved across installations, but make uninstall
is obviously an intent to get rid of all traces of a package, so it should
get removed :)

Cheers
Jan
-- 


Re: Crash of CouchDB 1.2.x

Posted by Stefan Kögl <ko...@gmail.com>.
On 03/01/2012 07:38 PM, Jan Lehnardt wrote:
> On Mar 1, 2012, at 19:18 , Stefan Kögl wrote:
>> If this is a problem, I could remove CouchDB first and do a fresh
>> install instead. What would be the preferred way to do a clean
>> uninstall?
> 
> I don't want to claim that this is definitely the cause for your
> problem, but it'd be great if you could do a clean, fresh, empty
> install to make sure we can rule that out as :)

I just did

/etc/init.d/couchdb stop
make uninstall
make install
# edit local.ini -- why does that get removed anyway?
/etc/init.d/couchdb start

Is that enough to count as a fresh install, or should I do anything
else? I'll continue monitoring the instance. Previously the error
happened after a few days, so I can't say yet if the re-install changed
anything.


-- Stefan

Re: Crash of CouchDB 1.2.x

Posted by Jan Lehnardt <ja...@apache.org>.
On Mar 1, 2012, at 19:18 , Stefan Kögl wrote:

> On Thu, Mar 1, 2012 at 4:52 PM, Jan Lehnardt <ja...@apache.org> wrote:
>> Can you tell us how you installed 1.2.x? Is it a fresh installation,
>> did you do an in-place update from an earlier installation (earlier
>> 1.2.x or 1.1.x or 1.0.x?
> 
> I first did a fresh install of 1.2.x using R15B. I then removed R15B,
> installed R14B04 (both from source), compiled 1.2.x with the patch I
> mentioned earlier, and did an in-place update.
> 
> If this is a problem, I could remove CouchDB first and do a fresh
> install instead. What would be the preferred way to do a clean
> uninstall?

I don't want to claim that this is definitely the cause for your
problem, but it'd be great if you could do a clean, fresh, empty
install to make sure we can rule that out as :)

Cheers
Jan
-- 

> 
> 
> -- Stefan
> 
> 
>> On Mar 1, 2012, at 12:17 , Stefan Kögl wrote:
>> 
>>> Hello,
>>> 
>>> My experiments to replicate some live data / traffic to a CouchDB
>>> 1.2.x (running the current 1.2.x branch + the patch from [1]) that
>>> sparked the indexing speed discussions, did also yield another
>>> (potential) problem. First sorry for not further reporting back any
>>> performance measurements, but I didn't yet find the time to run the
>>> tests on my machines.
>>> 
>>> Anyway, I found the following stack traces in my log (after noticing
>>> that some requests failed and compaction of a view stopped)
>>> 
>>> http://skoegl.net/~stefan/tmp/couchdb-1.2.x-crash.txt
>>> 
>>> The files starts at the first failed requests. Every request before
>>> that returned a positiv (ie 2xx) status code. The crash might have
>>> some "natural" reason (such as timeouts, lack of RAM, etc), but I'm
>>> not sure how to interpret Erlang stack traces. Can somebody point me
>>> in the right direction for diagnosing the problem?
>>> 
>>> 
>>> Thanks,
>>> 
>>> -- Stefan
>>> 
>>> 
>>> [1] http://friendpaste.com/178nPFgfyyeGf2vtNRpL0w
>> 


Re: Crash of CouchDB 1.2.x

Posted by Stefan Kögl <ko...@gmail.com>.
On Thu, Mar 1, 2012 at 4:52 PM, Jan Lehnardt <ja...@apache.org> wrote:
> Can you tell us how you installed 1.2.x? Is it a fresh installation,
> did you do an in-place update from an earlier installation (earlier
> 1.2.x or 1.1.x or 1.0.x?

I first did a fresh install of 1.2.x using R15B. I then removed R15B,
installed R14B04 (both from source), compiled 1.2.x with the patch I
mentioned earlier, and did an in-place update.

If this is a problem, I could remove CouchDB first and do a fresh
install instead. What would be the preferred way to do a clean
uninstall?


-- Stefan


> On Mar 1, 2012, at 12:17 , Stefan Kögl wrote:
>
>> Hello,
>>
>> My experiments to replicate some live data / traffic to a CouchDB
>> 1.2.x (running the current 1.2.x branch + the patch from [1]) that
>> sparked the indexing speed discussions, did also yield another
>> (potential) problem. First sorry for not further reporting back any
>> performance measurements, but I didn't yet find the time to run the
>> tests on my machines.
>>
>> Anyway, I found the following stack traces in my log (after noticing
>> that some requests failed and compaction of a view stopped)
>>
>> http://skoegl.net/~stefan/tmp/couchdb-1.2.x-crash.txt
>>
>> The files starts at the first failed requests. Every request before
>> that returned a positiv (ie 2xx) status code. The crash might have
>> some "natural" reason (such as timeouts, lack of RAM, etc), but I'm
>> not sure how to interpret Erlang stack traces. Can somebody point me
>> in the right direction for diagnosing the problem?
>>
>>
>> Thanks,
>>
>> -- Stefan
>>
>>
>> [1] http://friendpaste.com/178nPFgfyyeGf2vtNRpL0w
>

Re: Crash of CouchDB 1.2.x

Posted by Jan Lehnardt <ja...@apache.org>.
Hi Stefan,

thanks for the report, this is very helpful!

Can you tell us how you installed 1.2.x? Is it a fresh installation,
did you do an in-place update from an earlier installation (earlier
1.2.x or 1.1.x or 1.0.x?

Without having dug too much yet and more as a pointer for the other
devs here, I remember seeing io_lib_* errors in cases where we catch
exceptions in an attempt to make prettier messages, but then the
catch-all clause tries to print whatever is actually unexpected
and fails there, resulting in the io_lib_* stacktrace when
attempting to print the actual stacktrace that subsequently
gets lost.

Cheers
Jan
-- 



On Mar 1, 2012, at 12:17 , Stefan Kögl wrote:

> Hello,
> 
> My experiments to replicate some live data / traffic to a CouchDB
> 1.2.x (running the current 1.2.x branch + the patch from [1]) that
> sparked the indexing speed discussions, did also yield another
> (potential) problem. First sorry for not further reporting back any
> performance measurements, but I didn't yet find the time to run the
> tests on my machines.
> 
> Anyway, I found the following stack traces in my log (after noticing
> that some requests failed and compaction of a view stopped)
> 
> http://skoegl.net/~stefan/tmp/couchdb-1.2.x-crash.txt
> 
> The files starts at the first failed requests. Every request before
> that returned a positiv (ie 2xx) status code. The crash might have
> some "natural" reason (such as timeouts, lack of RAM, etc), but I'm
> not sure how to interpret Erlang stack traces. Can somebody point me
> in the right direction for diagnosing the problem?
> 
> 
> Thanks,
> 
> -- Stefan
> 
> 
> [1] http://friendpaste.com/178nPFgfyyeGf2vtNRpL0w


Re: Crash of CouchDB 1.2.x

Posted by Stefan Kögl <ko...@gmail.com>.
Hi,

On Fri, Mar 2, 2012 at 4:33 AM, Nathan Vander Wilt
<na...@calftrail.com> wrote:
> Was your server under heavy load? Did you end up with a bunch of zombie couchjs processes?

The crash occured under load, but there are no zombies - at least not anymore.

If the crash happens again I'll try to inspect it more closely.


-- Stefan

Re: Crash of CouchDB 1.2.x

Posted by Nathan Vander Wilt <na...@calftrail.com>.
Was your server under heavy load? Did you end up with a bunch of zombie couchjs processes?

I'm a little worried I'm hopping in on something that might be a separate issue, but I was consistently getting nasty crashes the other day when doing a Blitz.io rush on a _list function, stacktraces similar to this but in my case the database became completely unresponsive.

Basically:
1. build-couchdb (1.1.1) on an EC2 t1.micro running Ubuntu
2. Add the '42' file at the path Blitz.io is looking for (a simple way to do this is what's held up filing JIRA ticket)
3. Run their default rush on a _list function (mine happened to do a fair amount of work, doing a Markdown conversion and Mustache templating)
4. Around about the 40 concurrent user mark, CouchDB dies a terrible horrible death with a bunch of zombie couchjs process, all those numbers in the logs and some traces like:

                        {pid,<0.116.0>},
                        {registered_name,[]},
                        {error_info,
                         {exit,
                          {noproc,
                           {gen_server,call,
                            [couch_httpd_vhost,
                             {match_vhost,
                              {mochiweb_request,#Port<0.1461>,'GET',"/",
                               {1,0},
                               {7,

[Tue, 28 Feb 2012 23:47:57 GMT] [error] [<0.712.0>] Uncaught error in HTTP request: {exit,
                                                     {timeout,
                                                      {gen_server,call,
                                                       [couch_query_servers,
                                                        {get_proc,
                                                         {doc,
                                                          <<"_design/glob">>,
                                                          {65,


Does this sound related, or separate issue? (Either way, I'm hoping to get a cleaner set of logs to submit.)

thanks,
-natevw


On Mar 1, 2012, at 3:17 AM, Stefan Kögl wrote:
> Hello,
> 
> My experiments to replicate some live data / traffic to a CouchDB
> 1.2.x (running the current 1.2.x branch + the patch from [1]) that
> sparked the indexing speed discussions, did also yield another
> (potential) problem. First sorry for not further reporting back any
> performance measurements, but I didn't yet find the time to run the
> tests on my machines.
> 
> Anyway, I found the following stack traces in my log (after noticing
> that some requests failed and compaction of a view stopped)
> 
> http://skoegl.net/~stefan/tmp/couchdb-1.2.x-crash.txt
> 
> The files starts at the first failed requests. Every request before
> that returned a positiv (ie 2xx) status code. The crash might have
> some "natural" reason (such as timeouts, lack of RAM, etc), but I'm
> not sure how to interpret Erlang stack traces. Can somebody point me
> in the right direction for diagnosing the problem?
> 
> 
> Thanks,
> 
> -- Stefan
> 
> 
> [1] http://friendpaste.com/178nPFgfyyeGf2vtNRpL0w