You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Pasi Eronen <pe...@iki.fi> on 2011/02/25 10:18:47 UTC

CouchDB 1.0.2 errors under load

Hi,

I had a big batch job (inserting 10M+ documents and generating views for them)
that ran just fine for about 6 hours, but then I got this error:

[Thu, 24 Feb 2011 19:42:57 GMT] [error] [<0.276.0>] ** Generic server
<0.276.0> terminating
** Last message in was delayed_commit
** When Server state == {db,<0.275.0>,<0.276.0>,nil,<<"1298547642391489">>,
                            <0.273.0>,<0.277.0>,
                            {db_header,5,739828,0,
                                {4778613011,{663866,0}},
                                {4778614954,663866},
                                nil,0,nil,nil,1000},
                            739828,
                            {btree,<0.273.0>,
                                {4778772755,{663866,0}},
                                #Fun<couch_db_updater.7.10053969>,
                                #Fun<couch_db_updater.8.35220795>,
                                #Fun<couch_btree.5.124754102>,
                                #Fun<couch_db_updater.9.107593676>},
                            {btree,<0.273.0>,
                                {4778774698,663866},
                                #Fun<couch_db_updater.10.30996817>,
                                #Fun<couch_db_updater.11.96515267>,
                                #Fun<couch_btree.5.124754102>,
                                #Fun<couch_db_updater.12.117826253>},
                            {btree,<0.273.0>,nil,
                                #Fun<couch_btree.0.83553141>,
                                #Fun<couch_btree.1.30790806>,
                                #Fun<couch_btree.2.124754102>,nil},
                            739831,<<"foo_replication_tmp">>,
                            "/data/foo/couchdb-data/foo_replication_tmp.couch",
                            [],[],nil,
                            {user_ctx,null,[],undefined},
                            #Ref<0.0.1793.256453>,1000,
                            [before_header,after_header,on_file_open],
                            false}
** Reason for termination ==
** {{badmatch,{error,emfile}},
    [{couch_file,sync,1},
     {couch_db_updater,commit_data,2},
     {couch_db_updater,handle_info,2},
     {gen_server,handle_msg,5},
     {proc_lib,init_p_do_apply,3}]}

(+lot of other messages with the same timestamp -- can send if they're useful)

Exactly at this time, the client got HTTP 500 status code; the request
was a bulk get (POST /foo_replication_tmp/_all_docs?include_docs=true).

Just before this request, the client had made a PUT (updating an existing
document) that got 200 status code, but apparently was not successfully
committed to the disk (I'm using "delayed_commits=true" - for my app,
this is just fine). The client had received the new _rev value, but when
it tried updating the same document a minute later, there was a conflict
(and it's not possible that somebody else updated this same document).

About four hours later, there was a different error ("accept_failed"
sounds like some temporary problem with sockets?):

[Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>] {error_report,<0.31.0>,
              {<0.20693.4>,std_error,
               [{application,mochiweb},
                "Accept failed error","{error,emfile}"]}}

[Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>] {error_report,<0.31.0>,
    {<0.20693.4>,crash_report,
     [[{initial_call,{mochiweb_socket_server,acceptor_loop,['Argument__1']}},
       {pid,<0.20693.4>},
       {registered_name,[]},
       {error_info,
           {exit,
               {error,accept_failed},
               [{mochiweb_socket_server,acceptor_loop,1},
                {proc_lib,init_p_do_apply,3}]}},
       {ancestors,
           [couch_httpd,couch_secondary_services,couch_server_sup,<0.32.0>]},
       {messages,[]},
       {links,[<0.106.0>]},
       {dictionary,[]},
       {trap_exit,false},
       {status,running},
       {heap_size,233},
       {stack_size,24},
       {reductions,200}],
      []]}}

(+lots of other messages within the next couple of minutes)

The same error occured once more, about four hours later.

I'm quite new to CouchDB, so I'd appreciate any help in interpreting
what these error messages mean. (BTW, are these something I should
report as bugs in JIRA? I can do that, but I'd like to at least understand
which parts of the error messages are actually relevant here :-)

I'm running CouchDB 1.0.2 with Erlang R14B on 64-bit RHEL 5.6.

Best regards,
Pasi

Re: CouchDB 1.0.2 errors under load

Posted by Pasi Eronen <pe...@iki.fi>.

I've finally managed to reproduce this problem in a test environment, and
filed a bug report:

https://issues.apache.org/jira/browse/COUCHDB-1100

Best regards,
Pasi

On Sat, Feb 26, 2011 at 11:19, Pasi Eronen <pe...@iki.fi> wrote:

> BTW, it seems just changing "ulimit -n" did not work. At some point,
> the server just stopped responding completely (even the root URL just
> times out, and there's nothing in the error log).
>
> A quick googling found references to enabling kernel poll (at build time
> and on erl command line), and increasing ERL_MAX_PORTS
> environment variable. Based on a quick look it seems kernel poll
> was already enabled, but I'll experiment with ERL_MAX_PORTS...
>
> Best regards,
> Pasi
>
>
>
> On Sat, Feb 26, 2011 at 03:40, Paul J. Davis <pa...@gmail.com>wrote:
>
>>
>>
>> On Feb 25, 2011, at 7:31 PM, Pasi Eronen <pe...@iki.fi> wrote:
>>
>> Yes, the client is doing something funky :-) Many of the requests
>> it sends are POSTs with chunked transfer encoding, so it's being
>> hit hard by issue COUCHDB-682. In other words, there's lot of closing
>> and opening of TCP connections going on, and retrying requests that
>> failed due to COUCHDB-682.
>>
>> (The server behavior in COUCHDB-682 is probably strictly speaking not
>> violating the HTTP spec -- the server is allowed to close an idle
>> persistent
>> connection at any time -- but it's certainly causing unusual behavior
>> vs. opening and closing of TCP connections.)
>>
>> Output from netstat -anp looks like this:
>>
>> tcp  1  0  127.0.0.1:5984  127.0.0.1:51681  CLOSE_WAIT  19124/beam.smp
>> tcp  1  0  127.0.0.1:5984  127.0.0.1:36576  CLOSE_WAIT  19124/beam.smp
>> tcp  1  0  127.0.0.1:5984  127.0.0.1:35311  CLOSE_WAIT  19124/beam.smp
>> (...)
>>
>> The other directions (e.g. 127.0.0.1:51681 -> 127.0.0.1:5984) are no
>> longer
>> around (if they would be, I guess the client would also run out of fds --
>> and
>> the number of fds used by the client is not growing).
>>
>> Does this offer any clues for diagnosing the bug?
>>
>>
>> Its definitely pointing the finger at CouchDB/erlang.
>>
>> I can also try if I can reproduce the fd leak with some simple test
>> program...
>>
>>
>> Please do. I'd start by trying to mimic your scripts main request thing to
>> see if you can trigger individual sockets to end up in close_wait.
>>
>> Best regards,
>> Pasi
>>
>> On Sat, Feb 26, 2011 at 01:11, Paul Davis < <pa...@gmail.com>
>> paul.joseph.davis@gmail.com> wrote:
>> > It sounds like your client is doing something funky with shutting down
>> > connections to the server. If I were to guess I would say that it
>> > looks like you're leaking sockets between a call to shutdown and a
>> > call to close. Though to say for certain I'd need to see what state
>> > both ends of the socket are in.
>> >
>> > The errors about pread_iolist are hard to diagnose without the
>> > surrounding error messages but if I were to guess is that you're
>> > getting an emfile that cascades to the file gen_server which then
>> > notifies other people trying to issue read calls.
>> >
>> > On Fri, Feb 25, 2011 at 5:43 PM, Pasi Eronen < <pe...@iki.fi>
>> wrote:
>> >> Thanks! I looked at /proc/(pid-of-beam.smp)/fd, and it seems CouchDB
>> >> is indeed using a lot of file descriptors, and the number is growing.
>> >> I restarted CouchDB (and my batch job), and after 40 minutes, it was
>> >> already using 300 fds; an hour later, the figure was 600; half
>> >> an hour later, almost 700.
>> >>
>> >> The default "ulimit -n" was 1024, so that adds up to couple of hours.
>> >> As a temporary fix I changed the limit to 64K, so it should work fine
>> >> for the weekend :-) (but not much longer)
>> >>
>> >> Based on "netstat -an", the fds are TCP connections stuck in CLOSE_WAIT
>> >> state. Hmm. It seems CLOSE_WAIT means the client closed the connection
>> >> but CouchDB didn't call close() to release the fd? And that's
>> presumably
>> >> a bug somewhere in CouchDB, right?
>> >>
>> >> After the initial error message, the log had ~300 almost identical
>> >> error messages of the form (with different number after pread_iolist):
>> >>
>> >> [Thu, 24 Feb 2011 19:42:57 GMT] [error] [<0.31510.2>]
>> >> Uncaught error in HTTP request: {exit,
>> >>                                 {normal,
>> >>                                  {gen_server,call,
>> >>                                   [<0.288.0>,
>> >>                                    {pread_iolist,812881959},
>> >>                                    infinity]}}}
>> >>
>> >> Does this give any hints where the bug might be?
>> >>
>> >> If not, I'll see if I can reproduce the bug somehow in more controlled
>> >> setting. (I have a hunch that this might involve the use of HTTP
>> >> expect/continue or POST requests with chunked transfer encoding,
>> >> but let's see....)
>> >>
>> >> Best regards,
>> >> Pasi
>> >>
>> >> On Fri, Feb 25, 2011 at 19:01, Paul Davis <<p...@gmail.com>
>> paul.joseph.davis@gmail.com> wrote:
>> >>> On Fri, Feb 25, 2011 at 4:18 AM, Pasi Eronen < <pe...@iki.fi>
>> wrote:
>> >>>> Hi,
>> >>>>
>> >>>> I had a big batch job (inserting 10M+ documents and generating views
>> for them)
>> >>>> that ran just fine for about 6 hours, but then I got this error:
>> >>>>
>> >>>> [Thu, 24 Feb 2011 19:42:57 GMT] [error] [<0.276.0>] ** Generic server
>> >>>> <0.276.0> terminating
>> >>>> ** Last message in was delayed_commit
>> >>>> ** When Server state ==
>> {db,<0.275.0>,<0.276.0>,nil,<<"1298547642391489">>,
>> >>>>                            <0.273.0>,<0.277.0>,
>> >>>>                            {db_header,5,739828,0,
>> >>>>                                {4778613011,{663866,0}},
>> >>>>                                {4778614954,663866},
>> >>>>                                nil,0,nil,nil,1000},
>> >>>>                            739828,
>> >>>>                            {btree,<0.273.0>,
>> >>>>                                {4778772755,{663866,0}},
>> >>>>                                #Fun<couch_db_updater.7.10053969>,
>> >>>>                                #Fun<couch_db_updater.8.35220795>,
>> >>>>                                #Fun<couch_btree.5.124754102>,
>> >>>>                                #Fun<couch_db_updater.9.107593676>},
>> >>>>                            {btree,<0.273.0>,
>> >>>>                                {4778774698,663866},
>> >>>>                                #Fun<couch_db_updater.10.30996817>,
>> >>>>                                #Fun<couch_db_updater.11.96515267>,
>> >>>>                                #Fun<couch_btree.5.124754102>,
>> >>>>                                #Fun<couch_db_updater.12.117826253>},
>> >>>>                            {btree,<0.273.0>,nil,
>> >>>>                                #Fun<couch_btree.0.83553141>,
>> >>>>                                #Fun<couch_btree.1.30790806>,
>> >>>>                                #Fun<couch_btree.2.124754102>,nil},
>> >>>>                            739831,<<"foo_replication_tmp">>,
>> >>>>
>>  "/data/foo/couchdb-data/foo_replication_tmp.couch",
>> >>>>                            [],[],nil,
>> >>>>                            {user_ctx,null,[],undefined},
>> >>>>                            #Ref<0.0.1793.256453>,1000,
>> >>>>                            [before_header,after_header,on_file_open],
>> >>>>                            false}
>> >>>> ** Reason for termination ==
>> >>>> ** {{badmatch,{error,emfile}},
>> >>>>    [{couch_file,sync,1},
>> >>>>     {couch_db_updater,commit_data,2},
>> >>>>     {couch_db_updater,handle_info,2},
>> >>>>     {gen_server,handle_msg,5},
>> >>>>     {proc_lib,init_p_do_apply,3}]}
>> >>>>
>> >>>> (+lot of other messages with the same timestamp -- can send if
>> they're useful)
>> >>>>
>> >>>> Exactly at this time, the client got HTTP 500 status code; the
>> request
>> >>>> was a bulk get (POST
>> /foo_replication_tmp/_all_docs?include_docs=true).
>> >>>>
>> >>>> Just before this request, the client had made a PUT (updating an
>> existing
>> >>>> document) that got 200 status code, but apparently was not
>> successfully
>> >>>> committed to the disk (I'm using "delayed_commits=true" - for my app,
>> >>>> this is just fine). The client had received the new _rev value, but
>> when
>> >>>> it tried updating the same document a minute later, there was a
>> conflict
>> >>>> (and it's not possible that somebody else updated this same
>> document).
>> >>>>
>> >>>> About four hours later, there was a different error ("accept_failed"
>> >>>> sounds like some temporary problem with sockets?):
>> >>>>
>> >>>> [Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>]
>> {error_report,<0.31.0>,
>> >>>>              {<0.20693.4>,std_error,
>> >>>>               [{application,mochiweb},
>> >>>>                "Accept failed error","{error,emfile}"]}}
>> >>>>
>> >>>> [Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>]
>> {error_report,<0.31.0>,
>> >>>>    {<0.20693.4>,crash_report,
>> >>>>
>> [[{initial_call,{mochiweb_socket_server,acceptor_loop,['Argument__1']}},
>> >>>>       {pid,<0.20693.4>},
>> >>>>       {registered_name,[]},
>> >>>>       {error_info,
>> >>>>           {exit,
>> >>>>               {error,accept_failed},
>> >>>>               [{mochiweb_socket_server,acceptor_loop,1},
>> >>>>                {proc_lib,init_p_do_apply,3}]}},
>> >>>>       {ancestors,
>> >>>>
>> [couch_httpd,couch_secondary_services,couch_server_sup,<0.32.0>]},
>> >>>>       {messages,[]},
>> >>>>       {links,[<0.106.0>]},
>> >>>>       {dictionary,[]},
>> >>>>       {trap_exit,false},
>> >>>>       {status,running},
>> >>>>       {heap_size,233},
>> >>>>       {stack_size,24},
>> >>>>       {reductions,200}],
>> >>>>      []]}}
>> >>>>
>> >>>> (+lots of other messages within the next couple of minutes)
>> >>>>
>> >>>> The same error occured once more, about four hours later.
>> >>>>
>> >>>> I'm quite new to CouchDB, so I'd appreciate any help in interpreting
>> >>>> what these error messages mean. (BTW, are these something I should
>> >>>> report as bugs in JIRA? I can do that, but I'd like to at least
>> understand
>> >>>> which parts of the error messages are actually relevant here :-)
>> >>>>
>> >>>> I'm running CouchDB 1.0.2 with Erlang R14B on 64-bit RHEL 5.6.
>> >>>>
>> >>>> Best regards,
>> >>>> Pasi
>> >>>>
>> >>>
>> >>> The error you're getting is because CouchDB is running out of
>> >>> available file descriptors to use. Try increasing the limit for the
>> >>> user running CouchDB.
>> >>>
>> >>
>> >
>>
>>
>

Re: CouchDB 1.0.2 errors under load

Posted by Pasi Eronen <pe...@iki.fi>.

BTW, it seems just changing "ulimit -n" did not work. At some point,
the server just stopped responding completely (even the root URL just
times out, and there's nothing in the error log).

A quick googling found references to enabling kernel poll (at build time
and on erl command line), and increasing ERL_MAX_PORTS
environment variable. Based on a quick look it seems kernel poll
was already enabled, but I'll experiment with ERL_MAX_PORTS...

Best regards,
Pasi


On Sat, Feb 26, 2011 at 03:40, Paul J. Davis <pa...@gmail.com>wrote:

>
>
> On Feb 25, 2011, at 7:31 PM, Pasi Eronen <pe...@iki.fi> wrote:
>
> Yes, the client is doing something funky :-) Many of the requests
> it sends are POSTs with chunked transfer encoding, so it's being
> hit hard by issue COUCHDB-682. In other words, there's lot of closing
> and opening of TCP connections going on, and retrying requests that
> failed due to COUCHDB-682.
>
> (The server behavior in COUCHDB-682 is probably strictly speaking not
> violating the HTTP spec -- the server is allowed to close an idle
> persistent
> connection at any time -- but it's certainly causing unusual behavior
> vs. opening and closing of TCP connections.)
>
> Output from netstat -anp looks like this:
>
> tcp  1  0  127.0.0.1:5984  127.0.0.1:51681  CLOSE_WAIT  19124/beam.smp
> tcp  1  0  127.0.0.1:5984  127.0.0.1:36576  CLOSE_WAIT  19124/beam.smp
> tcp  1  0  127.0.0.1:5984  127.0.0.1:35311  CLOSE_WAIT  19124/beam.smp
> (...)
>
> The other directions (e.g. 127.0.0.1:51681 -> 127.0.0.1:5984) are no
> longer
> around (if they would be, I guess the client would also run out of fds --
> and
> the number of fds used by the client is not growing).
>
> Does this offer any clues for diagnosing the bug?
>
>
> Its definitely pointing the finger at CouchDB/erlang.
>
> I can also try if I can reproduce the fd leak with some simple test
> program...
>
>
> Please do. I'd start by trying to mimic your scripts main request thing to
> see if you can trigger individual sockets to end up in close_wait.
>
> Best regards,
> Pasi
>
> On Sat, Feb 26, 2011 at 01:11, Paul Davis < <pa...@gmail.com>
> paul.joseph.davis@gmail.com> wrote:
> > It sounds like your client is doing something funky with shutting down
> > connections to the server. If I were to guess I would say that it
> > looks like you're leaking sockets between a call to shutdown and a
> > call to close. Though to say for certain I'd need to see what state
> > both ends of the socket are in.
> >
> > The errors about pread_iolist are hard to diagnose without the
> > surrounding error messages but if I were to guess is that you're
> > getting an emfile that cascades to the file gen_server which then
> > notifies other people trying to issue read calls.
> >
> > On Fri, Feb 25, 2011 at 5:43 PM, Pasi Eronen < <pe...@iki.fi>
> wrote:
> >> Thanks! I looked at /proc/(pid-of-beam.smp)/fd, and it seems CouchDB
> >> is indeed using a lot of file descriptors, and the number is growing.
> >> I restarted CouchDB (and my batch job), and after 40 minutes, it was
> >> already using 300 fds; an hour later, the figure was 600; half
> >> an hour later, almost 700.
> >>
> >> The default "ulimit -n" was 1024, so that adds up to couple of hours.
> >> As a temporary fix I changed the limit to 64K, so it should work fine
> >> for the weekend :-) (but not much longer)
> >>
> >> Based on "netstat -an", the fds are TCP connections stuck in CLOSE_WAIT
> >> state. Hmm. It seems CLOSE_WAIT means the client closed the connection
> >> but CouchDB didn't call close() to release the fd? And that's presumably
> >> a bug somewhere in CouchDB, right?
> >>
> >> After the initial error message, the log had ~300 almost identical
> >> error messages of the form (with different number after pread_iolist):
> >>
> >> [Thu, 24 Feb 2011 19:42:57 GMT] [error] [<0.31510.2>]
> >> Uncaught error in HTTP request: {exit,
> >>                                 {normal,
> >>                                  {gen_server,call,
> >>                                   [<0.288.0>,
> >>                                    {pread_iolist,812881959},
> >>                                    infinity]}}}
> >>
> >> Does this give any hints where the bug might be?
> >>
> >> If not, I'll see if I can reproduce the bug somehow in more controlled
> >> setting. (I have a hunch that this might involve the use of HTTP
> >> expect/continue or POST requests with chunked transfer encoding,
> >> but let's see....)
> >>
> >> Best regards,
> >> Pasi
> >>
> >> On Fri, Feb 25, 2011 at 19:01, Paul Davis <<p...@gmail.com>
> paul.joseph.davis@gmail.com> wrote:
> >>> On Fri, Feb 25, 2011 at 4:18 AM, Pasi Eronen < <pe...@iki.fi>
> wrote:
> >>>> Hi,
> >>>>
> >>>> I had a big batch job (inserting 10M+ documents and generating views
> for them)
> >>>> that ran just fine for about 6 hours, but then I got this error:
> >>>>
> >>>> [Thu, 24 Feb 2011 19:42:57 GMT] [error] [<0.276.0>] ** Generic server
> >>>> <0.276.0> terminating
> >>>> ** Last message in was delayed_commit
> >>>> ** When Server state ==
> {db,<0.275.0>,<0.276.0>,nil,<<"1298547642391489">>,
> >>>>                            <0.273.0>,<0.277.0>,
> >>>>                            {db_header,5,739828,0,
> >>>>                                {4778613011,{663866,0}},
> >>>>                                {4778614954,663866},
> >>>>                                nil,0,nil,nil,1000},
> >>>>                            739828,
> >>>>                            {btree,<0.273.0>,
> >>>>                                {4778772755,{663866,0}},
> >>>>                                #Fun<couch_db_updater.7.10053969>,
> >>>>                                #Fun<couch_db_updater.8.35220795>,
> >>>>                                #Fun<couch_btree.5.124754102>,
> >>>>                                #Fun<couch_db_updater.9.107593676>},
> >>>>                            {btree,<0.273.0>,
> >>>>                                {4778774698,663866},
> >>>>                                #Fun<couch_db_updater.10.30996817>,
> >>>>                                #Fun<couch_db_updater.11.96515267>,
> >>>>                                #Fun<couch_btree.5.124754102>,
> >>>>                                #Fun<couch_db_updater.12.117826253>},
> >>>>                            {btree,<0.273.0>,nil,
> >>>>                                #Fun<couch_btree.0.83553141>,
> >>>>                                #Fun<couch_btree.1.30790806>,
> >>>>                                #Fun<couch_btree.2.124754102>,nil},
> >>>>                            739831,<<"foo_replication_tmp">>,
> >>>>
>  "/data/foo/couchdb-data/foo_replication_tmp.couch",
> >>>>                            [],[],nil,
> >>>>                            {user_ctx,null,[],undefined},
> >>>>                            #Ref<0.0.1793.256453>,1000,
> >>>>                            [before_header,after_header,on_file_open],
> >>>>                            false}
> >>>> ** Reason for termination ==
> >>>> ** {{badmatch,{error,emfile}},
> >>>>    [{couch_file,sync,1},
> >>>>     {couch_db_updater,commit_data,2},
> >>>>     {couch_db_updater,handle_info,2},
> >>>>     {gen_server,handle_msg,5},
> >>>>     {proc_lib,init_p_do_apply,3}]}
> >>>>
> >>>> (+lot of other messages with the same timestamp -- can send if they're
> useful)
> >>>>
> >>>> Exactly at this time, the client got HTTP 500 status code; the request
> >>>> was a bulk get (POST
> /foo_replication_tmp/_all_docs?include_docs=true).
> >>>>
> >>>> Just before this request, the client had made a PUT (updating an
> existing
> >>>> document) that got 200 status code, but apparently was not
> successfully
> >>>> committed to the disk (I'm using "delayed_commits=true" - for my app,
> >>>> this is just fine). The client had received the new _rev value, but
> when
> >>>> it tried updating the same document a minute later, there was a
> conflict
> >>>> (and it's not possible that somebody else updated this same document).
> >>>>
> >>>> About four hours later, there was a different error ("accept_failed"
> >>>> sounds like some temporary problem with sockets?):
> >>>>
> >>>> [Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>]
> {error_report,<0.31.0>,
> >>>>              {<0.20693.4>,std_error,
> >>>>               [{application,mochiweb},
> >>>>                "Accept failed error","{error,emfile}"]}}
> >>>>
> >>>> [Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>]
> {error_report,<0.31.0>,
> >>>>    {<0.20693.4>,crash_report,
> >>>>
> [[{initial_call,{mochiweb_socket_server,acceptor_loop,['Argument__1']}},
> >>>>       {pid,<0.20693.4>},
> >>>>       {registered_name,[]},
> >>>>       {error_info,
> >>>>           {exit,
> >>>>               {error,accept_failed},
> >>>>               [{mochiweb_socket_server,acceptor_loop,1},
> >>>>                {proc_lib,init_p_do_apply,3}]}},
> >>>>       {ancestors,
> >>>>
> [couch_httpd,couch_secondary_services,couch_server_sup,<0.32.0>]},
> >>>>       {messages,[]},
> >>>>       {links,[<0.106.0>]},
> >>>>       {dictionary,[]},
> >>>>       {trap_exit,false},
> >>>>       {status,running},
> >>>>       {heap_size,233},
> >>>>       {stack_size,24},
> >>>>       {reductions,200}],
> >>>>      []]}}
> >>>>
> >>>> (+lots of other messages within the next couple of minutes)
> >>>>
> >>>> The same error occured once more, about four hours later.
> >>>>
> >>>> I'm quite new to CouchDB, so I'd appreciate any help in interpreting
> >>>> what these error messages mean. (BTW, are these something I should
> >>>> report as bugs in JIRA? I can do that, but I'd like to at least
> understand
> >>>> which parts of the error messages are actually relevant here :-)
> >>>>
> >>>> I'm running CouchDB 1.0.2 with Erlang R14B on 64-bit RHEL 5.6.
> >>>>
> >>>> Best regards,
> >>>> Pasi
> >>>>
> >>>
> >>> The error you're getting is because CouchDB is running out of
> >>> available file descriptors to use. Try increasing the limit for the
> >>> user running CouchDB.
> >>>
> >>
> >
>
>

Re: CouchDB 1.0.2 errors under load

Posted by "Paul J. Davis" <pa...@gmail.com>.


On Feb 25, 2011, at 7:31 PM, Pasi Eronen <pe...@iki.fi> wrote:

> Yes, the client is doing something funky :-) Many of the requests 
> it sends are POSTs with chunked transfer encoding, so it's being 
> hit hard by issue COUCHDB-682. In other words, there's lot of closing
> and opening of TCP connections going on, and retrying requests that
> failed due to COUCHDB-682. 
> 
> (The server behavior in COUCHDB-682 is probably strictly speaking not
> violating the HTTP spec -- the server is allowed to close an idle persistent
> connection at any time -- but it's certainly causing unusual behavior
> vs. opening and closing of TCP connections.)
> 
> Output from netstat -anp looks like this:
> 
> tcp  1  0  127.0.0.1:5984  127.0.0.1:51681  CLOSE_WAIT  19124/beam.smp
> tcp  1  0  127.0.0.1:5984  127.0.0.1:36576  CLOSE_WAIT  19124/beam.smp
> tcp  1  0  127.0.0.1:5984  127.0.0.1:35311  CLOSE_WAIT  19124/beam.smp
> (...)
> 
> The other directions (e.g. 127.0.0.1:51681 -> 127.0.0.1:5984) are no longer 
> around (if they would be, I guess the client would also run out of fds -- and
> the number of fds used by the client is not growing).
> 
> Does this offer any clues for diagnosing the bug? 
> 

Its definitely pointing the finger at CouchDB/erlang.

> I can also try if I can reproduce the fd leak with some simple test program...
> 

Please do. I'd start by trying to mimic your scripts main request thing to see if you can trigger individual sockets to end up in close_wait. 

> Best regards,
> Pasi
> 
> On Sat, Feb 26, 2011 at 01:11, Paul Davis <pa...@gmail.com> wrote:
> > It sounds like your client is doing something funky with shutting down
> > connections to the server. If I were to guess I would say that it
> > looks like you're leaking sockets between a call to shutdown and a
> > call to close. Though to say for certain I'd need to see what state
> > both ends of the socket are in.
> >
> > The errors about pread_iolist are hard to diagnose without the
> > surrounding error messages but if I were to guess is that you're
> > getting an emfile that cascades to the file gen_server which then
> > notifies other people trying to issue read calls.
> >
> > On Fri, Feb 25, 2011 at 5:43 PM, Pasi Eronen <pe...@iki.fi> wrote:
> >> Thanks! I looked at /proc/(pid-of-beam.smp)/fd, and it seems CouchDB
> >> is indeed using a lot of file descriptors, and the number is growing.
> >> I restarted CouchDB (and my batch job), and after 40 minutes, it was
> >> already using 300 fds; an hour later, the figure was 600; half
> >> an hour later, almost 700.
> >>
> >> The default "ulimit -n" was 1024, so that adds up to couple of hours.
> >> As a temporary fix I changed the limit to 64K, so it should work fine
> >> for the weekend :-) (but not much longer)
> >>
> >> Based on "netstat -an", the fds are TCP connections stuck in CLOSE_WAIT
> >> state. Hmm. It seems CLOSE_WAIT means the client closed the connection
> >> but CouchDB didn't call close() to release the fd? And that's presumably
> >> a bug somewhere in CouchDB, right?
> >>
> >> After the initial error message, the log had ~300 almost identical
> >> error messages of the form (with different number after pread_iolist):
> >>
> >> [Thu, 24 Feb 2011 19:42:57 GMT] [error] [<0.31510.2>]
> >> Uncaught error in HTTP request: {exit,
> >>                                 {normal,
> >>                                  {gen_server,call,
> >>                                   [<0.288.0>,
> >>                                    {pread_iolist,812881959},
> >>                                    infinity]}}}
> >>
> >> Does this give any hints where the bug might be?
> >>
> >> If not, I'll see if I can reproduce the bug somehow in more controlled
> >> setting. (I have a hunch that this might involve the use of HTTP
> >> expect/continue or POST requests with chunked transfer encoding,
> >> but let's see....)
> >>
> >> Best regards,
> >> Pasi
> >>
> >> On Fri, Feb 25, 2011 at 19:01, Paul Davis <pa...@gmail.com> wrote:
> >>> On Fri, Feb 25, 2011 at 4:18 AM, Pasi Eronen <pe...@iki.fi> wrote:
> >>>> Hi,
> >>>>
> >>>> I had a big batch job (inserting 10M+ documents and generating views for them)
> >>>> that ran just fine for about 6 hours, but then I got this error:
> >>>>
> >>>> [Thu, 24 Feb 2011 19:42:57 GMT] [error] [<0.276.0>] ** Generic server
> >>>> <0.276.0> terminating
> >>>> ** Last message in was delayed_commit
> >>>> ** When Server state == {db,<0.275.0>,<0.276.0>,nil,<<"1298547642391489">>,
> >>>>                            <0.273.0>,<0.277.0>,
> >>>>                            {db_header,5,739828,0,
> >>>>                                {4778613011,{663866,0}},
> >>>>                                {4778614954,663866},
> >>>>                                nil,0,nil,nil,1000},
> >>>>                            739828,
> >>>>                            {btree,<0.273.0>,
> >>>>                                {4778772755,{663866,0}},
> >>>>                                #Fun<couch_db_updater.7.10053969>,
> >>>>                                #Fun<couch_db_updater.8.35220795>,
> >>>>                                #Fun<couch_btree.5.124754102>,
> >>>>                                #Fun<couch_db_updater.9.107593676>},
> >>>>                            {btree,<0.273.0>,
> >>>>                                {4778774698,663866},
> >>>>                                #Fun<couch_db_updater.10.30996817>,
> >>>>                                #Fun<couch_db_updater.11.96515267>,
> >>>>                                #Fun<couch_btree.5.124754102>,
> >>>>                                #Fun<couch_db_updater.12.117826253>},
> >>>>                            {btree,<0.273.0>,nil,
> >>>>                                #Fun<couch_btree.0.83553141>,
> >>>>                                #Fun<couch_btree.1.30790806>,
> >>>>                                #Fun<couch_btree.2.124754102>,nil},
> >>>>                            739831,<<"foo_replication_tmp">>,
> >>>>                            "/data/foo/couchdb-data/foo_replication_tmp.couch",
> >>>>                            [],[],nil,
> >>>>                            {user_ctx,null,[],undefined},
> >>>>                            #Ref<0.0.1793.256453>,1000,
> >>>>                            [before_header,after_header,on_file_open],
> >>>>                            false}
> >>>> ** Reason for termination ==
> >>>> ** {{badmatch,{error,emfile}},
> >>>>    [{couch_file,sync,1},
> >>>>     {couch_db_updater,commit_data,2},
> >>>>     {couch_db_updater,handle_info,2},
> >>>>     {gen_server,handle_msg,5},
> >>>>     {proc_lib,init_p_do_apply,3}]}
> >>>>
> >>>> (+lot of other messages with the same timestamp -- can send if they're useful)
> >>>>
> >>>> Exactly at this time, the client got HTTP 500 status code; the request
> >>>> was a bulk get (POST /foo_replication_tmp/_all_docs?include_docs=true).
> >>>>
> >>>> Just before this request, the client had made a PUT (updating an existing
> >>>> document) that got 200 status code, but apparently was not successfully
> >>>> committed to the disk (I'm using "delayed_commits=true" - for my app,
> >>>> this is just fine). The client had received the new _rev value, but when
> >>>> it tried updating the same document a minute later, there was a conflict
> >>>> (and it's not possible that somebody else updated this same document).
> >>>>
> >>>> About four hours later, there was a different error ("accept_failed"
> >>>> sounds like some temporary problem with sockets?):
> >>>>
> >>>> [Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>] {error_report,<0.31.0>,
> >>>>              {<0.20693.4>,std_error,
> >>>>               [{application,mochiweb},
> >>>>                "Accept failed error","{error,emfile}"]}}
> >>>>
> >>>> [Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>] {error_report,<0.31.0>,
> >>>>    {<0.20693.4>,crash_report,
> >>>>     [[{initial_call,{mochiweb_socket_server,acceptor_loop,['Argument__1']}},
> >>>>       {pid,<0.20693.4>},
> >>>>       {registered_name,[]},
> >>>>       {error_info,
> >>>>           {exit,
> >>>>               {error,accept_failed},
> >>>>               [{mochiweb_socket_server,acceptor_loop,1},
> >>>>                {proc_lib,init_p_do_apply,3}]}},
> >>>>       {ancestors,
> >>>>           [couch_httpd,couch_secondary_services,couch_server_sup,<0.32.0>]},
> >>>>       {messages,[]},
> >>>>       {links,[<0.106.0>]},
> >>>>       {dictionary,[]},
> >>>>       {trap_exit,false},
> >>>>       {status,running},
> >>>>       {heap_size,233},
> >>>>       {stack_size,24},
> >>>>       {reductions,200}],
> >>>>      []]}}
> >>>>
> >>>> (+lots of other messages within the next couple of minutes)
> >>>>
> >>>> The same error occured once more, about four hours later.
> >>>>
> >>>> I'm quite new to CouchDB, so I'd appreciate any help in interpreting
> >>>> what these error messages mean. (BTW, are these something I should
> >>>> report as bugs in JIRA? I can do that, but I'd like to at least understand
> >>>> which parts of the error messages are actually relevant here :-)
> >>>>
> >>>> I'm running CouchDB 1.0.2 with Erlang R14B on 64-bit RHEL 5.6.
> >>>>
> >>>> Best regards,
> >>>> Pasi
> >>>>
> >>>
> >>> The error you're getting is because CouchDB is running out of
> >>> available file descriptors to use. Try increasing the limit for the
> >>> user running CouchDB.
> >>>
> >>
> >
>

Re: CouchDB 1.0.2 errors under load

Posted by Pasi Eronen <pe...@iki.fi>.

Yes, the client is doing something funky :-) Many of the requests
it sends are POSTs with chunked transfer encoding, so it's being
hit hard by issue COUCHDB-682. In other words, there's lot of closing
and opening of TCP connections going on, and retrying requests that
failed due to COUCHDB-682.

(The server behavior in COUCHDB-682 is probably strictly speaking not
violating the HTTP spec -- the server is allowed to close an idle persistent
connection at any time -- but it's certainly causing unusual behavior
vs. opening and closing of TCP connections.)

Output from netstat -anp looks like this:

tcp  1  0  127.0.0.1:5984  127.0.0.1:51681  CLOSE_WAIT  19124/beam.smp
tcp  1  0  127.0.0.1:5984  127.0.0.1:36576  CLOSE_WAIT  19124/beam.smp
tcp  1  0  127.0.0.1:5984  127.0.0.1:35311  CLOSE_WAIT  19124/beam.smp
(...)

The other directions (e.g. 127.0.0.1:51681 -> 127.0.0.1:5984) are no longer
around (if they would be, I guess the client would also run out of fds --
and
the number of fds used by the client is not growing).

Does this offer any clues for diagnosing the bug?

I can also try if I can reproduce the fd leak with some simple test
program...

Best regards,
Pasi

On Sat, Feb 26, 2011 at 01:11, Paul Davis <pa...@gmail.com>
wrote:
> It sounds like your client is doing something funky with shutting down
> connections to the server. If I were to guess I would say that it
> looks like you're leaking sockets between a call to shutdown and a
> call to close. Though to say for certain I'd need to see what state
> both ends of the socket are in.
>
> The errors about pread_iolist are hard to diagnose without the
> surrounding error messages but if I were to guess is that you're
> getting an emfile that cascades to the file gen_server which then
> notifies other people trying to issue read calls.
>
> On Fri, Feb 25, 2011 at 5:43 PM, Pasi Eronen <pe...@iki.fi> wrote:
>> Thanks! I looked at /proc/(pid-of-beam.smp)/fd, and it seems CouchDB
>> is indeed using a lot of file descriptors, and the number is growing.
>> I restarted CouchDB (and my batch job), and after 40 minutes, it was
>> already using 300 fds; an hour later, the figure was 600; half
>> an hour later, almost 700.
>>
>> The default "ulimit -n" was 1024, so that adds up to couple of hours.
>> As a temporary fix I changed the limit to 64K, so it should work fine
>> for the weekend :-) (but not much longer)
>>
>> Based on "netstat -an", the fds are TCP connections stuck in CLOSE_WAIT
>> state. Hmm. It seems CLOSE_WAIT means the client closed the connection
>> but CouchDB didn't call close() to release the fd? And that's presumably
>> a bug somewhere in CouchDB, right?
>>
>> After the initial error message, the log had ~300 almost identical
>> error messages of the form (with different number after pread_iolist):
>>
>> [Thu, 24 Feb 2011 19:42:57 GMT] [error] [<0.31510.2>]
>> Uncaught error in HTTP request: {exit,
>>                                 {normal,
>>                                  {gen_server,call,
>>                                   [<0.288.0>,
>>                                    {pread_iolist,812881959},
>>                                    infinity]}}}
>>
>> Does this give any hints where the bug might be?
>>
>> If not, I'll see if I can reproduce the bug somehow in more controlled
>> setting. (I have a hunch that this might involve the use of HTTP
>> expect/continue or POST requests with chunked transfer encoding,
>> but let's see....)
>>
>> Best regards,
>> Pasi
>>
>> On Fri, Feb 25, 2011 at 19:01, Paul Davis <pa...@gmail.com>
wrote:
>>> On Fri, Feb 25, 2011 at 4:18 AM, Pasi Eronen <pe...@iki.fi> wrote:
>>>> Hi,
>>>>
>>>> I had a big batch job (inserting 10M+ documents and generating views
for them)
>>>> that ran just fine for about 6 hours, but then I got this error:
>>>>
>>>> [Thu, 24 Feb 2011 19:42:57 GMT] [error] [<0.276.0>] ** Generic server
>>>> <0.276.0> terminating
>>>> ** Last message in was delayed_commit
>>>> ** When Server state ==
{db,<0.275.0>,<0.276.0>,nil,<<"1298547642391489">>,
>>>>                            <0.273.0>,<0.277.0>,
>>>>                            {db_header,5,739828,0,
>>>>                                {4778613011,{663866,0}},
>>>>                                {4778614954,663866},
>>>>                                nil,0,nil,nil,1000},
>>>>                            739828,
>>>>                            {btree,<0.273.0>,
>>>>                                {4778772755,{663866,0}},
>>>>                                #Fun<couch_db_updater.7.10053969>,
>>>>                                #Fun<couch_db_updater.8.35220795>,
>>>>                                #Fun<couch_btree.5.124754102>,
>>>>                                #Fun<couch_db_updater.9.107593676>},
>>>>                            {btree,<0.273.0>,
>>>>                                {4778774698,663866},
>>>>                                #Fun<couch_db_updater.10.30996817>,
>>>>                                #Fun<couch_db_updater.11.96515267>,
>>>>                                #Fun<couch_btree.5.124754102>,
>>>>                                #Fun<couch_db_updater.12.117826253>},
>>>>                            {btree,<0.273.0>,nil,
>>>>                                #Fun<couch_btree.0.83553141>,
>>>>                                #Fun<couch_btree.1.30790806>,
>>>>                                #Fun<couch_btree.2.124754102>,nil},
>>>>                            739831,<<"foo_replication_tmp">>,
>>>>
 "/data/foo/couchdb-data/foo_replication_tmp.couch",
>>>>                            [],[],nil,
>>>>                            {user_ctx,null,[],undefined},
>>>>                            #Ref<0.0.1793.256453>,1000,
>>>>                            [before_header,after_header,on_file_open],
>>>>                            false}
>>>> ** Reason for termination ==
>>>> ** {{badmatch,{error,emfile}},
>>>>    [{couch_file,sync,1},
>>>>     {couch_db_updater,commit_data,2},
>>>>     {couch_db_updater,handle_info,2},
>>>>     {gen_server,handle_msg,5},
>>>>     {proc_lib,init_p_do_apply,3}]}
>>>>
>>>> (+lot of other messages with the same timestamp -- can send if they're
useful)
>>>>
>>>> Exactly at this time, the client got HTTP 500 status code; the request
>>>> was a bulk get (POST /foo_replication_tmp/_all_docs?include_docs=true).
>>>>
>>>> Just before this request, the client had made a PUT (updating an
existing
>>>> document) that got 200 status code, but apparently was not successfully
>>>> committed to the disk (I'm using "delayed_commits=true" - for my app,
>>>> this is just fine). The client had received the new _rev value, but
when
>>>> it tried updating the same document a minute later, there was a
conflict
>>>> (and it's not possible that somebody else updated this same document).
>>>>
>>>> About four hours later, there was a different error ("accept_failed"
>>>> sounds like some temporary problem with sockets?):
>>>>
>>>> [Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>]
{error_report,<0.31.0>,
>>>>              {<0.20693.4>,std_error,
>>>>               [{application,mochiweb},
>>>>                "Accept failed error","{error,emfile}"]}}
>>>>
>>>> [Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>]
{error_report,<0.31.0>,
>>>>    {<0.20693.4>,crash_report,
>>>>
[[{initial_call,{mochiweb_socket_server,acceptor_loop,['Argument__1']}},
>>>>       {pid,<0.20693.4>},
>>>>       {registered_name,[]},
>>>>       {error_info,
>>>>           {exit,
>>>>               {error,accept_failed},
>>>>               [{mochiweb_socket_server,acceptor_loop,1},
>>>>                {proc_lib,init_p_do_apply,3}]}},
>>>>       {ancestors,
>>>>
[couch_httpd,couch_secondary_services,couch_server_sup,<0.32.0>]},
>>>>       {messages,[]},
>>>>       {links,[<0.106.0>]},
>>>>       {dictionary,[]},
>>>>       {trap_exit,false},
>>>>       {status,running},
>>>>       {heap_size,233},
>>>>       {stack_size,24},
>>>>       {reductions,200}],
>>>>      []]}}
>>>>
>>>> (+lots of other messages within the next couple of minutes)
>>>>
>>>> The same error occured once more, about four hours later.
>>>>
>>>> I'm quite new to CouchDB, so I'd appreciate any help in interpreting
>>>> what these error messages mean. (BTW, are these something I should
>>>> report as bugs in JIRA? I can do that, but I'd like to at least
understand
>>>> which parts of the error messages are actually relevant here :-)
>>>>
>>>> I'm running CouchDB 1.0.2 with Erlang R14B on 64-bit RHEL 5.6.
>>>>
>>>> Best regards,
>>>> Pasi
>>>>
>>>
>>> The error you're getting is because CouchDB is running out of
>>> available file descriptors to use. Try increasing the limit for the
>>> user running CouchDB.
>>>
>>
>

Re: CouchDB 1.0.2 errors under load

Posted by Paul Davis <pa...@gmail.com>.

It sounds like your client is doing something funky with shutting down
connections to the server. If I were to guess I would say that it
looks like you're leaking sockets between a call to shutdown and a
call to close. Though to say for certain I'd need to see what state
both ends of the socket are in.

The errors about pread_iolist are hard to diagnose without the
surrounding error messages but if I were to guess is that you're
getting an emfile that cascades to the file gen_server which then
notifies other people trying to issue read calls.

On Fri, Feb 25, 2011 at 5:43 PM, Pasi Eronen <pe...@iki.fi> wrote:
> Thanks! I looked at /proc/(pid-of-beam.smp)/fd, and it seems CouchDB
> is indeed using a lot of file descriptors, and the number is growing.
> I restarted CouchDB (and my batch job), and after 40 minutes, it was
> already using 300 fds; an hour later, the figure was 600; half
> an hour later, almost 700.
>
> The default "ulimit -n" was 1024, so that adds up to couple of hours.
> As a temporary fix I changed the limit to 64K, so it should work fine
> for the weekend :-) (but not much longer)
>
> Based on "netstat -an", the fds are TCP connections stuck in CLOSE_WAIT
> state. Hmm. It seems CLOSE_WAIT means the client closed the connection
> but CouchDB didn't call close() to release the fd? And that's presumably
> a bug somewhere in CouchDB, right?
>
> After the initial error message, the log had ~300 almost identical
> error messages of the form (with different number after pread_iolist):
>
> [Thu, 24 Feb 2011 19:42:57 GMT] [error] [<0.31510.2>]
> Uncaught error in HTTP request: {exit,
>                                 {normal,
>                                  {gen_server,call,
>                                   [<0.288.0>,
>                                    {pread_iolist,812881959},
>                                    infinity]}}}
>
> Does this give any hints where the bug might be?
>
> If not, I'll see if I can reproduce the bug somehow in more controlled
> setting. (I have a hunch that this might involve the use of HTTP
> expect/continue or POST requests with chunked transfer encoding,
> but let's see....)
>
> Best regards,
> Pasi
>
> On Fri, Feb 25, 2011 at 19:01, Paul Davis <pa...@gmail.com> wrote:
>> On Fri, Feb 25, 2011 at 4:18 AM, Pasi Eronen <pe...@iki.fi> wrote:
>>> Hi,
>>>
>>> I had a big batch job (inserting 10M+ documents and generating views for them)
>>> that ran just fine for about 6 hours, but then I got this error:
>>>
>>> [Thu, 24 Feb 2011 19:42:57 GMT] [error] [<0.276.0>] ** Generic server
>>> <0.276.0> terminating
>>> ** Last message in was delayed_commit
>>> ** When Server state == {db,<0.275.0>,<0.276.0>,nil,<<"1298547642391489">>,
>>>                            <0.273.0>,<0.277.0>,
>>>                            {db_header,5,739828,0,
>>>                                {4778613011,{663866,0}},
>>>                                {4778614954,663866},
>>>                                nil,0,nil,nil,1000},
>>>                            739828,
>>>                            {btree,<0.273.0>,
>>>                                {4778772755,{663866,0}},
>>>                                #Fun<couch_db_updater.7.10053969>,
>>>                                #Fun<couch_db_updater.8.35220795>,
>>>                                #Fun<couch_btree.5.124754102>,
>>>                                #Fun<couch_db_updater.9.107593676>},
>>>                            {btree,<0.273.0>,
>>>                                {4778774698,663866},
>>>                                #Fun<couch_db_updater.10.30996817>,
>>>                                #Fun<couch_db_updater.11.96515267>,
>>>                                #Fun<couch_btree.5.124754102>,
>>>                                #Fun<couch_db_updater.12.117826253>},
>>>                            {btree,<0.273.0>,nil,
>>>                                #Fun<couch_btree.0.83553141>,
>>>                                #Fun<couch_btree.1.30790806>,
>>>                                #Fun<couch_btree.2.124754102>,nil},
>>>                            739831,<<"foo_replication_tmp">>,
>>>                            "/data/foo/couchdb-data/foo_replication_tmp.couch",
>>>                            [],[],nil,
>>>                            {user_ctx,null,[],undefined},
>>>                            #Ref<0.0.1793.256453>,1000,
>>>                            [before_header,after_header,on_file_open],
>>>                            false}
>>> ** Reason for termination ==
>>> ** {{badmatch,{error,emfile}},
>>>    [{couch_file,sync,1},
>>>     {couch_db_updater,commit_data,2},
>>>     {couch_db_updater,handle_info,2},
>>>     {gen_server,handle_msg,5},
>>>     {proc_lib,init_p_do_apply,3}]}
>>>
>>> (+lot of other messages with the same timestamp -- can send if they're useful)
>>>
>>> Exactly at this time, the client got HTTP 500 status code; the request
>>> was a bulk get (POST /foo_replication_tmp/_all_docs?include_docs=true).
>>>
>>> Just before this request, the client had made a PUT (updating an existing
>>> document) that got 200 status code, but apparently was not successfully
>>> committed to the disk (I'm using "delayed_commits=true" - for my app,
>>> this is just fine). The client had received the new _rev value, but when
>>> it tried updating the same document a minute later, there was a conflict
>>> (and it's not possible that somebody else updated this same document).
>>>
>>> About four hours later, there was a different error ("accept_failed"
>>> sounds like some temporary problem with sockets?):
>>>
>>> [Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>] {error_report,<0.31.0>,
>>>              {<0.20693.4>,std_error,
>>>               [{application,mochiweb},
>>>                "Accept failed error","{error,emfile}"]}}
>>>
>>> [Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>] {error_report,<0.31.0>,
>>>    {<0.20693.4>,crash_report,
>>>     [[{initial_call,{mochiweb_socket_server,acceptor_loop,['Argument__1']}},
>>>       {pid,<0.20693.4>},
>>>       {registered_name,[]},
>>>       {error_info,
>>>           {exit,
>>>               {error,accept_failed},
>>>               [{mochiweb_socket_server,acceptor_loop,1},
>>>                {proc_lib,init_p_do_apply,3}]}},
>>>       {ancestors,
>>>           [couch_httpd,couch_secondary_services,couch_server_sup,<0.32.0>]},
>>>       {messages,[]},
>>>       {links,[<0.106.0>]},
>>>       {dictionary,[]},
>>>       {trap_exit,false},
>>>       {status,running},
>>>       {heap_size,233},
>>>       {stack_size,24},
>>>       {reductions,200}],
>>>      []]}}
>>>
>>> (+lots of other messages within the next couple of minutes)
>>>
>>> The same error occured once more, about four hours later.
>>>
>>> I'm quite new to CouchDB, so I'd appreciate any help in interpreting
>>> what these error messages mean. (BTW, are these something I should
>>> report as bugs in JIRA? I can do that, but I'd like to at least understand
>>> which parts of the error messages are actually relevant here :-)
>>>
>>> I'm running CouchDB 1.0.2 with Erlang R14B on 64-bit RHEL 5.6.
>>>
>>> Best regards,
>>> Pasi
>>>
>>
>> The error you're getting is because CouchDB is running out of
>> available file descriptors to use. Try increasing the limit for the
>> user running CouchDB.
>>
>

Re: CouchDB 1.0.2 errors under load

Posted by Pasi Eronen <pe...@iki.fi>.

Thanks! I looked at /proc/(pid-of-beam.smp)/fd, and it seems CouchDB
is indeed using a lot of file descriptors, and the number is growing.
I restarted CouchDB (and my batch job), and after 40 minutes, it was
already using 300 fds; an hour later, the figure was 600; half
an hour later, almost 700.

The default "ulimit -n" was 1024, so that adds up to couple of hours.
As a temporary fix I changed the limit to 64K, so it should work fine
for the weekend :-) (but not much longer)

Based on "netstat -an", the fds are TCP connections stuck in CLOSE_WAIT
state. Hmm. It seems CLOSE_WAIT means the client closed the connection
but CouchDB didn't call close() to release the fd? And that's presumably
a bug somewhere in CouchDB, right?

After the initial error message, the log had ~300 almost identical
error messages of the form (with different number after pread_iolist):

[Thu, 24 Feb 2011 19:42:57 GMT] [error] [<0.31510.2>]
Uncaught error in HTTP request: {exit,
                                 {normal,
                                  {gen_server,call,
                                   [<0.288.0>,
                                    {pread_iolist,812881959},
                                    infinity]}}}

Does this give any hints where the bug might be?

If not, I'll see if I can reproduce the bug somehow in more controlled
setting. (I have a hunch that this might involve the use of HTTP
expect/continue or POST requests with chunked transfer encoding,
but let's see....)

Best regards,
Pasi

On Fri, Feb 25, 2011 at 19:01, Paul Davis <pa...@gmail.com> wrote:
> On Fri, Feb 25, 2011 at 4:18 AM, Pasi Eronen <pe...@iki.fi> wrote:
>> Hi,
>>
>> I had a big batch job (inserting 10M+ documents and generating views for them)
>> that ran just fine for about 6 hours, but then I got this error:
>>
>> [Thu, 24 Feb 2011 19:42:57 GMT] [error] [<0.276.0>] ** Generic server
>> <0.276.0> terminating
>> ** Last message in was delayed_commit
>> ** When Server state == {db,<0.275.0>,<0.276.0>,nil,<<"1298547642391489">>,
>>                            <0.273.0>,<0.277.0>,
>>                            {db_header,5,739828,0,
>>                                {4778613011,{663866,0}},
>>                                {4778614954,663866},
>>                                nil,0,nil,nil,1000},
>>                            739828,
>>                            {btree,<0.273.0>,
>>                                {4778772755,{663866,0}},
>>                                #Fun<couch_db_updater.7.10053969>,
>>                                #Fun<couch_db_updater.8.35220795>,
>>                                #Fun<couch_btree.5.124754102>,
>>                                #Fun<couch_db_updater.9.107593676>},
>>                            {btree,<0.273.0>,
>>                                {4778774698,663866},
>>                                #Fun<couch_db_updater.10.30996817>,
>>                                #Fun<couch_db_updater.11.96515267>,
>>                                #Fun<couch_btree.5.124754102>,
>>                                #Fun<couch_db_updater.12.117826253>},
>>                            {btree,<0.273.0>,nil,
>>                                #Fun<couch_btree.0.83553141>,
>>                                #Fun<couch_btree.1.30790806>,
>>                                #Fun<couch_btree.2.124754102>,nil},
>>                            739831,<<"foo_replication_tmp">>,
>>                            "/data/foo/couchdb-data/foo_replication_tmp.couch",
>>                            [],[],nil,
>>                            {user_ctx,null,[],undefined},
>>                            #Ref<0.0.1793.256453>,1000,
>>                            [before_header,after_header,on_file_open],
>>                            false}
>> ** Reason for termination ==
>> ** {{badmatch,{error,emfile}},
>>    [{couch_file,sync,1},
>>     {couch_db_updater,commit_data,2},
>>     {couch_db_updater,handle_info,2},
>>     {gen_server,handle_msg,5},
>>     {proc_lib,init_p_do_apply,3}]}
>>
>> (+lot of other messages with the same timestamp -- can send if they're useful)
>>
>> Exactly at this time, the client got HTTP 500 status code; the request
>> was a bulk get (POST /foo_replication_tmp/_all_docs?include_docs=true).
>>
>> Just before this request, the client had made a PUT (updating an existing
>> document) that got 200 status code, but apparently was not successfully
>> committed to the disk (I'm using "delayed_commits=true" - for my app,
>> this is just fine). The client had received the new _rev value, but when
>> it tried updating the same document a minute later, there was a conflict
>> (and it's not possible that somebody else updated this same document).
>>
>> About four hours later, there was a different error ("accept_failed"
>> sounds like some temporary problem with sockets?):
>>
>> [Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>] {error_report,<0.31.0>,
>>              {<0.20693.4>,std_error,
>>               [{application,mochiweb},
>>                "Accept failed error","{error,emfile}"]}}
>>
>> [Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>] {error_report,<0.31.0>,
>>    {<0.20693.4>,crash_report,
>>     [[{initial_call,{mochiweb_socket_server,acceptor_loop,['Argument__1']}},
>>       {pid,<0.20693.4>},
>>       {registered_name,[]},
>>       {error_info,
>>           {exit,
>>               {error,accept_failed},
>>               [{mochiweb_socket_server,acceptor_loop,1},
>>                {proc_lib,init_p_do_apply,3}]}},
>>       {ancestors,
>>           [couch_httpd,couch_secondary_services,couch_server_sup,<0.32.0>]},
>>       {messages,[]},
>>       {links,[<0.106.0>]},
>>       {dictionary,[]},
>>       {trap_exit,false},
>>       {status,running},
>>       {heap_size,233},
>>       {stack_size,24},
>>       {reductions,200}],
>>      []]}}
>>
>> (+lots of other messages within the next couple of minutes)
>>
>> The same error occured once more, about four hours later.
>>
>> I'm quite new to CouchDB, so I'd appreciate any help in interpreting
>> what these error messages mean. (BTW, are these something I should
>> report as bugs in JIRA? I can do that, but I'd like to at least understand
>> which parts of the error messages are actually relevant here :-)
>>
>> I'm running CouchDB 1.0.2 with Erlang R14B on 64-bit RHEL 5.6.
>>
>> Best regards,
>> Pasi
>>
>
> The error you're getting is because CouchDB is running out of
> available file descriptors to use. Try increasing the limit for the
> user running CouchDB.
>

Re: CouchDB 1.0.2 errors under load

Posted by Paul Davis <pa...@gmail.com>.

On Fri, Feb 25, 2011 at 4:18 AM, Pasi Eronen <pe...@iki.fi> wrote:
> Hi,
>
> I had a big batch job (inserting 10M+ documents and generating views for them)
> that ran just fine for about 6 hours, but then I got this error:
>
> [Thu, 24 Feb 2011 19:42:57 GMT] [error] [<0.276.0>] ** Generic server
> <0.276.0> terminating
> ** Last message in was delayed_commit
> ** When Server state == {db,<0.275.0>,<0.276.0>,nil,<<"1298547642391489">>,
>                            <0.273.0>,<0.277.0>,
>                            {db_header,5,739828,0,
>                                {4778613011,{663866,0}},
>                                {4778614954,663866},
>                                nil,0,nil,nil,1000},
>                            739828,
>                            {btree,<0.273.0>,
>                                {4778772755,{663866,0}},
>                                #Fun<couch_db_updater.7.10053969>,
>                                #Fun<couch_db_updater.8.35220795>,
>                                #Fun<couch_btree.5.124754102>,
>                                #Fun<couch_db_updater.9.107593676>},
>                            {btree,<0.273.0>,
>                                {4778774698,663866},
>                                #Fun<couch_db_updater.10.30996817>,
>                                #Fun<couch_db_updater.11.96515267>,
>                                #Fun<couch_btree.5.124754102>,
>                                #Fun<couch_db_updater.12.117826253>},
>                            {btree,<0.273.0>,nil,
>                                #Fun<couch_btree.0.83553141>,
>                                #Fun<couch_btree.1.30790806>,
>                                #Fun<couch_btree.2.124754102>,nil},
>                            739831,<<"foo_replication_tmp">>,
>                            "/data/foo/couchdb-data/foo_replication_tmp.couch",
>                            [],[],nil,
>                            {user_ctx,null,[],undefined},
>                            #Ref<0.0.1793.256453>,1000,
>                            [before_header,after_header,on_file_open],
>                            false}
> ** Reason for termination ==
> ** {{badmatch,{error,emfile}},
>    [{couch_file,sync,1},
>     {couch_db_updater,commit_data,2},
>     {couch_db_updater,handle_info,2},
>     {gen_server,handle_msg,5},
>     {proc_lib,init_p_do_apply,3}]}
>
> (+lot of other messages with the same timestamp -- can send if they're useful)
>
> Exactly at this time, the client got HTTP 500 status code; the request
> was a bulk get (POST /foo_replication_tmp/_all_docs?include_docs=true).
>
> Just before this request, the client had made a PUT (updating an existing
> document) that got 200 status code, but apparently was not successfully
> committed to the disk (I'm using "delayed_commits=true" - for my app,
> this is just fine). The client had received the new _rev value, but when
> it tried updating the same document a minute later, there was a conflict
> (and it's not possible that somebody else updated this same document).
>
> About four hours later, there was a different error ("accept_failed"
> sounds like some temporary problem with sockets?):
>
> [Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>] {error_report,<0.31.0>,
>              {<0.20693.4>,std_error,
>               [{application,mochiweb},
>                "Accept failed error","{error,emfile}"]}}
>
> [Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>] {error_report,<0.31.0>,
>    {<0.20693.4>,crash_report,
>     [[{initial_call,{mochiweb_socket_server,acceptor_loop,['Argument__1']}},
>       {pid,<0.20693.4>},
>       {registered_name,[]},
>       {error_info,
>           {exit,
>               {error,accept_failed},
>               [{mochiweb_socket_server,acceptor_loop,1},
>                {proc_lib,init_p_do_apply,3}]}},
>       {ancestors,
>           [couch_httpd,couch_secondary_services,couch_server_sup,<0.32.0>]},
>       {messages,[]},
>       {links,[<0.106.0>]},
>       {dictionary,[]},
>       {trap_exit,false},
>       {status,running},
>       {heap_size,233},
>       {stack_size,24},
>       {reductions,200}],
>      []]}}
>
> (+lots of other messages within the next couple of minutes)
>
> The same error occured once more, about four hours later.
>
> I'm quite new to CouchDB, so I'd appreciate any help in interpreting
> what these error messages mean. (BTW, are these something I should
> report as bugs in JIRA? I can do that, but I'd like to at least understand
> which parts of the error messages are actually relevant here :-)
>
> I'm running CouchDB 1.0.2 with Erlang R14B on 64-bit RHEL 5.6.
>
> Best regards,
> Pasi
>

The error you're getting is because CouchDB is running out of
available file descriptors to use. Try increasing the limit for the
user running CouchDB.