You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Mike Leddy <mi...@loop.com.br> on 2011/04/14 17:14:52 UTC

Couchdb trunk purge_docs timeout

Hi,

I have a couch node current using version 1.2.0abaa0e30-git. I decided
to try a database maintenance task that I formerly used to use on
couchdb 1.0.2 to purge documents in batches of 500 on a database that
contains some 56 million documents.

>From what I can gather from the logs the call to purge_docs is timing
out after 5 seconds and terminating.

[Wed, 13 Apr 2011 20:25:57 GMT] [info] [<0.5192.19>] 172.17.17.3 - - GET /iris/_design/tidy/_view/conflicts?limit=0 200
[Wed, 13 Apr 2011 20:26:02 GMT] [error] [<0.5194.19>] Uncaught error in HTTP request: {exit,
                                                       {timeout,
                                                        {gen_server,call,
                                                         [<0.150.0>,
                                                          {purge_docs,
                                                           [{<<"1294099271F6261">>,
                                                             [{1,
                                                               <<181,64,95,
                                                                 54,247,104,
                                                                 56,34,109,
                                                                 228,7,108,
                                                                 250,72,57,
                                                                 190>>}]},
                                                            {<<"1294099281F7327">>,
                                                             [{1,
                                                               <<80,246,15,
                                                                 155,182,61,
                                                                 43,238,207,
                                                                 43,159,136,
                                                                 178,134,
                                                                 137,214>>}]},
... removed for brevity ....
[Wed, 13 Apr 2011 20:26:02 GMT] [info] [<0.5194.19>] Stacktrace: [{io_lib_pretty,cind_tag_tuple,7},
                                  {io_lib_pretty,while_fail,3},
                                  {io_lib_pretty,print,6},
                                  {io_lib_format,build,3},
                                  {io_lib_format,build,3},
                                  {io_lib_format,build,3},
                                  {io_lib_format,build,3},
                                  {io_lib_format,build,3}]
[Wed, 13 Apr 2011 20:26:02 GMT] [info] [<0.5194.19>] 172.17.17.3 - - POST /iris/_purge 500

I am pretty sure that this was not the case with 1.0.2. Does anyone 
have any insight regarding what is the root cause of the problem ?

Meanwhile I'm digging through the code looking for clues....

Thanks,

Mike


Re: Couchdb trunk purge_docs timeout

Posted by Mike Leddy <mi...@loop.com.br>.
Sorry its been two weeks, but I found out what was happening. I was
reprocessing some data which was being passed through a script to
resolve conflicts. However the number of conflicts being reprocessed
was much higher than had earlier been encountered.

Effectively the _bulk_doc insert call to resolve conflicts was taking
more than 5 seconds to complete so that other callers to
couch_db_updater on the same database were failing. As what I was 
doing was pretty extreme it was interfering with purging and 
replication or calls to compact etc.

I could see that the couch_db_updater process for the database was
spending most of its time in {couch_key_tree,merge_one} and
consequently not serving requests (I have to learn how to profile
erlang :-( )

I wrote a simple(ish) benchmarking script which can be used to 
generate what I was seeing:

###################################################################

require 'restclient'
require 'json'

Database='http://127.0.0.1:5984/conflict_test'

def flush_all
  # Flush all to disk
  RestClient::Request.execute(:method => :post,
    :url => Database+'/_ensure_full_commit',
    :headers => {:content_type => 'application/json'})
end


def run(documents)

  setup = Time.new
  # Prepare database
  RestClient::Request.execute(:method => :delete, :url => Database) rescue RestClient::ResourceNotFound
  RestClient::Request.execute(:method => :put, :url => Database)

  # Fill with conflicts
  docs=documents.times.collect {|i| {:_id => 'samedoc', :some_data => i}}
  RestClient::Request.execute(:method => :post,
    :url => Database+'/_bulk_docs',
    :payload => {'all_or_nothing' => true, 'docs' => docs}.to_json,
    :headers => {:content_type => 'application/json'})

  # Resolve the conflicts
  meta=JSON.parse(RestClient::Request.execute(:method => :get,
    :url => Database+'/samedoc?meta=true').body)
  revisions=[meta['_rev']]+meta['_conflicts']
  winner=revisions.delete(rand(revisions.size))

  flush_all

  # Do it
  bulk=revisions.collect {|r| {'_id'=>'samedoc', '_rev'=>r, '_deleted'=>true}}
  resolve = Time.new
  RestClient::Request.execute(:method => :post,
    :url => Database+'/_bulk_docs',
    :payload => {'all_or_nothing' => true, 'docs' => bulk}.to_json,
    :headers => {:content_type => 'application/json'},
    :timeout => 3600)

  to_disk = Time.new
  flush_all

  now=Time.new
  puts "#{documents} #{resolve-setup} #{to_disk-resolve} #{now-to_disk}"

end

10.step(500, 10) {|n| run n}

###################################################################

If you try to 

curl -s -X POST -H 'Content-Type: application/json' 127.0.0.1:5984/conflict_test/_compact

during a long call to resolve the conflicts you'll get an internal server
error:

[Mon, 02 May 2011 16:56:01 GMT] [info] [<0.10147.0>] 127.0.0.1 - - POST /conflict_test/_bulk_docs 201
[Mon, 02 May 2011 16:56:06 GMT] [error] [<0.10193.0>] Uncaught error in HTTP request: {exit,
                                                       {timeout,
                                                        {gen_server,call,
                                                         [<0.10188.0>,
                                                          start_compact]}}}
[Mon, 02 May 2011 16:56:06 GMT] [info] [<0.10193.0>] Stacktrace: [{io_lib_pretty,cind_tag_tuple,7},
                                  {io_lib_pretty,while_fail,3},
                                  {io_lib_pretty,print,6},
                                  {io_lib_format,build,3},
                                  {io_lib_format,build,3},
                                  {io_lib_format,build,3},
                                  {io_lib_format,build,3},
                                  {io_lib_format,build,3}]
[Mon, 02 May 2011 16:56:06 GMT] [info] [<0.10193.0>] 127.0.0.1 - - POST /conflict_test/_compact 500

I altered my script to avoid excessively large conflict resolution
updates so everything is working fine now. (my test machine starts
taking 5 seconds at 200 conflicts resolved in one go).

I did note that when you plot the data the bulk conflict resolution
call is significantly more exponential than the setup (bulk insert 
and get with meta) calls.....

Regards,

Mike




On Thu, 2011-04-14 at 12:14 -0300, Mike Leddy wrote:
> Hi,
> 
> I have a couch node current using version 1.2.0abaa0e30-git. I decided
> to try a database maintenance task that I formerly used to use on
> couchdb 1.0.2 to purge documents in batches of 500 on a database that
> contains some 56 million documents.
> 
> >From what I can gather from the logs the call to purge_docs is timing
> out after 5 seconds and terminating.
> 
> [Wed, 13 Apr 2011 20:25:57 GMT] [info] [<0.5192.19>] 172.17.17.3 - - GET /iris/_design/tidy/_view/conflicts?limit=0 200
> [Wed, 13 Apr 2011 20:26:02 GMT] [error] [<0.5194.19>] Uncaught error in HTTP request: {exit,
>                                                        {timeout,
>                                                         {gen_server,call,
>                                                          [<0.150.0>,
>                                                           {purge_docs,
>                                                            [{<<"1294099271F6261">>,
>                                                              [{1,
>                                                                <<181,64,95,
>                                                                  54,247,104,
>                                                                  56,34,109,
>                                                                  228,7,108,
>                                                                  250,72,57,
>                                                                  190>>}]},
>                                                             {<<"1294099281F7327">>,
>                                                              [{1,
>                                                                <<80,246,15,
>                                                                  155,182,61,
>                                                                  43,238,207,
>                                                                  43,159,136,
>                                                                  178,134,
>                                                                  137,214>>}]},
> ... removed for brevity ....
> [Wed, 13 Apr 2011 20:26:02 GMT] [info] [<0.5194.19>] Stacktrace: [{io_lib_pretty,cind_tag_tuple,7},
>                                   {io_lib_pretty,while_fail,3},
>                                   {io_lib_pretty,print,6},
>                                   {io_lib_format,build,3},
>                                   {io_lib_format,build,3},
>                                   {io_lib_format,build,3},
>                                   {io_lib_format,build,3},
>                                   {io_lib_format,build,3}]
> [Wed, 13 Apr 2011 20:26:02 GMT] [info] [<0.5194.19>] 172.17.17.3 - - POST /iris/_purge 500
> 
> I am pretty sure that this was not the case with 1.0.2. Does anyone 
> have any insight regarding what is the root cause of the problem ?
> 
> Meanwhile I'm digging through the code looking for clues....
> 
> Thanks,
> 
> Mike
> 
> 



Re: Couchdb trunk purge_docs timeout

Posted by Jan Lehnardt <ja...@apache.org>.
Hi Mike,

we did a fix in this area recently that affected purging of docs in conflict:

  http://svn.apache.org/viewvc?rev=1086241&view=rev

A couple of reviewers deemed the patch safe, but this is a seldom exercised
part of the code, so we may have introduced your issue.

Can you provide us with a reproducing script that maybe doesn't depend on
56m docs? :)

Also, can you paste the full error stack trace?

A few more questions:

 - Is replication involved here?
 - Do you have more I/O than before on the system?

CCing dev@.

Cheers
Jan
-- 

On 14 Apr 2011, at 17:14, Mike Leddy wrote:

> Hi,
> 
> I have a couch node current using version 1.2.0abaa0e30-git. I decided
> to try a database maintenance task that I formerly used to use on
> couchdb 1.0.2 to purge documents in batches of 500 on a database that
> contains some 56 million documents.
> 
> From what I can gather from the logs the call to purge_docs is timing
> out after 5 seconds and terminating.
> 
> [Wed, 13 Apr 2011 20:25:57 GMT] [info] [<0.5192.19>] 172.17.17.3 - - GET /iris/_design/tidy/_view/conflicts?limit=0 200
> [Wed, 13 Apr 2011 20:26:02 GMT] [error] [<0.5194.19>] Uncaught error in HTTP request: {exit,
>                                                       {timeout,
>                                                        {gen_server,call,
>                                                         [<0.150.0>,
>                                                          {purge_docs,
>                                                           [{<<"1294099271F6261">>,
>                                                             [{1,
>                                                               <<181,64,95,
>                                                                 54,247,104,
>                                                                 56,34,109,
>                                                                 228,7,108,
>                                                                 250,72,57,
>                                                                 190>>}]},
>                                                            {<<"1294099281F7327">>,
>                                                             [{1,
>                                                               <<80,246,15,
>                                                                 155,182,61,
>                                                                 43,238,207,
>                                                                 43,159,136,
>                                                                 178,134,
>                                                                 137,214>>}]},
> ... removed for brevity ....
> [Wed, 13 Apr 2011 20:26:02 GMT] [info] [<0.5194.19>] Stacktrace: [{io_lib_pretty,cind_tag_tuple,7},
>                                  {io_lib_pretty,while_fail,3},
>                                  {io_lib_pretty,print,6},
>                                  {io_lib_format,build,3},
>                                  {io_lib_format,build,3},
>                                  {io_lib_format,build,3},
>                                  {io_lib_format,build,3},
>                                  {io_lib_format,build,3}]
> [Wed, 13 Apr 2011 20:26:02 GMT] [info] [<0.5194.19>] 172.17.17.3 - - POST /iris/_purge 500
> 
> I am pretty sure that this was not the case with 1.0.2. Does anyone 
> have any insight regarding what is the root cause of the problem ?
> 
> Meanwhile I'm digging through the code looking for clues....
> 
> Thanks,
> 
> Mike
> 


Re: Couchdb trunk purge_docs timeout

Posted by Jan Lehnardt <ja...@apache.org>.
Hi Mike,

we did a fix in this area recently that affected purging of docs in conflict:

  http://svn.apache.org/viewvc?rev=1086241&view=rev

A couple of reviewers deemed the patch safe, but this is a seldom exercised
part of the code, so we may have introduced your issue.

Can you provide us with a reproducing script that maybe doesn't depend on
56m docs? :)

Also, can you paste the full error stack trace?

A few more questions:

 - Is replication involved here?
 - Do you have more I/O than before on the system?

CCing dev@.

Cheers
Jan
-- 

On 14 Apr 2011, at 17:14, Mike Leddy wrote:

> Hi,
> 
> I have a couch node current using version 1.2.0abaa0e30-git. I decided
> to try a database maintenance task that I formerly used to use on
> couchdb 1.0.2 to purge documents in batches of 500 on a database that
> contains some 56 million documents.
> 
> From what I can gather from the logs the call to purge_docs is timing
> out after 5 seconds and terminating.
> 
> [Wed, 13 Apr 2011 20:25:57 GMT] [info] [<0.5192.19>] 172.17.17.3 - - GET /iris/_design/tidy/_view/conflicts?limit=0 200
> [Wed, 13 Apr 2011 20:26:02 GMT] [error] [<0.5194.19>] Uncaught error in HTTP request: {exit,
>                                                       {timeout,
>                                                        {gen_server,call,
>                                                         [<0.150.0>,
>                                                          {purge_docs,
>                                                           [{<<"1294099271F6261">>,
>                                                             [{1,
>                                                               <<181,64,95,
>                                                                 54,247,104,
>                                                                 56,34,109,
>                                                                 228,7,108,
>                                                                 250,72,57,
>                                                                 190>>}]},
>                                                            {<<"1294099281F7327">>,
>                                                             [{1,
>                                                               <<80,246,15,
>                                                                 155,182,61,
>                                                                 43,238,207,
>                                                                 43,159,136,
>                                                                 178,134,
>                                                                 137,214>>}]},
> ... removed for brevity ....
> [Wed, 13 Apr 2011 20:26:02 GMT] [info] [<0.5194.19>] Stacktrace: [{io_lib_pretty,cind_tag_tuple,7},
>                                  {io_lib_pretty,while_fail,3},
>                                  {io_lib_pretty,print,6},
>                                  {io_lib_format,build,3},
>                                  {io_lib_format,build,3},
>                                  {io_lib_format,build,3},
>                                  {io_lib_format,build,3},
>                                  {io_lib_format,build,3}]
> [Wed, 13 Apr 2011 20:26:02 GMT] [info] [<0.5194.19>] 172.17.17.3 - - POST /iris/_purge 500
> 
> I am pretty sure that this was not the case with 1.0.2. Does anyone 
> have any insight regarding what is the root cause of the problem ?
> 
> Meanwhile I'm digging through the code looking for clues....
> 
> Thanks,
> 
> Mike
>