You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Sho Fukamachi <sh...@gmail.com> on 2009/02/01 09:48:02 UTC

Re: replication error

On 30/01/2009, at 6:03 PM, Adam Kocoloski wrote:

> Hi Jeff, it's starting to make some more sense now.  How big are the  
> normal attachments?  At present, Couch encodes all attachments using  
> Base64 and inlines them in the JSON representation of the document  
> during replication.  We'll fix this in the 0.9 release by taking  
> advantage of new support for multipart requests[1], but until then  
> replicating big attachments is iffy at best.


In the meantime, is there any way to increase the timeout, or limit  
the number of docs couch tries to send in one bulk_docs transaction?  
Replication is failing for me even with attachments under 500K, since  
my upload speed from home isn't that good.


Sho



> Regards,
>
> Adam
>
> [1] https://issues.apache.org/jira/browse/COUCHDB-163


Re: replication error

Posted by Sho Fukamachi <sh...@gmail.com>.
On 02/02/2009, at 5:01 AM, Adam Kocoloski wrote:

>> [...]
>
> That's odd.  I tried setting a 120 second timeout and didn't have  
> any trouble.  Then again, I only ran the test suite; I didn't  
> actually force a timeout to occur or anything.  Sorry, I don't have  
> any hints at the moment.

Guh. I'm an idiot. I'd forgotten to create the destination database.  
IIn my haste to test it I used futon, not my normal script, and of  
course, interpreted the error as something with the code I'd changed.

Sorry about that. : /

With the changes, it worked first time .. although did give a spurious  
error about how a server had restarted.


> Multipart won't solve the problem where ibrowse throws a timeout  
> error even while it's still sending data.  That seems like a pretty  
> curious choice on ibrowse' part to me.  Maybe when I have some more  
> free time I can look into the timeout algo and see if it can be  
> tweaked so that it only starts after the request has been fully  
> transmitted.  I think that would pretty much solve this problem.   
> Barring that, I agree that some sort of back-off algorithm that  
> lengthens the timeout after each failed request is warranted.
>
> There's also one more knob we can turn.  During replication we are  
> checking the memory consumption of the process collecting docs to  
> send to the target.  If it hits 10MB we send the bulk immediately,  
> regardless of whether it's 1 doc, 10, or 99.  10MB may be much too  
> high given a 30 second timeout window in which we have to transmit  
> the data; 1MB is possibly a better fit for home broadband users.  If  
> you want to fiddle with that knob instead of the ibrowse timeout you  
> can try changing line 224 of couch_rep.erl so that instead of
>
> couch_util:should_flush()
>
> it would read (value is in bytes)
>
> couch_util:should_flush(1000000)

Awesome tip. Thanks. Yeah, I had never noticed any problem with server  
to server replication... only when I then tried to do it from home...

> I don't have a strong opinion at this point in time about how many  
> of these parameters ought to be tunable in local.ini.  Best,

My opinion is usually that pretty much everything with a big effect,  
like this, should have a sensible default, but overrideable in config.  
Failing that, maybe the default timeout should be raised?

Thanks heaps for your help.

Sho


Re: replication error

Posted by Adam Kocoloski <ad...@gmail.com>.
On Feb 1, 2009, at 11:52 AM, Sho Fukamachi wrote:

>
> On 02/02/2009, at 2:23 AM, Adam Kocoloski wrote:
>>
>> Hi Sho, are you getting req_timedout errors in the logs?  It seems  
>> a little weird to me that ibrowse starts the timer while it's still  
>> sending data to the server; perhaps there's an alternative we  
>> haven't noticed.
>
> Yeah. Like this:
>
> [<0.166.0>] retrying couch_rep HTTP post request due to {error,  
> req_timedout}: "http://localhost:2808/media/_bulk_docs"
>
> then bombs out:
>
> [error] [emulator] Error in process <0.166.0> with exit value:  
> {{badmatch,ok},[{couch_rep,update_docs,4}, 
> {couch_rep,save_docs_buffer,3}]}
>
> After the 10 retries it gives an error report but I assume you know  
> what it says .. if not I can post it.

No need, I understand what's going on.

> Anyway, it finishes eventually, just needs a lot of babysitting.
>
>> There's no way to change the request timeout or bulk docs size at  
>> runtime right now, but if you don't mind digging into the source  
>> yourself you change these as follows:
>>
>> 1) request timeout -- line 187 of couch_rep.erl looks like
>>
>> case ibrowse:send_req(Url, Headers, Action, Body, Options) of
>>
>> You can add a timeout in milliseconds as a sixth parameter to  
>> ibrowse:send_req.  The default is 30000.  I think the atom  
>> 'infinity' also works.
>
> OK, I tried this. Unfortunately I have no idea what I am doing in  
> Erlang so completely screwed it up. I made it this:
>
> case ibrowse:send_req(Url, Headers, Action, Body, Options, 120000)
>
> Compiles fine but now throws this error if I try to replicate:
>
> [error] [<0.50.0>] Uncaught error in HTTP request: {error, 
> {badmatch,undefined}}
>
> No doubt every Erlang programmer here wants to punch me for doing  
> something that dumb, but putting that aside for the moment .. any  
> hints? : )

That's odd.  I tried setting a 120 second timeout and didn't have any  
trouble.  Then again, I only ran the test suite; I didn't actually  
force a timeout to occur or anything.  Sorry, I don't have any hints  
at the moment.

>> 2) bulk_docs size -- The number "100" is mentioned three times in  
>> couch_rep:get_doc_info_list/2.  You can lower that to something  
>> that works better for you.
>
> Well, a change in 1 place seems better than in 3 places .. I'll  
> stick to the timeout for now.
>
> My feeling is that CouchDB should probably start reducing the bulk  
> docs size, or increasing the timeout, or both, automatically when it  
> hits a timeout error - or making them configurable in local.ini. As  
> discussed here before, people are using Couch to store largish  
> attachments, and this is an intended use, so this kind of thing will  
> definitely come up again. Or, of course, if the upcoming multipart  
> feature will solve all of this, then .. not, heh.

Multipart won't solve the problem where ibrowse throws a timeout error  
even while it's still sending data.  That seems like a pretty curious  
choice on ibrowse' part to me.  Maybe when I have some more free time  
I can look into the timeout algo and see if it can be tweaked so that  
it only starts after the request has been fully transmitted.  I think  
that would pretty much solve this problem.  Barring that, I agree that  
some sort of back-off algorithm that lengthens the timeout after each  
failed request is warranted.

There's also one more knob we can turn.  During replication we are  
checking the memory consumption of the process collecting docs to send  
to the target.  If it hits 10MB we send the bulk immediately,  
regardless of whether it's 1 doc, 10, or 99.  10MB may be much too  
high given a 30 second timeout window in which we have to transmit the  
data; 1MB is possibly a better fit for home broadband users.  If you  
want to fiddle with that knob instead of the ibrowse timeout you can  
try changing line 224 of couch_rep.erl so that instead of

couch_util:should_flush()

it would read (value is in bytes)

couch_util:should_flush(1000000)

I don't have a strong opinion at this point in time about how many of  
these parameters ought to be tunable in local.ini.  Best,

Adam

>
>
> Thanks a lot for the help ..
>
> Sho
>
>


Re: replication error

Posted by Sho Fukamachi <sh...@gmail.com>.
On 02/02/2009, at 2:23 AM, Adam Kocoloski wrote:
>
> Hi Sho, are you getting req_timedout errors in the logs?  It seems a  
> little weird to me that ibrowse starts the timer while it's still  
> sending data to the server; perhaps there's an alternative we  
> haven't noticed.

Yeah. Like this:

[<0.166.0>] retrying couch_rep HTTP post request due to {error,  
req_timedout}: "http://localhost:2808/media/_bulk_docs"

then bombs out:

[error] [emulator] Error in process <0.166.0> with exit value:  
{{badmatch,ok},[{couch_rep,update_docs,4},{couch_rep,save_docs_buffer, 
3}]}

After the 10 retries it gives an error report but I assume you know  
what it says .. if not I can post it. Anyway, it finishes eventually,  
just needs a lot of babysitting.

> There's no way to change the request timeout or bulk docs size at  
> runtime right now, but if you don't mind digging into the source  
> yourself you change these as follows:
>
> 1) request timeout -- line 187 of couch_rep.erl looks like
>
> case ibrowse:send_req(Url, Headers, Action, Body, Options) of
>
> You can add a timeout in milliseconds as a sixth parameter to  
> ibrowse:send_req.  The default is 30000.  I think the atom  
> 'infinity' also works.

OK, I tried this. Unfortunately I have no idea what I am doing in  
Erlang so completely screwed it up. I made it this:

case ibrowse:send_req(Url, Headers, Action, Body, Options, 120000)

Compiles fine but now throws this error if I try to replicate:

[error] [<0.50.0>] Uncaught error in HTTP request: {error, 
{badmatch,undefined}}

No doubt every Erlang programmer here wants to punch me for doing  
something that dumb, but putting that aside for the moment .. any  
hints? : )

> 2) bulk_docs size -- The number "100" is mentioned three times in  
> couch_rep:get_doc_info_list/2.  You can lower that to something that  
> works better for you.

Well, a change in 1 place seems better than in 3 places .. I'll stick  
to the timeout for now.

My feeling is that CouchDB should probably start reducing the bulk  
docs size, or increasing the timeout, or both, automatically when it  
hits a timeout error - or making them configurable in local.ini. As  
discussed here before, people are using Couch to store largish  
attachments, and this is an intended use, so this kind of thing will  
definitely come up again. Or, of course, if the upcoming multipart  
feature will solve all of this, then .. not, heh.

Thanks a lot for the help ..

Sho



Re: replication error

Posted by Adam Kocoloski <ad...@gmail.com>.
On Feb 1, 2009, at 3:48 AM, Sho Fukamachi wrote:

> On 30/01/2009, at 6:03 PM, Adam Kocoloski wrote:
>
>> Hi Jeff, it's starting to make some more sense now.  How big are  
>> the normal attachments?  At present, Couch encodes all attachments  
>> using Base64 and inlines them in the JSON representation of the  
>> document during replication.  We'll fix this in the 0.9 release by  
>> taking advantage of new support for multipart requests[1], but  
>> until then replicating big attachments is iffy at best.
>
>
> In the meantime, is there any way to increase the timeout, or limit  
> the number of docs couch tries to send in one bulk_docs transaction?  
> Replication is failing for me even with attachments under 500K,  
> since my upload speed from home isn't that good.
>
>
> Sho

Hi Sho, are you getting req_timedout errors in the logs?  It seems a  
little weird to me that ibrowse starts the timer while it's still  
sending data to the server; perhaps there's an alternative we haven't  
noticed.  There's no way to change the request timeout or bulk docs  
size at runtime right now, but if you don't mind digging into the  
source yourself you change these as follows:

1) request timeout -- line 187 of couch_rep.erl looks like

case ibrowse:send_req(Url, Headers, Action, Body, Options) of

You can add a timeout in milliseconds as a sixth parameter to  
ibrowse:send_req.  The default is 30000.  I think the atom 'infinity'  
also works.

2) bulk_docs size -- The number "100" is mentioned three times in  
couch_rep:get_doc_info_list/2.  You can lower that to something that  
works better for you.

Best, Adam