You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@couchdb.apache.org by Antony Blakey <an...@gmail.com> on 2009/05/15 08:44:42 UTC

Attachment Replication Problem

I have a 3.5G Couchdb database, consisting of 1000 small documents,  
each with many attachments (0-30 per document), each attachment  
varying wildly in size (1K..10M).

To test replication I am running a server on my MBPro and another  
under Ubuntu in VMWare on the same machine. I'm testing using a pure  
trunk.

Doing a pull-replicate from OSX to Linux fails to complete. The point  
at which it fails is constant. I've added some debug logs into  
couch_rep/attachment_loop like this: http://gist.github.com/112070 and  
made the suggested "couch_util:should_flush(1000)" mod to try and  
guarantee progress (but to no avail). The debug output shows this: http://gist.github.com/112069 
  and the document it seems to fail on is this: http://gist.github.com/112074 
  . I'm only just starting to look at this - any pointers would be  
appreciated.

Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

The project was so plagued by politics and ego that when the engineers  
requested technical oversight, our manager hired a psychologist instead.
   -- Ron Avitzur

Re: Attachment Replication Problem

Posted by Antony Blakey <an...@gmail.com>.

On 15/05/2009, at 2:44 PM, Antony Blakey wrote:

> I have a 3.5G Couchdb database, consisting of 1000 small documents,  
> each with many attachments (0-30 per document), each attachment  
> varying wildly in size (1K..10M).
>
> To test replication I am running a server on my MBPro and another  
> under Ubuntu in VMWare on the same machine. I'm testing using a pure  
> trunk.
>
> Doing a pull-replicate from OSX to Linux fails to complete. The  
> point at which it fails is constant. I've added some debug logs into  
> couch_rep/attachment_loop like this: http://gist.github.com/112070  
> and made the suggested "couch_util:should_flush(1000)" mod to try  
> and guarantee progress (but to no avail). The debug output shows  
> this: http://gist.github.com/112069 and the document it seems to  
> fail on is this: http://gist.github.com/112074 . I'm only just  
> starting to look at this - any pointers would be appreciated.

It's not immediately obvious to me at the moment, but I wonder if with  
many small documents with lots of attachments considerable larger than  
the document size, it's possible to spawn too many async attachment  
downloads? In which case, maybe throttling the number of concurrent  
downloads would be a good idea.
Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

The greatest challenge to any thinker is stating the problem in a way  
that will allow a solution
   -- Bertrand Russell

Re: Attachment Replication Problem - Bug Found

Posted by Antony Blakey <an...@gmail.com>.

On 29/05/2009, at 1:09 PM, Damien Katz wrote:

> Antony, there is a good chance this is related to COUCHDB-366 which  
> I just submitted a fix for. Updated to the latest trunk and see if  
> it fixes it.

Unfortunately not.

Antony Blakey
-------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

When I hear somebody sigh, 'Life is hard,' I am always tempted to ask,  
'Compared to what?'
   -- Sydney Harris

Re: Attachment Replication Problem - Bug Found

Posted by Antony Blakey <an...@gmail.com>.

On 29/05/2009, at 2:07 PM, Antony Blakey wrote:

> Problem solved! It's actually the original problem - the patch I  
> submitted was misapplied. The problem is in couch_rep - this code:

Actually, that would be couch_db not couch_rep

Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

Success is not the key to happiness. Happiness is the key to success.
  -- Albert Schweitzer

Re: Attachment Replication Problem - Bug Found

Posted by Adam Kocoloski <ko...@apache.org>.

On May 29, 2009, at 1:08 AM, Antony Blakey wrote:

> On 29/05/2009, at 2:28 PM, Adam Kocoloski wrote:
>
>> I did apply your patch correctly, but the tail-append merge  
>> introduced a regression.
>
> Sorry, didn't mean to cast aspersions.
>
> BTW, did you see this: https://issues.apache.org/jira/browse/COUCHDB-365? 
>  Your mods plus the fix I made for that means that the trunk now has  
> no problems replicating my very attachments-heavy dbs (apart from he  
> purely aesthetic issue of the chunked-with-no-body PUT retries).

Hey, that's great to hear!  I committed a couple of fixes related to  
that ticket, including one that's hopefully equivalent to your fix.   
Best, Adam

Re: Attachment Replication Problem - Bug Found

Posted by Antony Blakey <an...@gmail.com>.

On 29/05/2009, at 2:28 PM, Adam Kocoloski wrote:

> I did apply your patch correctly, but the tail-append merge  
> introduced a regression.

Sorry, didn't mean to cast aspersions.

BTW, did you see this: https://issues.apache.org/jira/browse/COUCHDB-365? 
  Your mods plus the fix I made for that means that the trunk now has  
no problems replicating my very attachments-heavy dbs (apart from he  
purely aesthetic issue of the chunked-with-no-body PUT retries).

Cheers,

Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

The project was so plagued by politics and ego that when the engineers  
requested technical oversight, our manager hired a psychologist instead.
   -- Ron Avitzur

Re: Attachment Replication Problem - Bug Found

Posted by Adam Kocoloski <ko...@apache.org>.

Thanks for catching this, Antony.  I did apply your patch correctly,  
but the tail-append merge introduced a regression.  Best,

Adam

On May 29, 2009, at 12:37 AM, Antony Blakey wrote:

> Problem solved! It's actually the original problem - the patch I  
> submitted was misapplied. The problem is in couch_rep - this code:
>
> write_streamed_attachment(Stream, F, LenLeft) ->
>    Bin = F(),
>    ok = couch_stream:write(Stream, check_bin_length(LenLeft, Bin)),
>    write_streamed_attachment(Stream, F, LenLeft - size(Bin)).
>
> needs to be this:
>
> write_streamed_attachment(Stream, F, LenLeft) ->
>    Bin = F(),
>    TruncatedBin = check_bin_length(LenLeft, Bin),
>    ok = couch_stream:write(Stream, TruncatedBin),
>    write_streamed_attachment(Stream, F, LenLeft - size(TruncatedBin)).
>
> The problem is the last parameter to write_streamed_attachment being  
> < 0 when the data is too long.
>
> Antony Blakey
> --------------------------
> CTO, Linkuistics Pty Ltd
> Ph: 0438 840 787
>
> Always have a vision. Why spend your life making other people’s  
> dreams?
> -- Orson Welles (1915-1985)
>

Re: Attachment Replication Problem - Bug Found

Posted by Antony Blakey <an...@gmail.com>.

Problem solved! It's actually the original problem - the patch I  
submitted was misapplied. The problem is in couch_rep - this code:

write_streamed_attachment(Stream, F, LenLeft) ->
     Bin = F(),
     ok = couch_stream:write(Stream, check_bin_length(LenLeft, Bin)),
     write_streamed_attachment(Stream, F, LenLeft - size(Bin)).

needs to be this:

write_streamed_attachment(Stream, F, LenLeft) ->
     Bin = F(),
     TruncatedBin = check_bin_length(LenLeft, Bin),
     ok = couch_stream:write(Stream, TruncatedBin),
     write_streamed_attachment(Stream, F, LenLeft - size(TruncatedBin)).

The problem is the last parameter to write_streamed_attachment being <  
0 when the data is too long.

Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

Always have a vision. Why spend your life making other people’s dreams?
  -- Orson Welles (1915-1985)

Re: Attachment Replication Problem - Bug Found

Posted by Damien Katz <da...@apache.org>.

Antony, there is a good chance this is related to COUCHDB-366 which I  
just submitted a fix for. Updated to the latest trunk and see if it  
fixes it.

-Damien


On May 28, 2009, at 11:08 PM, Antony Blakey wrote:

> Further to this issue, using trunk my repeatable error hangs  
> replication. I get a result like this:
>
> ------------------------------------------------------------------------------------------
> [debug] [<0.125.0>] Attachment URL http://localhost:5985/acumen-curricula/3861b6572b83310c8d1bd4af19c24960/generated_learning_guide.pdf?rev=12-3009421878
> [debug] [<0.125.0>] streaming attachment Status "200" Headers  
> [{"Transfer-Encoding","chunked"},
>                                           {"Server",
>                                            "CouchDB/0.10.0a (Erlang  
> OTP/R13B)"},
>                                            
> {"ETag","\"12-3009421878\""},
>                                           {"Date",
>                                            "Fri, 29 May 2009  
> 02:29:05 GMT"},
>                                           {"Content- 
> Type","application/pdf"},
>                                           {"Cache-Control",
>                                            "must-revalidate"}]
> [debug] [<0.125.0>] REPLICATOR: about to update_docs
> [debug] [<0.125.0>] REPLICATOR: in update_docs
> [debug] [<0.125.0>] REPLICATOR: about to write_and_commit
> [debug] [<0.125.0>] REPLICATOR: about to doc_flush_binaries
> [debug] [<0.125.0>] REPLICATOR: about to flush_binary for Learning  
> Guide.odt
> [debug] [<0.125.0>] REPLICATOR: about to flush_binary for Pre- 
> Service Lesson Planning Guide.doc
> [debug] [<0.125.0>] REPLICATOR: about to flush_binary for Trainers  
> Guide.odt
> [debug] [<0.125.0>] REPLICATOR: about to flush_binary for New  
> Mindmap.png
> [debug] [<0.125.0>] REPLICATOR: about to flush_binary for  
> Presentation Guide.odp
> [debug] [<0.125.0>] REPLICATOR: about to flush_binary for mpabroa.mm
> [debug] [<0.125.0>] REPLICATOR: about to flush_binary for New  
> Mindmap.mm
> [debug] [<0.125.0>] REPLICATOR: about to flush_binary for  
> generated_learning_guide.pdf
> [debug] [<0.125.0>] write_streamed_attachment has written too much  
> expected: 81912 got: 81913 tail: <<"\r">>
> ------------------------------------------------------------------------------------------
>
> And it's all over for that couchdb. Restart is required to continue.
>
> I've set the ibrowse # of sessions and pipeline to 1 to try and  
> remove the pipelining and concurrent connections from the mix, but  
> still it happens - then again it seems I'm not being entirely  
> successful in that regard because ...
>
> When I turn ibrowse tracing on (in make_attachment_receiver) like  
> this:
>
>    ...
>    {ok, Conn} = ibrowse:spawn_link_worker_process(Host, Port),
>    Conn ! {trace, true},
>    ...
>
> then I see this result:
>
> ------------------------------------------------------------------------------------------
> 2009-5-28_23:18:25:922 -- (localhost:5985) - Recvd more data: size:  
> 69. NeedBytes: 62
> 2009-5-28_23:18:25:922 -- (localhost:5985) - Recvd another chunk...
> 2009-5-28_23:18:25:922 -- (localhost:5985) - RemData -> "\r\n0\r\n\r 
> \n"
> 2009-5-28_23:18:25:925 -- (localhost:5985) - Determined chunk size:  
> 0. Already recvd: 2
> 2009-5-28_23:18:25:925 -- (localhost:5985) - Detected end of chunked  
> transfer...
> [debug] [<0.120.0>] REPLICATOR: about to flush_binary for  
> Presentation Guide.odp
> [debug] [<0.120.0>] REPLICATOR: about to flush_binary for New  
> Mindmap.mm
> [debug] [<0.120.0>] REPLICATOR: about to flush_binary for  
> Mathematics 2 Application of Subtraction.odt
> [debug] [<0.120.0>] REPLICATOR: about to flush_binary for  
> generated_learning_guide.pdf
> [debug] [<0.120.0>] REPLICATOR: about to flush_binary for Learning  
> Guide.odt
> [debug] [<0.120.0>] REPLICATOR: about to flush_binary for Pre- 
> Service Lesson Planning Guide.doc
> [debug] [<0.120.0>] REPLICATOR: about to flush_binary for Trainers  
> Guide.odt
> [debug] [<0.120.0>] REPLICATOR: about to flush_binary for  
> Presentation Guide.odp
> [debug] [<0.120.0>] REPLICATOR: about to flush_binary for New  
> Mindmap.mm
> [debug] [<0.120.0>] REPLICATOR: about to flush_binary for  
> generated_learning_guide.pdf
> [debug] [<0.120.0>] REPLICATOR: about to flush_binary for Learning  
> Guide.odt
> [debug] [<0.120.0>] REPLICATOR: about to flush_binary for Trainers  
> Guide.odt
> [debug] [<0.120.0>] REPLICATOR: about to flush_binary for Pre- 
> Service Lesson Planning Guide.doc
> [debug] [<0.120.0>] REPLICATOR: about to flush_binary for Gases.png
> [debug] [<0.120.0>] REPLICATOR: about to flush_binary for Behaviour  
> of Gases1.odt
> [debug] [<0.120.0>] REPLICATOR: about to flush_binary for Science 9  
> - Behaviour of Gases.pdf
> [debug] [<0.120.0>] write_streamed_attachment has written too much  
> expected: 383360 got: 383361 tail: <<"\r">>
> 2009-5-28_23:18:51:418 -- (localhost:5985) - TCP connection closed  
> by peer!
> 2009-5-28_23:18:51:637 -- (localhost:5985) - TCP connection closed  
> by peer!
> 2009-5-28_23:18:51:678 -- (localhost:5985) - TCP connection closed  
> by peer!
> 2009-5-28_23:18:51:932 -- (localhost:5985) - TCP connection closed  
> by peer!
> 2009-5-28_23:18:52:75 -- (localhost:5985) - TCP connection closed by  
> peer!
> ------------------------------------------------------------------------------------------
>
> And once again, all over. I initially suspected that the ibrowse  
> error was terminating the stream readers without sending the failure  
> response, but I'm not really sure.
>
> I suspect from the ibrowse tracing showing interleaved data response  
> (and those 5 connection close messages) that I haven't succeeded in  
> make ibrowse linear-one-request, which I need to do to find this  
> problem in ibrowse.
>
> Any hints on how to truly make ibrowse single-connection without  
> pipelining?
>
> Antony Blakey
> -------------
> CTO, Linkuistics Pty Ltd
> Ph: 0438 840 787
>
> Some defeats are instalments to victory.
>  -- Jacob Riis
>
>

Re: Attachment Replication Problem - Bug Found

Posted by Antony Blakey <an...@gmail.com>.

On 29/05/2009, at 12:38 PM, Antony Blakey wrote:

> Any hints on how to truly make ibrowse single-connection without  
> pipelining?

My error - attachment downloads bypass the ibrowse pool.

Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

Don't anthropomorphize computers. They hate that.

Re: Attachment Replication Problem - Bug Found

Posted by Antony Blakey <an...@gmail.com>.

Further to this issue, using trunk my repeatable error hangs  
replication. I get a result like this:

------------------------------------------------------------------------------------------
[debug] [<0.125.0>] Attachment URL http://localhost:5985/acumen-curricula/3861b6572b83310c8d1bd4af19c24960/generated_learning_guide.pdf?rev=12-3009421878
[debug] [<0.125.0>] streaming attachment Status "200" Headers  
[{"Transfer-Encoding","chunked"},
                                            {"Server",
                                             "CouchDB/0.10.0a (Erlang  
OTP/R13B)"},
                                             
{"ETag","\"12-3009421878\""},
                                            {"Date",
                                             "Fri, 29 May 2009  
02:29:05 GMT"},
                                            {"Content- 
Type","application/pdf"},
                                            {"Cache-Control",
                                             "must-revalidate"}]
[debug] [<0.125.0>] REPLICATOR: about to update_docs
[debug] [<0.125.0>] REPLICATOR: in update_docs
[debug] [<0.125.0>] REPLICATOR: about to write_and_commit
[debug] [<0.125.0>] REPLICATOR: about to doc_flush_binaries
[debug] [<0.125.0>] REPLICATOR: about to flush_binary for Learning  
Guide.odt
[debug] [<0.125.0>] REPLICATOR: about to flush_binary for Pre-Service  
Lesson Planning Guide.doc
[debug] [<0.125.0>] REPLICATOR: about to flush_binary for Trainers  
Guide.odt
[debug] [<0.125.0>] REPLICATOR: about to flush_binary for New  
Mindmap.png
[debug] [<0.125.0>] REPLICATOR: about to flush_binary for Presentation  
Guide.odp
[debug] [<0.125.0>] REPLICATOR: about to flush_binary for mpabroa.mm
[debug] [<0.125.0>] REPLICATOR: about to flush_binary for New Mindmap.mm
[debug] [<0.125.0>] REPLICATOR: about to flush_binary for  
generated_learning_guide.pdf
[debug] [<0.125.0>] write_streamed_attachment has written too much  
expected: 81912 got: 81913 tail: <<"\r">>
------------------------------------------------------------------------------------------

And it's all over for that couchdb. Restart is required to continue.

I've set the ibrowse # of sessions and pipeline to 1 to try and remove  
the pipelining and concurrent connections from the mix, but still it  
happens - then again it seems I'm not being entirely successful in  
that regard because ...

When I turn ibrowse tracing on (in make_attachment_receiver) like this:

     ...
     {ok, Conn} = ibrowse:spawn_link_worker_process(Host, Port),
     Conn ! {trace, true},
     ...

then I see this result:

------------------------------------------------------------------------------------------
2009-5-28_23:18:25:922 -- (localhost:5985) - Recvd more data: size:  
69. NeedBytes: 62
2009-5-28_23:18:25:922 -- (localhost:5985) - Recvd another chunk...
2009-5-28_23:18:25:922 -- (localhost:5985) - RemData -> "\r\n0\r\n\r\n"
2009-5-28_23:18:25:925 -- (localhost:5985) - Determined chunk size: 0.  
Already recvd: 2
2009-5-28_23:18:25:925 -- (localhost:5985) - Detected end of chunked  
transfer...
[debug] [<0.120.0>] REPLICATOR: about to flush_binary for Presentation  
Guide.odp
[debug] [<0.120.0>] REPLICATOR: about to flush_binary for New Mindmap.mm
[debug] [<0.120.0>] REPLICATOR: about to flush_binary for Mathematics  
2 Application of Subtraction.odt
[debug] [<0.120.0>] REPLICATOR: about to flush_binary for  
generated_learning_guide.pdf
[debug] [<0.120.0>] REPLICATOR: about to flush_binary for Learning  
Guide.odt
[debug] [<0.120.0>] REPLICATOR: about to flush_binary for Pre-Service  
Lesson Planning Guide.doc
[debug] [<0.120.0>] REPLICATOR: about to flush_binary for Trainers  
Guide.odt
[debug] [<0.120.0>] REPLICATOR: about to flush_binary for Presentation  
Guide.odp
[debug] [<0.120.0>] REPLICATOR: about to flush_binary for New Mindmap.mm
[debug] [<0.120.0>] REPLICATOR: about to flush_binary for  
generated_learning_guide.pdf
[debug] [<0.120.0>] REPLICATOR: about to flush_binary for Learning  
Guide.odt
[debug] [<0.120.0>] REPLICATOR: about to flush_binary for Trainers  
Guide.odt
[debug] [<0.120.0>] REPLICATOR: about to flush_binary for Pre-Service  
Lesson Planning Guide.doc
[debug] [<0.120.0>] REPLICATOR: about to flush_binary for Gases.png
[debug] [<0.120.0>] REPLICATOR: about to flush_binary for Behaviour of  
Gases1.odt
[debug] [<0.120.0>] REPLICATOR: about to flush_binary for Science 9 -  
Behaviour of Gases.pdf
[debug] [<0.120.0>] write_streamed_attachment has written too much  
expected: 383360 got: 383361 tail: <<"\r">>
2009-5-28_23:18:51:418 -- (localhost:5985) - TCP connection closed by  
peer!
2009-5-28_23:18:51:637 -- (localhost:5985) - TCP connection closed by  
peer!
2009-5-28_23:18:51:678 -- (localhost:5985) - TCP connection closed by  
peer!
2009-5-28_23:18:51:932 -- (localhost:5985) - TCP connection closed by  
peer!
2009-5-28_23:18:52:75 -- (localhost:5985) - TCP connection closed by  
peer!
------------------------------------------------------------------------------------------

And once again, all over. I initially suspected that the ibrowse error  
was terminating the stream readers without sending the failure  
response, but I'm not really sure.

I suspect from the ibrowse tracing showing interleaved data response  
(and those 5 connection close messages) that I haven't succeeded in  
make ibrowse linear-one-request, which I need to do to find this  
problem in ibrowse.

Any hints on how to truly make ibrowse single-connection without  
pipelining?

Antony Blakey
-------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

Some defeats are instalments to victory.
   -- Jacob Riis

Re: Attachment Replication Problem - Bug Found

Posted by Adam Kocoloski <ko...@apache.org>.

On May 16, 2009, at 8:38 PM, Antony Blakey wrote:

> On 16/05/2009, at 8:45 PM, Adam Kocoloski wrote:
>
>> Hi Antony, thank you!  I encountered this "one more byte" problem  
>> once before, but it wasn't 100% reproducible, so I wasn't really  
>> comfortable checking in a workaround.  I've basically been waiting  
>> to see if it would ever show up for anyone else :-/
>>
>> I think we should commit this change, but I'd still like to confirm  
>> that the attachment on the target is not corrupted by the chunk  
>> processing issue (i.e. the last chunk starts with a \r or something  
>> like that).  Or even better, fix the chunk processing issue.
>
> Given my results, will you commit that fix?

Yes, I committed it now.  Thanks, Adam

Re: Attachment Replication Problem - Bug Found

Posted by Antony Blakey <an...@gmail.com>.

On 16/05/2009, at 8:45 PM, Adam Kocoloski wrote:

> Hi Antony, thank you!  I encountered this "one more byte" problem  
> once before, but it wasn't 100% reproducible, so I wasn't really  
> comfortable checking in a workaround.  I've basically been waiting  
> to see if it would ever show up for anyone else :-/
>
> I think we should commit this change, but I'd still like to confirm  
> that the attachment on the target is not corrupted by the chunk  
> processing issue (i.e. the last chunk starts with a \r or something  
> like that).  Or even better, fix the chunk processing issue.

Given my results, will you commit that fix?

Antony Blakey
-------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

When I hear somebody sigh, 'Life is hard,' I am always tempted to ask,  
'Compared to what?'
   -- Sydney Harris

Re: Attachment Replication Problem - Bug Found

Posted by Antony Blakey <an...@gmail.com>.

On 17/05/2009, at 9:27 PM, Adam Kocoloski wrote:

> On May 16, 2009, at 8:30 PM, Antony Blakey wrote:
>
>> On 17/05/2009, at 12:09 AM, Adam Kocoloski wrote:
>>
>>> So, I think there's still some confusion here.  By "open  
>>> connections" do you mean TCP connections to the source?  That  
>>> number is never higher than 10.  ibrowse does pipeline requests on  
>>> those 10 connections, so there could be as many as 1000  
>>> simultaneous HTTP requests.  However, those requests complete as  
>>> soon as the data reaches the ibrowse client process, so in fact  
>>> the number of outstanding request during replication is usually  
>>> very small.  We're not doing flow control at the TCP socket layer.
>>
>> OK, I understand that now. That means that a document with > 1000  
>> attachments can't be replicated because ibrowse will never send  
>> ibrowse_async_headers for the excess attachments to  
>> attachment_loop, which needs to happen for every attachment before  
>> any of the data is read by doc_flush_binaries. Which is to say that  
>> every document attachment needs to start e.g. receive headers,  
>> before any attachment bodies are consumed.
>
> Not quite.  So, this discussion is going to quickly become even more  
> confusing because as of yesterday attachments are downloaded on  
> dedicated connections outside the load-balanced connection pool.   
> For the sake of argument let's stick with the behavior as of 2 days  
> ago at first.
>
> I keep coming back to this key point: _ibrowse has no flow  
> control_.  It doesn't matter whether we consume the  
> ibrowse_async_headers message in the attachment receiver or not;  
> ibrowse is still going to immediately send all those  
> ibrowse_async_response messages our way.

Sure, my point was that once the queue is full it won't send the  
ibrowse_async_headers (because it will never start the connection). I  
didn't realise that it would fail before that (as you explain below).  
I was assuming it would just block. Hence all my previous comments.

> Now, your point about limits on the number of attachments in a  
> document is a good one.  What I imagine would happen is the following:
>
> 1) couch_rep spawns off 1000+ attachment requests to ibrowse for a  
> single document
> 2) ibrowse starts sending back {error, retry_later} responses when  
> the queue is full
> 3) the attachment receiver processes start exiting with  
> attachment_request_failed
> 4) couch_rep traps the exits and reboots the document enumerator  
> starting at current_seq
> 5) repeat
>
> Obviously this is not a good situation.  Now, I mentioned earlier  
> that as of yesterday the attachment downloads are each done on  
> dedicated connections.  I pulled them out of the connection pool so  
> that a document download didn't get stuck behind a giant attachment  
> download (the end result would be one way to make couch run out of  
> memory).  This means that the max_connections x max_pipeline doesn't  
> apply to attachments.  Of course, using dedicated connections has  
> its own scalability problems.  Setting up and tearing down all of  
> those connections for the "lots of small attachments" case  
> introduces a significant cost, and eventually we could have so many  
> connections in TIME_WAIT that we run out of ephemeral ports.

That new scalability problem is what I thought the original problem  
was with ibrowse before I learnt it had a pool.

> A better solution might be to have a separate load-balanced  
> connection pool just for attachments.  We'd have to exercise some  
> care not to retry attachment requests on a connection that already  
> has requests in the pipeline.
>> In my case, I have some large attachments and unreliable links, so  
>> I'm partial to a solution that allows progress even of partial  
>> attachments during link failure. We could get this by not delaying  
>> the attachments, and buffering them to disk, using range requests  
>> on the get for partial downloads. This would solve some problems  
>> because it starts with the requirement to always make progress,  
>> never redoing work. This seems like it could be done reasonably  
>> transparently just by modifying the attachment download code.
>
> I definitely like the idea of Range support for making progress in  
> the event of link failure.  In theory, it would be possible to build  
> this into ibrowse so we could transparently use it for very large  
> documents as well.
>
> I'm not absolutely opposed to saving attachments to temporary files  
> on disk, but I'd prefer to exhaust in-memory options first.


I'm pretty sure that the only scalable solution that will handle  
documents with significant numbers of attachments is to avoid having  
all the attachments be in-progress downloading before the document is  
written e.g. either buffering to disk or a more radical mod of  
allowing attachments to be written before the document, which I guess  
is not going to happen. And once you allow buffering to disk as a last  
resort, you may as well use it as the default mechanism. Apart from  
anything else, it's a good basis for partial attachment download  
restart.

I'm wondering if it's worth exhausting in-memory options if disk  
buffering is absolutely required for at least one use case?

The problem I see with building it into ibrowse is the requirement to  
inject the restart/file management/expiration policies into ibrowse.

Cheers,

Antony Blakey
-------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

In anything at all, perfection is finally attained not when there is  
no longer anything to add, but when there is no longer anything to  
take away.
   -- Antoine de Saint-Exupery

Re: Attachment Replication Problem - Bug Found

Posted by Adam Kocoloski <ko...@apache.org>.

On May 16, 2009, at 8:30 PM, Antony Blakey wrote:

> On 17/05/2009, at 12:09 AM, Adam Kocoloski wrote:
>
>> So, I think there's still some confusion here.  By "open  
>> connections" do you mean TCP connections to the source?  That  
>> number is never higher than 10.  ibrowse does pipeline requests on  
>> those 10 connections, so there could be as many as 1000  
>> simultaneous HTTP requests.  However, those requests complete as  
>> soon as the data reaches the ibrowse client process, so in fact the  
>> number of outstanding request during replication is usually very  
>> small.  We're not doing flow control at the TCP socket layer.
>
> OK, I understand that now. That means that a document with > 1000  
> attachments can't be replicated because ibrowse will never send  
> ibrowse_async_headers for the excess attachments to attachment_loop,  
> which needs to happen for every attachment before any of the data is  
> read by doc_flush_binaries. Which is to say that every document  
> attachment needs to start e.g. receive headers, before any  
> attachment bodies are consumed.

Not quite.  So, this discussion is going to quickly become even more  
confusing because as of yesterday attachments are downloaded on  
dedicated connections outside the load-balanced connection pool.  For  
the sake of argument let's stick with the behavior as of 2 days ago at  
first.

I keep coming back to this key point: _ibrowse has no flow control_.   
It doesn't matter whether we consume the ibrowse_async_headers message  
in the attachment receiver or not; ibrowse is still going to  
immediately send all those ibrowse_async_response messages our way.

Now, your point about limits on the number of attachments in a  
document is a good one.  What I imagine would happen is the following:

1) couch_rep spawns off 1000+ attachment requests to ibrowse for a  
single document
2) ibrowse starts sending back {error, retry_later} responses when the  
queue is full
3) the attachment receiver processes start exiting with  
attachment_request_failed
4) couch_rep traps the exits and reboots the document enumerator  
starting at current_seq
5) repeat

Obviously this is not a good situation.  Now, I mentioned earlier that  
as of yesterday the attachment downloads are each done on dedicated  
connections.  I pulled them out of the connection pool so that a  
document download didn't get stuck behind a giant attachment download  
(the end result would be one way to make couch run out of memory).   
This means that the max_connections x max_pipeline doesn't apply to  
attachments.  Of course, using dedicated connections has its own  
scalability problems.  Setting up and tearing down all of those  
connections for the "lots of small attachments" case introduces a  
significant cost, and eventually we could have so many connections in  
TIME_WAIT that we run out of ephemeral ports.

A better solution might be to have a separate load-balanced connection  
pool just for attachments.  We'd have to exercise some care not to  
retry attachment requests on a connection that already has requests in  
the pipeline.

> In my case, I have some large attachments and unreliable links, so  
> I'm partial to a solution that allows progress even of partial  
> attachments during link failure. We could get this by not delaying  
> the attachments, and buffering them to disk, using range requests on  
> the get for partial downloads. This would solve some problems  
> because it starts with the requirement to always make progress,  
> never redoing work. This seems like it could be done reasonably  
> transparently just by modifying the attachment download code.

I definitely like the idea of Range support for making progress in the  
event of link failure.  In theory, it would be possible to build this  
into ibrowse so we could transparently use it for very large documents  
as well.

I'm not absolutely opposed to saving attachments to temporary files on  
disk, but I'd prefer to exhaust in-memory options first.

Cheers, Adam

Re: Attachment Replication Problem - Bug Found

Posted by Antony Blakey <an...@gmail.com>.

On 17/05/2009, at 12:09 AM, Adam Kocoloski wrote:

> So, I think there's still some confusion here.  By "open  
> connections" do you mean TCP connections to the source?  That number  
> is never higher than 10.  ibrowse does pipeline requests on those 10  
> connections, so there could be as many as 1000 simultaneous HTTP  
> requests.  However, those requests complete as soon as the data  
> reaches the ibrowse client process, so in fact the number of  
> outstanding request during replication is usually very small.  We're  
> not doing flow control at the TCP socket layer.

OK, I understand that now. That means that a document with > 1000  
attachments can't be replicated because ibrowse will never send  
ibrowse_async_headers for the excess attachments to attachment_loop,  
which needs to happen for every attachment before any of the data is  
read by doc_flush_binaries. Which is to say that every document  
attachment needs to start e.g. receive headers, before any attachment  
bodies are consumed.

With concurrent replications the maximum number of attachments is  
reduced, and it's possible to get a deadlock where the ibrowse queue  
is full but no document has all of it's attachment downloads started.

> I'm not sure I understand what part is "not scalable".  I agree that  
> ignoring the attachment receivers and their mailboxes when deciding  
> whether to checkpoint is a big problem.  I'm testing a fix for that  
> right now.  Is there something else you meant by that statement?   
> Best,

I didn't know about the ibrowse pool, so that part is scalable i.e.  
bounded number of connections and requests. If my comments above are  
correct, then the current architecture isn't scalable in respect to  
the number of attachments in the single-replicator case, and a more  
complicated equation in the multiple-replicator case.

> P.S. One issue in my mind is that we only do the checkpoint test  
> after we receive a document.  We could end up in a situation where a  
> document request is sitting in a pipeline behind a huge attachment,  
> and the checkpoint test won't execute until the entire attachment is  
> downloaded into memory.  There are ways around this, e.g. using  
> ibrowse:spawn_link_worker_process/2 to bypass the default connection  
> pool for attachment downloads.

Requiring every attachment to be started but not completed seems to me  
to be a fundamental issue.

In my case, I have some large attachments and unreliable links, so I'm  
partial to a solution that allows progress even of partial attachments  
during link failure. We could get this by not delaying the  
attachments, and buffering them to disk, using range requests on the  
get for partial downloads. This would solve some problems because it  
starts with the requirement to always make progress, never redoing  
work. This seems like it could be done reasonably transparently just  
by modifying the attachment download code.

Antony Blakey
-------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

Nothing is really work unless you would rather be doing something else.
   -- J. M. Barre

Re: Attachment Replication Problem - Bug Found

Posted by Adam Kocoloski <ko...@apache.org>.

On May 16, 2009, at 8:53 PM, Antony Blakey wrote:

> On 17/05/2009, at 12:09 AM, Adam Kocoloski wrote:
>
>> So, I think there's still some confusion here.  By "open  
>> connections" do you mean TCP connections to the source?  That  
>> number is never higher than 10.  ibrowse does pipeline requests on  
>> those 10 connections, so there could be as many as 1000  
>> simultaneous HTTP requests.  However, those requests complete as  
>> soon as the data reaches the ibrowse client process, so in fact the  
>> number of outstanding request during replication is usually very  
>> small.  We're not doing flow control at the TCP socket layer.
>
> IIUC, given that no attachments bodies are consumed by the  
> replicator until the documents are checkpointed, it's possible for  
> the replicator to block if the number of pending attachments in a  
> checkpoint buffer is greater than the ibrowse concurrent request  
> limit. In a case like mine, with many attachments on very small  
> documents, this is very likely. Or am I still confused? :/

There's one key point that you're overlooking.  From ibrowse'  
perspective, _there is no checkpoint buffer_.  ibrowse gets a request  
to download an attachment, and it immediately starts that request,  
sends the data in 1MB chunks to an attachment receiver process, and  
completes the request.  In theory, we could have 10,000 attachment  
receiver processes holding binaries to be written to disk, and ibrowse  
would be none the wiser.

Best, Adam

Re: Attachment Replication Problem - Bug Found

Posted by Antony Blakey <an...@gmail.com>.

On 17/05/2009, at 12:09 AM, Adam Kocoloski wrote:

> So, I think there's still some confusion here.  By "open  
> connections" do you mean TCP connections to the source?  That number  
> is never higher than 10.  ibrowse does pipeline requests on those 10  
> connections, so there could be as many as 1000 simultaneous HTTP  
> requests.  However, those requests complete as soon as the data  
> reaches the ibrowse client process, so in fact the number of  
> outstanding request during replication is usually very small.  We're  
> not doing flow control at the TCP socket layer.

IIUC, given that no attachments bodies are consumed by the replicator  
until the documents are checkpointed, it's possible for the replicator  
to block if the number of pending attachments in a checkpoint buffer  
is greater than the ibrowse concurrent request limit. In a case like  
mine, with many attachments on very small documents, this is very  
likely. Or am I still confused? :/

Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

A priest, a minister and a rabbi walk into a bar. The bartender says  
"What is this, a joke?"

Re: Attachment Replication Problem - Bug Found

Posted by Adam Kocoloski <ko...@apache.org>.

On May 16, 2009, at 11:22 AM, Antony Blakey wrote:

> On 16/05/2009, at 11:07 PM, Adam Kocoloski wrote:
>
>> No, I don't believe so.  ibrowse accepts a {stream_to, pid()}  
>> option.  It accumulates packets until it reaches a threshold  
>> configurable by {stream_chunk_size, integer()} (default 1MB), then  
>> sends the data to the Pid.  I don't think ibrowse is writing to  
>> disk at any point  in the process.  We do see that when streaming  
>> really large attachments, ibrowse becomes the biggest memory user  
>> in the emulator.
>
> This is what I thought was happening, which means that with small  
> documents with many attachments (say > 1Mb) you could potentially  
> end up with masses of open connections representing data promises  
> that are only forced at checkpoint time, so that's not scalable. I  
> think the number of open ibrowse connections (which I see doesn't  
> neccessariy match the number of unforced promises), needs to be an  
> input to the checkpoint decision.

So, I think there's still some confusion here.  By "open connections"  
do you mean TCP connections to the source?  That number is never  
higher than 10.  ibrowse does pipeline requests on those 10  
connections, so there could be as many as 1000 simultaneous HTTP  
requests.  However, those requests complete as soon as the data  
reaches the ibrowse client process, so in fact the number of  
outstanding request during replication is usually very small.  We're  
not doing flow control at the TCP socket layer.

If by "open connections" you really mean "attachment receiver  
processes spawned by the couch_rep gen_server" I think you'd be closer  
to the mark.  We can get an approximate handle on that just by  
counting the number of links to the gen_server.

I'm not sure I understand what part is "not scalable".  I agree that  
ignoring the attachment receivers and their mailboxes when deciding  
whether to checkpoint is a big problem.  I'm testing a fix for that  
right now.  Is there something else you meant by that statement?  Best,

Adam

P.S. One issue in my mind is that we only do the checkpoint test after  
we receive a document.  We could end up in a situation where a  
document request is sitting in a pipeline behind a huge attachment,  
and the checkpoint test won't execute until the entire attachment is  
downloaded into memory.  There are ways around this, e.g. using  
ibrowse:spawn_link_worker_process/2 to bypass the default connection  
pool for attachment downloads.

Re: Attachment Replication Problem - Bug Found

Posted by Antony Blakey <an...@gmail.com>.

On 16/05/2009, at 11:07 PM, Adam Kocoloski wrote:

> No, I don't believe so.  ibrowse accepts a {stream_to, pid()}  
> option.  It accumulates packets until it reaches a threshold  
> configurable by {stream_chunk_size, integer()} (default 1MB), then  
> sends the data to the Pid.  I don't think ibrowse is writing to disk  
> at any point  in the process.  We do see that when streaming really  
> large attachments, ibrowse becomes the biggest memory user in the  
> emulator.

This is what I thought was happening, which means that with small  
documents with many attachments (say > 1Mb) you could potentially end  
up with masses of open connections representing data promises that are  
only forced at checkpoint time, so that's not scalable. I think the  
number of open ibrowse connections (which I see doesn't neccessariy  
match the number of unforced promises), needs to be an input to the  
checkpoint decision.

> ibrowse does offer a {save_response_to_file, boolean()|filename()}  
> option that we could possibly leverage.

That sounds like a better idea.

> Are you keeping an eye on the memory usage?  I think an out of  
> memory error can trigger this sudden death in Erlang.

Not a memory failure - it happens at the same place with either 1 or  
1.5G. Once again I have a repeatable failure that would normally be  
random :/ I'm just wondering how to debug it.

Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

A priest, a minister and a rabbi walk into a bar. The bartender says  
"What is this, a joke?"

Re: Attachment Replication Problem - Bug Found

Posted by Adam Kocoloski <ko...@apache.org>.

Hi Antony,

On May 16, 2009, at 10:39 AM, Antony Blakey wrote:

> I can confirm that the target and source of replicated resources  
> affected by this issue are identical with this fix, and both are  
> correct i.e. uncorrupted, although this is only according to the  
> failures I've seen.

Thanks!  Makes me feel better, at least.

>> Now, on to the checkpointing conditions.  I think there's some  
>> confusion about the attachment workflow.  Attachments are  
>> downloaded _immediately_ and in their entirety by ibrowse, which  
>> then sends the data as 1MB binary chunks to the attachment receiver  
>> processes.
>
> Are they downloaded to disk by ibrowse?

No, I don't believe so.  ibrowse accepts a {stream_to, pid()} option.   
It accumulates packets until it reaches a threshold configurable by  
{stream_chunk_size, integer()} (default 1MB), then sends the data to  
the Pid.  I don't think ibrowse is writing to disk at any point  in  
the process.  We do see that when streaming really large attachments,  
ibrowse becomes the biggest memory user in the emulator.

ibrowse does offer a {save_response_to_file, boolean()|filename()}  
option that we could possibly leverage.

>> In another thread Matt Goodall suggested checkpointing after a  
>> certain amount of time has passed.  So we'd have a checkpointing  
>> algo that considers
>>
>> * memory utilization
>> * number of pending writes
>> * time elapsed
>
> That seems to cover both resource usage and incremental progress. As  
> far as the couch_util:should_flush mechanism is concerned, I think a  
> good idea would be to commit 1 document, then 2, then 4 i.e. a  
> binary increasing window which adapts well to both unreliable and  
> reliable connections without requiring configuration, which is  
> tricky because you may want to run the system in a variety of  
> scenarios, and you might not know what the failure characteristics  
> are (and they may change over time).

It sounds like a good idea.  I had thought about doing the same for  
the process that pulls new docs from the source server, so that we  
could do a better job of filling up the pipes when we're dealing with  
the common case of small documents without significant attachment data.

> While we on this - any idea about why couchdb is quiting during  
> replication? It's not giving me any errors.

Errm, no, I'm afraid I don't have any idea there.  I remember one or  
two other reports in JIRA that sounds similar, but I've not been able  
to reproduce them.  Are you keeping an eye on the memory usage?  I  
think an out of memory error can trigger this sudden death in Erlang.   
Sorry, that's the best I've got at the moment.

Adam

Re: Attachment Replication Problem - Bug Found

Posted by Antony Blakey <an...@gmail.com>.

On 16/05/2009, at 8:45 PM, Adam Kocoloski wrote:

> I encountered this "one more byte" problem once before, but it  
> wasn't 100% reproducible, so I wasn't really comfortable checking in  
> a workaround.  I've basically been waiting to see if it would ever  
> show up for anyone else :-/

I'm lucky to have a repeatable failure :/

> I think we should commit this change, but I'd still like to confirm  
> that the attachment on the target is not corrupted by the chunk  
> processing issue (i.e. the last chunk starts with a \r or something  
> like that).

I can confirm that the target and source of replicated resources  
affected by this issue are identical with this fix, and both are  
correct i.e. uncorrupted, although this is only according to the  
failures I've seen.

> Now, on to the checkpointing conditions.  I think there's some  
> confusion about the attachment workflow.  Attachments are downloaded  
> _immediately_ and in their entirety by ibrowse, which then sends the  
> data as 1MB binary chunks to the attachment receiver processes.

Are they downloaded to disk by ibrowse?

> In another thread Matt Goodall suggested checkpointing after a  
> certain amount of time has passed.  So we'd have a checkpointing  
> algo that considers
>
> * memory utilization
> * number of pending writes
> * time elapsed

That seems to cover both resource usage and incremental progress. As  
far as the couch_util:should_flush mechanism is concerned, I think a  
good idea would be to commit 1 document, then 2, then 4 i.e. a binary  
increasing window which adapts well to both unreliable and reliable  
connections without requiring configuration, which is tricky because  
you may want to run the system in a variety of scenarios, and you  
might not know what the failure characteristics are (and they may  
change over time).

While we on this - any idea about why couchdb is quiting during  
replication? It's not giving me any errors.

Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

Hi, I'd like to do $THING. I know that $SOLUTION_A and $SOLUTION_B  
will do it very easily and for a very reasonable price, but I don't  
want to use $SOLUTION_A or $SOLUTION_B because $VAGUE_REASON and  
$CONTRADICTORY_REASON. Instead, I'd like your under-informed ideas on  
how to achieve my $POORLY_CONCEIVED_AMBITIONS using Linux, duct tape,  
an iPod, and hours and hours of my precious time.
   -- Slashdot response to an enquiry

Re: Attachment Replication Problem - Bug Found

Posted by Adam Kocoloski <ko...@apache.org>.

Hi Antony, thank you!  I encountered this "one more byte" problem once  
before, but it wasn't 100% reproducible, so I wasn't really  
comfortable checking in a workaround.  I've basically been waiting to  
see if it would ever show up for anyone else :-/

I think we should commit this change, but I'd still like to confirm  
that the attachment on the target is not corrupted by the chunk  
processing issue (i.e. the last chunk starts with a \r or something  
like that).  Or even better, fix the chunk processing issue.

Now, on to the checkpointing conditions.  I think there's some  
confusion about the attachment workflow.  Attachments are downloaded  
_immediately_ and in their entirety by ibrowse, which then sends the  
data as 1MB binary chunks to the attachment receiver processes.  The  
data sits in these processes' mailboxes until the next checkpoint.   
The flow control occurs entirely in Couch, not in ibrowse or the TCP  
layer.  We shouldn't end up with too many open connections this way --  
but if we do, we can tweak the max_connections and max_pipeline_size  
ibrowse parameters to throttle it.

You all appear to be right, the pending attachment data are not  
considered when deciding when to checkpoint.  That's a major bug and a  
regression in my opinion.  My bad.

In another thread Matt Goodall suggested checkpointing after a certain  
amount of time has passed.  So we'd have a checkpointing algo that  
considers

* memory utilization
* number of pending writes
* time elapsed

Anything else we ought to take as an input?  I've got some time to  
hack on this today.

Adam

On May 16, 2009, at 5:08 AM, Antony Blakey wrote:

>
> On 16/05/2009, at 12:59 PM, Antony Blakey wrote:
>
>> and truncate the binary to the expected length. I'm not familiar  
>> with ibrowse in terms of debugging this problem further.
>
> The final mod I've ended up with is this, which deals with the  
> ibrowse problem:
>
> ------------------------------------------------------------------------------
>
> write_streamed_attachment(_Stream, _F, 0, SpAcc) ->
>    {ok, SpAcc};
> write_streamed_attachment(Stream, F, LenLeft, nil) ->
>    Bin = F(),
>    TruncatedBin = check_bin_length(LenLeft, Bin),
>    {ok, StreamPointer} = couch_stream:write(Stream, TruncatedBin),
>    write_streamed_attachment(Stream, F, LenLeft -  
> size(TruncatedBin), StreamPointer);
> write_streamed_attachment(Stream, F, LenLeft, SpAcc) ->
>    Bin = F(),
>    TruncatedBin = check_bin_length(LenLeft, Bin),
>    {ok, _} = couch_stream:write(Stream, TruncatedBin),
>    write_streamed_attachment(Stream, F, LenLeft -  
> size(TruncatedBin), SpAcc).
>
> check_bin_length(LenLeft, Bin) when size(Bin) > LenLeft ->
>    <<ValidData:LenLeft/binary, Crap/binary>> = Bin,
>    ?LOG_ERROR("write_streamed_attachment has written too much  
> expected: ~p got: ~p tail: ~p", [LenLeft, size(Bin), Crap]),
>    ValidData;
> check_bin_length(_, Bin) -> Bin.
>
> ------------------------------------------------------------------------------
>
> Interestingly, the problems occur at the exactly the same points  
> during replication, and in each case the excess tail is <<"\r">>,  
> which suggests to me a boundary condition processing a chunked  
> response. It's probably not a problem creating the response because  
> direct access using wget returns the right amount of data.
>
> My replication still fails near the end, this time silently killing  
> couchdb, but it's getting closer.
>
> Antony Blakey
> --------------------------
> CTO, Linkuistics Pty Ltd
> Ph: 0438 840 787
>
> Always have a vision. Why spend your life making other people’s  
> dreams?
> -- Orson Welles (1915-1985)
>

Re: Attachment Replication Problem - Bug Found

Posted by Antony Blakey <an...@gmail.com>.

On 16/05/2009, at 12:59 PM, Antony Blakey wrote:

> and truncate the binary to the expected length. I'm not familiar  
> with ibrowse in terms of debugging this problem further.

The final mod I've ended up with is this, which deals with the ibrowse  
problem:

------------------------------------------------------------------------------

write_streamed_attachment(_Stream, _F, 0, SpAcc) ->
     {ok, SpAcc};
write_streamed_attachment(Stream, F, LenLeft, nil) ->
     Bin = F(),
     TruncatedBin = check_bin_length(LenLeft, Bin),
     {ok, StreamPointer} = couch_stream:write(Stream, TruncatedBin),
     write_streamed_attachment(Stream, F, LenLeft -  
size(TruncatedBin), StreamPointer);
write_streamed_attachment(Stream, F, LenLeft, SpAcc) ->
     Bin = F(),
     TruncatedBin = check_bin_length(LenLeft, Bin),
     {ok, _} = couch_stream:write(Stream, TruncatedBin),
     write_streamed_attachment(Stream, F, LenLeft -  
size(TruncatedBin), SpAcc).

check_bin_length(LenLeft, Bin) when size(Bin) > LenLeft ->
     <<ValidData:LenLeft/binary, Crap/binary>> = Bin,
     ?LOG_ERROR("write_streamed_attachment has written too much  
expected: ~p got: ~p tail: ~p", [LenLeft, size(Bin), Crap]),
     ValidData;
check_bin_length(_, Bin) -> Bin.

------------------------------------------------------------------------------

Interestingly, the problems occur at the exactly the same points  
during replication, and in each case the excess tail is <<"\r">>,  
which suggests to me a boundary condition processing a chunked  
response. It's probably not a problem creating the response because  
direct access using wget returns the right amount of data.

My replication still fails near the end, this time silently killing  
couchdb, but it's getting closer.

Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

Always have a vision. Why spend your life making other people’s dreams?
  -- Orson Welles (1915-1985)

Re: Attachment Replication Problem - Bug Found

Posted by Antony Blakey <an...@gmail.com>.

On 16/05/2009, at 9:15 AM, Antony Blakey wrote:

> On 16/05/2009, at 8:27 AM, Chris Anderson wrote:
>
>> Thanks for reporting this. I'm not sure I can see the issue in the
>> last logfile you posted (I haven't gone through the diffs to see  
>> where
>> you added log statements...) It seems that the attachment size is not
>> an issue, its the fact that there are many many attachments on each
>> doc. This means it should be fairly easy to make a reproducible
>> JavaScript test case, that causes a never-finishing replication. Once
>> we have that, I'd be happy to run it and bang on the code till I get
>> it to pass.
>
> I've created a test case with many documents, but it doesn't cause  
> the problem, so it must be somewhat more subtle than it looks.  
> Specifically, it may have something to do with the replication state  
> to that point.

To deal with the problem of outstanding promises I set  
couch_util:should_flush(1) - that's a separate issue. The bug that  
causes my replication to hang seems to be in ibrowse. The problem is  
that ibrowse is returning 1 more byte of data than it should, and so  
the following code in couch_db is failing because the case where  
LenLeft - size(Bin) < 0 isn't being caught. This blocks replication.  
When I wget the offending resource I get the correct length. The  
problem is with the second attachment (Perceive.png) in http://gist.github.com/112074 
.

   write_streamed_attachment(_Stream, _F, 0, SpAcc)
       {ok, SpAcc};
   write_streamed_attachment(Stream, F, LenLeft, nil) ->
       Bin = F(),
       {ok, StreamPointer} = couch_stream:write(Stream, Bin),
       write_streamed_attachment(Stream, F, LenLeft - size(Bin),  
StreamPointer);
   write_streamed_attachment(Stream, F, LenLeft, SpAcc) ->
       Bin = F(),
       {ok, _} = couch_stream:write(Stream, Bin),
       write_streamed_attachment(Stream, F, LenLeft - size(Bin), SpAcc).

To enable replication to continue, a temporary fix is to replace the  
first case with this:

   write_streamed_attachment(_Stream, _F, LenLeft, SpAcc) when 1 >  
LenLeft
       {ok, SpAcc};

although maybe a better option is to *add* this case:

   write_streamed_attachment(_Stream, _F, LenLeft, SpAcc) when 0 >  
LenLeft
       ?LOG_ERROR("write_streamed_attachment has written too much  
data", []),
       {ok, SpAcc};

and truncate the binary to the expected length. I'm not familiar with  
ibrowse in terms of debugging this problem further.

Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

I contend that we are both atheists. I just believe in one fewer god  
than you do. When you understand why you dismiss all the other  
possible gods, you will understand why I dismiss yours.
   --Stephen F Roberts

Re: Attachment Replication Problem

Posted by Antony Blakey <an...@gmail.com>.

On 16/05/2009, at 8:27 AM, Chris Anderson wrote:

> Thanks for reporting this. I'm not sure I can see the issue in the
> last logfile you posted (I haven't gone through the diffs to see where
> you added log statements...) It seems that the attachment size is not
> an issue, its the fact that there are many many attachments on each
> doc. This means it should be fairly easy to make a reproducible
> JavaScript test case, that causes a never-finishing replication. Once
> we have that, I'd be happy to run it and bang on the code till I get
> it to pass.

I've created a test case with many documents, but it doesn't cause the  
problem, so it must be somewhat more subtle than it looks.  
Specifically, it may have something to do with the replication state  
to that point.

> I think the big problem is the architecture where attachments aren't
> started streaming until the doc itself is written to disk. There's no
> reason it should have to be this way, as we could setup a queue of
> attachments (and docs that are waiting on them) and make it's width
> configurable, beginning the attachment transfer right away. I've
> written code like this a few times, and it should be totally doable in
> this context.

That's what I was thinking, although I think it's a considerable  
rewrite from what is currently there, and a significant issue is  
avoiding out-of-order writes. A better option might be  to trigger  
checkpoints on the basis of the number of outstanding promises, in  
combination with buffering attachment downloads to disk.

Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

I contend that we are both atheists. I just believe in one fewer god  
than you do. When you understand why you dismiss all the other  
possible gods, you will understand why I dismiss yours.
   --Stephen F Roberts

Re: Attachment Replication Problem

Posted by Chris Anderson <jc...@apache.org>.

On Fri, May 15, 2009 at 5:16 PM, Antony Blakey <an...@gmail.com> wrote:
>
> On 15/05/2009, at 2:44 PM, Antony Blakey wrote:
>
>> I have a 3.5G Couchdb database, consisting of 1000 small documents, each
>> with many attachments (0-30 per document), each attachment varying wildly in
>> size (1K..10M).
>>
>> To test replication I am running a server on my MBPro and another under
>> Ubuntu in VMWare on the same machine. I'm testing using a pure trunk.
>>
>> Doing a pull-replicate from OSX to Linux fails to complete. The point at
>> which it fails is constant. I've added some debug logs into
>> couch_rep/attachment_loop like this: http://gist.github.com/112070 and made
>> the suggested "couch_util:should_flush(1000)" mod to try and guarantee
>> progress (but to no avail). The debug output shows this:
>> http://gist.github.com/112069 and the document it seems to fail on is this:
>> http://gist.github.com/112074 . I'm only just starting to look at this - any
>> pointers would be appreciated.
>
> I put some more logging in attachment_loop, specifically this:
>
>        {ibrowse_async_response, ReqId, Data} ->
>            ?LOG_DEBUG("ATTACHMENT_LOOP: ibrowse_async_response Data A ~p",
> [Url]),
>            receive {From, gimme_data} -> From ! {self(), Data} end,
>            ?LOG_DEBUG("ATTACHMENT_LOOP: ibrowse_async_response Data B ~p",
> [Url]),
>            attachment_loop(ReqId);
>
> The result of this is to see an enormous number of 'Data A' logs without a
> corresponding 'Data B'. This happens because make_attachment_stub_receiver
> uses a promise to read the data, created like this:
>
>        ResponseCode >= 200, ResponseCode < 300 ->
>            % the normal case
>            Pid ! {self(), continue},
>            %% this function goes into the streaming attachment code.
>            %% It gets executed by the replication gen_server, so it can't
>            %% be the one to actually receive the ibrowse data.
>            {ok, fun() ->
>                Pid ! {self(), gimme_data},
>                receive {Pid, Data} -> Data end
>            end};
>
> It seems that the promise is forced (e.g. the data read) only when the
> documents are checkpointed. If, as in my case, you have lots of small
> documents with many attachments, this results in massive numbers of open
> connections to download the attachments, each blocked reading the first bit
> of data from the first chunk, because the checkpointing occurs by default
> after 10MB of document data has been read, excluding attachments. In any
> case purely using size as a trigger won't work if you have lots of small
> documents with lots of small attachments. It would seem that the
> checkpointing, and hence forcing of the http-reading promises needs to also
> account for the number of promises.
>
> To overcome this problem I used couch_util:should_flush(1) to ensure that
> each document would be checkpointed, but that simply demonstrated that this
> isn't the cause of the 100% repeatable replication hang that I have. Now I
> get a log trace like this: http://gist.github.com/112512 (ignoring the crap
> at the end of each log statement, which is my incompleted attempt to link
> each log to the associated url).
>
> Anyone with any thoughts?
>

Thanks for reporting this. I'm not sure I can see the issue in the
last logfile you posted (I haven't gone through the diffs to see where
you added log statements...) It seems that the attachment size is not
an issue, its the fact that there are many many attachments on each
doc. This means it should be fairly easy to make a reproducible
JavaScript test case, that causes a never-finishing replication. Once
we have that, I'd be happy to run it and bang on the code till I get
it to pass.

I think the big problem is the architecture where attachments aren't
started streaming until the doc itself is written to disk. There's no
reason it should have to be this way, as we could setup a queue of
attachments (and docs that are waiting on them) and make it's width
configurable, beginning the attachment transfer right away. I've
written code like this a few times, and it should be totally doable in
this context.

If you create a JS test case that'll kick us into gear looking for the best fix.

-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Re: Attachment Replication Problem

Posted by Antony Blakey <an...@gmail.com>.

On 15/05/2009, at 2:44 PM, Antony Blakey wrote:

> I have a 3.5G Couchdb database, consisting of 1000 small documents,  
> each with many attachments (0-30 per document), each attachment  
> varying wildly in size (1K..10M).
>
> To test replication I am running a server on my MBPro and another  
> under Ubuntu in VMWare on the same machine. I'm testing using a pure  
> trunk.
>
> Doing a pull-replicate from OSX to Linux fails to complete. The  
> point at which it fails is constant. I've added some debug logs into  
> couch_rep/attachment_loop like this: http://gist.github.com/112070  
> and made the suggested "couch_util:should_flush(1000)" mod to try  
> and guarantee progress (but to no avail). The debug output shows  
> this: http://gist.github.com/112069 and the document it seems to  
> fail on is this: http://gist.github.com/112074 . I'm only just  
> starting to look at this - any pointers would be appreciated.

I put some more logging in attachment_loop, specifically this:

         {ibrowse_async_response, ReqId, Data} ->
             ?LOG_DEBUG("ATTACHMENT_LOOP: ibrowse_async_response Data  
A ~p", [Url]),
             receive {From, gimme_data} -> From ! {self(), Data} end,
             ?LOG_DEBUG("ATTACHMENT_LOOP: ibrowse_async_response Data  
B ~p", [Url]),
             attachment_loop(ReqId);

The result of this is to see an enormous number of 'Data A' logs  
without a corresponding 'Data B'. This happens because  
make_attachment_stub_receiver uses a promise to read the data, created  
like this:

         ResponseCode >= 200, ResponseCode < 300 ->
             % the normal case
             Pid ! {self(), continue},
             %% this function goes into the streaming attachment code.
             %% It gets executed by the replication gen_server, so it  
can't
             %% be the one to actually receive the ibrowse data.
             {ok, fun() ->
                 Pid ! {self(), gimme_data},
                 receive {Pid, Data} -> Data end
             end};

It seems that the promise is forced (e.g. the data read) only when the  
documents are checkpointed. If, as in my case, you have lots of small  
documents with many attachments, this results in massive numbers of  
open connections to download the attachments, each blocked reading the  
first bit of data from the first chunk, because the checkpointing  
occurs by default after 10MB of document data has been read, excluding  
attachments. In any case purely using size as a trigger won't work if  
you have lots of small documents with lots of small attachments. It  
would seem that the checkpointing, and hence forcing of the http- 
reading promises needs to also account for the number of promises.

To overcome this problem I used couch_util:should_flush(1) to ensure  
that each document would be checkpointed, but that simply demonstrated  
that this isn't the cause of the 100% repeatable replication hang that  
I have. Now I get a log trace like this: http://gist.github.com/112512  
(ignoring the crap at the end of each log statement, which is my  
incompleted attempt to link each log to the associated url).

Anyone with any thoughts?

Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

What can be done with fewer [assumptions] is done in vain with more
   -- William of Ockham (ca. 1285-1349)