You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by Sebastian Cohnen <se...@googlemail.com> on 2010/12/08 16:54:07 UTC

Re: svn commit: r1043461 - /couchdb/branches/1.1.x/src/couchdb/couch_db_updater.erl

do I read this correctly and two normal compaction runs will take care of dupes in both, _all_docs and _changes?

On 08.12.2010, at 16:48, kocolosk@apache.org wrote:

> Author: kocolosk
> Date: Wed Dec  8 15:48:52 2010
> New Revision: 1043461
> 
> URL: http://svn.apache.org/viewvc?rev=1043461&view=rev
> Log:
> Usort the infos during compaction to remove dupes, COUCHDB-968
> 
> This is not a bulletproof solution; it only removes dupes when the
> they appear in the same batch of 1000 updates.  However, for dupes
> that show up in _all_docs the probability of that happening is quite
> high.  If the dupes are only in _changes a user may need to compact
> twice, once to get the dupes ordered together and a second time to
> remove them.
> 
> A more complete solution would be to trigger the compaction in "retry"
> mode, but this is siginificantly slower.
> 
> Modified:
>    couchdb/branches/1.1.x/src/couchdb/couch_db_updater.erl
> 
> Modified: couchdb/branches/1.1.x/src/couchdb/couch_db_updater.erl
> URL: http://svn.apache.org/viewvc/couchdb/branches/1.1.x/src/couchdb/couch_db_updater.erl?rev=1043461&r1=1043460&r2=1043461&view=diff
> ==============================================================================
> --- couchdb/branches/1.1.x/src/couchdb/couch_db_updater.erl (original)
> +++ couchdb/branches/1.1.x/src/couchdb/couch_db_updater.erl Wed Dec  8 15:48:52 2010
> @@ -775,7 +775,10 @@ copy_rev_tree_attachments(SrcDb, DestFd,
>         end, Tree).
> 
> 
> -copy_docs(Db, #db{fd=DestFd}=NewDb, InfoBySeq, Retry) ->
> +copy_docs(Db, #db{fd=DestFd}=NewDb, InfoBySeq0, Retry) ->
> +    % COUCHDB-968, make sure we prune duplicates during compaction
> +    InfoBySeq = lists:usort(fun(#doc_info{id=A}, #doc_info{id=B}) -> A =< B end,
> +        InfoBySeq0),
>     Ids = [Id || #doc_info{id=Id} <- InfoBySeq],
>     LookupResults = couch_btree:lookup(Db#db.fulldocinfo_by_id_btree, Ids),
> 
> 
> 


Re: svn commit: r1043461 - /couchdb/branches/1.1.x/src/couchdb/couch_db_updater.erl

Posted by Adam Kocoloski <ko...@apache.org>.
With this patch applied, in ~99% of cases, yes.  Best,

Adam

On Dec 8, 2010, at 10:54 AM, Sebastian Cohnen wrote:

> do I read this correctly and two normal compaction runs will take care of dupes in both, _all_docs and _changes?
> 
> On 08.12.2010, at 16:48, kocolosk@apache.org wrote:
> 
>> Author: kocolosk
>> Date: Wed Dec  8 15:48:52 2010
>> New Revision: 1043461
>> 
>> URL: http://svn.apache.org/viewvc?rev=1043461&view=rev
>> Log:
>> Usort the infos during compaction to remove dupes, COUCHDB-968
>> 
>> This is not a bulletproof solution; it only removes dupes when the
>> they appear in the same batch of 1000 updates.  However, for dupes
>> that show up in _all_docs the probability of that happening is quite
>> high.  If the dupes are only in _changes a user may need to compact
>> twice, once to get the dupes ordered together and a second time to
>> remove them.
>> 
>> A more complete solution would be to trigger the compaction in "retry"
>> mode, but this is siginificantly slower.
>> 
>> Modified:
>>   couchdb/branches/1.1.x/src/couchdb/couch_db_updater.erl
>> 
>> Modified: couchdb/branches/1.1.x/src/couchdb/couch_db_updater.erl
>> URL: http://svn.apache.org/viewvc/couchdb/branches/1.1.x/src/couchdb/couch_db_updater.erl?rev=1043461&r1=1043460&r2=1043461&view=diff
>> ==============================================================================
>> --- couchdb/branches/1.1.x/src/couchdb/couch_db_updater.erl (original)
>> +++ couchdb/branches/1.1.x/src/couchdb/couch_db_updater.erl Wed Dec  8 15:48:52 2010
>> @@ -775,7 +775,10 @@ copy_rev_tree_attachments(SrcDb, DestFd,
>>        end, Tree).
>> 
>> 
>> -copy_docs(Db, #db{fd=DestFd}=NewDb, InfoBySeq, Retry) ->
>> +copy_docs(Db, #db{fd=DestFd}=NewDb, InfoBySeq0, Retry) ->
>> +    % COUCHDB-968, make sure we prune duplicates during compaction
>> +    InfoBySeq = lists:usort(fun(#doc_info{id=A}, #doc_info{id=B}) -> A =< B end,
>> +        InfoBySeq0),
>>    Ids = [Id || #doc_info{id=Id} <- InfoBySeq],
>>    LookupResults = couch_btree:lookup(Db#db.fulldocinfo_by_id_btree, Ids),
>> 
>> 
>> 
>