You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by "Filipe Manana (JIRA)" <ji...@apache.org> on 2011/09/19 05:41:08 UTC

[jira] [Created] (COUCHDB-1288) More efficient builtin filters _doc_ids and _design

More efficient builtin filters _doc_ids and _design
---------------------------------------------------

                 Key: COUCHDB-1288
                 URL: https://issues.apache.org/jira/browse/COUCHDB-1288
             Project: CouchDB
          Issue Type: Improvement
            Reporter: Filipe Manana


We have the _doc_ids and _design _changes filter as of CouchDB 1.1.0.
While they meet the expectations of applications/users, they're far from efficient for large databases.
Basically the implementation folds the entire seq btree and then filters values by the document's ID, causing too much IO and busting caches. This makes replication by doc IDs not so efficient as it could be.

The proposed patch avoids this by doing direct lookups in the ID btree, for _doc_ids, and ranged fold for _design.

If there are no objections, I would apply to branch 1.2.x besides 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (COUCHDB-1288) More efficient builtin filters _doc_ids and _design

Posted by "Filipe Manana (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13108036#comment-13108036 ] 

Filipe Manana commented on COUCHDB-1288:
----------------------------------------

Thanks Bob.

If it's separate issue, unrelated to any changes from this patch, it should go into a separate patch/ticket :)

> More efficient builtin filters _doc_ids and _design
> ---------------------------------------------------
>
>                 Key: COUCHDB-1288
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1288
>             Project: CouchDB
>          Issue Type: Improvement
>            Reporter: Filipe Manana
>         Attachments: couchdb_1288_2.patch, couchdb_1288_3.patch
>
>
> We have the _doc_ids and _design _changes filter as of CouchDB 1.1.0.
> While they meet the expectations of applications/users, they're far from efficient for large databases.
> Basically the implementation folds the entire seq btree and then filters values by the document's ID, causing too much IO and busting caches. This makes replication by doc IDs not so efficient as it could be.
> The proposed patch avoids this by doing direct lookups in the ID btree, for _doc_ids, and ranged fold for _design.
> If there are no objections, I would apply to branch 1.2.x besides 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (COUCHDB-1288) More efficient builtin filters _doc_ids and _design

Posted by "Filipe Manana (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/COUCHDB-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Filipe Manana updated COUCHDB-1288:
-----------------------------------

    Attachment: couchdb_1288_2.patch

Second version of the patch, for _doc_ids, the optimized code patch is only triggered if the number of doc IDs is not greater than 100. This is too avoid loading too many full_doc_info records into memory, which can be big if the rev trees are long and/or with many branches.

> More efficient builtin filters _doc_ids and _design
> ---------------------------------------------------
>
>                 Key: COUCHDB-1288
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1288
>             Project: CouchDB
>          Issue Type: Improvement
>            Reporter: Filipe Manana
>         Attachments: couchdb_1288.patch, couchdb_1288_2.patch
>
>
> We have the _doc_ids and _design _changes filter as of CouchDB 1.1.0.
> While they meet the expectations of applications/users, they're far from efficient for large databases.
> Basically the implementation folds the entire seq btree and then filters values by the document's ID, causing too much IO and busting caches. This makes replication by doc IDs not so efficient as it could be.
> The proposed patch avoids this by doing direct lookups in the ID btree, for _doc_ids, and ranged fold for _design.
> If there are no objections, I would apply to branch 1.2.x besides 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (COUCHDB-1288) More efficient builtin filters _doc_ids and _design

Posted by "Filipe Manana (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/COUCHDB-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Filipe Manana updated COUCHDB-1288:
-----------------------------------

    Attachment:     (was: couchdb_1288_2.patch)

> More efficient builtin filters _doc_ids and _design
> ---------------------------------------------------
>
>                 Key: COUCHDB-1288
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1288
>             Project: CouchDB
>          Issue Type: Improvement
>            Reporter: Filipe Manana
>         Attachments: couchdb_1288.patch
>
>
> We have the _doc_ids and _design _changes filter as of CouchDB 1.1.0.
> While they meet the expectations of applications/users, they're far from efficient for large databases.
> Basically the implementation folds the entire seq btree and then filters values by the document's ID, causing too much IO and busting caches. This makes replication by doc IDs not so efficient as it could be.
> The proposed patch avoids this by doing direct lookups in the ID btree, for _doc_ids, and ranged fold for _design.
> If there are no objections, I would apply to branch 1.2.x besides 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (COUCHDB-1288) More efficient builtin filters _doc_ids and _design

Posted by "Filipe Manana (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/COUCHDB-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Filipe Manana updated COUCHDB-1288:
-----------------------------------

    Attachment: couchdb_1288_2.patch

> More efficient builtin filters _doc_ids and _design
> ---------------------------------------------------
>
>                 Key: COUCHDB-1288
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1288
>             Project: CouchDB
>          Issue Type: Improvement
>            Reporter: Filipe Manana
>         Attachments: couchdb_1288_2.patch
>
>
> We have the _doc_ids and _design _changes filter as of CouchDB 1.1.0.
> While they meet the expectations of applications/users, they're far from efficient for large databases.
> Basically the implementation folds the entire seq btree and then filters values by the document's ID, causing too much IO and busting caches. This makes replication by doc IDs not so efficient as it could be.
> The proposed patch avoids this by doing direct lookups in the ID btree, for _doc_ids, and ranged fold for _design.
> If there are no objections, I would apply to branch 1.2.x besides 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (COUCHDB-1288) More efficient builtin filters _doc_ids and _design

Posted by "Filipe Manana (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/COUCHDB-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Filipe Manana resolved COUCHDB-1288.
------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.2

Applied to trunk and branch 1.2.x

> More efficient builtin filters _doc_ids and _design
> ---------------------------------------------------
>
>                 Key: COUCHDB-1288
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1288
>             Project: CouchDB
>          Issue Type: Improvement
>            Reporter: Filipe Manana
>             Fix For: 1.2
>
>         Attachments: couchdb_1288_2.patch, couchdb_1288_3.patch
>
>
> We have the _doc_ids and _design _changes filter as of CouchDB 1.1.0.
> While they meet the expectations of applications/users, they're far from efficient for large databases.
> Basically the implementation folds the entire seq btree and then filters values by the document's ID, causing too much IO and busting caches. This makes replication by doc IDs not so efficient as it could be.
> The proposed patch avoids this by doing direct lookups in the ID btree, for _doc_ids, and ranged fold for _design.
> If there are no objections, I would apply to branch 1.2.x besides 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (COUCHDB-1288) More efficient builtin filters _doc_ids and _design

Posted by "Filipe Manana (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/COUCHDB-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Filipe Manana updated COUCHDB-1288:
-----------------------------------

    Attachment: couchdb_1288.patch

> More efficient builtin filters _doc_ids and _design
> ---------------------------------------------------
>
>                 Key: COUCHDB-1288
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1288
>             Project: CouchDB
>          Issue Type: Improvement
>            Reporter: Filipe Manana
>         Attachments: couchdb_1288.patch
>
>
> We have the _doc_ids and _design _changes filter as of CouchDB 1.1.0.
> While they meet the expectations of applications/users, they're far from efficient for large databases.
> Basically the implementation folds the entire seq btree and then filters values by the document's ID, causing too much IO and busting caches. This makes replication by doc IDs not so efficient as it could be.
> The proposed patch avoids this by doing direct lookups in the ID btree, for _doc_ids, and ranged fold for _design.
> If there are no objections, I would apply to branch 1.2.x besides 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (COUCHDB-1288) More efficient builtin filters _doc_ids and _design

Posted by "Filipe Manana (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/COUCHDB-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Filipe Manana updated COUCHDB-1288:
-----------------------------------

    Attachment:     (was: couchdb_1288.patch)

> More efficient builtin filters _doc_ids and _design
> ---------------------------------------------------
>
>                 Key: COUCHDB-1288
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1288
>             Project: CouchDB
>          Issue Type: Improvement
>            Reporter: Filipe Manana
>         Attachments: couchdb_1288_2.patch
>
>
> We have the _doc_ids and _design _changes filter as of CouchDB 1.1.0.
> While they meet the expectations of applications/users, they're far from efficient for large databases.
> Basically the implementation folds the entire seq btree and then filters values by the document's ID, causing too much IO and busting caches. This makes replication by doc IDs not so efficient as it could be.
> The proposed patch avoids this by doing direct lookups in the ID btree, for _doc_ids, and ranged fold for _design.
> If there are no objections, I would apply to branch 1.2.x besides 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (COUCHDB-1288) More efficient builtin filters _doc_ids and _design

Posted by "Filipe Manana (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107653#comment-13107653 ] 

Filipe Manana commented on COUCHDB-1288:
----------------------------------------

This still needs some small work for the continuous case and a test.

> More efficient builtin filters _doc_ids and _design
> ---------------------------------------------------
>
>                 Key: COUCHDB-1288
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1288
>             Project: CouchDB
>          Issue Type: Improvement
>            Reporter: Filipe Manana
>         Attachments: couchdb_1288_2.patch
>
>
> We have the _doc_ids and _design _changes filter as of CouchDB 1.1.0.
> While they meet the expectations of applications/users, they're far from efficient for large databases.
> Basically the implementation folds the entire seq btree and then filters values by the document's ID, causing too much IO and busting caches. This makes replication by doc IDs not so efficient as it could be.
> The proposed patch avoids this by doing direct lookups in the ID btree, for _doc_ids, and ranged fold for _design.
> If there are no objections, I would apply to branch 1.2.x besides 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (COUCHDB-1288) More efficient builtin filters _doc_ids and _design

Posted by "Bob Dionne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107757#comment-13107757 ] 

Bob Dionne commented on COUCHDB-1288:
-------------------------------------

Filipe,

  I started reviewing this and it looks good so far. There's an edge case we ran into the other day that @davisp and @kocolosk ran down. When you have `feed=continuous` and a hearbeat and a filter function that fail enough, the heartbeat timeout never triggers and no changes are sent. It's easy to reproduce, you can see how it's handled in fabric[1]. I can probably add it to this patch or open a second ticket if you prefer.

   Also, as an aside the `couch_changes:get_changes_timeout` is slightly awkward in the way heartbeat is handled. It appears to allow `heartbeat=true` and in that case defaults to the timeout in the config. That certainly does not agree with the documented semantics.  

Cheers,

Bob


[1] https://github.com/cloudant/fabric/commit/f9eea28e62496afcb

> More efficient builtin filters _doc_ids and _design
> ---------------------------------------------------
>
>                 Key: COUCHDB-1288
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1288
>             Project: CouchDB
>          Issue Type: Improvement
>            Reporter: Filipe Manana
>         Attachments: couchdb_1288_2.patch, couchdb_1288_3.patch
>
>
> We have the _doc_ids and _design _changes filter as of CouchDB 1.1.0.
> While they meet the expectations of applications/users, they're far from efficient for large databases.
> Basically the implementation folds the entire seq btree and then filters values by the document's ID, causing too much IO and busting caches. This makes replication by doc IDs not so efficient as it could be.
> The proposed patch avoids this by doing direct lookups in the ID btree, for _doc_ids, and ranged fold for _design.
> If there are no objections, I would apply to branch 1.2.x besides 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (COUCHDB-1288) More efficient builtin filters _doc_ids and _design

Posted by "Filipe Manana (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/COUCHDB-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Filipe Manana updated COUCHDB-1288:
-----------------------------------

    Attachment: couchdb_1288_3.patch

Added patch with test case, including the case for continuous changes.

> More efficient builtin filters _doc_ids and _design
> ---------------------------------------------------
>
>                 Key: COUCHDB-1288
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1288
>             Project: CouchDB
>          Issue Type: Improvement
>            Reporter: Filipe Manana
>         Attachments: couchdb_1288_2.patch, couchdb_1288_3.patch
>
>
> We have the _doc_ids and _design _changes filter as of CouchDB 1.1.0.
> While they meet the expectations of applications/users, they're far from efficient for large databases.
> Basically the implementation folds the entire seq btree and then filters values by the document's ID, causing too much IO and busting caches. This makes replication by doc IDs not so efficient as it could be.
> The proposed patch avoids this by doing direct lookups in the ID btree, for _doc_ids, and ranged fold for _design.
> If there are no objections, I would apply to branch 1.2.x besides 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (COUCHDB-1288) More efficient builtin filters _doc_ids and _design

Posted by "Filipe Manana (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/COUCHDB-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Filipe Manana updated COUCHDB-1288:
-----------------------------------

    Attachment:     (was: couchdb_1288.patch)

> More efficient builtin filters _doc_ids and _design
> ---------------------------------------------------
>
>                 Key: COUCHDB-1288
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1288
>             Project: CouchDB
>          Issue Type: Improvement
>            Reporter: Filipe Manana
>         Attachments: couchdb_1288.patch
>
>
> We have the _doc_ids and _design _changes filter as of CouchDB 1.1.0.
> While they meet the expectations of applications/users, they're far from efficient for large databases.
> Basically the implementation folds the entire seq btree and then filters values by the document's ID, causing too much IO and busting caches. This makes replication by doc IDs not so efficient as it could be.
> The proposed patch avoids this by doing direct lookups in the ID btree, for _doc_ids, and ranged fold for _design.
> If there are no objections, I would apply to branch 1.2.x besides 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (COUCHDB-1288) More efficient builtin filters _doc_ids and _design

Posted by "Filipe Manana (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/COUCHDB-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Filipe Manana updated COUCHDB-1288:
-----------------------------------

    Attachment: couchdb_1288.patch

> More efficient builtin filters _doc_ids and _design
> ---------------------------------------------------
>
>                 Key: COUCHDB-1288
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1288
>             Project: CouchDB
>          Issue Type: Improvement
>            Reporter: Filipe Manana
>         Attachments: couchdb_1288.patch
>
>
> We have the _doc_ids and _design _changes filter as of CouchDB 1.1.0.
> While they meet the expectations of applications/users, they're far from efficient for large databases.
> Basically the implementation folds the entire seq btree and then filters values by the document's ID, causing too much IO and busting caches. This makes replication by doc IDs not so efficient as it could be.
> The proposed patch avoids this by doing direct lookups in the ID btree, for _doc_ids, and ranged fold for _design.
> If there are no objections, I would apply to branch 1.2.x besides 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira