You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by "Nathan Vander Wilt (JIRA)" <ji...@apache.org> on 2011/02/25 20:18:22 UTC

[jira] Created: (COUCHDB-1076) _all_docs performance degrades as doc_del_count increases

_all_docs performance degrades as doc_del_count increases
---------------------------------------------------------

                 Key: COUCHDB-1076
                 URL: https://issues.apache.org/jira/browse/COUCHDB-1076
             Project: CouchDB
          Issue Type: Bug
            Reporter: Nathan Vander Wilt


The time required to query _all_docs?limit=1 can be proportional to the doc_del_count of the database depending on where the deleted docs are in the sort order. As kocolosk explained on IRC:

The tree that serves as the source for _all_docs contains a small record for each document, deleted or not ... if you have a large number of deleted docs and their IDs are interspersed with the non-deleted ones i can imagine that it would cause additional seeks when streaming the _all_docs response

In my use case (https://github.com/natevw/RQMS) and in other cases (e.g. a rolling log or any long-lived database), the deleted docs may all be at the beginning of the _all_docs view, making query performance end up like using "?limit=N&skip=doc_del_count".

To improve the performance in cases where large blocks of documents have been deleted, kocolosk notes:

[10:30am] kocolosk: the inner nodes in the btree currently report the doc_count and doc_del_count
[10:31am] kocolosk: we might be able to rewrite the function that walks the btree so that it checks if the doc_count underneath an inner node is zero
[10:31am] kocolosk: and then it can skip that part of the tree entirely
[10:31am] kocolosk: instead of descending all the way to the leaf nodes and skipping deleted documents one by one
[10:31am] n[ate]vw: yeah, that'd have the benefit of being a code-only change
[10:31am] kocolosk: right
[10:32am] kocolosk: i think it should work rather nicely, though it probably requires some interesting spelunking into the depths of couch_btree
[10:33am] kocolosk: davisp: whaddya think?  would what i'm proposing be possible? (checking for doc_count=0 in the inner node reduction and then skipping ahead when serving _all_docs)
...
[10:39am] davisp: n[ate]vw: Its definitely doable, its just a question of how to do it so that I don't bleed from the eyes when trying to read through the implementation
...
[10:40am] rnewson: davisp: right. we definitely want to preserve the high readability of the current btree code. /s
[10:40am] davisp: kocolosk: The only thing I can think of is passing a function to the view iteration that gets called to evaluate whether it should decend to a child node based on the key/reduction pair
[10:41am] kocolosk: davisp: what you're proposing would also allow for (more) efficient implementation of skip

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (COUCHDB-1076) _all_docs performance degrades as doc_del_count increases

Posted by "Damien Katz (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13037682#comment-13037682 ] 

Damien Katz commented on COUCHDB-1076:
--------------------------------------

I haven't looked very closely at this patch line for line, so I can't say if there are bugs, etc. But I definitely agree with this approach and structure. Nice work.

> _all_docs performance degrades as doc_del_count increases
> ---------------------------------------------------------
>
>                 Key: COUCHDB-1076
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1076
>             Project: CouchDB
>          Issue Type: Bug
>            Reporter: Nathan Vander Wilt
>
> The time required to query _all_docs?limit=1 can be proportional to the doc_del_count of the database depending on where the deleted docs are in the sort order. As kocolosk explained on IRC:
> The tree that serves as the source for _all_docs contains a small record for each document, deleted or not ... if you have a large number of deleted docs and their IDs are interspersed with the non-deleted ones i can imagine that it would cause additional seeks when streaming the _all_docs response
> In my use case (https://github.com/natevw/RQMS) and in other cases (e.g. a rolling log or any long-lived database), the deleted docs may all be at the beginning of the _all_docs view, making query performance end up like using "?limit=N&skip=doc_del_count".
> To improve the performance in cases where large blocks of documents have been deleted, kocolosk notes:
> [10:30am] kocolosk: the inner nodes in the btree currently report the doc_count and doc_del_count
> [10:31am] kocolosk: we might be able to rewrite the function that walks the btree so that it checks if the doc_count underneath an inner node is zero
> [10:31am] kocolosk: and then it can skip that part of the tree entirely
> [10:31am] kocolosk: instead of descending all the way to the leaf nodes and skipping deleted documents one by one
> [10:31am] n[ate]vw: yeah, that'd have the benefit of being a code-only change
> [10:31am] kocolosk: right
> [10:32am] kocolosk: i think it should work rather nicely, though it probably requires some interesting spelunking into the depths of couch_btree
> [10:33am] kocolosk: davisp: whaddya think?  would what i'm proposing be possible? (checking for doc_count=0 in the inner node reduction and then skipping ahead when serving _all_docs)
> ...
> [10:39am] davisp: n[ate]vw: Its definitely doable, its just a question of how to do it so that I don't bleed from the eyes when trying to read through the implementation
> ...
> [10:40am] rnewson: davisp: right. we definitely want to preserve the high readability of the current btree code. /s
> [10:40am] davisp: kocolosk: The only thing I can think of is passing a function to the view iteration that gets called to evaluate whether it should decend to a child node based on the key/reduction pair
> [10:41am] kocolosk: davisp: what you're proposing would also allow for (more) efficient implementation of skip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (COUCHDB-1076) _all_docs performance degrades as doc_del_count increases

Posted by "Randall Leeds (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/COUCHDB-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Randall Leeds closed COUCHDB-1076.
----------------------------------


> _all_docs performance degrades as doc_del_count increases
> ---------------------------------------------------------
>
>                 Key: COUCHDB-1076
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1076
>             Project: CouchDB
>          Issue Type: Sub-task
>            Reporter: Nathan Vander Wilt
>            Assignee: Randall Leeds
>             Fix For: 1.1.1, 1.2
>
>
> The time required to query _all_docs?limit=1 can be proportional to the doc_del_count of the database depending on where the deleted docs are in the sort order. As kocolosk explained on IRC:
> The tree that serves as the source for _all_docs contains a small record for each document, deleted or not ... if you have a large number of deleted docs and their IDs are interspersed with the non-deleted ones i can imagine that it would cause additional seeks when streaming the _all_docs response
> In my use case (https://github.com/natevw/RQMS) and in other cases (e.g. a rolling log or any long-lived database), the deleted docs may all be at the beginning of the _all_docs view, making query performance end up like using "?limit=N&skip=doc_del_count".
> To improve the performance in cases where large blocks of documents have been deleted, kocolosk notes:
> [10:30am] kocolosk: the inner nodes in the btree currently report the doc_count and doc_del_count
> [10:31am] kocolosk: we might be able to rewrite the function that walks the btree so that it checks if the doc_count underneath an inner node is zero
> [10:31am] kocolosk: and then it can skip that part of the tree entirely
> [10:31am] kocolosk: instead of descending all the way to the leaf nodes and skipping deleted documents one by one
> [10:31am] n[ate]vw: yeah, that'd have the benefit of being a code-only change
> [10:31am] kocolosk: right
> [10:32am] kocolosk: i think it should work rather nicely, though it probably requires some interesting spelunking into the depths of couch_btree
> [10:33am] kocolosk: davisp: whaddya think?  would what i'm proposing be possible? (checking for doc_count=0 in the inner node reduction and then skipping ahead when serving _all_docs)
> ...
> [10:39am] davisp: n[ate]vw: Its definitely doable, its just a question of how to do it so that I don't bleed from the eyes when trying to read through the implementation
> ...
> [10:40am] rnewson: davisp: right. we definitely want to preserve the high readability of the current btree code. /s
> [10:40am] davisp: kocolosk: The only thing I can think of is passing a function to the view iteration that gets called to evaluate whether it should decend to a child node based on the key/reduction pair
> [10:41am] kocolosk: davisp: what you're proposing would also allow for (more) efficient implementation of skip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (COUCHDB-1076) _all_docs performance degrades as doc_del_count increases

Posted by "Randall Leeds (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038083#comment-13038083 ] 

Randall Leeds commented on COUCHDB-1076:
----------------------------------------

Thanks, Filipe and Damien.

Filipe:
1) I'd rather keep it in couch_db_updater. The other functions depending on (or generating) the reduction values for the by_id tree are in that module. It makes most sense to me to keep all the code that relies on that format in one place.

2) Cool. Easy fix. Good catch.

3) I'll see if I can add some trivial tests.

> _all_docs performance degrades as doc_del_count increases
> ---------------------------------------------------------
>
>                 Key: COUCHDB-1076
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1076
>             Project: CouchDB
>          Issue Type: Bug
>            Reporter: Nathan Vander Wilt
>
> The time required to query _all_docs?limit=1 can be proportional to the doc_del_count of the database depending on where the deleted docs are in the sort order. As kocolosk explained on IRC:
> The tree that serves as the source for _all_docs contains a small record for each document, deleted or not ... if you have a large number of deleted docs and their IDs are interspersed with the non-deleted ones i can imagine that it would cause additional seeks when streaming the _all_docs response
> In my use case (https://github.com/natevw/RQMS) and in other cases (e.g. a rolling log or any long-lived database), the deleted docs may all be at the beginning of the _all_docs view, making query performance end up like using "?limit=N&skip=doc_del_count".
> To improve the performance in cases where large blocks of documents have been deleted, kocolosk notes:
> [10:30am] kocolosk: the inner nodes in the btree currently report the doc_count and doc_del_count
> [10:31am] kocolosk: we might be able to rewrite the function that walks the btree so that it checks if the doc_count underneath an inner node is zero
> [10:31am] kocolosk: and then it can skip that part of the tree entirely
> [10:31am] kocolosk: instead of descending all the way to the leaf nodes and skipping deleted documents one by one
> [10:31am] n[ate]vw: yeah, that'd have the benefit of being a code-only change
> [10:31am] kocolosk: right
> [10:32am] kocolosk: i think it should work rather nicely, though it probably requires some interesting spelunking into the depths of couch_btree
> [10:33am] kocolosk: davisp: whaddya think?  would what i'm proposing be possible? (checking for doc_count=0 in the inner node reduction and then skipping ahead when serving _all_docs)
> ...
> [10:39am] davisp: n[ate]vw: Its definitely doable, its just a question of how to do it so that I don't bleed from the eyes when trying to read through the implementation
> ...
> [10:40am] rnewson: davisp: right. we definitely want to preserve the high readability of the current btree code. /s
> [10:40am] davisp: kocolosk: The only thing I can think of is passing a function to the view iteration that gets called to evaluate whether it should decend to a child node based on the key/reduction pair
> [10:41am] kocolosk: davisp: what you're proposing would also allow for (more) efficient implementation of skip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (COUCHDB-1076) _all_docs performance degrades as doc_del_count increases

Posted by "Randall Leeds (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/COUCHDB-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Randall Leeds updated COUCHDB-1076:
-----------------------------------

    Fix Version/s: 1.2
                   1.1.1
         Assignee: Randall Leeds

> _all_docs performance degrades as doc_del_count increases
> ---------------------------------------------------------
>
>                 Key: COUCHDB-1076
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1076
>             Project: CouchDB
>          Issue Type: Bug
>            Reporter: Nathan Vander Wilt
>            Assignee: Randall Leeds
>             Fix For: 1.1.1, 1.2
>
>
> The time required to query _all_docs?limit=1 can be proportional to the doc_del_count of the database depending on where the deleted docs are in the sort order. As kocolosk explained on IRC:
> The tree that serves as the source for _all_docs contains a small record for each document, deleted or not ... if you have a large number of deleted docs and their IDs are interspersed with the non-deleted ones i can imagine that it would cause additional seeks when streaming the _all_docs response
> In my use case (https://github.com/natevw/RQMS) and in other cases (e.g. a rolling log or any long-lived database), the deleted docs may all be at the beginning of the _all_docs view, making query performance end up like using "?limit=N&skip=doc_del_count".
> To improve the performance in cases where large blocks of documents have been deleted, kocolosk notes:
> [10:30am] kocolosk: the inner nodes in the btree currently report the doc_count and doc_del_count
> [10:31am] kocolosk: we might be able to rewrite the function that walks the btree so that it checks if the doc_count underneath an inner node is zero
> [10:31am] kocolosk: and then it can skip that part of the tree entirely
> [10:31am] kocolosk: instead of descending all the way to the leaf nodes and skipping deleted documents one by one
> [10:31am] n[ate]vw: yeah, that'd have the benefit of being a code-only change
> [10:31am] kocolosk: right
> [10:32am] kocolosk: i think it should work rather nicely, though it probably requires some interesting spelunking into the depths of couch_btree
> [10:33am] kocolosk: davisp: whaddya think?  would what i'm proposing be possible? (checking for doc_count=0 in the inner node reduction and then skipping ahead when serving _all_docs)
> ...
> [10:39am] davisp: n[ate]vw: Its definitely doable, its just a question of how to do it so that I don't bleed from the eyes when trying to read through the implementation
> ...
> [10:40am] rnewson: davisp: right. we definitely want to preserve the high readability of the current btree code. /s
> [10:40am] davisp: kocolosk: The only thing I can think of is passing a function to the view iteration that gets called to evaluate whether it should decend to a child node based on the key/reduction pair
> [10:41am] kocolosk: davisp: what you're proposing would also allow for (more) efficient implementation of skip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (COUCHDB-1076) _all_docs performance degrades as doc_del_count increases

Posted by "Randall Leeds (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13003752#comment-13003752 ] 

Randall Leeds commented on COUCHDB-1076:
----------------------------------------

I did the digging and I think I've got an elegant fix. A little more work to be done before I post a patch but I wanted to let you all know so we avoid duplicate work. If you've thought about, or are curious about, this or have partial work of your own, ping me on IRC (tilgovi).

> _all_docs performance degrades as doc_del_count increases
> ---------------------------------------------------------
>
>                 Key: COUCHDB-1076
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1076
>             Project: CouchDB
>          Issue Type: Bug
>            Reporter: Nathan Vander Wilt
>
> The time required to query _all_docs?limit=1 can be proportional to the doc_del_count of the database depending on where the deleted docs are in the sort order. As kocolosk explained on IRC:
> The tree that serves as the source for _all_docs contains a small record for each document, deleted or not ... if you have a large number of deleted docs and their IDs are interspersed with the non-deleted ones i can imagine that it would cause additional seeks when streaming the _all_docs response
> In my use case (https://github.com/natevw/RQMS) and in other cases (e.g. a rolling log or any long-lived database), the deleted docs may all be at the beginning of the _all_docs view, making query performance end up like using "?limit=N&skip=doc_del_count".
> To improve the performance in cases where large blocks of documents have been deleted, kocolosk notes:
> [10:30am] kocolosk: the inner nodes in the btree currently report the doc_count and doc_del_count
> [10:31am] kocolosk: we might be able to rewrite the function that walks the btree so that it checks if the doc_count underneath an inner node is zero
> [10:31am] kocolosk: and then it can skip that part of the tree entirely
> [10:31am] kocolosk: instead of descending all the way to the leaf nodes and skipping deleted documents one by one
> [10:31am] n[ate]vw: yeah, that'd have the benefit of being a code-only change
> [10:31am] kocolosk: right
> [10:32am] kocolosk: i think it should work rather nicely, though it probably requires some interesting spelunking into the depths of couch_btree
> [10:33am] kocolosk: davisp: whaddya think?  would what i'm proposing be possible? (checking for doc_count=0 in the inner node reduction and then skipping ahead when serving _all_docs)
> ...
> [10:39am] davisp: n[ate]vw: Its definitely doable, its just a question of how to do it so that I don't bleed from the eyes when trying to read through the implementation
> ...
> [10:40am] rnewson: davisp: right. we definitely want to preserve the high readability of the current btree code. /s
> [10:40am] davisp: kocolosk: The only thing I can think of is passing a function to the view iteration that gets called to evaluate whether it should decend to a child node based on the key/reduction pair
> [10:41am] kocolosk: davisp: what you're proposing would also allow for (more) efficient implementation of skip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (COUCHDB-1076) _all_docs performance degrades as doc_del_count increases

Posted by "Filipe Manana (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13037900#comment-13037900 ] 

Filipe Manana commented on COUCHDB-1076:
----------------------------------------

Looks good Randall, only 2 minor comments:

1) Do you have to add the skip function to couch_db_updater and export it if it's only used in couch_httpd_db.erl?

2) The skip function doesn't account for db reduction values created by 1.1.x and older releases, which are 2 tuple terms {NotDeleted, Deleted}; So either another clause should be added or simply do a  "Skip >= element(1, Reduction)"

Do you think there's a sane way do add some etap tests (to 020-btree-basics.t)? At least for the trivial cases it's easy to test (all branches skipped, or none of the branches is skipped)

> _all_docs performance degrades as doc_del_count increases
> ---------------------------------------------------------
>
>                 Key: COUCHDB-1076
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1076
>             Project: CouchDB
>          Issue Type: Bug
>            Reporter: Nathan Vander Wilt
>
> The time required to query _all_docs?limit=1 can be proportional to the doc_del_count of the database depending on where the deleted docs are in the sort order. As kocolosk explained on IRC:
> The tree that serves as the source for _all_docs contains a small record for each document, deleted or not ... if you have a large number of deleted docs and their IDs are interspersed with the non-deleted ones i can imagine that it would cause additional seeks when streaming the _all_docs response
> In my use case (https://github.com/natevw/RQMS) and in other cases (e.g. a rolling log or any long-lived database), the deleted docs may all be at the beginning of the _all_docs view, making query performance end up like using "?limit=N&skip=doc_del_count".
> To improve the performance in cases where large blocks of documents have been deleted, kocolosk notes:
> [10:30am] kocolosk: the inner nodes in the btree currently report the doc_count and doc_del_count
> [10:31am] kocolosk: we might be able to rewrite the function that walks the btree so that it checks if the doc_count underneath an inner node is zero
> [10:31am] kocolosk: and then it can skip that part of the tree entirely
> [10:31am] kocolosk: instead of descending all the way to the leaf nodes and skipping deleted documents one by one
> [10:31am] n[ate]vw: yeah, that'd have the benefit of being a code-only change
> [10:31am] kocolosk: right
> [10:32am] kocolosk: i think it should work rather nicely, though it probably requires some interesting spelunking into the depths of couch_btree
> [10:33am] kocolosk: davisp: whaddya think?  would what i'm proposing be possible? (checking for doc_count=0 in the inner node reduction and then skipping ahead when serving _all_docs)
> ...
> [10:39am] davisp: n[ate]vw: Its definitely doable, its just a question of how to do it so that I don't bleed from the eyes when trying to read through the implementation
> ...
> [10:40am] rnewson: davisp: right. we definitely want to preserve the high readability of the current btree code. /s
> [10:40am] davisp: kocolosk: The only thing I can think of is passing a function to the view iteration that gets called to evaluate whether it should decend to a child node based on the key/reduction pair
> [10:41am] kocolosk: davisp: what you're proposing would also allow for (more) efficient implementation of skip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (COUCHDB-1076) _all_docs performance degrades as doc_del_count increases

Posted by "Randall Leeds (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/COUCHDB-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Randall Leeds updated COUCHDB-1076:
-----------------------------------

    Issue Type: Sub-task  (was: Bug)
        Parent: COUCHDB-977

> _all_docs performance degrades as doc_del_count increases
> ---------------------------------------------------------
>
>                 Key: COUCHDB-1076
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1076
>             Project: CouchDB
>          Issue Type: Sub-task
>            Reporter: Nathan Vander Wilt
>            Assignee: Randall Leeds
>             Fix For: 1.1.1, 1.2
>
>
> The time required to query _all_docs?limit=1 can be proportional to the doc_del_count of the database depending on where the deleted docs are in the sort order. As kocolosk explained on IRC:
> The tree that serves as the source for _all_docs contains a small record for each document, deleted or not ... if you have a large number of deleted docs and their IDs are interspersed with the non-deleted ones i can imagine that it would cause additional seeks when streaming the _all_docs response
> In my use case (https://github.com/natevw/RQMS) and in other cases (e.g. a rolling log or any long-lived database), the deleted docs may all be at the beginning of the _all_docs view, making query performance end up like using "?limit=N&skip=doc_del_count".
> To improve the performance in cases where large blocks of documents have been deleted, kocolosk notes:
> [10:30am] kocolosk: the inner nodes in the btree currently report the doc_count and doc_del_count
> [10:31am] kocolosk: we might be able to rewrite the function that walks the btree so that it checks if the doc_count underneath an inner node is zero
> [10:31am] kocolosk: and then it can skip that part of the tree entirely
> [10:31am] kocolosk: instead of descending all the way to the leaf nodes and skipping deleted documents one by one
> [10:31am] n[ate]vw: yeah, that'd have the benefit of being a code-only change
> [10:31am] kocolosk: right
> [10:32am] kocolosk: i think it should work rather nicely, though it probably requires some interesting spelunking into the depths of couch_btree
> [10:33am] kocolosk: davisp: whaddya think?  would what i'm proposing be possible? (checking for doc_count=0 in the inner node reduction and then skipping ahead when serving _all_docs)
> ...
> [10:39am] davisp: n[ate]vw: Its definitely doable, its just a question of how to do it so that I don't bleed from the eyes when trying to read through the implementation
> ...
> [10:40am] rnewson: davisp: right. we definitely want to preserve the high readability of the current btree code. /s
> [10:40am] davisp: kocolosk: The only thing I can think of is passing a function to the view iteration that gets called to evaluate whether it should decend to a child node based on the key/reduction pair
> [10:41am] kocolosk: davisp: what you're proposing would also allow for (more) efficient implementation of skip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (COUCHDB-1076) _all_docs performance degrades as doc_del_count increases

Posted by "Randall Leeds (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/COUCHDB-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Randall Leeds resolved COUCHDB-1076.
------------------------------------

    Resolution: Fixed

Fixed on trunk (r1152398) and 1.1.x (r1152406)

> _all_docs performance degrades as doc_del_count increases
> ---------------------------------------------------------
>
>                 Key: COUCHDB-1076
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1076
>             Project: CouchDB
>          Issue Type: Sub-task
>            Reporter: Nathan Vander Wilt
>            Assignee: Randall Leeds
>             Fix For: 1.1.1, 1.2
>
>
> The time required to query _all_docs?limit=1 can be proportional to the doc_del_count of the database depending on where the deleted docs are in the sort order. As kocolosk explained on IRC:
> The tree that serves as the source for _all_docs contains a small record for each document, deleted or not ... if you have a large number of deleted docs and their IDs are interspersed with the non-deleted ones i can imagine that it would cause additional seeks when streaming the _all_docs response
> In my use case (https://github.com/natevw/RQMS) and in other cases (e.g. a rolling log or any long-lived database), the deleted docs may all be at the beginning of the _all_docs view, making query performance end up like using "?limit=N&skip=doc_del_count".
> To improve the performance in cases where large blocks of documents have been deleted, kocolosk notes:
> [10:30am] kocolosk: the inner nodes in the btree currently report the doc_count and doc_del_count
> [10:31am] kocolosk: we might be able to rewrite the function that walks the btree so that it checks if the doc_count underneath an inner node is zero
> [10:31am] kocolosk: and then it can skip that part of the tree entirely
> [10:31am] kocolosk: instead of descending all the way to the leaf nodes and skipping deleted documents one by one
> [10:31am] n[ate]vw: yeah, that'd have the benefit of being a code-only change
> [10:31am] kocolosk: right
> [10:32am] kocolosk: i think it should work rather nicely, though it probably requires some interesting spelunking into the depths of couch_btree
> [10:33am] kocolosk: davisp: whaddya think?  would what i'm proposing be possible? (checking for doc_count=0 in the inner node reduction and then skipping ahead when serving _all_docs)
> ...
> [10:39am] davisp: n[ate]vw: Its definitely doable, its just a question of how to do it so that I don't bleed from the eyes when trying to read through the implementation
> ...
> [10:40am] rnewson: davisp: right. we definitely want to preserve the high readability of the current btree code. /s
> [10:40am] davisp: kocolosk: The only thing I can think of is passing a function to the view iteration that gets called to evaluate whether it should decend to a child node based on the key/reduction pair
> [10:41am] kocolosk: davisp: what you're proposing would also allow for (more) efficient implementation of skip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (COUCHDB-1076) _all_docs performance degrades as doc_del_count increases

Posted by "Randall Leeds (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13037672#comment-13037672 ] 

Randall Leeds commented on COUCHDB-1076:
----------------------------------------

Got it. Came out very clean. If there are no complaints I'll commit this in a couple of days.
I haven't done the work to apply this to ddoc views yet, but it should be pretty easy.

Differences here: https://github.com/tilgovi/couchdb/compare/trunk...skiptree

> _all_docs performance degrades as doc_del_count increases
> ---------------------------------------------------------
>
>                 Key: COUCHDB-1076
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1076
>             Project: CouchDB
>          Issue Type: Bug
>            Reporter: Nathan Vander Wilt
>
> The time required to query _all_docs?limit=1 can be proportional to the doc_del_count of the database depending on where the deleted docs are in the sort order. As kocolosk explained on IRC:
> The tree that serves as the source for _all_docs contains a small record for each document, deleted or not ... if you have a large number of deleted docs and their IDs are interspersed with the non-deleted ones i can imagine that it would cause additional seeks when streaming the _all_docs response
> In my use case (https://github.com/natevw/RQMS) and in other cases (e.g. a rolling log or any long-lived database), the deleted docs may all be at the beginning of the _all_docs view, making query performance end up like using "?limit=N&skip=doc_del_count".
> To improve the performance in cases where large blocks of documents have been deleted, kocolosk notes:
> [10:30am] kocolosk: the inner nodes in the btree currently report the doc_count and doc_del_count
> [10:31am] kocolosk: we might be able to rewrite the function that walks the btree so that it checks if the doc_count underneath an inner node is zero
> [10:31am] kocolosk: and then it can skip that part of the tree entirely
> [10:31am] kocolosk: instead of descending all the way to the leaf nodes and skipping deleted documents one by one
> [10:31am] n[ate]vw: yeah, that'd have the benefit of being a code-only change
> [10:31am] kocolosk: right
> [10:32am] kocolosk: i think it should work rather nicely, though it probably requires some interesting spelunking into the depths of couch_btree
> [10:33am] kocolosk: davisp: whaddya think?  would what i'm proposing be possible? (checking for doc_count=0 in the inner node reduction and then skipping ahead when serving _all_docs)
> ...
> [10:39am] davisp: n[ate]vw: Its definitely doable, its just a question of how to do it so that I don't bleed from the eyes when trying to read through the implementation
> ...
> [10:40am] rnewson: davisp: right. we definitely want to preserve the high readability of the current btree code. /s
> [10:40am] davisp: kocolosk: The only thing I can think of is passing a function to the view iteration that gets called to evaluate whether it should decend to a child node based on the key/reduction pair
> [10:41am] kocolosk: davisp: what you're proposing would also allow for (more) efficient implementation of skip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Work started] (COUCHDB-1076) _all_docs performance degrades as doc_del_count increases

Posted by "Randall Leeds (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/COUCHDB-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on COUCHDB-1076 started by Randall Leeds.

> _all_docs performance degrades as doc_del_count increases
> ---------------------------------------------------------
>
>                 Key: COUCHDB-1076
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1076
>             Project: CouchDB
>          Issue Type: Sub-task
>            Reporter: Nathan Vander Wilt
>            Assignee: Randall Leeds
>             Fix For: 1.1.1, 1.2
>
>
> The time required to query _all_docs?limit=1 can be proportional to the doc_del_count of the database depending on where the deleted docs are in the sort order. As kocolosk explained on IRC:
> The tree that serves as the source for _all_docs contains a small record for each document, deleted or not ... if you have a large number of deleted docs and their IDs are interspersed with the non-deleted ones i can imagine that it would cause additional seeks when streaming the _all_docs response
> In my use case (https://github.com/natevw/RQMS) and in other cases (e.g. a rolling log or any long-lived database), the deleted docs may all be at the beginning of the _all_docs view, making query performance end up like using "?limit=N&skip=doc_del_count".
> To improve the performance in cases where large blocks of documents have been deleted, kocolosk notes:
> [10:30am] kocolosk: the inner nodes in the btree currently report the doc_count and doc_del_count
> [10:31am] kocolosk: we might be able to rewrite the function that walks the btree so that it checks if the doc_count underneath an inner node is zero
> [10:31am] kocolosk: and then it can skip that part of the tree entirely
> [10:31am] kocolosk: instead of descending all the way to the leaf nodes and skipping deleted documents one by one
> [10:31am] n[ate]vw: yeah, that'd have the benefit of being a code-only change
> [10:31am] kocolosk: right
> [10:32am] kocolosk: i think it should work rather nicely, though it probably requires some interesting spelunking into the depths of couch_btree
> [10:33am] kocolosk: davisp: whaddya think?  would what i'm proposing be possible? (checking for doc_count=0 in the inner node reduction and then skipping ahead when serving _all_docs)
> ...
> [10:39am] davisp: n[ate]vw: Its definitely doable, its just a question of how to do it so that I don't bleed from the eyes when trying to read through the implementation
> ...
> [10:40am] rnewson: davisp: right. we definitely want to preserve the high readability of the current btree code. /s
> [10:40am] davisp: kocolosk: The only thing I can think of is passing a function to the view iteration that gets called to evaluate whether it should decend to a child node based on the key/reduction pair
> [10:41am] kocolosk: davisp: what you're proposing would also allow for (more) efficient implementation of skip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira