You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@couchdb.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/07/16 00:06:20 UTC
[jira] [Commented] (COUCHDB-3061) Decrease search time for deeply buried headers

    [ https://issues.apache.org/jira/browse/COUCHDB-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15380335#comment-15380335 ] 

ASF GitHub Bot commented on COUCHDB-3061:
-----------------------------------------

GitHub user jaydoane opened a pull request:

    https://github.com/apache/couchdb-couch/pull/185

    3061 adaptive header search

    The current find_header algorithm performs one file read per block,
    which can be inefficient when the most recent header is buried deeply in
    a large file.
    
    With default config params, this change essentially keeps current
    behavior, reading each block into memory one at a time until a header is
    found, or beginning of file is reached. However, by setting new config
    params, it's possible to exponentially increase the number of blocks
    read into a "chunk" of memory with a single read, up to a configurable
    limit.
    
    For example, the following settings begin with a chunk size of one
    block, and then double in size each additional search step backward to a
    maximum limit of 16GB:
    
    [couchdb]
    chunk_max_size = 4096*4096
    chunk_exponent_base = 2
    
    Measurements for a 12GB .couch.meta file with its last header at 4GB
    shows a speed improvement of from 17x (server) to 27x (laptop).
    
    COUCHDB-3061

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloudant/couchdb-couch 3061-adaptive-header-search

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/couchdb-couch/pull/185.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #185
    
----
commit 1c3db513a029351af6af26b4dd1da1f72f978789
Author: Jay Doane <ja...@gmail.com>
Date:   2016-07-14T02:05:56Z

    Rename function and variables:
    
    - Drop "calculate_" prefix since all functions perform calculations
    - Change "_len" to "_size" for better consistency
    - Change TotalBytes to TotalSize for better consistency

commit abf1e76fa996327bb52ac038b3d4520662cbab15
Author: Jay Doane <ja...@gmail.com>
Date:   2016-07-15T23:29:02Z

    Implement adaptive header search
    
    The current find_header algorithm performs one file read per block,
    which can be inefficient when the most recent header is buried deeply in
    a large file.
    
    With default config params, this change essentially keeps current
    behavior, reading each block into memory one at a time until a header is
    found, or beginning of file is reached. However, by setting new config
    params, it's possible to exponentially increase the number of blocks
    read into a "chunk" of memory with a single read, up to a configurable
    limit.
    
    For example, the following settings begin with a chunk size of one
    block, and then double in size each additional search step backward to a
    maximum limit of 16GB:
    
    [couchdb]
    chunk_max_size = 4096*4096
    chunk_exponent_base = 2
    
    Measurements for a 12GB .couch.meta file with its last header at 4GB
    shows a speed improvement of from 17x (server) to 27x (laptop).
    
    COUCHDB-3061

----


> Decrease search time for deeply buried headers
> ----------------------------------------------
>
>                 Key: COUCHDB-3061
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-3061
>             Project: CouchDB
>          Issue Type: Improvement
>          Components: Database Core
>            Reporter: Jay Doane
>            Assignee: Jay Doane
>
> When a db compaction is interrupted and restarted, it's possible for the most recent header written to the .compact.meta file to be buried under many GB from the end of file. For example, an interrupted compaction left the following files:
> {code}
> -rw-r--r-- 1 dbcore dbcore 47G Jun 27 04:29 /srv/db/shards/40000000-7fffffff/opendi/yellow-ng-loadbalancer.1461156255.couch.compact.data
> -rw-r--r-- 1 dbcore dbcore 12G Jun 27 05:20 /srv/db/shards/40000000-7fffffff/opendi/yellow-ng-loadbalancer.1461156255.couch.compact.meta
> {code}
> but
> {code}
> # grep -abo db_header yellow-ng-loadbalancer.1461156255.couch.compact.meta | tail -1
> 4364894251:db_header
> {code}
> which means the current algorithm must search through about 8GB of file before it gets to a header. I measured how long it takes, both on an SSD laptop and SSD server, getting similar numbers:
> {code}
> (dbcore@db3.testy012.cloudant.net)18> timer:tc(couch_file,test_find_header,["/srv/jdoane/yellow-ng-loadbalancer.1461156255.couch.compact.meta", default]).
> {328335338, ...
> (node1@127.0.0.1)27> timer:tc(couch_file,test_find_header,["/Users/jay/proj/ibm/sample-data/yellow-ng-loadbalancer.1461156255.couch.compact.meta", default]).
> {426650530, ...
> {code}
> which is between 328-427 seconds, or 19-25 MB/s.
> One reason for this relative slowness is because the current algorithm performs a disk read for every block it searches:
> https://github.com/apache/couchdb-couch/blob/master/src/couch_file.erl#L537-L539
> We can improve the speed by loading a "chunk" of many blocks into memory with a single read operation, and then search each block in memory, trading off memory for speed. Ideally, the tradeoff can be made configurable, so that existing speed/memory behavior can be retained by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)