You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Doron Cohen (JIRA)" <ji...@apache.org> on 2006/12/06 22:42:22 UTC

[jira] Updated: (LUCENE-738) read/write .del as d-gaps when the deleted bit vector is sufficiently sparse

     [ http://issues.apache.org/jira/browse/LUCENE-738?page=all ]

Doron Cohen updated LUCENE-738:
-------------------------------

    Attachment: del.dgap.patch.txt

Patch added: "del.dgap.patch.txt" for the above optn "(1) writing d-gaps for ids of deleted docs".

Patch changes index format, but is backwards compatible.

I still need to update the FileFormats document - will add that part of the patch later.


> read/write .del as d-gaps when the deleted bit vector is sufficiently sparse
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-738
>                 URL: http://issues.apache.org/jira/browse/LUCENE-738
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doron Cohen
>         Assigned To: Doron Cohen
>         Attachments: del.dgap.patch.txt
>
>
> .del file of a segment maintains info on deleted documents in that segment. The file exists only for segments having deleted docs, so it does not exists for newly created segments (e.g. resulted from merge). Each time closing an index reader that deleted any document, the .del file is rewritten. In fact, since the lock-less commits change a new (generation of) .del file is created in each such occasion.
> For small indexes there is no real problem with current situation. But for very large indexes, each time such an index reader is closed, creating such new bit-vector seems like unnecessary overhead in cases that the bit vector is sparse (just a few docs were deleted). For instance, for an index with a segment of 1M docs, the sequence: {open reader; delete 1 doc from that segment; close reader;} would write a file of ~128KB. Repeat this sequence 8 times: 8 new files of total size of 1MB are written to disk.
> Whether this is a bottleneck or not depends on the application deletes pattern, but for the case that deleted docs are sparse, writing just the d-gaps would save space and time. 
> I have this (simple) change to BitVector running and currently trying some performance tests to, yet, convince myself on the worthiness of this.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org