You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2013/03/04 11:45:36 UTC
[jira] [Commented] (LUCENE-4752) Merge segments to sort them

    [ https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13592107#comment-13592107 ] 

Adrien Grand commented on LUCENE-4752:
--------------------------------------

I think a very simple first step could be have an experimental IndexWriterConfig option to tell IndexWriter to provide an atomic sorted view (easy once LUCENE-3918 is committed) of the segments to merge to SegmentMerger instead of the segments themselves (see calls to SegmentMerger.add(SegmentReader) in IndexWriter.mergeMiddle). Initial segments would remain unsorted, but the big ones, those that are interesting for both index compression and early query termination, would be sorted.

It can seem inefficient to sort segments over and over but I don't think we should worry too much:
 - if we are merging "initial" segments (those created from IndexWriter.flush), they would be small so sorting/merging them would be fast?
 - if we are merging big segments, I think that the following reasons could make merging slower than a regular merge:
   1. computing the new doc ID mapping,
   2. random I/O access,
   3. not being able to use the specialized codec merging methods.

But if the segments to merge are sorted, computing the new doc ID mapping could be actually fast (some sorting algorithms such as [TimSort|http://en.wikipedia.org/wiki/Timsort] perform in O(n) when the input is a succession of sorted sequences), and the access patterns to the individual segments would be I/O cache-friendly (because each segment would be read sequentially). So I think this approach could be fast enough?
                
> Merge segments to sort them
> ---------------------------
>
>                 Key: LUCENE-4752
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4752
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/index
>            Reporter: David Smiley
>            Assignee: Adrien Grand
>
> It would be awesome if Lucene could write the documents out in a segment based on a configurable order.  This of course applies to merging segments to. The benefit is increased locality on disk of documents that are likely to be accessed together.  This often applies to documents near each other in time, but also spatially.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org