You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2010/07/01 11:42:49 UTC

[jira] Commented: (LUCENE-2373) Create a Codec to work with streaming and append-only filesystems

    [ https://issues.apache.org/jira/browse/LUCENE-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884232#action_12884232 ] 

Michael McCandless commented on LUCENE-2373:
--------------------------------------------

This looks great Andrzej!  This gives codecs full control over reading/writing of SegmentInfo/s, which now empowers a Codec to store any per-segment info it needs to (eg, hasProx, which is now a hardwired bit in SegmentInfo, is really a codec level detail).  Probably the codec could return a (private to it) subclass of SegmentInfo to hold such extra info...

Maybe we should provide default impls for CodecProvider.getSegmentInfosReader/Writer?  (Ie returning the Default impls)

Also, should we factor out the "leave space for index pointer" (out.writeLong(0)) to the subclass?  (And, the reading of that dirOffset).  Because this is wasted now for the appending codec...


> Create a Codec to work with streaming and append-only filesystems
> -----------------------------------------------------------------
>
>                 Key: LUCENE-2373
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2373
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Andrzej Bialecki 
>             Fix For: 4.0
>
>         Attachments: appending.patch
>
>
> Since early 2.x times Lucene used a skip/seek/write trick to patch the length of the terms dict into a place near the start of the output data file. This however made it impossible to use Lucene with append-only filesystems such as HDFS.
> In the post-flex trunk the following code in StandardTermsDictWriter initiates this:
> {code}
>     // Count indexed fields up front
>     CodecUtil.writeHeader(out, CODEC_NAME, VERSION_CURRENT); 
>     out.writeLong(0);                             // leave space for end index pointer
> {code}
> and completes this in close():
> {code}
>       out.seek(CodecUtil.headerLength(CODEC_NAME));
>       out.writeLong(dirStart);
> {code}
> I propose to change this layout so that this pointer is stored simply at the end of the file. It's always 8 bytes long, and we known the final length of the file from Directory, so it's a single additional seek(length - 8) to read it, which is not much considering the benefits.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org