You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2012/05/12 08:31:51 UTC

[jira] [Commented] (LUCENE-4050) Change SegmentInfos format to plain text

    [ https://issues.apache.org/jira/browse/LUCENE-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273858#comment-13273858 ] 

Robert Muir commented on LUCENE-4050:
-------------------------------------

I agree this is a total mess. We should really revisit how we handle:

# commit file (in my opinion this should just be a list of segments! only!)
  currently segmentinfos stores a ton of stuff more than this, it stores
  per-segment metadata within this file when it really should not.
# per-segment metadata. In this case we have a lot of confusion with 
  segmentinfo and fieldinfo. It would be great for the codec to have more
  flexibility here, via abstract classes/interfaces+attributes or something
  that ensures its lossless yet still a codec can add what it needs. Really
  for the most part segmentinfo is basically useless since many values actually
  return "well if you want to know this, then go look at the fieldinfos".
# actual commit strategy. We do a lot of funky stuff like writing fake bogus
  data, seeking backwards, etc. Why not just a normal atomic rename like
  any other computer program on the planet????

                
> Change SegmentInfos format to plain text
> ----------------------------------------
>
>                 Key: LUCENE-4050
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4050
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>             Fix For: 4.0
>
>
> I propose to change the format of SegmentInfos file (segments_NN) to use plain text instead of the current binary format.
> SegmentInfos file represents a commit point, and it also declares what codecs were used for writing each of the segments that the commit point consists of. However, this is a chicken and egg situation - in theory the format of this file is customizable via Codec.getSegmentInfosFormat, but in practice we have to first discover what is the codec implementation that wrote this file - so the SegmentCoreReaders assumes a certain fixed binary layout of a preamble of this file that contains the codec name... and then the file is read again, only this time using the right Codec.
> This is ugly. Instead I propose to use a simple plain text format, either line oriented properties or JSON, in such a way that newer versions could easily extend it, and which wouldn't require any special Codec to read and parse. Consequently we could remove SegmentInfosFormat altogether, and instead add SegmentInfoFormat (notice the singular) to Codec to read single per-segment SegmentInfo-s in a codec-specific way. E.g. for Lucene40 codec we could either add another file or we could extend the .fnm file (FieldInfos) to contain also this information. 
> Then the plain text SegmentInfos would contain just the following information:
> * list of global files for this commit point (if any)
> * list of segments for this commit point, and their corresponding codec class names
> * user data map

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org