You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (Resolved) (JIRA)" <ji...@apache.org> on 2012/03/20 17:13:38 UTC
[jira] [Resolved] (LUCENE-783) Store all metadata in human-readable
segments file
[ https://issues.apache.org/jira/browse/LUCENE-783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless resolved LUCENE-783.
---------------------------------------
Resolution: Fixed
Actually I think SimpleText's SegmentInfosFormat does this well?
> Store all metadata in human-readable segments file
> --------------------------------------------------
>
> Key: LUCENE-783
> URL: https://issues.apache.org/jira/browse/LUCENE-783
> Project: Lucene - Java
> Issue Type: Improvement
> Components: core/index
> Reporter: Marvin Humphrey
> Priority: Minor
> Labels: gsoc2012, lucene-gsoc-12
> Fix For: 4.0
>
>
> Various index-reading components in Lucene need metadata in addition to data.
> This metadata is presently stored in arbitrary binary headers and spread out
> over several files. We should move to concentrate it in a single file, and
> this file should be encoded using a human-readable, extensible, standardized
> data serialization language -- either XML or YAML.
> * Making metadata human-readable makes debugging easier. Centralizing it
> makes debugging easier still. Developers benefit from being able to scan
> and locate relevant information quickly and with less debug printing. Users
> get a new window through which to peer into the index structure.
> * Since metadata is written to a separate file, there would no longer be a
> need to seek back to the beginning of any data file to finish a header,
> solving issue LUCENE-532.
> * Special-case parsing code needed for extracting metadata supplied by
> different index formats can be pared down. If a value is no longer
> necessary, it can just be ignored/discarded.
> * Removing headers from the data files simplifies them and makes the file
> format easier to implement.
> * With headers removed, all or nearly all data structures can take the
> form of records stacked end to end, so that once a decoder has been
> selected, an iterator can read the file from top to tail. To an extent,
> this allows us to separate our data-processing algorithms from our
> serialization algorithms, decoupling Lucene's code base from its file
> format. For instance, instead of further subclassing TermDocs to deal with
> "flexible indexing" formats, we might replace it with a PostingList which
> returns a subclass of Posting. The deserialization code would be wholly
> contained within the Posting subclass rather than spread out over several
> subclasses of TermDocs.
> * YAML and XML are equally well suited for the task of storing metadata,
> but in either case a complete parser would not be needed -- a small subset
> of the language will do. KinoSearch 0.20's custom-coded YAML parser
> occupies about 600 lines of C -- not too bad, considering how miserable C's
> string handling capabilities are.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org