You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Marvin Humphrey (JIRA)" <ji...@apache.org> on 2007/01/23 20:41:49 UTC

[jira] Created: (LUCENE-783) Store all metadata in human-readable segments file

Store all metadata in human-readable segments file
--------------------------------------------------

                 Key: LUCENE-783
                 URL: https://issues.apache.org/jira/browse/LUCENE-783
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Index
            Reporter: Marvin Humphrey
            Priority: Minor


Various index-reading components in Lucene need metadata in addition to data.
This metadata is presently stored in arbitrary binary headers and spread out
over several files.  We should move to concentrate it in a single file, and 
this file should be encoded using a human-readable, extensible, standardized 
data serialization language -- either XML or YAML.

* Making metadata human-readable makes debugging easier.  Centralizing it
  makes debugging easier still.  Developers benefit from being able to scan
  and locate relevant information quickly and with less debug printing.  Users
  get a new window through which to peer into the index structure.
* Since metadata is written to a separate file, there would no longer be a 
  need to seek back to the beginning of any data file to finish a header, 
  solving issue LUCENE-532.
* Special-case parsing code needed for extracting metadata supplied by 
  different index formats can be pared down.  If a value is no longer 
  necessary, it can just be ignored/discarded.
* Removing headers from the data files simplifies them and makes the file
  format easier to implement. 
* With headers removed, all or nearly all data structures can take the
  form of records stacked end to end, so that once a decoder has been
  selected, an iterator can read the file from top to tail.  To an extent,
  this allows us to separate our data-processing algorithms from our
  serialization algorithms, decoupling Lucene's code base from its file
  format.  For instance, instead of further subclassing TermDocs to deal with
  "flexible indexing" formats, we might replace it with a PostingList which
  returns a subclass of Posting.  The deserialization code would be wholly
  contained within the Posting subclass rather than spread out over several
  subclasses of TermDocs.
* YAML and XML are equally well suited for the task of storing metadata, 
  but in either case a complete parser would not be needed -- a small subset 
  of the language will do.  KinoSearch 0.20's custom-coded YAML parser 
  occupies about 600 lines of C -- not too bad, considering how miserable C's 
  string handling capabilities are. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Assigned: (LUCENE-783) Store all metadata in human-readable segments file

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Busch reassigned LUCENE-783:
------------------------------------

    Assignee: Michael Busch

> Store all metadata in human-readable segments file
> --------------------------------------------------
>
>                 Key: LUCENE-783
>                 URL: https://issues.apache.org/jira/browse/LUCENE-783
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Marvin Humphrey
>         Assigned To: Michael Busch
>            Priority: Minor
>
> Various index-reading components in Lucene need metadata in addition to data.
> This metadata is presently stored in arbitrary binary headers and spread out
> over several files.  We should move to concentrate it in a single file, and 
> this file should be encoded using a human-readable, extensible, standardized 
> data serialization language -- either XML or YAML.
> * Making metadata human-readable makes debugging easier.  Centralizing it
>   makes debugging easier still.  Developers benefit from being able to scan
>   and locate relevant information quickly and with less debug printing.  Users
>   get a new window through which to peer into the index structure.
> * Since metadata is written to a separate file, there would no longer be a 
>   need to seek back to the beginning of any data file to finish a header, 
>   solving issue LUCENE-532.
> * Special-case parsing code needed for extracting metadata supplied by 
>   different index formats can be pared down.  If a value is no longer 
>   necessary, it can just be ignored/discarded.
> * Removing headers from the data files simplifies them and makes the file
>   format easier to implement. 
> * With headers removed, all or nearly all data structures can take the
>   form of records stacked end to end, so that once a decoder has been
>   selected, an iterator can read the file from top to tail.  To an extent,
>   this allows us to separate our data-processing algorithms from our
>   serialization algorithms, decoupling Lucene's code base from its file
>   format.  For instance, instead of further subclassing TermDocs to deal with
>   "flexible indexing" formats, we might replace it with a PostingList which
>   returns a subclass of Posting.  The deserialization code would be wholly
>   contained within the Posting subclass rather than spread out over several
>   subclasses of TermDocs.
> * YAML and XML are equally well suited for the task of storing metadata, 
>   but in either case a complete parser would not be needed -- a small subset 
>   of the language will do.  KinoSearch 0.20's custom-coded YAML parser 
>   occupies about 600 lines of C -- not too bad, considering how miserable C's 
>   string handling capabilities are. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Assigned: (LUCENE-783) Store all metadata in human-readable segments file

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Busch reassigned LUCENE-783:
------------------------------------

    Assignee:     (was: Michael Busch)

I don't think I will be able anytime soon to work on this.

I still think this is a good idea, maybe someone else would like to take it?

> Store all metadata in human-readable segments file
> --------------------------------------------------
>
>                 Key: LUCENE-783
>                 URL: https://issues.apache.org/jira/browse/LUCENE-783
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Marvin Humphrey
>            Priority: Minor
>
> Various index-reading components in Lucene need metadata in addition to data.
> This metadata is presently stored in arbitrary binary headers and spread out
> over several files.  We should move to concentrate it in a single file, and 
> this file should be encoded using a human-readable, extensible, standardized 
> data serialization language -- either XML or YAML.
> * Making metadata human-readable makes debugging easier.  Centralizing it
>   makes debugging easier still.  Developers benefit from being able to scan
>   and locate relevant information quickly and with less debug printing.  Users
>   get a new window through which to peer into the index structure.
> * Since metadata is written to a separate file, there would no longer be a 
>   need to seek back to the beginning of any data file to finish a header, 
>   solving issue LUCENE-532.
> * Special-case parsing code needed for extracting metadata supplied by 
>   different index formats can be pared down.  If a value is no longer 
>   necessary, it can just be ignored/discarded.
> * Removing headers from the data files simplifies them and makes the file
>   format easier to implement. 
> * With headers removed, all or nearly all data structures can take the
>   form of records stacked end to end, so that once a decoder has been
>   selected, an iterator can read the file from top to tail.  To an extent,
>   this allows us to separate our data-processing algorithms from our
>   serialization algorithms, decoupling Lucene's code base from its file
>   format.  For instance, instead of further subclassing TermDocs to deal with
>   "flexible indexing" formats, we might replace it with a PostingList which
>   returns a subclass of Posting.  The deserialization code would be wholly
>   contained within the Posting subclass rather than spread out over several
>   subclasses of TermDocs.
> * YAML and XML are equally well suited for the task of storing metadata, 
>   but in either case a complete parser would not be needed -- a small subset 
>   of the language will do.  KinoSearch 0.20's custom-coded YAML parser 
>   occupies about 600 lines of C -- not too bad, considering how miserable C's 
>   string handling capabilities are. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org