You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2009/03/23 03:24:54 UTC

Segment

Greets,

In another thread, I described a proposed "Snapshot" class.  The goal we're
working towards is pluggable index reading/writing via Architecture and
DataReader/DataWriter.  The constructor for the prototype implementation of
DataWriter I've worked up in KS takes three arguments: a Snapshot, a
PolyReader (analogous to Lucene's MultiSegmentReader)... and a "Segment".

The Segment class has three main responsibilities:

  * Keep track of how many documents are in the segment (not counting
    deletions).
  * Maintain per-segment field-name-to-field-number associations.
  * Write the "segmeta" file, which stores arbitrary metadata.

The Segment's doc count is used both at index time...

    void
    SegWriter_add_doc(SegWriter *self, Doc *doc)
    {
        i32_t doc_num = Seg_Increment_Doc_Count(self->segment, 1);
        Inverter_Invert_Doc(self->inverter, doc);
        SegWriter_Add_Inverted_Doc(self, self->inverter, doc_num);
    }

... and at search-time:

    i32_t
    SegReader_doc_max(SegReader *self)
    {
        return Seg_Get_Doc_Count(self->segment);
    }

In Lucene, field-name-to-field-number mappings are the province of the
FieldInfos class, which also tracks field characteristics such as "isStored".
Lucy uses global field semantics, though, so there's no need for per-segment
field specs.

The "segmeta" file is used to store both metadata needed by Segment itself and
metadata belonging to other index components:

    {
       "lexicon" : { 
          "counts" : { 
             "content" : "20576"
          },  
          "format" : "2",
          "index_counts" : { 
             "content" : "161"
          }   
       },  
       "postings" : { 
          "format" : "1" 
       },  
       "records" : { 
          "format" : "1" 
       },  
       "segmeta" : { 
          "doc_count" : "11054",
          "field_names" : [ 
             "", 
             "title",
             "category",
             "content",
             "url"
          ],  
          "format" : "1"           
          "name" : "seg_3"
       },  
       "term_vectors" : { 
          "format" : "1" 
       }   
    }

Providing a place for plugin indexing components to store arbitrary metadata
relieves them from the responsibility for writing and parsing metadata
themselves.  In Lucene, metadata classes such as FieldInfos have their own
binary file formats and maintain their own parsing routines, bloating the
Lucene file format documentation and adding maintenance overhead.  While
binary formats are necessary for bulk data, for small amounts of metadata they
hinder bare-eye browsing and provide no significant performance advantage.

In some sense Segment is similar to the Lucene class SegmentInfo.  For
example, both of them store format version data; however, Segment is only
aware of its own format, and it is up to individual plugins to track their own
format versions and adjust behavior as needed.  SegmentInfo is tightly bound
to other Lucene classes because it knows too much about them, hindering
extensibility; Segment, while capable of storing much more data than SegInfo
since it uses generic scalar-list-mapping data structures, knows nothing about
any of the plugin components that access that data.

Prototype code:

  http://tinyurl.com/proto-seg-bp
  http://tinyurl.com/proto-seg-c

HTML presentation of public API documentation for Perl binding:

  http://tinyurl.com/seg-dev-docs

Marvin Humphrey

Re: Segment

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Sat, Mar 28, 2009 at 10:54 PM, Marvin Humphrey
<ma...@rectangular.com> wrote:
> On Wed, Mar 25, 2009 at 07:34:01AM -0400, Michael McCandless wrote:
>> >> What does "incremented" mean?
>> >
>> > It means that the caller has to take responsibility for one refcount.  Usually
>> > you'll see that on constructors and factory methods.
>> >
>> > Having "incremented" as part of the method/function signature makes it easier
>> > to autogenerate binding code that doesn't make refcounting errors and leak
>> > memory.
>>
>> OK got it.  It's like when Python's docs say "returns a reference".
>> It's great to make this a "formal" part of the API.
>
> I'm pretty sure you grok this already but for clarity's sake: this is
> Boilerplater syntax -- so it's a "formal" part of an *internal* API.

Yeah got it.

> Even though Boilerplater is a very small language, I was deeply reluctant to
> write it.  Naturally I hate all programming languages and I have fantasies of
> replacing C with something "better" :) -- but I recognize the challenges that
> language authors face and have no desire to expose Boilerplater outside of
> Lucy.  It's just a means to an end.

Yes all languages have their problems.  Our species hasn't quite
figured out the best way to program these computers just yet...

> The C API docs -- which I expect we'll autogenerate from the .bp source files
> just as I'm currently generating Perl POD docs from .bp files -- will probably
> be HTML files and will say "returns a new reference" or "returns a borrowed
> reference" just like the Python docs.

Sounds good.

>> Instead of having a bunch of version constants at the top of a class
>> (eg FieldsReader.java), we'd invoke the "Versions.add(...)"  to create
>> each version.
>
> Where would we keep track of the registrations?  Will each DataReader subclass
> keep a class Hash variable?
>
>  static Hash* versions = NULL;
>
>  static void
>  S_init_versions_hash()
>  {
>      versions = Hash_new(2);
>      Hash_Store_Str(versions, "1", 1, CB_newf("initial format"));
>      Hash_Store_Str(versions, "2", 1, CB_newf("fixed stoopid mistake"));
>  }
>
>  Hash*
>  LexWriter_versions(LexWriter *self)
>  {
>      UNUSED_VAR(self);
>      if (!versions) { S_init_versions_hash(); }
>      return versions;
>  }
>
> Actually that'll leak memory without an atexit() or something like that, but
> you get the idea.

What does UNUSED_VAR(self) do?

Yes, I think the registrations'd be stored only in memory, but I
wasn't picturing you'd interact w/ a hash directly; I thought a
Versions class that holds the hash, and you statically instantiate
Versions and call "add" to store your versions.  Then you consult that
instance to get latest() (used when writing), to check a version
number, for transparency when writing comments into the JSON, etc.

>> Introspection/transparency is the primary reason I can think of --
>> it's the same motivation that led you to JSON over private binary.
>> Ie, it'd be great to see a string description of what "format: '2'"
>> means; eg if each int has a known corresponding description, you could
>> add a comment on that line the JSON.
>>
>> And, in the source code, we of course assign symbolic names to these
>> constants anyway.
>>
>> Also, having an explicit method call to "add" a new version avoids
>> silly risks that when adding a new version someone messes up adding
>> one to the int :) Or, messes up keeping track of the latest format
>> (the format that's written).  It may help with the back compat unit
>> tests, too, ensuring that each supported version is tested.
>>
>> I guess it's a matter of where do you draw the line b/w browseability
>> of your JSON metadata vs "you must pull in an external tool to get
>> more details".
>
> OK, I'm cool with this so long as we can come up with a sensible API.

Yeah I haven't fleshed out a full API just yet...

> There are no performance implications or significant shared-object-bloat issues.
>
>> You are needing to bring online a scary amount of basic
>> infrastructure (GC, exception handling, object vtables, etc.) just to
>> get the ball rolling.
>
> True to an extent, but there's a huge payoff: the actual search code -- where
> the rubber hits the road -- is only marginally harder to follow than Java.

I agree.  This is simply the ante for the game you want to play, here.

Mike

Re: Segment

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Wed, Mar 25, 2009 at 07:34:01AM -0400, Michael McCandless wrote:
> >> What does "incremented" mean?
> >
> > It means that the caller has to take responsibility for one refcount.  Usually
> > you'll see that on constructors and factory methods.
> >
> > Having "incremented" as part of the method/function signature makes it easier
> > to autogenerate binding code that doesn't make refcounting errors and leak
> > memory.
> 
> OK got it.  It's like when Python's docs say "returns a reference".
> It's great to make this a "formal" part of the API.

I'm pretty sure you grok this already but for clarity's sake: this is
Boilerplater syntax -- so it's a "formal" part of an *internal* API.

Even though Boilerplater is a very small language, I was deeply reluctant to
write it.  Naturally I hate all programming languages and I have fantasies of
replacing C with something "better" :) -- but I recognize the challenges that
language authors face and have no desire to expose Boilerplater outside of
Lucy.  It's just a means to an end.

The C API docs -- which I expect we'll autogenerate from the .bp source files
just as I'm currently generating Perl POD docs from .bp files -- will probably
be HTML files and will say "returns a new reference" or "returns a borrowed
reference" just like the Python docs.

> Instead of having a bunch of version constants at the top of a class
> (eg FieldsReader.java), we'd invoke the "Versions.add(...)"  to create
> each version.

Where would we keep track of the registrations?  Will each DataReader subclass
keep a class Hash variable?

  static Hash* versions = NULL;

  static void
  S_init_versions_hash()
  {
      versions = Hash_new(2);
      Hash_Store_Str(versions, "1", 1, CB_newf("initial format"));
      Hash_Store_Str(versions, "2", 1, CB_newf("fixed stoopid mistake"));
  }

  Hash*
  LexWriter_versions(LexWriter *self)
  {
      UNUSED_VAR(self);
      if (!versions) { S_init_versions_hash(); }
      return versions;
  }

Actually that'll leak memory without an atexit() or something like that, but
you get the idea.

> Introspection/transparency is the primary reason I can think of --
> it's the same motivation that led you to JSON over private binary.
> Ie, it'd be great to see a string description of what "format: '2'"
> means; eg if each int has a known corresponding description, you could
> add a comment on that line the JSON.
> 
> And, in the source code, we of course assign symbolic names to these
> constants anyway.
> 
> Also, having an explicit method call to "add" a new version avoids
> silly risks that when adding a new version someone messes up adding
> one to the int :) Or, messes up keeping track of the latest format
> (the format that's written).  It may help with the back compat unit
> tests, too, ensuring that each supported version is tested.
> 
> I guess it's a matter of where do you draw the line b/w browseability
> of your JSON metadata vs "you must pull in an external tool to get
> more details".

OK, I'm cool with this so long as we can come up with a sensible API.

There are no performance implications or significant shared-object-bloat issues.

> You are needing to bring online a scary amount of basic
> infrastructure (GC, exception handling, object vtables, etc.) just to
> get the ball rolling.

True to an extent, but there's a huge payoff: the actual search code -- where
the rubber hits the road -- is only marginally harder to follow than Java.

Marvin Humphrey

Re: Segment

Posted by Michael McCandless <lu...@mikemccandless.com>.

Marvin Humphrey <ma...@rectangular.com> wrote:

>> >    public incremented Hash*
>> >    Metadata(DataWriter *self);
>
>> What does "incremented" mean?
>
> It means that the caller has to take responsibility for one refcount.  Usually
> you'll see that on constructors and factory methods.
>
> Having "incremented" as part of the method/function signature makes it easier
> to autogenerate binding code that doesn't make refcounting errors and leak
> memory.

OK got it.  It's like when Python's docs say "returns a reference".
It's great to make this a "formal" part of the API.

>> Looks good, though, I might add a way for a given module to register
>> the versions it reads & writes (presumably it only writes the most
>> recent one); then min/max can be derived based on what was registered.
>
> I thought about something like that.  It's more awkward, though, and I'm not
> sure how much it buys us.  I think the common case would be to drop support
> for versions below a certain minimum and to support anything later.  In the
> event that your DataReader really does support a discontiguous set of
> versions, you can just do extra error checking yourself.
>
> Even though DataReader is an advanced class, we should still value simplicity
> and try to make it as easy to use as possible.

Agreed, though I don't think recording int <-> string adds much
complexity to the impl nor the API.  That mapping need not be
recorded permanently anywhere (except in the source code).

Instead of having a bunch of version constants at the top of a class
(eg FieldsReader.java), we'd invoke the "Versions.add(...)"  to create
each version.

>>  This can be useful for introspection too, so instead of just seeing
>> "format 2" something could decode that to the string describing what
>> format 2 was (eg "added omitTermFreqAndPositions capability").
>
> So, the advantage would be that we could throw more meaningful error messages?
>
> The thing is, I'm not sure how useful it is to tell the user what kind of
> change occurred at "format 2".  How would that help them to recover?
>
> There's also Luke-style index browsing.  But there's only so much screen
> space, and I can't see how that info has utility compared to other things that
> Luke can show you.
>
> It seems to me that that kind of thing belongs in the plugin class
> documentation.  Am I missing another important runtime application?

Introspection/transparency is the primary reason I can think of --
it's the same motivation that led you to JSON over private binary.
Ie, it'd be great to see a string description of what "format: '2'"
means; eg if each int has a known corresponding description, you could
add a comment on that line the JSON.

And, in the source code, we of course assign symbolic names to these
constants anyway.

Also, having an explicit method call to "add" a new version avoids
silly risks that when adding a new version someone messes up adding
one to the int :) Or, messes up keeping track of the latest format
(the format that's written).  It may help with the back compat unit
tests, too, ensuring that each supported version is tested.

I guess it's a matter of where do you draw the line b/w browseability
of your JSON metadata vs "you must pull in an external tool to get
more details".

> We can throw exceptions that belong to meaningful classes without too much
> difficulty.  We just can't set up try-catch-finally.
>
> But that's not a big deal.  We can just set most things up to check return
> values, and throw fatal errors when necessary.

OK

>> > However, we could create full-fledged exception objects for Lucy, so that THROW
>> > calls might look something like this:
>> >
>> >    THROW(Err_data_component_version, /* <--- An integer error id */
>> >        "Format version '%i32' is less than the minimum "
>> >        "supported version '%i32' for %o", format, min,
>> >        DataReader_Get_Class_Name(self));
>> >
>> > The exception objects generated by THROW calls do not have to subclass
>> > Lucy::Obj, because we will always be returning control to the host.  So, they
>> > could be, for example, plain old Java Exception subclasses.
>>
>> What would THROW try to do, and, how?
>
> The Lucy core code would format an error message and choose an error number
> from a list of Lucy error codes.  A stack trace would be great, too, though
> that's hard to do portably.
>
> Then it would call a method which would have to be implemented per-Host.

OK so the primary goal of THROW is to cross the bridge back to the
host language and throw the exception there.

> For Java, the implementation might contain something  like this:
>
>  if (errorNumber == lucy_Err_data_component_version) {
>    throw new DataComponentVersionException(message);
>  }
>  else if (...) {
>
> I should also mention that THROW would be a macro, as implied by the all-caps.
> It would call the function lucy_Err_throw_at, automatically inserting line and
> function name information when possible:
>
>  #ifdef CHY_HAS_VARIADIC_MACROS
>    void
>    lucy_Err_throw_at(const char *file, int line, const char *func,
>                      const char *pattern, ...);
>    #ifdef CHY_HAS_ISO_VARIADIC_MACROS
>      #define LUCY_THROW(...) \
>        lucy_Err_throw_at(__FILE__, __LINE__, LUCY_ERR_FUNC_MACRO, \
>                         __VA_ARGS__)
>
> Some compilers don't support variadic macros, though (cough cough MSVC cough),
> so we have to omit the context data and define THROW as a variadic function.
>
>    void
>    LUCY_THROW(const char *pattern, ...);
>
> How about "Lucy::Util::Err" for the exception handling code?  I've been trying
> to avoid things like "String", "Array", "Exception" and such so that we don't
> conflict with core host symbols -- hence the funny names like "CharBuf" and
> "VArray".

Sounds good.  You are needing to bring online a scary amount of basic
infrastructure (GC, exception handling, object vtables, etc.) just to
get the ball rolling.

Mike

Re: Segment

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Tue, Mar 24, 2009 at 03:54:32PM -0400, Michael McCandless wrote:
> Marvin Humphrey <ma...@rectangular.com> wrote:

> >    public incremented Hash*
> >    Metadata(DataWriter *self);

> What does "incremented" mean?

It means that the caller has to take responsibility for one refcount.  Usually
you'll see that on constructors and factory methods.

Having "incremented" as part of the method/function signature makes it easier
to autogenerate binding code that doesn't make refcounting errors and leak
memory.

> Looks good, though, I might add a way for a given module to register
> the versions it reads & writes (presumably it only writes the most
> recent one); then min/max can be derived based on what was registered.

I thought about something like that.  It's more awkward, though, and I'm not
sure how much it buys us.  I think the common case would be to drop support
for versions below a certain minimum and to support anything later.  In the
event that your DataReader really does support a discontiguous set of
versions, you can just do extra error checking yourself.

Even though DataReader is an advanced class, we should still value simplicity
and try to make it as easy to use as possible.

>  This can be useful for introspection too, so instead of just seeing
> "format 2" something could decode that to the string describing what
> format 2 was (eg "added omitTermFreqAndPositions capability").

So, the advantage would be that we could throw more meaningful error messages?  

The thing is, I'm not sure how useful it is to tell the user what kind of
change occurred at "format 2".  How would that help them to recover? 

There's also Luke-style index browsing.  But there's only so much screen
space, and I can't see how that info has utility compared to other things that
Luke can show you.

It seems to me that that kind of thing belongs in the plugin class
documentation.  Am I missing another important runtime application?

> > It might make sense to throw specific exception classes in Lucy.  I haven't
> > worked something like that out in KS for three reasons.  First, it's hard to
> > catch exceptions from C without leaking memory.  Second Perl's try-catch
> > mechanism isn't very elegant.  Third, faking up a try-catch-finally interface
> > in C that would be abstract enough to handle all potential host
> > exception-handling mechanisms is, uh, challenging.
> 
> This sounds very difficult!

We can throw exceptions that belong to meaningful classes without too much
difficulty.  We just can't set up try-catch-finally.

But that's not a big deal.  We can just set most things up to check return
values, and throw fatal errors when necessary.

> > However, we could create full-fledged exception objects for Lucy, so that THROW
> > calls might look something like this:
> >
> >    THROW(Err_data_component_version, /* <--- An integer error id */
> >        "Format version '%i32' is less than the minimum "
> >        "supported version '%i32' for %o", format, min,
> >        DataReader_Get_Class_Name(self));
> >
> > The exception objects generated by THROW calls do not have to subclass
> > Lucy::Obj, because we will always be returning control to the host.  So, they
> > could be, for example, plain old Java Exception subclasses.
> 
> What would THROW try to do, and, how?

The Lucy core code would format an error message and choose an error number
from a list of Lucy error codes.  A stack trace would be great, too, though
that's hard to do portably.

Then it would call a method which would have to be implemented per-Host.

For Java, the implementation might contain something  like this:

  if (errorNumber == lucy_Err_data_component_version) {
    throw new DataComponentVersionException(message);
  }
  else if (...) {

I should also mention that THROW would be a macro, as implied by the all-caps.
It would call the function lucy_Err_throw_at, automatically inserting line and
function name information when possible:

  #ifdef CHY_HAS_VARIADIC_MACROS
    void
    lucy_Err_throw_at(const char *file, int line, const char *func,
                      const char *pattern, ...);
    #ifdef CHY_HAS_ISO_VARIADIC_MACROS
      #define LUCY_THROW(...) \
        lucy_Err_throw_at(__FILE__, __LINE__, LUCY_ERR_FUNC_MACRO, \
                         __VA_ARGS__)

Some compilers don't support variadic macros, though (cough cough MSVC cough),
so we have to omit the context data and define THROW as a variadic function.

    void
    LUCY_THROW(const char *pattern, ...);

How about "Lucy::Util::Err" for the exception handling code?  I've been trying
to avoid things like "String", "Array", "Exception" and such so that we don't
conflict with core host symbols -- hence the funny names like "CharBuf" and
"VArray".

Marvin Humphrey

Re: Segment

Posted by Michael McCandless <lu...@mikemccandless.com>.

Marvin Humphrey <ma...@rectangular.com> wrote:

>> Shouldn't segmeta itself have a format too?
>
> Yes -- it's in there, just under the "segmeta" key rather than at the root
> level.

Woops, missed it, good.

>> Are you going to provide utility APIs that components can use to deal
>> with the format number?
>
> A good plan.  DataWriter already has two relevant methods.
>
>    /** Create a Hash of arbitrary metadata to be serialized and stored
>     * by the Segment.  The default implementation supplies a Hash with
>     * a single key-value pair for "format".
>     */
>    public incremented Hash*
>    Metadata(DataWriter *self);
>
>    /** Every writer must specify a file format revision number, which should
>     * increment each time the format changes. Responsibility for revision
>     * checking is left to the companion DataReader.
>     */
>    public abstract i32_t
>    Format(DataWriter *self);

What does "incremented" mean?

>> eg so a component can register the N formats it's able to deal with,
>> so a consistent error is thrown if a format is too old or too new,
>> etc.
>
> Haven't got standardized methods to perform format checking in DataReader yet.
> How do these look?
>
>    /** Throw an error unless the supplied format version is at least
>     * <code>min</code> and no more than <code>max</code>.
>     *
>     * @param format Format version.
>     * @param min Minimum supported format version, which must be at least 1.
>     * @param max Maximum supported format version, which must be at least 1.
>     * @return the version.
>    public i32_t
>    Validate_Format(DataReader *self, i32_t format, i32_t min, i32_t max);
>
>    /** Attempt to extract a "format" value from the supplied metadata Hash.
>     * If the extraction is a success, calls Validate_Format().
>     *
>     * @return either the return value of Validate_Format() or 0 (an invalid
>     * format value).
>     * /
>    i32_t
>    Check_Format(DataReader *self, Hash *metadata = NULL,
>                 i32_t min, i32_t max);
>
> Note that Validate_Format() is public, but that Check_Format(), which would be
> used by core components, is not.
>
> Implementation code (unverified):
>
>    i32_t
>    DataWriter_validate_format(DataReader *self, i32_t format,
>                               i32_t min, i32_t max)
>    {
>        if (format < min) {
>            THROW("Format version '%i32' is less than the minimum "
>                "supported version '%i32' for %o", format, min,
>                DataReader_Get_Class_Name(self));
>        }
>        else if (format > max) {
>            THROW("Format version '%i32' is greater than the maximum
>                "supported version '%i32' for %o", format, max,
>                DataReader_Get_Class_Name(self));
>        }
>        return format;
>    }
>
>    i32_t
>    DataWriter_check_format(DataReader *self, Hash *metadata,
>                            i32_t min, i32_t max)
>    {
>        i32_t version = 0;
>        if (metadata) {
>            Obj *format = Hash_Fetch_Str(metadata, "format", 6);
>            if (format) {
>                version = DataWriter_Check_Format(self, Obj_To_I64(format),
>                    min, max);
>            }
>        }
>        return version;
>    }

Looks good, though, I might add a way for a given module to register
the versions it reads & writes (presumably it only writes the most
recent one); then min/max can be derived based on what was registered.
 This can be useful for introspection too, so instead of just seeing
"format 2" something could decode that to the string describing what
format 2 was (eg "added omitTermFreqAndPositions capability").

> It might make sense to throw specific exception classes in Lucy.  I haven't
> worked something like that out in KS for three reasons.  First, it's hard to
> catch exceptions from C without leaking memory.  Second Perl's try-catch
> mechanism isn't very elegant.  Third, faking up a try-catch-finally interface
> in C that would be abstract enough to handle all potential host
> exception-handling mechanisms is, uh, challenging.

This sounds very difficult!

> The only caught exceptions in the KS core happen in IndexReader's open()
> command, due to the lockless opening code and for reasons you are no doubt
> familiar with. ;)  All other errors are fatal.

I think I might know.  So, that answers my earlier question about the
snapshots file.

> However, we could create full-fledged exception objects for Lucy, so that THROW
> calls might look something like this:
>
>    THROW(Err_data_component_version, /* <--- An integer error id */
>        "Format version '%i32' is less than the minimum "
>        "supported version '%i32' for %o", format, min,
>        DataReader_Get_Class_Name(self));
>
> The exception objects generated by THROW calls do not have to subclass
> Lucy::Obj, because we will always be returning control to the host.  So, they
> could be, for example, plain old Java Exception subclasses.

What would THROW try to do, and, how?

Mike

Re: Segment

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Tue, Mar 24, 2009 at 08:11:09AM -0400, Michael McCandless wrote:

> Shouldn't segmeta itself have a format too?

Yes -- it's in there, just under the "segmeta" key rather than at the root
level.

      "segmeta" : { 
         "doc_count" : "11054",
         "field_names" : [ 
            "", 
            "title",
            "category",
            "content",
            "url"
         ], 
         "format" : "1"    <--------------------
         "name" : "seg_3"
      },

> Are you going to provide utility APIs that components can use to deal
> with the format number?  

A good plan.  DataWriter already has two relevant methods.

    /** Create a Hash of arbitrary metadata to be serialized and stored 
     * by the Segment.  The default implementation supplies a Hash with 
     * a single key-value pair for "format".
     */  
    public incremented Hash*
    Metadata(DataWriter *self);

    /** Every writer must specify a file format revision number, which should
     * increment each time the format changes. Responsibility for revision
     * checking is left to the companion DataReader.
     */  
    public abstract i32_t
    Format(DataWriter *self);

> eg so a component can register the N formats it's able to deal with,
> so a consistent error is thrown if a format is too old or too new,
> etc.

Haven't got standardized methods to perform format checking in DataReader yet. 
How do these look?

    /** Throw an error unless the supplied format version is at least
     * <code>min</code> and no more than <code>max</code>.
     *
     * @param format Format version.
     * @param min Minimum supported format version, which must be at least 1.
     * @param max Maximum supported format version, which must be at least 1.
     * @return the version.
    public i32_t
    Validate_Format(DataReader *self, i32_t format, i32_t min, i32_t max);

    /** Attempt to extract a "format" value from the supplied metadata Hash.
     * If the extraction is a success, calls Validate_Format().
     * 
     * @return either the return value of Validate_Format() or 0 (an invalid
     * format value).
     * /
    i32_t 
    Check_Format(DataReader *self, Hash *metadata = NULL,
                 i32_t min, i32_t max);

Note that Validate_Format() is public, but that Check_Format(), which would be
used by core components, is not.

Implementation code (unverified): 

    i32_t 
    DataWriter_validate_format(DataReader *self, i32_t format, 
                               i32_t min, i32_t max)
    {
        if (format < min) {
            THROW("Format version '%i32' is less than the minimum "
                "supported version '%i32' for %o", format, min,
                DataReader_Get_Class_Name(self));
        }
        else if (format > max) {
            THROW("Format version '%i32' is greater than the maximum 
                "supported version '%i32' for %o", format, max,
                DataReader_Get_Class_Name(self));
        }
        return format;
    }

    i32_t 
    DataWriter_check_format(DataReader *self, Hash *metadata,
                            i32_t min, i32_t max)
    {
        i32_t version = 0;
        if (metadata) {
            Obj *format = Hash_Fetch_Str(metadata, "format", 6);
            if (format) {
                version = DataWriter_Check_Format(self, Obj_To_I64(format), 
                    min, max);
            }
        }
        return version;
    }

It might make sense to throw specific exception classes in Lucy.  I haven't
worked something like that out in KS for three reasons.  First, it's hard to
catch exceptions from C without leaking memory.  Second Perl's try-catch
mechanism isn't very elegant.  Third, faking up a try-catch-finally interface
in C that would be abstract enough to handle all potential host
exception-handling mechanisms is, uh, challenging.

The only caught exceptions in the KS core happen in IndexReader's open()
command, due to the lockless opening code and for reasons you are no doubt
familiar with. ;)  All other errors are fatal.

However, we could create full-fledged exception objects for Lucy, so that THROW
calls might look something like this:

    THROW(Err_data_component_version, /* <--- An integer error id */
        "Format version '%i32' is less than the minimum "
        "supported version '%i32' for %o", format, min,
        DataReader_Get_Class_Name(self));

The exception objects generated by THROW calls do not have to subclass
Lucy::Obj, because we will always be returning control to the host.  So, they
could be, for example, plain old Java Exception subclasses.

Marvin Humphrey

Re: Segment

Posted by Michael McCandless <lu...@mikemccandless.com>.

Marvin Humphrey <ma...@rectangular.com> wrote:

> In another thread, I described a proposed "Snapshot" class.  The goal we're
> working towards is pluggable index reading/writing via Architecture and
> DataReader/DataWriter.  The constructor for the prototype implementation of
> DataWriter I've worked up in KS takes three arguments: a Snapshot, a
> PolyReader (analogous to Lucene's MultiSegmentReader)... and a "Segment".
>
> The Segment class has three main responsibilities:
>
>  * Keep track of how many documents are in the segment (not counting
>    deletions).
>  * Maintain per-segment field-name-to-field-number associations.
>  * Write the "segmeta" file, which stores arbitrary metadata.

Ahh, there's the answer (to my "what about segment metadata"
questions).  This is good!

Shouldn't segmeta itself have a format too?

Are you going to provide utility APIs that components can use to deal
with the format number?  We don't in Lucene but it's been broached...
eg so a component can register the N formats it's able to deal with,
so a consistent error is thrown if a format is too old or too new,
etc.

> The Segment's doc count is used both at index time...
>
>    void
>    SegWriter_add_doc(SegWriter *self, Doc *doc)
>    {
>        i32_t doc_num = Seg_Increment_Doc_Count(self->segment, 1);
>        Inverter_Invert_Doc(self->inverter, doc);
>        SegWriter_Add_Inverted_Doc(self, self->inverter, doc_num);
>    }
>
> ... and at search-time:
>
>    i32_t
>    SegReader_doc_max(SegReader *self)
>    {
>        return Seg_Get_Doc_Count(self->segment);
>    }
>
> In Lucene, field-name-to-field-number mappings are the province of the
> FieldInfos class, which also tracks field characteristics such as "isStored".
> Lucy uses global field semantics, though, so there's no need for per-segment
> field specs.
>
> The "segmeta" file is used to store both metadata needed by Segment itself and
> metadata belonging to other index components:
>
>    {
>       "lexicon" : {
>          "counts" : {
>             "content" : "20576"
>          },
>          "format" : "2",
>          "index_counts" : {
>             "content" : "161"
>          }
>       },
>       "postings" : {
>          "format" : "1"
>       },
>       "records" : {
>          "format" : "1"
>       },
>       "segmeta" : {
>          "doc_count" : "11054",
>          "field_names" : [
>             "",
>             "title",
>             "category",
>             "content",
>             "url"
>          ],
>          "format" : "1"
>          "name" : "seg_3"
>       },
>       "term_vectors" : {
>          "format" : "1"
>       }
>    }
>
> Providing a place for plugin indexing components to store arbitrary metadata
> relieves them from the responsibility for writing and parsing metadata
> themselves.  In Lucene, metadata classes such as FieldInfos have their own
> binary file formats and maintain their own parsing routines, bloating the
> Lucene file format documentation and adding maintenance overhead.  While
> binary formats are necessary for bulk data, for small amounts of metadata they
> hinder bare-eye browsing and provide no significant performance advantage.
>
> In some sense Segment is similar to the Lucene class SegmentInfo.  For
> example, both of them store format version data; however, Segment is only
> aware of its own format, and it is up to individual plugins to track their own
> format versions and adjust behavior as needed.  SegmentInfo is tightly bound
> to other Lucene classes because it knows too much about them, hindering
> extensibility; Segment, while capable of storing much more data than SegInfo
> since it uses generic scalar-list-mapping data structures, knows nothing about
> any of the plugin components that access that data.
>
> Prototype code:
>
>  http://tinyurl.com/proto-seg-bp
>  http://tinyurl.com/proto-seg-c
>
> HTML presentation of public API documentation for Perl binding:
>
>  http://tinyurl.com/seg-dev-docs
>
> Marvin Humphrey
>
>