You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2011/03/03 01:07:56 UTC

[lucy-dev] FieldType Equals() and class membership

Greets,

Lucy is not supposed to let you change field definitions for an index.  If you
supply a Schema with a modified field type to an Indexer and attempt to modify
an existing index, Lucy will throw an exception.  You're allowed a new schema
if the "truncate" flag is supplied to Indexer, but in that case there will be
a hard cutover to the new Schema when Indexer_Commit() completes -- there is
never a situation where the same index hosts distinct segments with
conflicting field definitions.

The enforcement happens here, in core/Lucy/Index/Indexer.c:

            // Make sure than any existing fields which may have been
            // dynamically added during past indexing sessions get added.
            Schema *old_schema = PolyReader_Get_Schema(self->polyreader);
            Schema_Eat(schema, old_schema);

However, there is a weakness the definition checking performed by
Schema_Eat(): it relies upon FType_Equals() to detect conflicts, and
FType_Equals() doesn't enforce class membership.  That means that an
incompatible index definition can overwrite another one.

This bug was uncovered when we changed an index to use a new FieldType
subclass that spec'd an alternate posting format. (Alternate posting formats
are an undocumented feature of Lucy.) The new index was written successfully,
displacing the old when Indexer_Commit() finished.  An errant process using
the old schema then modified the index, adding a new segment in an
incompatible format and overwriting the new schema with the old.  Once that
update completed, the index was no longer usable, because the incorrect
posting decoder spec'd by the new FieldType was incorrect with the posting
data in all the old segments -- it read a bunch of junk, then finally yielded
a read-past-EOF error.

As a solution, I believe that FieldType's Equals() method should change to
enforce that both "self" and "other" belong to the same class.  We don't care
*which* class, we only care that they be the *same* class.  So long as we
don't require that that class be Lucy::Plan::FieldType itself, the method is
still inheritable and can be invoked via SUPER.

For some classes, e.g. CharBuf, Equals() should emphatically *not* test for
class membership, so that e.g. a CharBuf and a ViewCharBuf with the same
logical content test as equal.  For FieldType, I can't think of a scenario
where having objects which belong to different classes test as equal would be
desirable.

Patch below.

Marvin Humphrey


Index: ../core/Lucy/Plan/FieldType.c
===================================================================
--- ../core/Lucy/Plan/FieldType.c   (revision 1075667)
+++ ../core/Lucy/Plan/FieldType.c   (working copy)
@@ -87,6 +87,7 @@
 {
     FieldType *evil_twin = (FieldType*)other;
     if (evil_twin == self) return true;
+    if (FType_Get_VTable(self) != FType_Get_VTable(evil_twin)) return false;
     if (self->boost != evil_twin->boost) return false;
     if (!!self->indexed    != !!evil_twin->indexed)    return false;
     if (!!self->stored     != !!evil_twin->stored)     return false;


Re: [lucy-dev] FieldType Equals() and class membership

Posted by Nathan Kurz <na...@verse.com>.
On Wed, Mar 2, 2011 at 5:18 PM, Marvin Humphrey <ma...@rectangular.com> wrote:
>> My only thought would be that it might be better not to mentally overload
>> the name Equals.
>
> We use Equals() and Hash_Sum() the way that several other object systems do:
> objects for which Equals() returns true must produce identical values for
> Hash_Sum().  This allows us to use many different kinds of objects as hash
> keys.

I agree that is is common practice.  I was questioning, though,
whether it was best practice.  But when it comes to OO techniques,
it's hard for me to tell whether I'm a canary in the coal mine or just
a mad dog barking at the moon.  I felt I should bring it up, but I'll
trust your instinct.

>> While I'm nitpicking, is the '!!' really necessary there?  I'm not
>> sure how these fields are being used, but is it certain that one
>> always wants to consider things the same if these have different
>> values?
>
> Those members you're referring to have boolean values.
>
>    if (!!self->indexed    != !!evil_twin->indexed)    return false;
>
> Since C doesn't provide a boolean type with values restricted to true and
> false, using !! provides an extra level of safety, insulating this method from
> a glitch elsewhere in the code.  Whether that's appropriate caution or
> paranoia is a matter for debate.  :)

I understand what it does, but wondered if it was being
_insufficiently_ paranoid.  Hypothetically, who knows why, but someone
might try:

#define INDEXED_BASIC 1
#define INDEXED_EXTRA 2
BasicType->indexed = INDEXED_BASIC;
ExtraType->indexed = INDEXED_EXTRA;

What should FieldType_Equals() return?  If one was being fully
paranoid, one might conclude that if the values are anything other
than numerically (not logically) equal, that it's always safer to
return false rather than taking any risk of corrupting the index.
Currently, the 'error' is silently 'fixed'.  I was wondering if this
is really a good thing.  I'd personally probably do it numerically and
might even add an ASSERT(indexed == !!index) somewhere, but I can get
a little ASSERT happy at times.

Nathan Kurz
nate@verse.com

Re: [lucy-dev] FieldType Equals() and class membership

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Wed, Mar 02, 2011 at 04:40:45PM -0800, Nathan Kurz wrote:
> On Wed, Mar 2, 2011 at 4:07 PM, Marvin Humphrey <ma...@rectangular.com> wrote:
> > For some classes, e.g. CharBuf, Equals() should emphatically *not* test for
> > class membership, so that e.g. a CharBuf and a ViewCharBuf with the same
> > logical content test as equal.  For FieldType, I can't think of a scenario
> > where having objects which belong to different classes test as equal would be
> > desirable.
> 
> All sounds reasonable and like good changes.  

Cool, thanks for thinking it through.

> My only thought would be that it might be better not to mentally overload
> the name Equals.

We use Equals() and Hash_Sum() the way that several other object systems do:
objects for which Equals() returns true must produce identical values for
Hash_Sum().  This allows us to use many different kinds of objects as hash
keys.  

Off the top of my head, I know that Python, Java, Ruby, and C# all provide
similar contracts:

    Python:  __hash__, __eq__
    Java:    hashCode, equals
    Ruby     hash, eql?
    C#       GetHashCode, Equals

Perl and Javascript are the odd ones out, since they limit their hash keys to
string types.

> Maybe FType_SameClass() would be clearer?

It's not uncommon to provide variants on equals() methods -- for example, Ruby
provides the following:

    http://www.ruby-doc.org/core/classes/Object.html#M001011
    ==
    eql?
    equals?

I think that the semantics of FieldType's Equals() method fit within
traditional definitions pretty well: we want to know whether these objects
have the same value.  We don't want to allow two different FieldType objects
which produce different index formats to be conflated.  There can be only one
index format per field name, therefore there can be only one FieldType per
field name in a Schema.  This is very similar to the problem of testing
whether one key should should displace another within a set/hashset.

> While I'm nitpicking, is the '!!' really necessary there?  I'm not
> sure how these fields are being used, but is it certain that one
> always wants to consider things the same if these have different
> values?

Those members you're referring to have boolean values.  

    if (!!self->indexed    != !!evil_twin->indexed)    return false;

Since C doesn't provide a boolean type with values restricted to true and
false, using !! provides an extra level of safety, insulating this method from
a glitch elsewhere in the code.  Whether that's appropriate caution or
paranoia is a matter for debate.  :)

Marvin Humphrey


Re: [lucy-dev] FieldType Equals() and class membership

Posted by Nathan Kurz <na...@verse.com>.
On Wed, Mar 2, 2011 at 4:07 PM, Marvin Humphrey <ma...@rectangular.com> wrote:
> For some classes, e.g. CharBuf, Equals() should emphatically *not* test for
> class membership, so that e.g. a CharBuf and a ViewCharBuf with the same
> logical content test as equal.  For FieldType, I can't think of a scenario
> where having objects which belong to different classes test as equal would be
> desirable.

All sounds reasonable and like good changes.  My only thought would be
that it might be better not to mentally overload the name Equals.
Maybe FType_SameClass() would be clearer?

At heart, I'm really not in favor of overloading anything, so I'd even
be happy to rename both of them if it would improve clarity.  Should a
CharBuf and a ViewCharBuf  have the SameContent()?

While I'm nitpicking, is the '!!' really necessary there?  I'm not
sure how these fields are being used, but is it certain that one
always wants to consider things the same if these have different
values?

--nate