You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2009/04/12 15:08:50 UTC

Types and Schemas (was "Sort cache file format")

On Sat, Apr 11, 2009 at 10:58:44AM -0400, Michael McCandless wrote:

> Does FieldSpec sub divide the options?  Eg options about indexing
> could live in its own class, with commonly used constants like "NO".
> 
> This was the motivation of that comment in Lucene (the fact that we
> don't subdivide means suddenly stored only fields have to figure out
> what to do with omitNorms, omitTFAP booleans; if we had Field.Index.NO
> that's be better).

Right now, FieldSpec doesn't subdivide, but it's not a least common
denominator, either.  To illustrate: FieldSpec has boolean members for
"indexed", "stored", and "sortable", but knows nothing about Analyzers.
Analyzers are the exclusive province of the FullTextField subclass.

If you don't permit automatic merging of field types, then there isn't a need
for FieldSpec to know everything about all its subclasses.  I see why
subdividing options might be useful in Lucene, but I'm not sure it's necessary
for Lucy.  

I think it's better OO design for the parent class to be simple rather than
comprehensive.

> Well, in Lucene we could better decouple a Field's value from its
> "extended type".  The type would still be attached to the Field's
> value (not to the global schema as in KS), but strongly decoupled &
> shared across Field instances.

That makes sense.  The "extended type" class could look almost identical, but
in Lucene the user would make the connection directly, while in Lucy it
would be made indirectly via the field name.

> [A fun aside: Wow I just did a Google search for "javascript self" and
> it offered up respelling to "javascript this" -- they've got one smart
> respeller!]

Haha, awesome. :)

> Lucene in fact implicitly has a global schema in that when segments
> are merged, or when docs are added into a single segment, the schema
> for each document or segment are "merged" according to certain rules.
> When your index is optimized then you have your global schema.

That's a good way of putting it.

> > Dump them to a JSON-izable data structure.  Include the class name so that you
> > can pick a deserialization routine at load time.
> 
> You rely on the same namespace -> obj mapping being present at
> deserialize time?  Ie its the callers responsibility to import the
> same modules, ensure the names "map" to the same objs (or at least
> compatible ones) as were used during serialization, etc.

If the user has implemented custom subclasses, then yes, the subclasses must be
loaded or you'll get a "class not found" error.

> Though, for core objects, you would use the global name -> vtable
> mapping that Lucy core maintains?  

Yes.  Any core class would already be loaded.

> (I still don't fully understand why Lucy needs that global hash -- this is
> what namespaces are for).

If we didn't implement it internally, we'd need to implement it in the
bindings for e.g. looking up deserialization routines.  Furthermore, we need
some mechanism for C-level subclassing, since that's not part of the C
language.  No namespaces there.  :)

> OK, so if I've made a custom Tokenizer doing some funky Python code
> instead of a regexp, I could simply implement dump/load to do the
> right thing.

Yes.

BTW, I saw that Earwin Burrfoot calls his type class "FieldType".  

"FieldType" is probably a better name than "FieldSpec", as it implies
subclasses with "Type" as a suffix: FullTextType, StringType, BlobType,
Int32Type, etc.

Marvin Humphrey

Re: Types and Schemas

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Tue, Apr 14, 2009 at 6:38 AM, Marvin Humphrey <ma...@rectangular.com> wrote:
> On Mon, Apr 13, 2009 at 09:43:06AM -0400, Michael McCandless wrote:
>
>> > Because Lucy's Doc objects will be hash based, there will *never* be a case
>> > where the same field has two "values" per se within the same doc.
>> >
>> > However, it's fine if we support compound types via specific FieldType
>> > subclasses, e.g. Float32ArrayType, or StringArrayType.
>>
>> I see -- does KS support multi-valued (compound) types today?
>
> No.  There hasn't been a pressing need for them.  (IMO, the fact that Lucene
> allows multiple "values" per field is a misfeature.  Effectively, *all* fields
> in Lucene are compound types, which is limiting.)

I think compound types are important (eg "author"), though "compound"
is a bit too powerful sounding (eg a "struct" is a compound type, but
we're not going there, I hope).  Maybe we can call them "arrays" or
"lists" or "multi-valued".

Maybe you mean Lucene's weak typing (of multi-valued types, in
particular) is the misfeature here?

> Nevertheless, I'm up for supplementing scalar types with compound types in
> Lucy.  The "tags" use case in particular might be more elegantly handled with
> a StringArrayType.  Or maybe a FullTextArrayType, if it was important for the
> field to be analyzed.
>
> Right now, you can fake up a "tags" field in KS using a dedicated Tokenizer
> pretty easily, but the scoring is kind of messed up because of length
> normalization.

Another EG might be a product that comes in three sizes (S, M, L) and
your current search is filtering by "size == S", that's hard to
emulate well w/o compound types (you could do substring search, but
that scales poorly).

>> For which "types"?  And I imagine for such types, "sortable" is not allowed
>
> There's an inherent confusion in how fields that can "hit" in multiple ways
> should sort.  On one hand, you might want to sort by the value that "hit".  On
> the other hand, you might want to sort by the first value in the field.
>
> In the face of that confusion, I think it makes sense to just disable sorting
> for compound types.

Or allow custom comparator, or custom "ValueSource".  Hmm, I wonder
whether ValueSource should make it possible to eg return multiple ints
for a single doc.

>> (yet "sortable" is set at the top FieldSpec, right?)?
>
> Sure, but subclasses of FieldType can override Set_Sortable() to throw an
> error and avoid it as a constructor arg.

Ahh, right, you can "subtract" functionality from the base.  OK.

>> > It's also important to distinguish between "multi-valued" and the
>> > "multi-token" FullTextType.  FullTextType fields are tokenized within the
>> > index, but in the context of the doc reader, they only have one string
>> > "value".  Note, however, that you cannot sort on a FullTextType field in KS.
>>
>> So if I want to index & sort by "title" field, I make 2 separate fields?
>
> Hmm.  Good point, that's a waste and shouldn't be necessary.
>
> That behavior is an artifact of using Lexicon data to build the sort cache.
> Once we move to a dedicated SortWriter/SortReader, though, we'll be building
> the sort cache at index time from the full field value, and that problem goes
> away.
>
> So, I think it makes sense to allow sorting on FullTextType fields after all.

OK that sounds right.  Lucene won't be able to do this until we have
CSF, or, if we also write sort caches at index time, which does make
sense.

>> >>   * Open vs closed (known set of values) enums
>> >
>> > It would be nice to add this later.  I don't think it's a high priority, since
>> > it's an optimization.
>>
>> You mean you'd start with "open" enums?
>
> I meant no enum type right now.

OK, since at index time we can basically deduce ourself it it's
"relatively" enumerated and act accordingly (or simply treat all as
enums for now, as you've suggested).

>> >>   * Sortable
>> >
>> > I think this belongs in the base class -- that's where KS has it now.  That
>> > way, we can perform the following test, regardless of what the type is.
>> >
>> >   if (FieldType_Sortable(field_type)) {
>> >        /* Build sort cache. */
>> >        ...
>> >   }
>>
>> Yeah... except multi-valued (compound) types would disable this, I
>> guess.  Though Lucene users seem to hit this limitation enough to make
>> it relaxable... and customize how SortCache gets created.
>
> In the abstract, that sounds like a can of worms, but we can revisit after the
> sort cache writer (SortWriter?) gets a provisional implementation.

OK.

>> >>   * nulls sort on top or bottom
>> >
>> > This would be individual to each sort comparator.  Note that we might want to
>> > use a different sort comparator for NOT NULL fields for efficiency's sake,
>> > which complicates making the comparator a method on FieldSpec.
>>
>> Yes, we're iterating on this now in LUCENE-831.  Though I wonder if
>> this ought to be the realm of source code specialization...
>> multiplying out all the combinations of "single comparator or not",
>> "scoring or not", "track max score or not", "string index may have
>> nulls or not", in Lucene's "true" sources (vs generated sources)
>> starts to get crazy.  Soon we'll also multiply in "docIDs guaranteed
>> to arrive in order to the collector, or not" as well.
>
> Actually, you know what?  The vast majority of our sort costs come at
> index-time, when we build the ords array.  At search time, the only time we
> have to worry about the cost of the comparator is when comparing values across
> segments.  So: we can afford to have NULL checks in the default comparator
> routines.

I think the search time optimizations add up... not having to break
ties on docID is a good gain, for example, if the sort has ties.

I'm seeing sizable gains by specializing the source code (in Java, at
least).  Though, a good chunk of that is pushing random-access filters
down low, so that's a low hanging fruit for the true source code.

>> > My general inclination is to have NULLs sort towards the end of the array.
>> >
>> >>   * Omit norms, omit TFAP
>> >
>> > I'm putting this off for now.  It will be addressed when we refactor for
>> > flexible indexing.
>>
>> OK.  These would seem to live nicely under FullTextType... oh actually
>> maybe not, because presumably I can index single-valued fields (the
>> equivalent of NOT_ANALYZED in Lucene).
>
> Yes.  Right now in KS, StringType fields -- which are single-valued -- can be
> indexed.

OK

>> EG an Int32Type may in fact be indexed, and I would at that point want to
>> put omit norms/TFAP there.  Hmmm, cross cutting concerns.  Maybe sub-typing
>> is needed...
>
> Right now in KS, norms are stored in the postings files, a la the original
> "flexible indexing" design that Doug, Grant and I hashed out a while back.
> It's inefficient and needs refactoring.
>
> However, I plan to wait on that until after the next dev release.

OK

>> >>   * Term vectors or not, positions, offsets
>> >
>> > Term vectors are unique to FullTextType, since it is the only multi-token
>> > field.  Right now in KS, it's a boolean member var in FullTextType.
>>
>> Single-token indexed fields might want term vectors too?
>
> I dunno, is that necessary?  I guess it's not a big deal to move it down into
> FieldType.
>
> Right now in KS, there's only one flag, "vectorized", and start offsets and
> end offsets are always included.  That's because the only significant use case
> is highlighting.  (I've always regarded MoreLikeThis queries based on term
> vectors as fatally flawed.)

Why flawed?

> I've often wondered whether or not to call that flag "highlightable" rather
> than the obtuse "vectorized".

> IMO, it's important to have a high quality
> highlighter/excerpter in core, and perhaps the API should be adjusted to
> reflect that priority.

+1

> If you really need "term vectors" per se, you can
> either go with a dedicated plugin or specify "highlightable" and exploit the
> fact that it's a term-vector based implementation.

Yeah, maybe.  Besides highlighting, MoreLikeThis and maybe
clustoring/categorizing , I don't have a good sense of what
else term vectors are "typically" used for.

>> >>   * CSF'd or not
>> >
>> > Right now, I'd say keep this out of core.
>>
>> OK, and, merge with sort cache somehow.  For most types they are one
>> and the same.
>
> Yeah, I think that's right.  The only difference is the extra deref in cases
> where high levels of uniqueness suggest a pure array would be ideal.

Right.

>> >>   * I will use RangeFilter on this field
>> >
>> > The "sortable" boolean member var fills this need, no?
>>
>> They are different?  Eg you'll add aggregates (Trie*) to your index
>> for fast range constraints, but for sorting you just need a sort cache
>> computed.
>
> I haven't really looked at the TrieRange stuff yet...
>
> In KS, range queries are implemented to just look up a term number in the
> Lexicon for both the lower and upper terms before scoring commences, then see
> if the ord value from the sort cache falls between them for each document.

Ahh got it.  Lucene recently added that approach (using our FieldCache
to check inclusion in the range).  We now have too many RangeQueries.

Trie simply aggreates big ranges at indexing time (logically
equivalent to, 0-10, 20-20, then next trie 0-100, 100-200, etc.), ie
each range is new term on the doc, and then at search time you can
pick a much smaller set of terms to iterate.

>> > FWIW, the current implementation of Boilerplater only supports two level
>> > namespacing (with nicknames).  Outside of core, fully qualified code would
>> > look like this:
>> >
>> >  lucy_StandardAnalyzer *analyzer = lucy_StdAnalyzer_new();
>> >  lucy_Inversion *inversion = Lucy_StdAnalyzer_Transform_Text(charbuf);
>>
>> What are the two levels here?  Level 1 is "StdAnalyzer", and Level 2
>> is "new" and "Transform_Text"?
>
> Level 1 is "lucy_".  Level 2 is "StdAnalyzer_".

OK.

>> > If we readonly that Hash, we can't add subclasses to it -- and therefore we
>> > won't be able to retrieve their deserializers.
>>
>> I guess it's only subclasses implemented in C where this is important?
>>
>> Because a hosty subclass's deserializer is using/relying the host's
>> namespace to find classes by name.
>
> Within Schema_deserialize, Lucy will have to be able to track down
> deserializers for custom subclasses of Analyzer and FieldType.  Same thing
> with custom Query subclasses and remote searching.
>
> We either deal with that need in the Lucy core or punt back to the host.

Seems like if the subclass is in the host, the host's namespace should
locate it (and its deserializer method).  The global hash that expands
at runtime to include all known named things in the universe still
doesn't quite sit right w/ me... but I agree based on deserializer's
needs, and lack of namespaces in C, it seems to solve those needs.

Mike

Re: Types and Schemas

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Mon, Apr 13, 2009 at 09:43:06AM -0400, Michael McCandless wrote:

> > Because Lucy's Doc objects will be hash based, there will *never* be a case
> > where the same field has two "values" per se within the same doc.
> >
> > However, it's fine if we support compound types via specific FieldType
> > subclasses, e.g. Float32ArrayType, or StringArrayType.
> 
> I see -- does KS support multi-valued (compound) types today?  

No.  There hasn't been a pressing need for them.  (IMO, the fact that Lucene
allows multiple "values" per field is a misfeature.  Effectively, *all* fields
in Lucene are compound types, which is limiting.)

Nevertheless, I'm up for supplementing scalar types with compound types in
Lucy.  The "tags" use case in particular might be more elegantly handled with
a StringArrayType.  Or maybe a FullTextArrayType, if it was important for the
field to be analyzed.

Right now, you can fake up a "tags" field in KS using a dedicated Tokenizer
pretty easily, but the scoring is kind of messed up because of length
normalization.  

> For which "types"?  And I imagine for such types, "sortable" is not allowed

There's an inherent confusion in how fields that can "hit" in multiple ways
should sort.  On one hand, you might want to sort by the value that "hit".  On
the other hand, you might want to sort by the first value in the field.

In the face of that confusion, I think it makes sense to just disable sorting
for compound types.

> (yet "sortable" is set at the top FieldSpec, right?)?

Sure, but subclasses of FieldType can override Set_Sortable() to throw an
error and avoid it as a constructor arg.

> > It's also important to distinguish between "multi-valued" and the
> > "multi-token" FullTextType.  FullTextType fields are tokenized within the
> > index, but in the context of the doc reader, they only have one string
> > "value".  Note, however, that you cannot sort on a FullTextType field in KS.
> 
> So if I want to index & sort by "title" field, I make 2 separate fields?

Hmm.  Good point, that's a waste and shouldn't be necessary.

That behavior is an artifact of using Lexicon data to build the sort cache.
Once we move to a dedicated SortWriter/SortReader, though, we'll be building
the sort cache at index time from the full field value, and that problem goes
away.

So, I think it makes sense to allow sorting on FullTextType fields after all.

> >>   * Open vs closed (known set of values) enums
> >
> > It would be nice to add this later.  I don't think it's a high priority, since
> > it's an optimization.
> 
> You mean you'd start with "open" enums?

I meant no enum type right now.

> >>   * Sortable
> >
> > I think this belongs in the base class -- that's where KS has it now.  That
> > way, we can perform the following test, regardless of what the type is.
> >
> >   if (FieldType_Sortable(field_type)) {
> >        /* Build sort cache. */
> >        ...
> >   }
> 
> Yeah... except multi-valued (compound) types would disable this, I
> guess.  Though Lucene users seem to hit this limitation enough to make
> it relaxable... and customize how SortCache gets created.

In the abstract, that sounds like a can of worms, but we can revisit after the
sort cache writer (SortWriter?) gets a provisional implementation.

> >>   * nulls sort on top or bottom
> >
> > This would be individual to each sort comparator.  Note that we might want to
> > use a different sort comparator for NOT NULL fields for efficiency's sake,
> > which complicates making the comparator a method on FieldSpec.
> 
> Yes, we're iterating on this now in LUCENE-831.  Though I wonder if
> this ought to be the realm of source code specialization...
> multiplying out all the combinations of "single comparator or not",
> "scoring or not", "track max score or not", "string index may have
> nulls or not", in Lucene's "true" sources (vs generated sources)
> starts to get crazy.  Soon we'll also multiply in "docIDs guaranteed
> to arrive in order to the collector, or not" as well.

Actually, you know what?  The vast majority of our sort costs come at
index-time, when we build the ords array.  At search time, the only time we
have to worry about the cost of the comparator is when comparing values across
segments.  So: we can afford to have NULL checks in the default comparator
routines.

> > My general inclination is to have NULLs sort towards the end of the array.
> >
> >>   * Omit norms, omit TFAP
> >
> > I'm putting this off for now.  It will be addressed when we refactor for
> > flexible indexing.
> 
> OK.  These would seem to live nicely under FullTextType... oh actually
> maybe not, because presumably I can index single-valued fields (the
> equivalent of NOT_ANALYZED in Lucene).  

Yes.  Right now in KS, StringType fields -- which are single-valued -- can be
indexed.  

> EG an Int32Type may in fact be indexed, and I would at that point want to
> put omit norms/TFAP there.  Hmmm, cross cutting concerns.  Maybe sub-typing
> is needed...

Right now in KS, norms are stored in the postings files, a la the original
"flexible indexing" design that Doug, Grant and I hashed out a while back.
It's inefficient and needs refactoring.

However, I plan to wait on that until after the next dev release.

> >>   * Term vectors or not, positions, offsets
> >
> > Term vectors are unique to FullTextType, since it is the only multi-token
> > field.  Right now in KS, it's a boolean member var in FullTextType.
> 
> Single-token indexed fields might want term vectors too?

I dunno, is that necessary?  I guess it's not a big deal to move it down into
FieldType.

Right now in KS, there's only one flag, "vectorized", and start offsets and
end offsets are always included.  That's because the only significant use case
is highlighting.  (I've always regarded MoreLikeThis queries based on term
vectors as fatally flawed.)

I've often wondered whether or not to call that flag "highlightable" rather
than the obtuse "vectorized".  IMO, it's important to have a high quality
highlighter/excerpter in core, and perhaps the API should be adjusted to
reflect that priority.  If you really need "term vectors" per se, you can
either go with a dedicated plugin or specify "highlightable" and exploit the
fact that it's a term-vector based implementation.

> >>   * CSF'd or not
> >
> > Right now, I'd say keep this out of core.
> 
> OK, and, merge with sort cache somehow.  For most types they are one
> and the same.

Yeah, I think that's right.  The only difference is the extra deref in cases
where high levels of uniqueness suggest a pure array would be ideal.

> >>   * I will use RangeFilter on this field
> >
> > The "sortable" boolean member var fills this need, no?
> 
> They are different?  Eg you'll add aggregates (Trie*) to your index
> for fast range constraints, but for sorting you just need a sort cache
> computed.

I haven't really looked at the TrieRange stuff yet...

In KS, range queries are implemented to just look up a term number in the
Lexicon for both the lower and upper terms before scoring commences, then see
if the ord value from the sort cache falls between them for each document.

> > FWIW, the current implementation of Boilerplater only supports two level
> > namespacing (with nicknames).  Outside of core, fully qualified code would
> > look like this:
> >
> >  lucy_StandardAnalyzer *analyzer = lucy_StdAnalyzer_new();
> >  lucy_Inversion *inversion = Lucy_StdAnalyzer_Transform_Text(charbuf);
> 
> What are the two levels here?  Level 1 is "StdAnalyzer", and Level 2
> is "new" and "Transform_Text"?

Level 1 is "lucy_".  Level 2 is "StdAnalyzer_".

> > If we readonly that Hash, we can't add subclasses to it -- and therefore we
> > won't be able to retrieve their deserializers.
> 
> I guess it's only subclasses implemented in C where this is important?
> 
> Because a hosty subclass's deserializer is using/relying the host's
> namespace to find classes by name.

Within Schema_deserialize, Lucy will have to be able to track down
deserializers for custom subclasses of Analyzer and FieldType.  Same thing
with custom Query subclasses and remote searching.

We either deal with that need in the Lucy core or punt back to the host.

Marvin Humphrey

Re: Types and Schemas (was "Sort cache file format")

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Sun, Apr 12, 2009 at 5:04 PM, Marvin Humphrey <ma...@rectangular.com> wrote:
>> I think Lucene could continue to merge yet isolate information
>> (subdivision, subclassing).  At least I sure hope so :)
>>
>> > I see why subdividing options might be useful in Lucene, but I'm not
>> > sure it's necessary for Lucy.
>>
>> It's all still hazy to me :) Hopefully once we talk about it enough
>> I'll get some clarity...
>
> Actually, what we probably need are Python bindings so that you can start
> playing around.  :)

That'd be nice but I'm quite hurting for time these days ;)  Sudden
bursts of innovation all over the place...

> I've been trying to clean up Boilerplater enough so that it porting
> Boilerplater::Binding::Perl to Boilerplater::Binding::Python would be a
> reasonable undertaking.  Perl's C API and object model are so complicated that
> other languages will probably be a lot easier -- but right now, it's not
> apparent from Boilerplater's API how you would get started.

OK.  It would also be good to have > 1 host language driving the
design... to keep things generic/portable.

>> it is sort of scary that we're inventing a type system.
>
> What's scary is that Java Lucene *has* a type system but won't admit it.

Yah.  In fact Lucene is "weakly typed", like Tcl.  We gleefully,
secretly "merge" one type with another.  I'd be happy to get to strong
but dynamic typing (ie the write once schema).

>> EG there are many things the FieldType should somehow tell us:
>>
>>   * How does FieldSpec model "multi-valued" fields? Is there a
>>     boolean in the base class?
>
> Because Lucy's Doc objects will be hash based, there will *never* be a case
> where the same field has two "values" per se within the same doc.
>
> However, it's fine if we support compound types via specific FieldType
> subclasses, e.g. Float32ArrayType, or StringArrayType.

I see -- does KS support multi-valued (compound) types today?  For
which "types"?  And I imagine for such types, "sortable" is not
allowed (yet "sortable" is set at the top FieldSpec, right?)?

> It's also important to distinguish between "multi-valued" and the
> "multi-token" FullTextType.  FullTextType fields are tokenized within the
> index, but in the context of the doc reader, they only have one string
> "value".  Note, however, that you cannot sort on a FullTextType field in KS.

So if I want to index & sort by "title" field, I make 2 separate fields?

>>   * "Has only one token" -- I guess this is implied by the class (ie
>>     only FullTextType may have > 1 token)
>
> For the near-to-middle-term future, yes -- FullTextType is the only
> multi-token, single-valued type.
>
> Looking down the road, I suppose other types like Int32ArrayType could have
> more than one "token", but it wouldn't be an ordinary string "token".

OK

>>   * Open vs closed (known set of values) enums
>
> It would be nice to add this later.  I don't think it's a high priority, since
> it's an optimization.

You mean you'd start with "open" enums?

>>   * Sortable
>
> I think this belongs in the base class -- that's where KS has it now.  That
> way, we can perform the following test, regardless of what the type is.
>
>   if (FieldType_Sortable(field_type)) {
>        /* Build sort cache. */
>        ...
>   }

Yeah... except multi-valued (compound) types would disable this, I
guess.  Though Lucene users seem to hit this limitation enough to make
it relaxable... and customize how SortCache gets created.

>>   * nulls sort on top or bottom
>
> This would be individual to each sort comparator.  Note that we might want to
> use a different sort comparator for NOT NULL fields for efficiency's sake,
> which complicates making the comparator a method on FieldSpec.

Yes, we're iterating on this now in LUCENE-831.  Though I wonder if
this ought to be the realm of source code specialization...
multiplying out all the combinations of "single comparator or not",
"scoring or not", "track max score or not", "string index may have
nulls or not", in Lucene's "true" sources (vs generated sources)
starts to get crazy.  Soon we'll also multiply in "docIDs guaranteed
to arrive in order to the collector, or not" as well.

> My general inclination is to have NULLs sort towards the end of the array.
>
>>   * Omit norms, omit TFAP
>
> I'm putting this off for now.  It will be addressed when we refactor for
> flexible indexing.

OK.  These would seem to live nicely under FullTextType... oh actually
maybe not, because presumably I can index single-valued fields (the
equivalent of NOT_ANALYZED in Lucene).  EG an Int32Type may in fact be
indexed, and I would at that point want to put omit norms/TFAP there.
Hmmm, cross cutting concerns.  Maybe sub-typing is needed...

>>   * Binary or not (I guess BlobType <-> binary)
>
> BlobType is one binary type, but I propose adding others, e.g. Int32Type.
>
> Binary() should be an abstract method on the base class.  It shouldn't be a
> boolean flag member, because it's not something that can be switched up within
> a class.

OK.

>>   * Term vectors or not, positions, offsets
>
> Term vectors are unique to FullTextType, since it is the only multi-token
> field.  Right now in KS, it's a boolean member var in FullTextType.

Single-token indexed fields might want term vectors too?

>>   * Stored or not -- toplevel?
>
> Yes.  As a boolean member.

Makes sense.

>>   * CSF'd or not
>
> Right now, I'd say keep this out of core.

OK, and, merge with sort cache somehow.  For most types they are one
and the same.

>>   * ValueSource is XYZ for this field
>
> I'd like to avoid ValueSource if we can.  I think it's better to add real
> binary types like Int32Type, DateStamp32, and so on -- instead of faking them
> with strings.

Well, that's UninversionValueSource you're thinking of (faking w/
strings).

But, yes, it's not good that ValueSource has type switching internal
to itself..... vs, you get lookup FieldType for the field and use it
to "switch".

>>   * I will use RangeFilter on this field
>
> The "sortable" boolean member var fills this need, no?

They are different?  Eg you'll add aggregates (Trie*) to your index
for fast range constraints, but for sorting you just need a sort cache
computed.

>>   * Analyzer to use (exposed only FullTextType)
>
> Analyzer should be a required constructor arg to FullTextType.

OK

>>   * Extensibility -- so app can enroll new attrs / make new type
>>     subclasses
>
> So long as the core performs inheritance checks rather than absolute class
> membership checks, subclasses will work fine.

OK.

>> Remind me again: do custom subclasses get enrolled into the global
>> hash in Lucy's core?  I know you had said it's a thread risk, ie, not
>> read only...
>
> Yes.
>
>> I'm still confused.  Say StandardAnalyzer is implemented in C; maybe
>> you'd name it Lucy_Analysis_StandardAnalyzer (since C doesn't support
>> namespaces you put prefixes in front).
>
> FWIW, the current implementation of Boilerplater only supports two level
> namespacing (with nicknames).  Outside of core, fully qualified code would
> look like this:
>
>  lucy_StandardAnalyzer *analyzer = lucy_StdAnalyzer_new();
>  lucy_Inversion *inversion = Lucy_StdAnalyzer_Transform_Text(charbuf);

What are the two levels here?  Level 1 is "StdAnalyzer", and Level 2
is "new" and "Transform_Text"?

> One of the constraints the two-level limitation imposes is that the last part
> of every core class name must be unique.  However, it makes for fully
> qualified C names that are are just cumbersome rather than unworkably long.

OK

>> Any time something in core wants to use that class, it refers to it by
>> name (and the C compiler/linker maps it), not via the global hash?
>
> For the most part.  A quick once-over of the KS code seems to indicate that
> the exceptions to that rule are all related to Deserialize() and Load().

OK

>> But for deserializing a core object, when the deserializer is
>> implemented in C, I agree you'd need a global lookup; basically
>> because you can't consult the OBJ's symbol table dynamically.  (If you
>> have a hosty deserializer, then it would "import lucy; lucy.XXX" to
>> find its classes).
>>
>> (But it seems like that global hash should be readonly-able).
>
> If we readonly that Hash, we can't add subclasses to it -- and therefore we
> won't be able to retrieve their deserializers.

I guess it's only subclasses implemented in C where this is important?

Because a hosty subclass's deserializer is using/relying the host's
namespace to find classes by name.

Mike

Re: Types and Schemas (was "Sort cache file format")

Posted by Marvin Humphrey <ma...@rectangular.com>.

> I think Lucene could continue to merge yet isolate information
> (subdivision, subclassing).  At least I sure hope so :)
> 
> > I see why subdividing options might be useful in Lucene, but I'm not
> > sure it's necessary for Lucy.
> 
> It's all still hazy to me :) Hopefully once we talk about it enough
> I'll get some clarity... 

Actually, what we probably need are Python bindings so that you can start
playing around.  :)

I've been trying to clean up Boilerplater enough so that it porting
Boilerplater::Binding::Perl to Boilerplater::Binding::Python would be a
reasonable undertaking.  Perl's C API and object model are so complicated that
other languages will probably be a lot easier -- but right now, it's not
apparent from Boilerplater's API how you would get started.

> it is sort of scary that we're inventing a type system.

What's scary is that Java Lucene *has* a type system but won't admit it.

> EG there are many things the FieldType should somehow tell us:
> 
>   * How does FieldSpec model "multi-valued" fields? Is there a
>     boolean in the base class?

Because Lucy's Doc objects will be hash based, there will *never* be a case
where the same field has two "values" per se within the same doc.

However, it's fine if we support compound types via specific FieldType
subclasses, e.g. Float32ArrayType, or StringArrayType.

It's also important to distinguish between "multi-valued" and the
"multi-token" FullTextType.  FullTextType fields are tokenized within the
index, but in the context of the doc reader, they only have one string
"value".  Note, however, that you cannot sort on a FullTextType field in KS.

>   * Must not be null -- base class?

Yes, I think that makes sense.

>   * "Has only one token" -- I guess this is implied by the class (ie
>     only FullTextType may have > 1 token)

For the near-to-middle-term future, yes -- FullTextType is the only
multi-token, single-valued type.

Looking down the road, I suppose other types like Int32ArrayType could have
more than one "token", but it wouldn't be an ordinary string "token".

>   * Open vs closed (known set of values) enums

It would be nice to add this later.  I don't think it's a high priority, since
it's an optimization.

>   * Sortable

I think this belongs in the base class -- that's where KS has it now.  That
way, we can perform the following test, regardless of what the type is.

   if (FieldType_Sortable(field_type)) {
        /* Build sort cache. */
        ...
   }

>   * nulls sort on top or bottom

This would be individual to each sort comparator.  Note that we might want to
use a different sort comparator for NOT NULL fields for efficiency's sake,
which complicates making the comparator a method on FieldSpec.

My general inclination is to have NULLs sort towards the end of the array.  

>   * Omit norms, omit TFAP

I'm putting this off for now.  It will be addressed when we refactor for
flexible indexing.

>   * Binary or not (I guess BlobType <-> binary)

BlobType is one binary type, but I propose adding others, e.g. Int32Type.  

Binary() should be an abstract method on the base class.  It shouldn't be a
boolean flag member, because it's not something that can be switched up within
a class.

>   * Term vectors or not, positions, offsets

Term vectors are unique to FullTextType, since it is the only multi-token
field.  Right now in KS, it's a boolean member var in FullTextType.

>   * Stored or not -- toplevel?

Yes.  As a boolean member.

>   * CSF'd or not

Right now, I'd say keep this out of core.

>   * ValueSource is XYZ for this field

I'd like to avoid ValueSource if we can.  I think it's better to add real
binary types like Int32Type, DateStamp32, and so on -- instead of faking them
with strings.

>   * I will use RangeFilter on this field

The "sortable" boolean member var fills this need, no?

>   * Analyzer to use (exposed only FullTextType)

Analyzer should be a required constructor arg to FullTextType.

>   * Extensibility -- so app can enroll new attrs / make new type
>     subclasses

So long as the core performs inheritance checks rather than absolute class
membership checks, subclasses will work fine.

> Remind me again: do custom subclasses get enrolled into the global
> hash in Lucy's core?  I know you had said it's a thread risk, ie, not
> read only...

Yes.

> I'm still confused.  Say StandardAnalyzer is implemented in C; maybe
> you'd name it Lucy_Analysis_StandardAnalyzer (since C doesn't support
> namespaces you put prefixes in front).

FWIW, the current implementation of Boilerplater only supports two level
namespacing (with nicknames).  Outside of core, fully qualified code would
look like this:

  lucy_StandardAnalyzer *analyzer = lucy_StdAnalyzer_new();
  lucy_Inversion *inversion = Lucy_StdAnalyzer_Transform_Text(charbuf);

One of the constraints the two-level limitation imposes is that the last part
of every core class name must be unique.  However, it makes for fully
qualified C names that are are just cumbersome rather than unworkably long.

> Any time something in core wants to use that class, it refers to it by
> name (and the C compiler/linker maps it), not via the global hash?

For the most part.  A quick once-over of the KS code seems to indicate that
the exceptions to that rule are all related to Deserialize() and Load().

> But for deserializing a core object, when the deserializer is
> implemented in C, I agree you'd need a global lookup; basically
> because you can't consult the OBJ's symbol table dynamically.  (If you
> have a hosty deserializer, then it would "import lucy; lucy.XXX" to
> find its classes).
> 
> (But it seems like that global hash should be readonly-able).

If we readonly that Hash, we can't add subclasses to it -- and therefore we
won't be able to retrieve their deserializers.

Marvin Humphrey

Re: Types and Schemas (was "Sort cache file format")

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Sun, Apr 12, 2009 at 9:08 AM, Marvin Humphrey <ma...@rectangular.com> wrote:

>> Does FieldSpec sub divide the options?  Eg options about indexing
>> could live in its own class, with commonly used constants like "NO".
>>
>> This was the motivation of that comment in Lucene (the fact that we
>> don't subdivide means suddenly stored only fields have to figure out
>> what to do with omitNorms, omitTFAP booleans; if we had Field.Index.NO
>> that's be better).
>
> Right now, FieldSpec doesn't subdivide, but it's not a least common
> denominator, either.  To illustrate: FieldSpec has boolean members for
> "indexed", "stored", and "sortable", but knows nothing about Analyzers.
> Analyzers are the exclusive province of the FullTextField subclass.

OK.

> If you don't permit automatic merging of field types, then there isn't a need
> for FieldSpec to know everything about all its subclasses.

I think Lucene could continue to merge yet isolate information
(subdivision, subclassing).  At least I sure hope so :)

> I see why subdividing options might be useful in Lucene, but I'm not
> sure it's necessary for Lucy.

It's all still hazy to me :) Hopefully once we talk about it enough
I'll get some clarity... it is sort of scary that we're inventing a
type system.

EG there are many things the FieldType should somehow tell us:

  * How does FieldSpec model "multi-valued" fields?  Is there a
    boolean in the base class?

  * Must not be null -- base class?

  * "Has only one token" -- I guess this is implied by the class (ie
    only FullTextType may have > 1 token)

  * Open vs closed (known set of values) enums

  * Sortable

  * nulls sort on top or bottom

  * Omit norms, omit TFAP

  * Binary or not (I guess BlobType <-> binary)

  * Term vectors or not, positions, offsets

  * Stored or not -- toplevel?

  * CSF'd or not

  * ValueSource is XYZ for this field

  * I will use RangeFilter on this field

  * Analyzer to use (exposed only FullTextType)

  * Extensibility -- so app can enroll new attrs / make new type
    subclasses

> I think it's better OO design for the parent class to be simple rather than
> comprehensive.
>
>> Well, in Lucene we could better decouple a Field's value from its
>> "extended type".  The type would still be attached to the Field's
>> value (not to the global schema as in KS), but strongly decoupled &
>> shared across Field instances.
>
> That makes sense.  The "extended type" class could look almost identical, but
> in Lucene the user would make the connection directly, while in Lucy it
> would be made indirectly via the field name.

Right.

>> > Dump them to a JSON-izable data structure.  Include the class name so that you
>> > can pick a deserialization routine at load time.
>>
>> You rely on the same namespace -> obj mapping being present at
>> deserialize time?  Ie its the callers responsibility to import the
>> same modules, ensure the names "map" to the same objs (or at least
>> compatible ones) as were used during serialization, etc.
>
> If the user has implemented custom subclasses, then yes, the subclasses must be
> loaded or you'll get a "class not found" error.

OK just like unpickling in Python...

Remind me again: do custom subclasses get enrolled into the global
hash in Lucy's core?  I know you had said it's a thread risk, ie, not
read only...

>> Though, for core objects, you would use the global name -> vtable
>> mapping that Lucy core maintains?
>
> Yes.  Any core class would already be loaded.
>
>> (I still don't fully understand why Lucy needs that global hash -- this is
>> what namespaces are for).
>
> If we didn't implement it internally, we'd need to implement it in the
> bindings for e.g. looking up deserialization routines.  Furthermore, we need
> some mechanism for C-level subclassing, since that's not part of the C
> language.  No namespaces there.  :)

I'm still confused.  Say StandardAnalyzer is implemented in C; maybe
you'd name it Lucy_Analysis_StandardAnalyzer (since C doesn't support
namespaces you put prefixes in front).

Any time something in core wants to use that class, it refers to it by
name (and the C compiler/linker maps it), not via the global hash?

But for deserializing a core object, when the deserializer is
implemented in C, I agree you'd need a global lookup; basically
because you can't consult the OBJ's symbol table dynamically.  (If you
have a hosty deserializer, then it would "import lucy; lucy.XXX" to
find its classes).

(But it seems like that global hash should be readonly-able).

>> OK, so if I've made a custom Tokenizer doing some funky Python code
>> instead of a regexp, I could simply implement dump/load to do the
>> right thing.
>
> Yes.
>
> BTW, I saw that Earwin Burrfoot calls his type class "FieldType".
>
> "FieldType" is probably a better name than "FieldSpec", as it implies
> subclasses with "Type" as a suffix: FullTextType, StringType, BlobType,
> Int32Type, etc.

Agreed.

Mike