You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Nikola Tanković <ni...@gmail.com> on 2011/05/09 21:45:19 UTC

Re: GSoC: LUCENE-2308: Separately specify a field's type

My answers are inline.

2011/4/14 Michael McCandless <lu...@mikemccandless.com>

> 2011/4/13 Nikola Tanković <ni...@gmail.com>:
> > Hi all,
> > if everything goes well I'll be delighted to be part of this project this
> > summer together with my assigned mentor Mike. My task will be to
> introduce
> > new classes to Lucene core which will enable to separate Fields' Lucene
> > properties from it's value
> > (https://issues.apache.org/jira/browse/LUCENE-2308).
>
> Welcome Nikola!
>
> > Changes will include:
> >
> > Introduction of an FieldType class that will hold all the extra
> properties
> > now stored inside Field instance other than field value itself.
>
> Seems like this is an easy first baby step -- leave current Field
> class, but break out the "type" details into a separate class that can
> be shared across Field instances.
>

Yes, I agree, this could be a good first step. Mike submitted a patch on
issue #2308. I think it's a solid base for this.


>
> > New FieldTypeAttribute interface will be added to handle extension with
> new
> > field properties inspired by IndexWriterConfig.
>
> How would this work?  What's an example compelling usage?  An app
> could use this for extensibility, and then make a matching codec that
> picks up this attr?  EG, say, maybe for marking that a field is a
> "primary key field" and then codec could optimize accordingly...?
>

Well that could be very interesting scenario. It didn't rang a bell to me
for possible codec usage, but it seems very reasonable. Attributes otherwise
don't make much sense, unless propertly used in custom codecs.

How will we ensure attribute and codec compatibility?

> Refactoring and dividing of settings for term frequency and positioning
> can
> > also be done (LUCENE-2048)
>
> Ahh great!  So we can omit-positions-but-not-TF.
>
> > Discuss possible effects of completion of LUCENE-2310 on this project
>
> This one is badly needed... but we should keep your project focused.
>

We'll tackle this one afterwards.


> > Adequate Factory class for easier configuration of new Field instances
> > together with manually added new FieldTypeAttributes
> > FieldType, once instantiated is read-only. Only fields value can be
> changed.
>
> OK.
>
> > Simple hierarchy of Field classes with core properties logically
> > predefaulted. E.g.:
> >
> > NumberField,
>
> Can't this just be our existing NumericField?
>

Yes, this is classic NumericField with changes proposed in LUCENE-2310. Tim
Smith mentioned that Fieldable class should be kept for custom
implementations to reduce number of setters (for defaults).
Chris Male suggested new CoreFieldTypeAttribute interface, so maybe it
should be implemented instead of Fieldable for custom implementations, so
both Fieldable and AbstractField are not needed anymore.
In my opinion Field shoud become abstract extended with others.

Another proposal: how about keeping only Field (with no hierarchy) and move
hierarchy to FieldType, such as NumericFieldType, StringFieldType since this
hierarchy concerns type information only?

e.g. Usage:

FieldType number = new NumericFieldType();
Field price = new Field();
price.setType(number);

// but this is much cleaner...

Field price = new NumericField();

so maybe whe should have paraller XYZField with XYZFieldType...

Am I complicating?


> > StringField,
>
> This would be like NOT_ANALYZED?
>

Yes, strings are often one word only. Or maybe we can name it NameField,
NonAnalyzedField or something.


>
> > TextField,
>
> This would be ANALYZED?
>

Yes.


>
> > NonIndexedField,
>
> This would be only stored?
>
> > My questions and issues:
> >
> > Backward compatibility? Will this go to Lucene 3.0?
>
> Maybe focus on 4.0 for starters and then if there's a nice backport we
> can do that...?
>

OK, that also seems reasonable.


>
> > What is the best way to break this into small baby steps?
>
> Hopefully this becomes clearer as we iterate.
>

Well, we know the first step: moving type details into FieldType class.


>
> Mike
>

Re: GSoC: LUCENE-2308: Separately specify a field's type

Posted by Chris Male <ge...@gmail.com>.

2011/5/14 Nikola Tanković <ni...@gmail.com>

> 2011/5/12 Michael McCandless <lu...@mikemccandless.com>
>
>> 2011/5/9 Nikola Tanković <ni...@gmail.com>:
>>
>>
>> >> > Introduction of an FieldType class that will hold all the extra
>> >> > properties
>> >> > now stored inside Field instance other than field value itself.
>> >>
>> >> Seems like this is an easy first baby step -- leave current Field
>> >> class, but break out the "type" details into a separate class that can
>> >> be shared across Field instances.
>> >
>> > Yes, I agree, this could be a good first step. Mike submitted a patch on
>> > issue #2308. I think it's a solid base for this.
>>
>> Make that Chris.
>>
>
> Ouch, sorry!
>
>
>>
>> >> > New FieldTypeAttribute interface will be added to handle extension
>> with
>> >> > new
>> >> > field properties inspired by IndexWriterConfig.
>> >>
>> >> How would this work?  What's an example compelling usage?  An app
>> >> could use this for extensibility, and then make a matching codec that
>> >> picks up this attr?  EG, say, maybe for marking that a field is a
>> >> "primary key field" and then codec could optimize accordingly...?
>> >
>> > Well that could be very interesting scenario. It didn't rang a bell to
>> me
>> > for possible codec usage, but it seems very reasonable. Attributes
>> otherwise
>> > don't make much sense, unless propertly used in custom codecs.
>> >
>> > How will we ensure attribute and codec compatibility?
>>
>> I'm just thinking we should have concrete reasons in mind for cutting
>> over to attributes here... I'd rather see a fixed, well thought out
>> concrete FieldType hierarchy first...
>>
>
> Yes, I couldn't agree more, and I also think Chris has some great ideas on
> this field, given his work on Spatial indexing which tends to have use of
> this additional attributes.
>

I think Attributes should be used sparingly, but I do think they make sense.
 I do use a similar idea in some spatial work where different fields have
different requirements but need to work with the same set of strategies.  I
feel this is metadata and doesn't belong in an extension to Field.  But
equally its not 'core' to FieldType either, which is why I added the
FieldTypeAttribute idea.

In the end I feel we should provide maximum flexibility here, especially if
we are going to move over to a more minimal API for the indexer.  We need to
allow custom extensions to FieldType and I'm not sure having 'instanceof'
statements everytime I need to something specific to a subtype, is the best
way to go.


>
>
>>
>> >> > Refactoring and dividing of settings for term frequency and
>> positioning
>> >> > can
>> >> > also be done (LUCENE-2048)
>> >>
>> >> Ahh great!  So we can omit-positions-but-not-TF.
>> >>
>> >> > Discuss possible effects of completion of LUCENE-2310 on this project
>> >>
>> >> This one is badly needed... but we should keep your project focused.
>> >
>> >
>> > We'll tackle this one afterwards.
>>
>> Good.
>>
>>
>> >> > Adequate Factory class for easier configuration of new Field
>> instances
>> >> > together with manually added new FieldTypeAttributes
>> >> > FieldType, once instantiated is read-only. Only fields value can be
>> >> > changed.
>> >>
>> >> OK.
>> >>
>> >> > Simple hierarchy of Field classes with core properties logically
>> >> > predefaulted. E.g.:
>> >> >
>> >> > NumberField,
>> >>
>> >> Can't this just be our existing NumericField?
>> >
>> > Yes, this is classic NumericField with changes proposed in LUCENE-2310.
>> Tim
>> > Smith mentioned that Fieldable class should be kept for custom
>> > implementations to reduce number of setters (for defaults).
>> > Chris Male suggested new CoreFieldTypeAttribute interface, so maybe it
>> > should be implemented instead of Fieldable for custom implementations,
>> so
>> > both Fieldable and AbstractField are not needed anymore.
>> > In my opinion Field shoud become abstract extended with others.
>> > Another proposal: how about keeping only Field (with no hierarchy) and
>> move
>> > hierarchy to FieldType, such as NumericFieldType, StringFieldType since
>> this
>> > hierarchy concerns type information only?
>>
>> I think hierarchy of both types and the "value containers" that hold
>> the corresponding values could make sense?
>>
>
> Hmm, I think we should get more opinions on this one also.
>

I'm unsure about this.  What information would a StringFieldType have over a
NumericFieldType? I can imagine NumericFieldType maybe having precision
step.  Couldn't that be an Attribute?  I can see the benefit of a
StringField though, and a NumericField, since they are providing different
implementations of the same fundamental needs of a Field; its name, its
value, its type and its tokenstream.  I think we should use hierarchies
sparingly as well, since really we want to make this as simple as possible.
 But we should also keep our eye on those fundamental needs of the indexer.


>
>
>>
>> > e.g. Usage:
>> > FieldType number = new NumericFieldType();
>> > Field price = new Field();
>> > price.setType(number);
>> > // but this is much cleaner...
>> > Field price = new NumericField();
>> > so maybe whe should have paraller XYZField with XYZFieldType...
>> > Am I complicating?
>> >>
>> >> > StringField,
>> >>
>> >> This would be like NOT_ANALYZED?
>> >
>> > Yes, strings are often one word only. Or maybe we can name it NameField,
>> > NonAnalyzedField or something.
>>
>> StringField sounds good actually...
>>
>>
>> >> > TextField,
>> >>
>> >> This would be ANALYZED?
>> >
>> > Yes.
>> >
>>
>> OK.
>>
>>
>> >> > What is the best way to break this into small baby steps?
>> >>
>> >> Hopefully this becomes clearer as we iterate.
>> >
>> > Well, we know the first step: moving type details into FieldType class.
>>
>> Yes!
>>
>> Somehow tying into this as well is a stronger decoupling of the
>> indexer from analysis/document.  Ie, what indexer needs of a document
>> is very minimal -- just an iterable over indexed & stored values.
>> Separately we can still provide a "full featured" Document class w/
>> add, get, remove, etc., but that's "outside" of the indexer.
>>
>
> I'll get back to this one after additional research. Maybe we should do
> couple of more interactions, then I'll summarize the conclusions.
>

I've been going backwards and forwards in my mind about whether its best to
pursue this idea as part of FieldType changes or after FieldType.  My plan
has always been to mover the indexer over to an Indexable abstraction which
returns a list of Fields (nothing more).  Field would then be simple (name,
value, type, tokenstream).  Document and a full suite of user friendly
classes can then go 'outside' as Mike says.

The advantages of pursuing this now is that we can get the messy classes of
Document and Field* out of the core where they can be improved without fear
of impacting the indexer.  At the same time there is a chicken or egg
situation that we need FieldType working before we can cut over.


>
>
>>
>> Mike
>>
>> http://blog.mikemccandless.com
>
>
> Nikola
>
>


-- 
Chris Male | Software Developer | JTeam BV.| www.jteam.nl

Re: GSoC: LUCENE-2308: Separately specify a field's type

Posted by Nikola Tanković <ni...@gmail.com>.

2011/5/12 Michael McCandless <lu...@mikemccandless.com>

> 2011/5/9 Nikola Tanković <ni...@gmail.com>:
>
> >> > Introduction of an FieldType class that will hold all the extra
> >> > properties
> >> > now stored inside Field instance other than field value itself.
> >>
> >> Seems like this is an easy first baby step -- leave current Field
> >> class, but break out the "type" details into a separate class that can
> >> be shared across Field instances.
> >
> > Yes, I agree, this could be a good first step. Mike submitted a patch on
> > issue #2308. I think it's a solid base for this.
>
> Make that Chris.
>

Ouch, sorry!


>
> >> > New FieldTypeAttribute interface will be added to handle extension
> with
> >> > new
> >> > field properties inspired by IndexWriterConfig.
> >>
> >> How would this work?  What's an example compelling usage?  An app
> >> could use this for extensibility, and then make a matching codec that
> >> picks up this attr?  EG, say, maybe for marking that a field is a
> >> "primary key field" and then codec could optimize accordingly...?
> >
> > Well that could be very interesting scenario. It didn't rang a bell to me
> > for possible codec usage, but it seems very reasonable. Attributes
> otherwise
> > don't make much sense, unless propertly used in custom codecs.
> >
> > How will we ensure attribute and codec compatibility?
>
> I'm just thinking we should have concrete reasons in mind for cutting
> over to attributes here... I'd rather see a fixed, well thought out
> concrete FieldType hierarchy first...
>

Yes, I couldn't agree more, and I also think Chris has some great ideas on
this field, given his work on Spatial indexing which tends to have use of
this additional attributes.


>
> >> > Refactoring and dividing of settings for term frequency and
> positioning
> >> > can
> >> > also be done (LUCENE-2048)
> >>
> >> Ahh great!  So we can omit-positions-but-not-TF.
> >>
> >> > Discuss possible effects of completion of LUCENE-2310 on this project
> >>
> >> This one is badly needed... but we should keep your project focused.
> >
> >
> > We'll tackle this one afterwards.
>
> Good.
>
> >> > Adequate Factory class for easier configuration of new Field instances
> >> > together with manually added new FieldTypeAttributes
> >> > FieldType, once instantiated is read-only. Only fields value can be
> >> > changed.
> >>
> >> OK.
> >>
> >> > Simple hierarchy of Field classes with core properties logically
> >> > predefaulted. E.g.:
> >> >
> >> > NumberField,
> >>
> >> Can't this just be our existing NumericField?
> >
> > Yes, this is classic NumericField with changes proposed in LUCENE-2310.
> Tim
> > Smith mentioned that Fieldable class should be kept for custom
> > implementations to reduce number of setters (for defaults).
> > Chris Male suggested new CoreFieldTypeAttribute interface, so maybe it
> > should be implemented instead of Fieldable for custom implementations, so
> > both Fieldable and AbstractField are not needed anymore.
> > In my opinion Field shoud become abstract extended with others.
> > Another proposal: how about keeping only Field (with no hierarchy) and
> move
> > hierarchy to FieldType, such as NumericFieldType, StringFieldType since
> this
> > hierarchy concerns type information only?
>
> I think hierarchy of both types and the "value containers" that hold
> the corresponding values could make sense?
>

Hmm, I think we should get more opinions on this one also.


>
> > e.g. Usage:
> > FieldType number = new NumericFieldType();
> > Field price = new Field();
> > price.setType(number);
> > // but this is much cleaner...
> > Field price = new NumericField();
> > so maybe whe should have paraller XYZField with XYZFieldType...
> > Am I complicating?
> >>
> >> > StringField,
> >>
> >> This would be like NOT_ANALYZED?
> >
> > Yes, strings are often one word only. Or maybe we can name it NameField,
> > NonAnalyzedField or something.
>
> StringField sounds good actually...
>
> >> > TextField,
> >>
> >> This would be ANALYZED?
> >
> > Yes.
> >
>
> OK.
>
> >> > What is the best way to break this into small baby steps?
> >>
> >> Hopefully this becomes clearer as we iterate.
> >
> > Well, we know the first step: moving type details into FieldType class.
>
> Yes!
>
> Somehow tying into this as well is a stronger decoupling of the
> indexer from analysis/document.  Ie, what indexer needs of a document
> is very minimal -- just an iterable over indexed & stored values.
> Separately we can still provide a "full featured" Document class w/
> add, get, remove, etc., but that's "outside" of the indexer.
>

I'll get back to this one after additional research. Maybe we should do
couple of more interactions, then I'll summarize the conclusions.


>
> Mike
>
> http://blog.mikemccandless.com


Nikola