You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by Grant Ingersoll <gs...@apache.org> on 2008/06/30 16:22:47 UTC

Extending Field and FieldType properties

Currently, FieldType throws a RuntimeException if there are any  
"extra" properties in the configuration.  I think SchemaField does  
something similar.

I'd like to consider not doing this.  My main case is I want to be  
able to store semantic information about the FieldType with the  
FieldType.  Doing this now, requires creating a whole separate object  
model that overlays the FieldType and stores the information elsewhere  
(i.e. DB).  For example, say you want to denote what language a given  
field type supports, one has to store this information elsewhere, when  
it could easily be seen as a property of the FieldType.  I think right  
now, people often rely on naming conventions to convey this, such as  
text_zh or text_it or something like that and that doesn't extend very  
well, IMO.  These new attributes would allow applications to make use  
of richer semantics for FieldType w/o harming Solr in anyway (I think.)

 From the looks of it, FieldType has all the functionality already  
built in, minus a few lines where the exception is thrown if there are  
"extra" attributes.

I think a similar argument can be made for SchemaField as well (and  
probably other things like RequestHandler, etc. but "baby steps" first)

Any thoughts/objections?


-Grant

Re: Extending Field and FieldType properties

Posted by Chris Hostetter <ho...@fucit.org>.

: You make a good point about the countless hours debugging.  On the flip side,
: one could ask the question as to whether the Solr schema is stable enough that
: we should publish an XML Schema for it, thus helping alleviate some of the
: pain.

it's pretty much imossible since "Map" initialized plugins like FieldType, 
TokenizerFactory, TokenFilterFactory can have user created subclasses that 
accept arbitrary XML attributes.  We *could* say "Here's the 'base' XSD 
for schema.xml, and if you define your own schema plugins (or use someone 
else's plugins) you have to specify your own schema that extends the base" 
... except that right now a plugin could accept any XML attribute and do 
interesting things with it (ie; iterate over all the keys in the map no 
matter what they are) and that wouldn't be expressible in an XSD.  (as far 
as i can tell anyway)


-Hoss

Re: Extending Field and FieldType properties

Posted by "J.J. Larrea" <jj...@panix.com>.

At 5:12 PM -0400 7/1/08, Grant Ingersoll wrote:
>You make a good point about the countless hours debugging.  On the flip side, one could ask the question as to whether the Solr schema is stable enough that we should publish an XML Schema for it, thus helping alleviate some of the pain.

That's a very good point: A lot of the internal code-based validation of the .xml configuration files could be obviated with parse-time validation, and using a well defined .xsd the schema itself would be user-extensible/restrictable.

>More below...

-- snip --

>This seems a bit clunky to me, syntax-wise, but the idea seems right.    I suppose another option is that I could just extend the FieldType and have it look for my own attributes.

Well for a specific field type there's already the init(...) method designed to allow subclasses to parse and remove attributes before the bad-argument test, e.g. as done in CompressibleField.

Where this won't work without a user-extensible dictionary is if one wants a new attribute across all field types.  I did, and so had to modify FieldType itself, which was a bit clunky in a different way.

Either way, by adding a getAttribute to FieldType such as I described, it's only necessary for init (in FieldType or a subclass) to remove the argument from initArgs, so the attribute can be retrieved and parsed on demand rather than creating an instance variable to store it.

But stepping back, is language-dependent analysis really the goal? As Erik Hatcher notes, there is this complication:

>Further on this.... if metadata is added to a field type, it needs to somehow make it down to the tokenizer and filter factories to use if desired.  Language, for example, could be attached to a field type, but then could be leveraged by a stop word filter to pick up a language-specific stop word file.

And perhaps what one perhaps really needs is not a static attribute added to the field type, but one that can vary across each document, e.g. via a different field's value or a payload affixed to the tokens.  I remember a thread on payloads being used for that purpose (and I see you contributed to the Lucene-side design of payloads), but I don't recall whether it converged on a usable Solr-side implementation.

>I'll have to think some more about it...

Me too... the use-cases for the schema.xml-driven extension I proposed may be so rare that it's not at all worth considering.

- J.J.

Re: Extending Field and FieldType properties

Posted by Grant Ingersoll <gs...@apache.org>.

You make a good point about the countless hours debugging.  On the  
flip side, one could ask the question as to whether the Solr schema is  
stable enough that we should publish an XML Schema for it, thus  
helping alleviate some of the pain.

More below...

On Jun 30, 2008, at 3:28 PM, J.J. Larrea wrote:

> I heartily agree with you Grant that these objects should be user- 
> extensible. But removing the exception test entirely would probably  
> be a great disservice to Solr users, who could spend untold hours  
> debugging problems in schema.xml (eg. misspelled or contextually  
> inappropriate properties) without the valuable feedback it  
> provides.  So to do this right there should be a way to define  
> additional properties (defined as booleans in Solr) and attributes  
> (which can be string-valued).
>
> Thinking aloud here...
>
> For properties, something like this added to FieldProperties would  
> allow user-defined global properties:
>
> final static int USER_DEFINED = 0x00010000;
>
> static int nextIndex = USER_DEFINED;
>
> static int addPropertyType(String prop) {
>    if( propertyMap.containsKey(prop) ) throw ...
>    if( nextIndex > 31 ) throw ...
>    i = nextIndex++;
>    propertyMap.put(prop, i);
>    return i;
> }
>
> Which could be enabled by parsing a new <fieldProperty name="..."/>  
> tag from schema.xml before any of the fieldType or field declarations.
>
> For string-valued attributes, FieldType could be extended with a Set  
> of user-defined names (or name/type mappings?) which would be  
> removed from initArgs before the exception test.  The values could  
> be returned by a trivial method
>
>  public String getAttribute(String name) {
> 	return args.get(name);
>  }
>
> so other code could repeatedly get access to them (initArgs are  
> progressively removed until the null set or error, but args persist)  
> without having to parse and store the value somewhere.
>
> Simplest would be for the attribute name set to be global across all  
> field-types, with a static addAttributeType method and a  
> freestanding tag in schema.xml similar to the above for properties.   
> But one could argue for the set of user-defined attribute to be  
> local to a particular fieldType and all fields defined from it,  
> perhaps set from an XML attribute:
>
>    <!-- text fields have an attribute lang defaulting to 'american'  
> -->
>    <fieldType name="text" extra="lang" lang="american" ... />

This seems a bit clunky to me, syntax-wise, but the idea seems  
right.    I suppose another option is that I could just extend the  
FieldType and have it look for my own attributes.

I'll have to think some more about it...

>
>
>    <field name="Prenom" type="text" lang="french" ... />
>
> Anyway, does this make sense and fit with what you were thinking of?
>
> - J.J.
>
> At 10:22 AM -0400 6/30/08, Grant Ingersoll wrote:
>> Currently, FieldType throws a RuntimeException if there are any  
>> "extra" properties in the configuration.  I think SchemaField does  
>> something similar.
>>
>> I'd like to consider not doing this.  My main case is I want to be  
>> able to store semantic information about the FieldType with the  
>> FieldType.  Doing this now, requires creating a whole separate  
>> object model that overlays the FieldType and stores the information  
>> elsewhere (i.e. DB).  For example, say you want to denote what  
>> language a given field type supports, one has to store this  
>> information elsewhere, when it could easily be seen as a property  
>> of the FieldType.  I think right now, people often rely on naming  
>> conventions to convey this, such as text_zh or text_it or something  
>> like that and that doesn't extend very well, IMO.  These new  
>> attributes would allow applications to make use of richer semantics  
>> for FieldType w/o harming Solr in anyway (I think.)
>>
>> From the looks of it, FieldType has all the functionality already  
>> built in, minus a few lines where the exception is thrown if there  
>> are "extra" attributes.
>>
>> I think a similar argument can be made for SchemaField as well (and  
>> probably other things like RequestHandler, etc. but "baby steps"  
>> first)
>>
>> Any thoughts/objections?
>>
>>
>> -Grant
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: Extending Field and FieldType properties

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

Further on this.... if metadata is added to a field type, it needs to  
somehow make it down to the tokenizer and filter factories to use if  
desired.  Language, for example, could be attached to a field type,  
but then could be leveraged by a stop word filter to pick up a  
language-specific stop word file.

Food for thought.

	Erik

On Jun 30, 2008, at 3:28 PM, J.J. Larrea wrote:

> I heartily agree with you Grant that these objects should be user- 
> extensible. But removing the exception test entirely would probably  
> be a great disservice to Solr users, who could spend untold hours  
> debugging problems in schema.xml (eg. misspelled or contextually  
> inappropriate properties) without the valuable feedback it  
> provides.  So to do this right there should be a way to define  
> additional properties (defined as booleans in Solr) and attributes  
> (which can be string-valued).
>
> Thinking aloud here...
>
> For properties, something like this added to FieldProperties would  
> allow user-defined global properties:
>
> final static int USER_DEFINED = 0x00010000;
>
> static int nextIndex = USER_DEFINED;
>
> static int addPropertyType(String prop) {
>    if( propertyMap.containsKey(prop) ) throw ...
>    if( nextIndex > 31 ) throw ...
>    i = nextIndex++;
>    propertyMap.put(prop, i);
>    return i;
> }
>
> Which could be enabled by parsing a new <fieldProperty name="..."/>  
> tag from schema.xml before any of the fieldType or field declarations.
>
> For string-valued attributes, FieldType could be extended with a Set  
> of user-defined names (or name/type mappings?) which would be  
> removed from initArgs before the exception test.  The values could  
> be returned by a trivial method
>
>  public String getAttribute(String name) {
> 	return args.get(name);
>  }
>
> so other code could repeatedly get access to them (initArgs are  
> progressively removed until the null set or error, but args persist)  
> without having to parse and store the value somewhere.
>
> Simplest would be for the attribute name set to be global across all  
> field-types, with a static addAttributeType method and a  
> freestanding tag in schema.xml similar to the above for properties.   
> But one could argue for the set of user-defined attribute to be  
> local to a particular fieldType and all fields defined from it,  
> perhaps set from an XML attribute:
>
>    <!-- text fields have an attribute lang defaulting to 'american'  
> -->
>    <fieldType name="text" extra="lang" lang="american" ... />
>
>    <field name="Prenom" type="text" lang="french" ... />
>
> Anyway, does this make sense and fit with what you were thinking of?
>
> - J.J.
>
> At 10:22 AM -0400 6/30/08, Grant Ingersoll wrote:
>> Currently, FieldType throws a RuntimeException if there are any  
>> "extra" properties in the configuration.  I think SchemaField does  
>> something similar.
>>
>> I'd like to consider not doing this.  My main case is I want to be  
>> able to store semantic information about the FieldType with the  
>> FieldType.  Doing this now, requires creating a whole separate  
>> object model that overlays the FieldType and stores the information  
>> elsewhere (i.e. DB).  For example, say you want to denote what  
>> language a given field type supports, one has to store this  
>> information elsewhere, when it could easily be seen as a property  
>> of the FieldType.  I think right now, people often rely on naming  
>> conventions to convey this, such as text_zh or text_it or something  
>> like that and that doesn't extend very well, IMO.  These new  
>> attributes would allow applications to make use of richer semantics  
>> for FieldType w/o harming Solr in anyway (I think.)
>>
>> From the looks of it, FieldType has all the functionality already  
>> built in, minus a few lines where the exception is thrown if there  
>> are "extra" attributes.
>>
>> I think a similar argument can be made for SchemaField as well (and  
>> probably other things like RequestHandler, etc. but "baby steps"  
>> first)
>>
>> Any thoughts/objections?
>>
>>
>> -Grant

Re: Extending Field and FieldType properties

Posted by "J.J. Larrea" <jj...@panix.com>.

I heartily agree with you Grant that these objects should be user-extensible. But removing the exception test entirely would probably be a great disservice to Solr users, who could spend untold hours debugging problems in schema.xml (eg. misspelled or contextually inappropriate properties) without the valuable feedback it provides.  So to do this right there should be a way to define additional properties (defined as booleans in Solr) and attributes (which can be string-valued).

Thinking aloud here...

For properties, something like this added to FieldProperties would allow user-defined global properties:

final static int USER_DEFINED = 0x00010000;

static int nextIndex = USER_DEFINED;

static int addPropertyType(String prop) {
    if( propertyMap.containsKey(prop) ) throw ...
    if( nextIndex > 31 ) throw ...
    i = nextIndex++;
    propertyMap.put(prop, i);
    return i;
}

Which could be enabled by parsing a new <fieldProperty name="..."/> tag from schema.xml before any of the fieldType or field declarations.

For string-valued attributes, FieldType could be extended with a Set of user-defined names (or name/type mappings?) which would be removed from initArgs before the exception test.  The values could be returned by a trivial method

  public String getAttribute(String name) {
 	return args.get(name);
  }

so other code could repeatedly get access to them (initArgs are progressively removed until the null set or error, but args persist) without having to parse and store the value somewhere.

Simplest would be for the attribute name set to be global across all field-types, with a static addAttributeType method and a freestanding tag in schema.xml similar to the above for properties.  But one could argue for the set of user-defined attribute to be local to a particular fieldType and all fields defined from it, perhaps set from an XML attribute:

    <!-- text fields have an attribute lang defaulting to 'american' -->
    <fieldType name="text" extra="lang" lang="american" ... />

    <field name="Prenom" type="text" lang="french" ... />

Anyway, does this make sense and fit with what you were thinking of?

- J.J.

At 10:22 AM -0400 6/30/08, Grant Ingersoll wrote:
>Currently, FieldType throws a RuntimeException if there are any "extra" properties in the configuration.  I think SchemaField does something similar.
>
>I'd like to consider not doing this.  My main case is I want to be able to store semantic information about the FieldType with the FieldType.  Doing this now, requires creating a whole separate object model that overlays the FieldType and stores the information elsewhere (i.e. DB).  For example, say you want to denote what language a given field type supports, one has to store this information elsewhere, when it could easily be seen as a property of the FieldType.  I think right now, people often rely on naming conventions to convey this, such as text_zh or text_it or something like that and that doesn't extend very well, IMO.  These new attributes would allow applications to make use of richer semantics for FieldType w/o harming Solr in anyway (I think.)
>
>>From the looks of it, FieldType has all the functionality already built in, minus a few lines where the exception is thrown if there are "extra" attributes.
>
>I think a similar argument can be made for SchemaField as well (and probably other things like RequestHandler, etc. but "baby steps" first)
>
>Any thoughts/objections?
>
>
>-Grant