You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Steve Rowe <sa...@gmail.com> on 2013/08/01 03:21:59 UTC

Proposal/request for comments: Solr schema annotation

In thinking about making the entire Solr schema REST-API-addressable (SOLR-4898), I'd like to be able to add arbitrary metadata at both the top level of the schema and at each leaf node, and allow read/write access to that metadata via the REST API.

Some uses I've thought of for such a facility: 

1. The managed schema now drops XML comments from schema.xml upon conversion to managed-schema format, but it would be much better if these were somehow preserved, as well as round-trippable when retrieving the schema and its constituents via the REST API.

2. Some comments in the example schemas don't refer to just one or to all leaf nodes, but rather to a group of them. I'd like to be able to group nodes by adding same-named "tags" to multiple nodes, and also have a top-level (optional) "tag description" - this description could then be presented with tagged nodes in various output formats.

3. Some comments in the example schema are documentation about a feature, e.g. copyFields.  A top-level "documentation" annotation could take a leaf node element name (or maybe an XPath? probably overkill) and apply to all matching elements. 

4. When modifying the schema via REST API, a "last-modified" annotation could be automatically added.

5. There were a couple of user complaints recently when schema.xml parsing was tightened to disallow unknown attributes on field declarations (SOLR-4641): people were storing their own information there.  User-level metadata would support this in a round-trippable way - I'm thinking we could restrict it to flat string-typed key/value pairs, with no nested structure.

W3C XML Schema has a similar facility: <http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/structures.html#element-annotation>.

Thoughts?

Some concrete examples of what I'm thinking of in schema.xml format (syntax/naming as yet unsettled):

<schema name="example" version="1.5">
  <annotation>
    <description element="tag" content="plain-numeric-field-types">
      Plain numeric field types store and index the text value verbatim.
    </description>
    <documentation element="copyField">
      copyField commands copy one field to another at the time a document
      is added to the index.  It's used either to index the same field differently,
      or to add multiple fields to the same field for easier/faster searching.
    </documentation>
    <last-modified>2014-03-08T12:14:02Z</last-modified>
    …
  </annotation>
…
  <fieldType name="pint" class="solr.IntField">
    <annotation>
      <tag>plain-numeric-field-types</tag>
    </annotation>
  </fieldType>
  <fieldType name="plong" class="solr.LongField">
    <annotation>
      <tag>plain-numeric-field-types</tag>
    </annotation>
  </fieldType>
  …
  <copyField source="cat" dest="text">
    <annotation>
      <todo>Should this field really be copied to the catchall text field?</todo>
    </annotation>
  </copyField>
  …
  <field name="text" type="text_general">
    <annotation>
      <description>catchall field</description>
      <visibility>public</visibility>
    </annotation>
  </field>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Proposal/request for comments: Solr schema annotation

Posted by Walter Underwood <wu...@wunderwood.org>.
An annotation field would be much better than the current "anything goes" schema-less schema.xml.

Has anyone built an XML Schema for schema.xml? I know it is extensible, but it would be worth a try.

wunder

On Jul 31, 2013, at 6:21 PM, Steve Rowe wrote:

> In thinking about making the entire Solr schema REST-API-addressable (SOLR-4898), I'd like to be able to add arbitrary metadata at both the top level of the schema and at each leaf node, and allow read/write access to that metadata via the REST API.
> 
> Some uses I've thought of for such a facility: 
> 
> 1. The managed schema now drops XML comments from schema.xml upon conversion to managed-schema format, but it would be much better if these were somehow preserved, as well as round-trippable when retrieving the schema and its constituents via the REST API.
> 
> 2. Some comments in the example schemas don't refer to just one or to all leaf nodes, but rather to a group of them. I'd like to be able to group nodes by adding same-named "tags" to multiple nodes, and also have a top-level (optional) "tag description" - this description could then be presented with tagged nodes in various output formats.
> 
> 3. Some comments in the example schema are documentation about a feature, e.g. copyFields.  A top-level "documentation" annotation could take a leaf node element name (or maybe an XPath? probably overkill) and apply to all matching elements. 
> 
> 4. When modifying the schema via REST API, a "last-modified" annotation could be automatically added.
> 
> 5. There were a couple of user complaints recently when schema.xml parsing was tightened to disallow unknown attributes on field declarations (SOLR-4641): people were storing their own information there.  User-level metadata would support this in a round-trippable way - I'm thinking we could restrict it to flat string-typed key/value pairs, with no nested structure.
> 
> W3C XML Schema has a similar facility: <http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/structures.html#element-annotation>.
> 
> Thoughts?
> 
> Some concrete examples of what I'm thinking of in schema.xml format (syntax/naming as yet unsettled):
> 
> <schema name="example" version="1.5">
>  <annotation>
>    <description element="tag" content="plain-numeric-field-types">
>      Plain numeric field types store and index the text value verbatim.
>    </description>
>    <documentation element="copyField">
>      copyField commands copy one field to another at the time a document
>      is added to the index.  It's used either to index the same field differently,
>      or to add multiple fields to the same field for easier/faster searching.
>    </documentation>
>    <last-modified>2014-03-08T12:14:02Z</last-modified>
>    …
>  </annotation>
> …
>  <fieldType name="pint" class="solr.IntField">
>    <annotation>
>      <tag>plain-numeric-field-types</tag>
>    </annotation>
>  </fieldType>
>  <fieldType name="plong" class="solr.LongField">
>    <annotation>
>      <tag>plain-numeric-field-types</tag>
>    </annotation>
>  </fieldType>
>  …
>  <copyField source="cat" dest="text">
>    <annotation>
>      <todo>Should this field really be copied to the catchall text field?</todo>
>    </annotation>
>  </copyField>
>  …
>  <field name="text" type="text_general">
>    <annotation>
>      <description>catchall field</description>
>      <visibility>public</visibility>
>    </annotation>
>  </field>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 

--
Walter Underwood
wunder@wunderwood.org




Re: Proposal/request for comments: Solr schema annotation

Posted by Walter Underwood <wu...@wunderwood.org>.
An annotation field would be much better than the current "anything goes" schema-less schema.xml.

Has anyone built an XML Schema for schema.xml? I know it is extensible, but it would be worth a try.

wunder

On Jul 31, 2013, at 6:21 PM, Steve Rowe wrote:

> In thinking about making the entire Solr schema REST-API-addressable (SOLR-4898), I'd like to be able to add arbitrary metadata at both the top level of the schema and at each leaf node, and allow read/write access to that metadata via the REST API.
> 
> Some uses I've thought of for such a facility: 
> 
> 1. The managed schema now drops XML comments from schema.xml upon conversion to managed-schema format, but it would be much better if these were somehow preserved, as well as round-trippable when retrieving the schema and its constituents via the REST API.
> 
> 2. Some comments in the example schemas don't refer to just one or to all leaf nodes, but rather to a group of them. I'd like to be able to group nodes by adding same-named "tags" to multiple nodes, and also have a top-level (optional) "tag description" - this description could then be presented with tagged nodes in various output formats.
> 
> 3. Some comments in the example schema are documentation about a feature, e.g. copyFields.  A top-level "documentation" annotation could take a leaf node element name (or maybe an XPath? probably overkill) and apply to all matching elements. 
> 
> 4. When modifying the schema via REST API, a "last-modified" annotation could be automatically added.
> 
> 5. There were a couple of user complaints recently when schema.xml parsing was tightened to disallow unknown attributes on field declarations (SOLR-4641): people were storing their own information there.  User-level metadata would support this in a round-trippable way - I'm thinking we could restrict it to flat string-typed key/value pairs, with no nested structure.
> 
> W3C XML Schema has a similar facility: <http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/structures.html#element-annotation>.
> 
> Thoughts?
> 
> Some concrete examples of what I'm thinking of in schema.xml format (syntax/naming as yet unsettled):
> 
> <schema name="example" version="1.5">
>  <annotation>
>    <description element="tag" content="plain-numeric-field-types">
>      Plain numeric field types store and index the text value verbatim.
>    </description>
>    <documentation element="copyField">
>      copyField commands copy one field to another at the time a document
>      is added to the index.  It's used either to index the same field differently,
>      or to add multiple fields to the same field for easier/faster searching.
>    </documentation>
>    <last-modified>2014-03-08T12:14:02Z</last-modified>
>    …
>  </annotation>
> …
>  <fieldType name="pint" class="solr.IntField">
>    <annotation>
>      <tag>plain-numeric-field-types</tag>
>    </annotation>
>  </fieldType>
>  <fieldType name="plong" class="solr.LongField">
>    <annotation>
>      <tag>plain-numeric-field-types</tag>
>    </annotation>
>  </fieldType>
>  …
>  <copyField source="cat" dest="text">
>    <annotation>
>      <todo>Should this field really be copied to the catchall text field?</todo>
>    </annotation>
>  </copyField>
>  …
>  <field name="text" type="text_general">
>    <annotation>
>      <description>catchall field</description>
>      <visibility>public</visibility>
>    </annotation>
>  </field>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 

--
Walter Underwood
wunder@wunderwood.org