You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Mike Sokolov (JIRA)" <ji...@apache.org> on 2011/05/24 03:50:47 UTC

[jira] [Commented] (SOLR-1758) schema definition for configuration files

    [ https://issues.apache.org/jira/browse/SOLR-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038355#comment-13038355 ] 

Mike Sokolov commented on SOLR-1758:
------------------------------------

This was originally reported in the context of DIH, but as the OP said, it applies equally well to all configuration.

The config-validation.patch includes changes to Config that validate all XML configuration files loaded there.  The patch includes a schema with rules for <config/>, <schema>, <solr/>, <elevate/> and <root/> (used in tests).  It could be extended for other files as well.  The change causes Config to look in solr.home for a file called config.xsd.  If found, it is loaded and used to validate whatever configuration file is being loaded.  If a validation error occurs, an exception is raised (and logged? this seemed to be the way it was done before, although it seemed odd to me - I'd have thought exception logging would want to be handled at an outermost layer).

The Solr XML usage seems to be very flexible in practice.  Therefore the schema attempts to allow a fair amount of flexibility: for elements marked as "plugins" in the Wiki documentation, I've allowed pretty much arbitrary child content. The wildcards in the schema are "lax" which means that they allow any element, even unknown elements, but when known elements are found, they are validated against the model in the schema (eg: <str> is not allowed to have any child elements).

All the Solr tests but one pass with the patch, which means that the configuration in the solr example, as well as the various test configurations in solr/src/test-files/solr/conf, are all valid according to the schema.  The exception is one solrconfig.xml with a
luceneMatchVersion=4.0; I think this should LUCENE_40?  The patch also includes one new test of an invalid schema; it probably should have a few more.

However, my knowledge of Solr configuration options is far from encyclopedic - I spent a while with the documentation and examples - and there are almost certainly additional  configuration options out there that are in use and should be accounted for in the "standard" schema, eg some elements that should accept any attribute that don't currently.

In general I expect the schema could be evolved to be looser in some areas, and perhaps, tighter in others.

To help with that, I created some ant rules to convert the schema from Relax NG Compact syntax to XML Schema.  I find Relax easier to maintain, but including runtime validation support for Relax would require a large jar to be added to solr.  In this patch is dev-tools/schema; in there is a config.rnc, which is the source schema, and build.xml which compiles config.xsd from that using the trang.jar library and copies it into a few
places in the solr source tree.

Some TODOs:

It might be better to have separate schema files for separate configuration documents - this way the decision to validate could be made on a per-file basis, rather than globally for all configuration.

There is no model for <highlighting> in the schema - it's just a big wildcard right now.


> schema definition for configuration files
> -----------------------------------------
>
>                 Key: SOLR-1758
>                 URL: https://issues.apache.org/jira/browse/SOLR-1758
>             Project: Solr
>          Issue Type: New Feature
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Jorg Heymans
>         Attachments: config-validation-20110523.patch
>
>
> A schema definition would be able to spot the subtle error in below config 
> {code}
>     <dataSource name="ora" driver="oracle.jdbc.OracleDriver" url="...." />
>     <datasource name="orablob" type="FieldStreamDataSource" />
>     <document name="mydoc">
>         <entity dataSource="ora" name="meta" query="select id, filename, bytes from documents" >            
>             <field column="ID" name="id" />
>             <field column="FILENAME" name="filename" />
>             <entity dataSource="orablob" processor="TikaEntityProcessor" url="bytes" dataField="meta.BYTES">
>               <field column="text" name="mainDocument"/>
>             </entity>
>          </entity>
>      </document>
> {code}
> Also, many xml editors support auto completion based on schema definition so it would be easier to create configuration without constantly having to refer to javadoc or samples from the distribution.
> This applies equally to schema.xml and solr-config.xml

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org