You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Mark Bennett <mb...@ideaeng.com> on 2009/07/02 21:53:51 UTC

Confirming doc change for Wiki for schema / plugins config

There's a particular confusion I've had with the Solr schema and plugins,
Though this stuff is "obvious" to the gurus, looking around I guess I wasn't
alone in my confusion.

I believe I understand it now and wanted to capture that on the Wiki, but
just double checking.... and maybe the gurus would have some additional
comments?


Two Syntaxes AND Two Plugin Sets

There is an abbreviated syntax for specifying plugins in the schema, but
there is a more powerful syntax that is preferred.

Also, Solr supports both solr-specific plugins, and is also compatible with
Lucene plugins.  Solr plugins use the more more modern longer syntax, but
Lucene plugins generally must use the abbreviated syntax OR use a custom
adapter class.

These two differences tend to coincide.  Solr plugins use the longer, more
powerful syntax, whereas Lucene plugins tend to use the shorter syntax (or
an adapter, see below).


Two Syntaxes for Defining Field Type Plugins:

Abbreviated Syntax:
<fieldType name=... class=...>
    <analyzer class="....SomeAnalyer" />
    <!-- Do not put additional plugins here -->
</fieldType>

Modern Syntax:
<fieldType name=... class=...>
    <analyzer>
        <tokenizer class="....SomeTokenizer" />
        <filter class="....SomeFilter" />
        <!-- other filters ... -->
    </analyzer>
</fieldType>

Of course you can have multiple <analyzer> blocks in the newer syntax, one
for index time and one for search.  And the filters can have options, etc.

This is confusing because the <analyzer> tag can EITHER have a class=
attribute OR nested subelements, usually of type <tokenizer> and <filter>.
You should not do both!  Futher, the main <fieldType> element also takes a
class attribute, which is required, but this is a separate class (...could
use some narrative as to why....)


Two Common Sources of Plugins:

When looking at schema configurations you find online, it's very important
to notice the prefixes in the class name.  Classes starting with
"org.apache.solr.analysis." or the shorthand "solr." are Solr specific, and
will use the longhand syntax.  Classes starting with
"org.apache.lucene.analysis." are NOT native Solr plugins and must EITHER
use the short hand syntax (which limits your functionality), or you need to
add a custom adapter class.

This is generally a good thing.  There are quite a few Lucene plugins out
there, and Solr can use any of them "out of the box" without the need for
breaking out a Java compiler.  However, when used in this compatibility
mode, you give up some functionality.

And you can't just use the longer syntax with the Lucene plugins. The
advanced syntax isn't directly compatible (at this time).  If you want the
advantages of the long form syntax you need to use a Lucene to Solr adapater
class, often called a "factory" class.


Examples of Right and Wrong Configurations.

Asian language Solr users will often want to use the CJK processor (CJK =
Chinese, Japanese and Korean).  They will typically use the base Lucene
plugin, but in various configurations.

Examples using CJK Plugins:

<!-- Correct Short form using Lucene compatible syntax -->
<fieldType name="text_cjk" class="solr.TextField">
  <analyzer class="org.apache.lucene. analysis.cjk.CJKAnalyzer"/>
</fieldType>

<!-- Incorrect attempt to use long form with Lucene plugins -->
<fieldType name="text_cjk" class="solr.TextField">
  <analyzer class="org.apache.lucene.analysis.cjk.CJKAnalyzer"/>
  <!-- Wrong: won't be used! -->
  <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
  <!-- ... other filters ... -->
</fieldType>

<!-- Correct Long Form syntax for Lucene plugins THAT HAVE AN ADAPTER -->
<fieldType name="text_cjk" class="solr.TextField">
  <analyzer>
    <!-- This ONLY works if you have an adapter class -->
    <tokenizer class="solr.CJKTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
    <!-- ... other filters ... -->
  </analyzer>
</fieldType>


There is a nice thread about the adapter class you need.  Later on in the
thread the discussion evolves into whether or not to make an "uber" Lucene
class loader, and the performance impact that might have here:

http://www.mail-archive.com/solr-user@lucene.apache.org/msg04487.html



--
Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513

Re: Confirming doc change for Wiki for schema / plugins config

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Thu, Jul 2, 2009 at 3:53 PM, Mark Bennett<mb...@ideaeng.com> wrote:
> There is an abbreviated syntax for specifying plugins in the schema, but
> there is a more powerful syntax that is preferred.

I think of it as specifying the Analyzer for a field: one can either
specify a Java Analyzer class (opaque, but good for legacy Analyzer
implementations or implementations that don't even use
Tokenizer/TokenFilter chains), or specify an Analyzer as a Tokenizer
followed by a list of Filters.

I'm still planning on cleaning up the schema for 1.4 - I'll see if the
comments can be made a little clearer.

> This is confusing because the <analyzer> tag can EITHER have a class=
> attribute OR nested subelements, usually of type <tokenizer> and <filter>.
> You should not do both!  Futher, the main <fieldType> element also takes a
> class attribute, which is required, but this is a separate class (...could
> use some narrative as to why....)

For polymorphic behavior for everything that falls outside Analyzer.

> Classes starting with
> "org.apache.lucene.analysis." are NOT native Solr plugins and must EITHER
> use the short hand syntax (which limits your functionality), or you need to
> add a custom adapter class.

Yeah, for years I've meant to look into getting this to "just work"
w/o having to create a factory.

FYI - the long-form/short-form is just a classloading thing, and
doesn't relate to factories.  It's only correlated in that something
in the solr namespace should have a factory.

-Yonik
http://www.lucidimagination.com