You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Alexandre Rafalovitch <ar...@gmail.com> on 2013/03/01 04:32:30 UTC

What makes an Analyzer/Tokenizer/CharFilter/etc suitable for Solr?

Hello,

I want to have a unified reference of all different processors one could
use in Solr in various extension points.

I have written a small tool to extract all implementations
of UpdateRequestProcessorFactory, Analyzer, CharFilterFactory, etc
(actually of any root class).

However, I assume not all Lucene Analyzer derivatives can be just plugged
into Solr.

Is it fair to say that the class must:
*) Derive from appropriate root (is there a list of ALL the roots?)
*) Be public and not abstract (though a common sub-root could be)
*) Have a default empty constructor

My preliminary tests seem to indicate this is the case. Am I missing
anything.

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)

Re: What makes an Analyzer/Tokenizer/CharFilter/etc suitable for Solr?

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Thanks Jack.

On Thu, Feb 28, 2013 at 11:04 PM, Jack Krupansky <ja...@basetechnology.com>wrote:

> The package Javadoc for Solr analysis is a good start:
>
> http://lucene.apache.org/solr/**4_1_0/solr-core/org/apache/**
> solr/analysis/package-tree.**html<http://lucene.apache.org/solr/4_1_0/solr-core/org/apache/solr/analysis/package-tree.html>
>

Actually, this is representative of why I am writing my own utility. That
package tree does not actually make it easy to see all the derivative
classes, as they are hiding behind the multiple levels of abstraction. I am
not saying it is terribly hard. Still, for a non-Java programmer who is
just stepping out of Solr as a black box and trying to understand what can
be plugged-in in various configurations to improve their results, it is
non-trivial first couple of times. Especially, since it is not just the
class name that is important but also which jar need to be added to the
library statement.

My (preliminary) output for the CharFilters looks like this:
 -CharFilterFactory
(example/solr-webapp/webapp/WEB-INF/lib/lucene-analyzers-common-4.1.0.jar/org.apache.lucene.analysis.util)
     HTMLStripCharFilterFactory
(example/solr-webapp/webapp/WEB-INF/lib/lucene-analyzers-common-4.1.0.jar/org.apache.lucene.analysis.charfilter)
     MappingCharFilterFactory
(example/solr-webapp/webapp/WEB-INF/lib/lucene-analyzers-common-4.1.0.jar/org.apache.lucene.analysis.charfilter)
     PersianCharFilterFactory
(example/solr-webapp/webapp/WEB-INF/lib/lucene-analyzers-common-4.1.0.jar/org.apache.lucene.analysis.fa)
     JapaneseIterationMarkCharFilterFactory
(example/solr-webapp/webapp/WEB-INF/lib/lucene-analyzers-kuromoji-4.1.0.jar/org.apache.lucene.analysis.ja)
     PatternReplaceCharFilterFactory
(example/solr-webapp/webapp/WEB-INF/lib/lucene-analyzers-common-4.1.0.jar/org.apache.lucene.analysis.pattern)
     LegacyHTMLStripCharFilterFactory
(dist/solr-core-4.1.0.jar/org.apache.solr.analysis)
     MockCharFilterFactory
(dist/solr-test-framework-4.1.0.jar/org.apache.solr.analysis)

And (part of) URP tree:
 -UpdateRequestProcessorFactory
(dist/solr-core-4.1.0.jar/org.apache.solr.update.processor)
     UIMAUpdateRequestProcessorFactory
(dist/solr-uima-4.1.0.jar/org.apache.solr.uima.processor)
     -AbstractDefaultValueUpdateProcessorFactory
(dist/solr-core-4.1.0.jar/org.apache.solr.update.processor)
         DefaultValueUpdateProcessorFactory
(dist/solr-core-4.1.0.jar/org.apache.solr.update.processor)
         TimestampUpdateProcessorFactory
(dist/solr-core-4.1.0.jar/org.apache.solr.update.processor)
         UUIDUpdateProcessorFactory
(dist/solr-core-4.1.0.jar/org.apache.solr.update.processor)
     CloneFieldUpdateProcessorFactory
(dist/solr-core-4.1.0.jar/org.apache.solr.update.processor)
     DistributedUpdateProcessorFactory
(dist/solr-core-4.1.0.jar/org.apache.solr.update.processor)
     -FieldMutatingUpdateProcessorFactory
(dist/solr-core-4.1.0.jar/org.apache.solr.update.processor)
         ConcatFieldUpdateProcessorFactory
(dist/solr-core-4.1.0.jar/org.apache.solr.update.processor)
         CountFieldValuesUpdateProcessorFactory
(dist/solr-core-4.1.0.jar/org.apache.solr.update.processor)
         FieldLengthUpdateProcessorFactory
(dist/solr-core-4.1.0.jar/org.apache.solr.update.processor)
         -FieldValueSubsetUpdateProcessorFactory
(dist/solr-core-4.1.0.jar/org.apache.solr.update.processor)
             FirstFieldValueUpdateProcessorFactory
(dist/solr-core-4.1.0.jar/org.apache.solr.update.processor)
             LastFieldValueUpdateProcessorFactory
(dist/solr-core-4.1.0.jar/org.apache.solr.update.processor)
....

- at the start is abstract class, I also have * (not here) for classes
without empty constructor (hence my original question).



> Especially the AbstractAnalysisFactory:
>
> http://lucene.apache.org/core/**4_1_0/analyzers-common/org/**
> apache/lucene/analysis/util/**AbstractAnalysisFactory.html<http://lucene.apache.org/core/4_1_0/analyzers-common/org/apache/lucene/analysis/util/AbstractAnalysisFactory.html>
>
This is useful and confirms my 'empty-constructor' assumption.


> Also, look at the various "factories" in solrconfig.xml for other Solr
> extension points. Including search components, spellcheckers, etc.

Will do. I was just wondering if there was a semi-comprehensive list. But I
can build it iteratively.

 Regards,
   Alex.


> -- Jack Krupansky
>
> -----Original Message----- From: Alexandre Rafalovitch
> Sent: Thursday, February 28, 2013 10:32 PM
> To: solr-user@lucene.apache.org
> Subject: What makes an Analyzer/Tokenizer/CharFilter/**etc suitable for
> Solr?
>
>
> Hello,
>
> I want to have a unified reference of all different processors one could
> use in Solr in various extension points.
>
> I have written a small tool to extract all implementations
> of UpdateRequestProcessorFactory, Analyzer, CharFilterFactory, etc
> (actually of any root class).
>
> However, I assume not all Lucene Analyzer derivatives can be just plugged
> into Solr.
>
> Is it fair to say that the class must:
> *) Derive from appropriate root (is there a list of ALL the roots?)
> *) Be public and not abstract (though a common sub-root could be)
> *) Have a default empty constructor
>
> My preliminary tests seem to indicate this is the case. Am I missing
> anything.
>
> Regards,
>   Alex.
> Personal blog: http://blog.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/**alexandrerafalovitch<http://www.linkedin.com/in/alexandrerafalovitch>
> - Time is the quality of nature that keeps events from happening all at
> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
>

Re: What makes an Analyzer/Tokenizer/CharFilter/etc suitable for Solr?

Posted by Jack Krupansky <ja...@basetechnology.com>.
The package Javadoc for Solr analysis is a good start:

http://lucene.apache.org/solr/4_1_0/solr-core/org/apache/solr/analysis/package-tree.html

Especially the AbstractAnalysisFactory:

http://lucene.apache.org/core/4_1_0/analyzers-common/org/apache/lucene/analysis/util/AbstractAnalysisFactory.html

Also, look at the various "factories" in solrconfig.xml for other Solr 
extension points.

Including search components, spellcheckers, etc.

-- Jack Krupansky

-----Original Message----- 
From: Alexandre Rafalovitch
Sent: Thursday, February 28, 2013 10:32 PM
To: solr-user@lucene.apache.org
Subject: What makes an Analyzer/Tokenizer/CharFilter/etc suitable for Solr?

Hello,

I want to have a unified reference of all different processors one could
use in Solr in various extension points.

I have written a small tool to extract all implementations
of UpdateRequestProcessorFactory, Analyzer, CharFilterFactory, etc
(actually of any root class).

However, I assume not all Lucene Analyzer derivatives can be just plugged
into Solr.

Is it fair to say that the class must:
*) Derive from appropriate root (is there a list of ALL the roots?)
*) Be public and not abstract (though a common sub-root could be)
*) Have a default empty constructor

My preliminary tests seem to indicate this is the case. Am I missing
anything.

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book) 


Re: What makes an Analyzer/Tokenizer/CharFilter/etc suitable for Solr?

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Has this logic (default constructor or version flag) changed due to
LUCENE-4877 ? I rerun my tool and suddenly huge number of Factories
acquired a new constructor (e.g. MappingCharFilterFactory).

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Wed, Mar 6, 2013 at 2:49 PM, Chris Hostetter
<ho...@fucit.org> wrote:
>
> : *) Have a default empty constructor
> :
> : My preliminary tests seem to indicate this is the case. Am I missing
> : anything.
>
> Any analyzer that has an empty construct *or* a constructor that takes in
> a lucene "Version" object may be specified.
>
> I've updated the wiki to make this more clear...
> https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Specifying_an_Analyzer_in_the_schema
>
> For CharFilters, Tokenizers, TokenFilters: they must have a factory of the
> appropriate type (CharFilterFactory, TokenizerFactory, TokenFilterFactory)
>
>
> -Hoss

Re: What makes an Analyzer/Tokenizer/CharFilter/etc suitable for Solr?

Posted by Chris Hostetter <ho...@fucit.org>.
: *) Have a default empty constructor
: 
: My preliminary tests seem to indicate this is the case. Am I missing
: anything.

Any analyzer that has an empty construct *or* a constructor that takes in 
a lucene "Version" object may be specified.

I've updated the wiki to make this more clear...
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Specifying_an_Analyzer_in_the_schema

For CharFilters, Tokenizers, TokenFilters: they must have a factory of the 
appropriate type (CharFilterFactory, TokenizerFactory, TokenFilterFactory)


-Hoss