You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Vihari Piratla <vi...@gmail.com> on 2015/01/12 08:51:08 UTC

Custom tokenizer

Hi,
I am trying to implement a custom tokenizer for my application and I have
few queries regarding the same.
1. Is there a way to provide an existing analyzer (say EnglishAnanlyzer)
the custom tokenizer and make it use this tokenizer instead of say
StandardTokenizer?
2. Why are analyzers such as Standard and EnglishAnalyzers defined final?
Because of which, I cannot extend them.

Thank you.
--
V

RE: Custom tokenizer

Posted by Uwe Schindler <uw...@thetaphi.de>.

> Thanks for the reply.
> 
> Hmm, I understand.
> I know about AnalyzerWrapper, but that is not what I am looking for.
> 
> I also know about cloning and overriding. I want my analyzer to behave
> exactly the same as EnglishAnalyzer and right now I am copying the code
> from the EnglishAnalyzer to mimic the behavior, which is a dirty solution.
> Is there any other proper solution(s) to this problem?

NO.

Analyzers that are provided by Lucene have a configuration (combination of Tokenizers and Filters) that won't change unless the matchVersion differs (which is documented in the Javadocs). The reason for this is: If you have indexed with a given analyzer you have to use it unmodified always when updating/searching the index, otherwise the results of those actions are undefined. So on updating Lucene every Analyzer should return exactly the same results. Otherwise all users would need to rebuild their indexes also in minor versions.

Also, see Lucene Analyzers as "example" code. What counts here is the combination of Tokenizers and TokenFilters, which is freely configureable. The ones provided by Lucene are useful for common cases, but whenever you have custom requirements, you have to define your Analyzer *completely* yourself. This is also what Solr and Elasticsearch users do in their config files.

Uwe

> Thank you.
> 
> On Mon, Jan 12, 2015 at 1:36 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
> 
> > Hi,
> >
> > Extending an existing Analyzer is not useful, because it is just a
> > factory that returns a TokenStream instance to consumers. If you want
> > to change the Tokenizer of an existing Analyzer, just clone it and
> > rewrite its
> > createComponents() method, see the example in the Javadocs:
> >
> http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/analysis/A
> > nalyzer.html
> >
> > If you want to add additional TokenFilters to the chain, you can do
> > this with AnalyzerWrapper (
> >
> http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/analysis/A
> > nalyzerWrapper.html), but this does not work with Tokenizers, because
> > those are instantiated before the TokenFilters which depend on them,
> > so changing the Tokenizer afterwards is impossible.
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> >
> > > -----Original Message-----
> > > From: Vihari Piratla [mailto:viharipiratla@gmail.com]
> > > Sent: Monday, January 12, 2015 8:51 AM
> > > To: java-user@lucene.apache.org
> > > Subject: Custom tokenizer
> > >
> > > Hi,
> > > I am trying to implement a custom tokenizer for my application and I
> > > have few queries regarding the same.
> > > 1. Is there a way to provide an existing analyzer (say
> > > EnglishAnanlyzer)
> > the
> > > custom tokenizer and make it use this tokenizer instead of say
> > > StandardTokenizer?
> > > 2. Why are analyzers such as Standard and EnglishAnalyzers defined final?
> > > Because of which, I cannot extend them.
> > >
> > > Thank you.
> > > --
> > > V
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> 
> --
> V


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Custom tokenizer

Posted by Vihari Piratla <vi...@gmail.com>.

Thanks for the reply.

Hmm, I understand.
I know about AnalyzerWrapper, but that is not what I am looking for.

I also know about cloning and overriding. I want my analyzer to behave
exactly the same as EnglishAnalyzer and right now I am copying the code
from the EnglishAnalyzer to mimic the behavior, which is a dirty solution.
Is there any other proper solution(s) to this problem?

Thank you.

On Mon, Jan 12, 2015 at 1:36 PM, Uwe Schindler <uw...@thetaphi.de> wrote:

> Hi,
>
> Extending an existing Analyzer is not useful, because it is just a factory
> that returns a TokenStream instance to consumers. If you want to change the
> Tokenizer of an existing Analyzer, just clone it and rewrite its
> createComponents() method, see the example in the Javadocs:
> http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/analysis/Analyzer.html
>
> If you want to add additional TokenFilters to the chain, you can do this
> with AnalyzerWrapper (
> http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/analysis/AnalyzerWrapper.html),
> but this does not work with Tokenizers, because those are instantiated
> before the TokenFilters which depend on them, so changing the Tokenizer
> afterwards is impossible.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
> > -----Original Message-----
> > From: Vihari Piratla [mailto:viharipiratla@gmail.com]
> > Sent: Monday, January 12, 2015 8:51 AM
> > To: java-user@lucene.apache.org
> > Subject: Custom tokenizer
> >
> > Hi,
> > I am trying to implement a custom tokenizer for my application and I have
> > few queries regarding the same.
> > 1. Is there a way to provide an existing analyzer (say EnglishAnanlyzer)
> the
> > custom tokenizer and make it use this tokenizer instead of say
> > StandardTokenizer?
> > 2. Why are analyzers such as Standard and EnglishAnalyzers defined final?
> > Because of which, I cannot extend them.
> >
> > Thank you.
> > --
> > V
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
V

RE: Custom tokenizer

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi,

Extending an existing Analyzer is not useful, because it is just a factory that returns a TokenStream instance to consumers. If you want to change the Tokenizer of an existing Analyzer, just clone it and rewrite its createComponents() method, see the example in the Javadocs: http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/analysis/Analyzer.html

If you want to add additional TokenFilters to the chain, you can do this with AnalyzerWrapper (http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/analysis/AnalyzerWrapper.html), but this does not work with Tokenizers, because those are instantiated before the TokenFilters which depend on them, so changing the Tokenizer afterwards is impossible.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Vihari Piratla [mailto:viharipiratla@gmail.com]
> Sent: Monday, January 12, 2015 8:51 AM
> To: java-user@lucene.apache.org
> Subject: Custom tokenizer
> 
> Hi,
> I am trying to implement a custom tokenizer for my application and I have
> few queries regarding the same.
> 1. Is there a way to provide an existing analyzer (say EnglishAnanlyzer) the
> custom tokenizer and make it use this tokenizer instead of say
> StandardTokenizer?
> 2. Why are analyzers such as Standard and EnglishAnalyzers defined final?
> Because of which, I cannot extend them.
> 
> Thank you.
> --
> V


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org