You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Steven Schlansker <st...@likeness.com> on 2013/06/12 01:52:59 UTC

Seemingly very difficult to wrap an Analyzer with CharFilter

Hi everyone,

I am trying to add a CharFilter to my Analyzer.  I started with a StandardAnalyzer wrapped with an ASCIIFoldingFilter.  Then I realized that it does not handle searches for names that include punctuation well, for example I want a PrefixQuery "pf" to match "P.F. Chang's" or "zaras" to match "Zara's".

It seems that the easiest plan of attack here is to filter out all punctuation before analysis.  Per the Analyzer package documentation, that means I should use a CharFilter.

However, it seems next to impossible to actually insert a CharFilter into the analyzer!

The JavaDoc for Analyzer.initReader says "Override this if you want to insert a CharFilter".

If my code extends Analyzer, I can extend initReader but I cannot delegate createComponents to my base StandardAnalyzer, as it is protected.  I cannot delegate tokenStream to my base analyzer, because it is final.  So a subclass of Analyzer seemingly cannot use another Analyzer to do its dirty work.

There is an AnalyzerWrapper class that seems perfect for what I want!  I can provide a base analyzer and only override the pieces that I want.  Except … initReader is overridden already to delegate to the base analyzer, and this override is "final"!  Bummer!

I guess I could have my Analyzer be in the org.apache.lucene.analyzers package and then I can access the protected createComponents method, but this seems like a disgustingly hacky way to bypass the public API that I really should use.

Am I missing something glaring here?  How can I amend a StandardAnalyzer to use a custom CharFilter?

Thanks for any guidance,
Steven


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Seemingly very difficult to wrap an Analyzer with CharFilter

Posted by Steven Schlansker <st...@likeness.com>.
On Jun 12, 2013, at 5:26 PM, Michael Sokolov <ms...@safaribooksonline.com> wrote:

> On 6/12/2013 7:02 PM, Steven Schlansker wrote:
>> On Jun 12, 2013, at 3:44 PM, Michael Sokolov <ms...@safaribooksonline.com> wrote:
>> 
>>> You may not have noticed that CharFilter extends Reader.  The expected pattern here is that you chain instances together -- your CharFilter should act as *input* to the Analyzer, I think.  Don't think in terms of extending these analysis classes (except the base ones designed for it): compose them so that each consumes the one before it
>>> 
>> Hi Mike,
>> 
>> Hm, that may work out.  I am a little surprised because I thought the intention is that you set the Analyzer up as part of the configuration, and when you add documents, the analyzer takes care of all text processing.  In particular this means that now I have to ensure that the same transformation is done at query time, and I thought the analyzer abstraction was supposed to avoid this.
>> 
>> But if this is how it should be done, it could work.  Thanks for the pointer.
>> 
>> Steven
>> 
>> 
> Um I'm sorry I was in a hurry and forgot to think... I went back and looked at my code and found the pattern was different from what I was thinking.  I have:
> 
> public final class DefaultAnalyzer extends Analyzer {
> 
>    @Override
>    protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
>        Tokenizer tokenizer = new StandardTokenizer(IndexConfiguration.LUCENE_VERSION, reader);
>        TokenStream tokenStream =  new LowerCaseFilter(IndexConfiguration.LUCENE_VERSION, tokenizer);
>        // ASCIIFoldingFilter
>        // Stemming
>        return new TokenStreamComponents(tokenizer, tokenStream);
>    }
> 
> }
> 
> You were exactly right that subclassing Analyzer and overriding the initReader is the way to go.
> The composition I was talking about can happen among filters.  I guess you have to duplicate the internals of StandardAnalyzer, but I don't think there's all that much in there?

You are right, it is not that hard.  It is only that my goal was to have "a StandardAnalyzer with a CharFilter" and I hate unnecessarily duplicating code :-)

But it seems that this is my only course of action.

> 
> I used AnalyzerWrapper for something -- um switching between multiple analyzers based on the input.  But it doesn't allow you to do anything with the internals of the analyzer(s) it wraps.

Yeah, this is a little unfortunate.  Just being able to override initReader would be nice.

Thanks for the pointers,
Steven


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Seemingly very difficult to wrap an Analyzer with CharFilter

Posted by Michael Sokolov <ms...@safaribooksonline.com>.
On 6/12/2013 7:02 PM, Steven Schlansker wrote:
> On Jun 12, 2013, at 3:44 PM, Michael Sokolov <ms...@safaribooksonline.com> wrote:
>
>> You may not have noticed that CharFilter extends Reader.  The expected pattern here is that you chain instances together -- your CharFilter should act as *input* to the Analyzer, I think.  Don't think in terms of extending these analysis classes (except the base ones designed for it): compose them so that each consumes the one before it
>>
> Hi Mike,
>
> Hm, that may work out.  I am a little surprised because I thought the intention is that you set the Analyzer up as part of the configuration, and when you add documents, the analyzer takes care of all text processing.  In particular this means that now I have to ensure that the same transformation is done at query time, and I thought the analyzer abstraction was supposed to avoid this.
>
> But if this is how it should be done, it could work.  Thanks for the pointer.
>
> Steven
>
>
Um I'm sorry I was in a hurry and forgot to think... I went back and 
looked at my code and found the pattern was different from what I was 
thinking.  I have:

public final class DefaultAnalyzer extends Analyzer {

     @Override
     protected TokenStreamComponents createComponents(String fieldName, 
Reader reader) {
         Tokenizer tokenizer = new 
StandardTokenizer(IndexConfiguration.LUCENE_VERSION, reader);
         TokenStream tokenStream =  new 
LowerCaseFilter(IndexConfiguration.LUCENE_VERSION, tokenizer);
         // ASCIIFoldingFilter
         // Stemming
         return new TokenStreamComponents(tokenizer, tokenStream);
     }

}

You were exactly right that subclassing Analyzer and overriding the 
initReader is the way to go.
The composition I was talking about can happen among filters.  I guess 
you have to duplicate the internals of StandardAnalyzer, but I don't 
think there's all that much in there?

I used AnalyzerWrapper for something -- um switching between multiple 
analyzers based on the input.  But it doesn't allow you to do anything 
with the internals of the analyzer(s) it wraps.

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Seemingly very difficult to wrap an Analyzer with CharFilter

Posted by Steven Schlansker <st...@likeness.com>.
On Jun 12, 2013, at 3:44 PM, Michael Sokolov <ms...@safaribooksonline.com> wrote:

> You may not have noticed that CharFilter extends Reader.  The expected pattern here is that you chain instances together -- your CharFilter should act as *input* to the Analyzer, I think.  Don't think in terms of extending these analysis classes (except the base ones designed for it): compose them so that each consumes the one before it
> 

Hi Mike,

Hm, that may work out.  I am a little surprised because I thought the intention is that you set the Analyzer up as part of the configuration, and when you add documents, the analyzer takes care of all text processing.  In particular this means that now I have to ensure that the same transformation is done at query time, and I thought the analyzer abstraction was supposed to avoid this.

But if this is how it should be done, it could work.  Thanks for the pointer.

Steven


> On 6/11/2013 7:52 PM, Steven Schlansker wrote:
>> Hi everyone,
>> 
>> I am trying to add a CharFilter to my Analyzer.  I started with a StandardAnalyzer wrapped with an ASCIIFoldingFilter.  Then I realized that it does not handle searches for names that include punctuation well, for example I want a PrefixQuery "pf" to match "P.F. Chang's" or "zaras" to match "Zara's".
>> 
>> It seems that the easiest plan of attack here is to filter out all punctuation before analysis.  Per the Analyzer package documentation, that means I should use a CharFilter.
>> 
>> However, it seems next to impossible to actually insert a CharFilter into the analyzer!
>> 
>> The JavaDoc for Analyzer.initReader says "Override this if you want to insert a CharFilter".
>> 
>> If my code extends Analyzer, I can extend initReader but I cannot delegate createComponents to my base StandardAnalyzer, as it is protected.  I cannot delegate tokenStream to my base analyzer, because it is final.  So a subclass of Analyzer seemingly cannot use another Analyzer to do its dirty work.
>> 
>> There is an AnalyzerWrapper class that seems perfect for what I want!  I can provide a base analyzer and only override the pieces that I want.  Except … initReader is overridden already to delegate to the base analyzer, and this override is "final"!  Bummer!
>> 
>> I guess I could have my Analyzer be in the org.apache.lucene.analyzers package and then I can access the protected createComponents method, but this seems like a disgustingly hacky way to bypass the public API that I really should use.
>> 
>> Am I missing something glaring here?  How can I amend a StandardAnalyzer to use a custom CharFilter?
>> 
>> Thanks for any guidance,
>> Steven
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Seemingly very difficult to wrap an Analyzer with CharFilter

Posted by Michael Sokolov <ms...@safaribooksonline.com>.
You may not have noticed that CharFilter extends Reader.  The expected 
pattern here is that you chain instances together -- your CharFilter 
should act as *input* to the Analyzer, I think.  Don't think in terms of 
extending these analysis classes (except the base ones designed for it): 
compose them so that each consumes the one before it

-Mike

On 6/11/2013 7:52 PM, Steven Schlansker wrote:
> Hi everyone,
>
> I am trying to add a CharFilter to my Analyzer.  I started with a StandardAnalyzer wrapped with an ASCIIFoldingFilter.  Then I realized that it does not handle searches for names that include punctuation well, for example I want a PrefixQuery "pf" to match "P.F. Chang's" or "zaras" to match "Zara's".
>
> It seems that the easiest plan of attack here is to filter out all punctuation before analysis.  Per the Analyzer package documentation, that means I should use a CharFilter.
>
> However, it seems next to impossible to actually insert a CharFilter into the analyzer!
>
> The JavaDoc for Analyzer.initReader says "Override this if you want to insert a CharFilter".
>
> If my code extends Analyzer, I can extend initReader but I cannot delegate createComponents to my base StandardAnalyzer, as it is protected.  I cannot delegate tokenStream to my base analyzer, because it is final.  So a subclass of Analyzer seemingly cannot use another Analyzer to do its dirty work.
>
> There is an AnalyzerWrapper class that seems perfect for what I want!  I can provide a base analyzer and only override the pieces that I want.  Except … initReader is overridden already to delegate to the base analyzer, and this override is "final"!  Bummer!
>
> I guess I could have my Analyzer be in the org.apache.lucene.analyzers package and then I can access the protected createComponents method, but this seems like a disgustingly hacky way to bypass the public API that I really should use.
>
> Am I missing something glaring here?  How can I amend a StandardAnalyzer to use a custom CharFilter?
>
> Thanks for any guidance,
> Steven
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org