You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by TimF <ti...@timflanders.com> on 2007/03/27 00:12:58 UTC

Custom Analyzer Help please

I would like to be able to get terms from my data that are a combination of
two existing analyzers.
I would like this for both posting and searching of various fields.
An example of the data might be as follows:
Hello XY&Z Corporation - abc@example.com
I would like the following terms to come out of the analyzer:
[hello] [xy&z] [corporation] [abc@example] [com] //this is the
StandardAnalyzer output
as well as
[xy] [z] [abc] [example]

Essentially, I want the StandardAnalyzer output, but then I want to run the
StopAnalyzer on the terms that come out of the StandardAnalyzer. Basically I
would like to be able to search against part of the "special" word or the
whole "special" word, where special word contains tokens for things like
email and part numbers, etc...

I know the answer is that I have to create a custom analyzer that combines
the standard and stop analyzers, and I have tried... but I just cannot
figure out how to do this.

I have read through the LIA book and looked through the samples for keyword
and perfield analyzers, but they just dont do it.

Anyone have any samples that do this kind of thing?
Thanks,
Tim
--
View this message in context: http://www.nabble.com/Custom-Analyzer-Help-please-tf3469904.html#a9682794
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Custom Analyzer Help please

Posted by Grant Ingersoll <gr...@gmail.com>.

OK, gotcha.   I now see what you mean.  StandardAnalyzer uses the  
StandardTokenizer, whereas StopAnalyzer uses the LowerCaseTokenizer,  
which divides text at non-letters.  What you most likely will need to  
do is create a Tokenizer that outputs the original token, and outputs  
the parts of it based on the LowerCaseTokenizer.  Have a look at the  
TokenStream API.  Essentially, you need to implement the next()  
method for your new Tokenizer.  You probably could just have your  
tokenizer wrap the other two, by using StandardTokenizer to get your  
first level tokens, then, given a Token, run it through the  
LowerCaseTokenizer to see if it has any values for next(), which can  
be added to the stream.

Once you have your Tokenizer working you can wrap them into your new  
Analyzer to use the other filters as you see fit.

If you have "Lucene In Action", have a look at Chapter 4 for more  
details on how Tokenizers and TokenFilters work.

HTH,
Grant

On Mar 28, 2007, at 11:18 AM, TimF wrote:

>
> Grant,
> Thanks for your reply and the pointer to the custom code sample. I  
> will be
> checking into that today. I did delve into the src for the OOTB  
> analyzers
> and was aware of what they did. Still, the StandardAnalyzer does  
> not do what
> I want. The real difference between my needs and the results of the
> StandardAnalyzer is that what I want is the union of the  
> StandardAnalyzer
> and the StopAnalyzer. If you refer back to my original example...
>
> An example of the data might be as follows:
>    Hello XY&Z Corporation - abc@example.com
> I would like the following terms to come out of the analyzer:
>  [hello]  [xy&z]  [corporation] [abc@example] [com]  //this is the
> StandardAnalyzer output
> as well as
>   [xy] [z]  [abc] [example]
>
> I figured that creating a custom analyzer is the only way to do  
> that, but
> unfortunately I am not that familiar with how the analyzers  
> "really" work
> internally( I am more of a mathematician than a lexicon).
>
> If you have any other thoughts or ideas I would love to hear.
> Thanks,
> Tim
>
> Grant Ingersoll-6 wrote:
>>
>> So, I think the answer is that StandardAnalyzer already has what you
>> state you want.  Is it, perhaps, that certain stopwords that you are
>> interested in are not currently being stopped?
>>
>
> -- 
> View this message in context: http://www.nabble.com/Custom-Analyzer- 
> Help-please-tf3469904.html#a9716016
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Custom Analyzer Help please

Posted by TimF <ti...@timflanders.com>.

Grant,
Thanks for your reply and the pointer to the custom code sample. I will be
checking into that today. I did delve into the src for the OOTB analyzers
and was aware of what they did. Still, the StandardAnalyzer does not do what
I want. The real difference between my needs and the results of the
StandardAnalyzer is that what I want is the union of the StandardAnalyzer
and the StopAnalyzer. If you refer back to my original example...

An example of the data might be as follows:
   Hello XY&Z Corporation - abc@example.com 
I would like the following terms to come out of the analyzer:
 [hello]  [xy&z]  [corporation] [abc@example] [com]  //this is the
StandardAnalyzer output
as well as
  [xy] [z]  [abc] [example] 

I figured that creating a custom analyzer is the only way to do that, but
unfortunately I am not that familiar with how the analyzers "really" work
internally( I am more of a mathematician than a lexicon). 

If you have any other thoughts or ideas I would love to hear.
Thanks,
Tim

Grant Ingersoll-6 wrote:
> 
> So, I think the answer is that StandardAnalyzer already has what you  
> state you want.  Is it, perhaps, that certain stopwords that you are  
> interested in are not currently being stopped?
> 

-- 
View this message in context: http://www.nabble.com/Custom-Analyzer-Help-please-tf3469904.html#a9716016
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Custom Analyzer Help please

Posted by Grant Ingersoll <gs...@apache.org>.

Hi Tim,

 From the StandardAnalyzer code, the TokenStream looks like:

/** Constructs a {@link StandardTokenizer} filtered by a {@link
   StandardFilter}, a {@link LowerCaseFilter} and a {@link  
StopFilter}. */
   public TokenStream tokenStream(String fieldName, Reader reader) {
     TokenStream result = new StandardTokenizer(reader);
     result = new StandardFilter(result);
     result = new LowerCaseFilter(result);
     result = new StopFilter(result, stopSet);
     return result;
   }

Whereas StopAnalyzer looks like:
/** Filters LowerCaseTokenizer with StopFilter. */
   public TokenStream tokenStream(String fieldName, Reader reader) {
     return new StopFilter(new LowerCaseTokenizer(reader), stopWords);
   }

So, I think the answer is that StandardAnalyzer already has what you  
state you want.  Is it, perhaps, that certain stopwords that you are  
interested in are not currently being stopped?
Also, there is a whole section of examples on how to write Analyzers  
in the contrib/analyzers section of the source code.

-Grant

On Mar 26, 2007, at 6:12 PM, TimF wrote:

>
> I would like to be able to get terms from my data that are a  
> combination of
> two existing analyzers.
> I would like this for both posting and searching of various fields.
> An example of the data might be as follows:
>    Hello XY&Z Corporation - abc@example.com
> I would like the following terms to come out of the analyzer:
>  [hello]  [xy&z]  [corporation] [abc@example] [com]  //this is the
> StandardAnalyzer output
> as well as
>   [xy] [z]  [abc] [example]
>
> Essentially, I want the StandardAnalyzer output, but then I want to  
> run the
> StopAnalyzer on the terms that come out of the StandardAnalyzer.  
> Basically I
> would like to be able to search against part of the "special" word  
> or the
> whole "special" word, where special word contains tokens for things  
> like
> email and part numbers, etc...
>
> I know the answer is that I have to create a custom analyzer that  
> combines
> the standard and stop analyzers, and I have tried... but I just cannot
> figure out how to do this.
>
> I have read through the LIA book and looked through the samples for  
> keyword
> and perfield analyzers, but they just dont do it.
>
> Anyone have any samples that do this kind of thing?
> Thanks,
> Tim
> -- 
> View this message in context: http://www.nabble.com/Custom-Analyzer- 
> Help-please-tf3469904.html#a9682794
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org