You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by TimF <ti...@timflanders.com> on 2007/03/27 00:12:58 UTC
Custom Analyzer Help please
I would like to be able to get terms from my data that are a combination of
two existing analyzers.
I would like this for both posting and searching of various fields.
An example of the data might be as follows:
Hello XY&Z Corporation - abc@example.com
I would like the following terms to come out of the analyzer:
[hello] [xy&z] [corporation] [abc@example] [com] //this is the
StandardAnalyzer output
as well as
[xy] [z] [abc] [example]
Essentially, I want the StandardAnalyzer output, but then I want to run the
StopAnalyzer on the terms that come out of the StandardAnalyzer. Basically I
would like to be able to search against part of the "special" word or the
whole "special" word, where special word contains tokens for things like
email and part numbers, etc...
I know the answer is that I have to create a custom analyzer that combines
the standard and stop analyzers, and I have tried... but I just cannot
figure out how to do this.
I have read through the LIA book and looked through the samples for keyword
and perfield analyzers, but they just dont do it.
Anyone have any samples that do this kind of thing?
Thanks,
Tim
--
View this message in context: http://www.nabble.com/Custom-Analyzer-Help-please-tf3469904.html#a9682794
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Custom Analyzer Help please
Posted by Grant Ingersoll <gr...@gmail.com>.
OK, gotcha. I now see what you mean. StandardAnalyzer uses the
StandardTokenizer, whereas StopAnalyzer uses the LowerCaseTokenizer,
which divides text at non-letters. What you most likely will need to
do is create a Tokenizer that outputs the original token, and outputs
the parts of it based on the LowerCaseTokenizer. Have a look at the
TokenStream API. Essentially, you need to implement the next()
method for your new Tokenizer. You probably could just have your
tokenizer wrap the other two, by using StandardTokenizer to get your
first level tokens, then, given a Token, run it through the
LowerCaseTokenizer to see if it has any values for next(), which can
be added to the stream.
Once you have your Tokenizer working you can wrap them into your new
Analyzer to use the other filters as you see fit.
If you have "Lucene In Action", have a look at Chapter 4 for more
details on how Tokenizers and TokenFilters work.
HTH,
Grant
On Mar 28, 2007, at 11:18 AM, TimF wrote:
>
> Grant,
> Thanks for your reply and the pointer to the custom code sample. I
> will be
> checking into that today. I did delve into the src for the OOTB
> analyzers
> and was aware of what they did. Still, the StandardAnalyzer does
> not do what
> I want. The real difference between my needs and the results of the
> StandardAnalyzer is that what I want is the union of the
> StandardAnalyzer
> and the StopAnalyzer. If you refer back to my original example...
>
> An example of the data might be as follows:
> Hello XY&Z Corporation - abc@example.com
> I would like the following terms to come out of the analyzer:
> [hello] [xy&z] [corporation] [abc@example] [com] //this is the
> StandardAnalyzer output
> as well as
> [xy] [z] [abc] [example]
>
> I figured that creating a custom analyzer is the only way to do
> that, but
> unfortunately I am not that familiar with how the analyzers
> "really" work
> internally( I am more of a mathematician than a lexicon).
>
> If you have any other thoughts or ideas I would love to hear.
> Thanks,
> Tim
>
> Grant Ingersoll-6 wrote:
>>
>> So, I think the answer is that StandardAnalyzer already has what you
>> state you want. Is it, perhaps, that certain stopwords that you are
>> interested in are not currently being stopped?
>>
>
> --
> View this message in context: http://www.nabble.com/Custom-Analyzer-
> Help-please-tf3469904.html#a9716016
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Custom Analyzer Help please
Posted by TimF <ti...@timflanders.com>.
Grant,
Thanks for your reply and the pointer to the custom code sample. I will be
checking into that today. I did delve into the src for the OOTB analyzers
and was aware of what they did. Still, the StandardAnalyzer does not do what
I want. The real difference between my needs and the results of the
StandardAnalyzer is that what I want is the union of the StandardAnalyzer
and the StopAnalyzer. If you refer back to my original example...
An example of the data might be as follows:
Hello XY&Z Corporation - abc@example.com
I would like the following terms to come out of the analyzer:
[hello] [xy&z] [corporation] [abc@example] [com] //this is the
StandardAnalyzer output
as well as
[xy] [z] [abc] [example]
I figured that creating a custom analyzer is the only way to do that, but
unfortunately I am not that familiar with how the analyzers "really" work
internally( I am more of a mathematician than a lexicon).
If you have any other thoughts or ideas I would love to hear.
Thanks,
Tim
Grant Ingersoll-6 wrote:
>
> So, I think the answer is that StandardAnalyzer already has what you
> state you want. Is it, perhaps, that certain stopwords that you are
> interested in are not currently being stopped?
>
--
View this message in context: http://www.nabble.com/Custom-Analyzer-Help-please-tf3469904.html#a9716016
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Custom Analyzer Help please
Posted by Grant Ingersoll <gs...@apache.org>.
Hi Tim,
From the StandardAnalyzer code, the TokenStream looks like:
/** Constructs a {@link StandardTokenizer} filtered by a {@link
StandardFilter}, a {@link LowerCaseFilter} and a {@link
StopFilter}. */
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new StandardTokenizer(reader);
result = new StandardFilter(result);
result = new LowerCaseFilter(result);
result = new StopFilter(result, stopSet);
return result;
}
Whereas StopAnalyzer looks like:
/** Filters LowerCaseTokenizer with StopFilter. */
public TokenStream tokenStream(String fieldName, Reader reader) {
return new StopFilter(new LowerCaseTokenizer(reader), stopWords);
}
So, I think the answer is that StandardAnalyzer already has what you
state you want. Is it, perhaps, that certain stopwords that you are
interested in are not currently being stopped?
Also, there is a whole section of examples on how to write Analyzers
in the contrib/analyzers section of the source code.
-Grant
On Mar 26, 2007, at 6:12 PM, TimF wrote:
>
> I would like to be able to get terms from my data that are a
> combination of
> two existing analyzers.
> I would like this for both posting and searching of various fields.
> An example of the data might be as follows:
> Hello XY&Z Corporation - abc@example.com
> I would like the following terms to come out of the analyzer:
> [hello] [xy&z] [corporation] [abc@example] [com] //this is the
> StandardAnalyzer output
> as well as
> [xy] [z] [abc] [example]
>
> Essentially, I want the StandardAnalyzer output, but then I want to
> run the
> StopAnalyzer on the terms that come out of the StandardAnalyzer.
> Basically I
> would like to be able to search against part of the "special" word
> or the
> whole "special" word, where special word contains tokens for things
> like
> email and part numbers, etc...
>
> I know the answer is that I have to create a custom analyzer that
> combines
> the standard and stop analyzers, and I have tried... but I just cannot
> figure out how to do this.
>
> I have read through the LIA book and looked through the samples for
> keyword
> and perfield analyzers, but they just dont do it.
>
> Anyone have any samples that do this kind of thing?
> Thanks,
> Tim
> --
> View this message in context: http://www.nabble.com/Custom-Analyzer-
> Help-please-tf3469904.html#a9682794
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org
Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org