You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Donna L Gresh <gr...@us.ibm.com> on 2007/08/09 16:36:44 UTC
special handling of certain terms with embedded periods
Is there a good way to handle the following scenario:
I have certain terms with embedded periods for which I want to leave them
intact (not split at the periods). For
example in my application a particular skill might be SAP.FIN (SAP
financial), and it should not be split into
SAP and FIN. Is there a way to specify a list of terms such as these which
should not be split? I am
currently using my own "SynonymAnalyzer" for which the token stream looks
like below
(pretty standard I think) and where engine is a custom SynonymEngine
where I provide the synonyms.
Is there a typical way to handle this situation?
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new SnowballFilter(
new SynonymFilter(
new StopFilter(
new LowerCaseFilter(
new StandardFilter(
new StandardTokenizer(reader))),
StandardAnalyzer.STOP_WORDS),
engine),"English"
);
return result;
}
Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
gresh@us.ibm.com
Re: special handling of certain terms with embedded periods
Posted by Mark Miller <ma...@gmail.com>.
Donna L Gresh wrote:
>
> But your point about the StandardAnalyzer being slow is
> well-taken, and I'll keep that in mind.
A new StandardAnalyzer that is 6x faster was recently committed on the
trunk. Should be in next release.
- Mark
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: special handling of certain terms with embedded periods
Posted by Donna L Gresh <gr...@us.ibm.com>.
thanks.
In this case it actually looks like I was trying to solve a problem
that doesn't exist (not an unusual occurrence in my experience)
since the StandardAnalyzer does not appear to split the terms
if the period has no white space following. I was a bit misled by
the additional complication that I am using the MoreLikeThis
class to construct the query, and it seemed to be dropping the
SAP.FIN term, apparently because it actually never appears in
my index to be searched, only in my input queries. In fact I may
decide to do some acronym expansion of this to allow it to
match things that *do* appear in my index.
But your point about the StandardAnalyzer being slow is
well-taken, and I'll keep that in mind. Also, the straighforward
substitution before indexing and searching is a reasonable
approach to keep in mind.
Thanks-
Donna
Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
gresh@us.ibm.com
"Erick Erickson" <er...@gmail.com>
08/09/2007 12:09 PM
Please respond to
java-user@lucene.apache.org
To
java-user@lucene.apache.org
cc
Subject
Re: special handling of certain terms with embedded periods
Some possibilities...
> write your own tokenizer and/or filter. If you alter your BNF,
you'll have to maintain it in later releases.
> use some simple transformations for the input *before* tokenizing.
> there's been some discussion that StandardAnalyzer (and, I assume,
the Standard* beasts) are slower than the other analyzers, so you
may be better off eschewing them.
Best
Erick
On 8/9/07, Donna L Gresh <gr...@us.ibm.com> wrote:
>
> Is there a good way to handle the following scenario:
>
> I have certain terms with embedded periods for which I want to leave
them
> intact (not split at the periods). For
> example in my application a particular skill might be SAP.FIN (SAP
> financial), and it should not be split into
> SAP and FIN. Is there a way to specify a list of terms such as these
which
> should not be split? I am
> currently using my own "SynonymAnalyzer" for which the token stream
looks
> like below
> (pretty standard I think) and where engine is a custom SynonymEngine
> where I provide the synonyms.
> Is there a typical way to handle this situation?
>
> public TokenStream tokenStream(String fieldName, Reader reader) {
>
> TokenStream result = new SnowballFilter(
> new SynonymFilter(
> new StopFilter(
> new LowerCaseFilter(
> new StandardFilter(
> new StandardTokenizer(reader))),
> StandardAnalyzer.STOP_WORDS),
> engine),"English"
> );
> return result;
> }
>
> Donna L. Gresh
> Services Research, Mathematical Sciences Department
> IBM T.J. Watson Research Center
> (914) 945-2472
> http://www.research.ibm.com/people/g/donnagresh
> gresh@us.ibm.com
>
Re: special handling of certain terms with embedded periods
Posted by Erick Erickson <er...@gmail.com>.
Some possibilities...
> write your own tokenizer and/or filter. If you alter your BNF,
you'll have to maintain it in later releases.
> use some simple transformations for the input *before* tokenizing.
> there's been some discussion that StandardAnalyzer (and, I assume,
the Standard* beasts) are slower than the other analyzers, so you
may be better off eschewing them.
Best
Erick
On 8/9/07, Donna L Gresh <gr...@us.ibm.com> wrote:
>
> Is there a good way to handle the following scenario:
>
> I have certain terms with embedded periods for which I want to leave them
> intact (not split at the periods). For
> example in my application a particular skill might be SAP.FIN (SAP
> financial), and it should not be split into
> SAP and FIN. Is there a way to specify a list of terms such as these which
> should not be split? I am
> currently using my own "SynonymAnalyzer" for which the token stream looks
> like below
> (pretty standard I think) and where engine is a custom SynonymEngine
> where I provide the synonyms.
> Is there a typical way to handle this situation?
>
> public TokenStream tokenStream(String fieldName, Reader reader) {
>
> TokenStream result = new SnowballFilter(
> new SynonymFilter(
> new StopFilter(
> new LowerCaseFilter(
> new StandardFilter(
> new StandardTokenizer(reader))),
> StandardAnalyzer.STOP_WORDS),
> engine),"English"
> );
> return result;
> }
>
> Donna L. Gresh
> Services Research, Mathematical Sciences Department
> IBM T.J. Watson Research Center
> (914) 945-2472
> http://www.research.ibm.com/people/g/donnagresh
> gresh@us.ibm.com
>
Re: special handling of certain terms with embedded periods
Posted by karl wettin <ka...@gmail.com>.
9 aug 2007 kl. 16.36 skrev Donna L Gresh:
> Is there a good way to handle the following scenario:
>
> I have certain terms with embedded periods for which I want to
> leave them
> intact (not split at the periods). For example in my application a
> particular skill might be SAP.FIN (SAP financial), and it should
> not be
> split into SAP and FIN. Is there a way to specify a list of terms
> such as
> these which should not be split?
Updating the standard analyzer BNF to allow terms with punctuation is
not a
big deal. If there is a list of terms you want to allow, you would
handle
them in a TokenFilter. See StandadardTokenizer and StandardFilter.
You might save a couple of clock ticks by implementing a BNF rule rather
than a filter though.
--
karl
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org