You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Donna L Gresh <gr...@us.ibm.com> on 2007/08/09 16:36:44 UTC

special handling of certain terms with embedded periods

Is there a good way to handle the following scenario:

I have certain terms with embedded periods for which I want to leave them 
intact (not split at the periods). For 
example in my application a particular skill might be SAP.FIN (SAP 
financial), and it should not be split into
SAP and FIN. Is there a way to specify a list of terms such as these which 
should not be split? I am 
currently using my own "SynonymAnalyzer" for which the token stream looks 
like below
 (pretty standard I think) and where engine is a custom SynonymEngine 
where I provide the synonyms. 
Is there a typical way to handle this situation?

public TokenStream tokenStream(String fieldName, Reader reader) {
 
TokenStream result = new SnowballFilter(
   new SynonymFilter(
        new StopFilter(
           new LowerCaseFilter(
             new StandardFilter(
               new StandardTokenizer(reader))),
                  StandardAnalyzer.STOP_WORDS),
          engine),"English"
);
return result;
}

Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
gresh@us.ibm.com

Re: special handling of certain terms with embedded periods

Posted by Mark Miller <ma...@gmail.com>.

Donna L Gresh wrote:
>
> But your point about the StandardAnalyzer being slow is 
> well-taken, and I'll keep that in mind. 
A new StandardAnalyzer that is 6x faster was recently committed on the 
trunk. Should be in next release.

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: special handling of certain terms with embedded periods

Posted by Donna L Gresh <gr...@us.ibm.com>.

thanks.
In this case it actually looks like I was trying to solve a problem
that doesn't exist (not an unusual occurrence in my experience)
since the StandardAnalyzer does not appear to split the terms
if the period has no white space following. I was a bit misled by
the additional complication that I am using the MoreLikeThis
class to construct the query, and it seemed to be dropping the
SAP.FIN term, apparently because it actually never appears in
my index to be searched, only in my input queries. In fact I may
decide to do some acronym expansion of this to allow it to
match things that *do* appear in my index.

But your point about the StandardAnalyzer being slow is 
well-taken, and I'll keep that in mind. Also, the straighforward
substitution before indexing and searching is a reasonable
approach to keep in mind.

Thanks-
Donna

Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
gresh@us.ibm.com

"Erick Erickson" <er...@gmail.com> 
08/09/2007 12:09 PM
Please respond to
java-user@lucene.apache.org

To
java-user@lucene.apache.org
cc

Subject
Re: special handling of certain terms with embedded periods

Some possibilities...
> write your own tokenizer and/or filter. If you alter your BNF,
     you'll have to maintain it in later releases.
> use some simple transformations for the input *before* tokenizing.
> there's been some discussion that StandardAnalyzer (and, I assume,
   the Standard* beasts) are slower than the other analyzers, so you
   may be better off eschewing them.

Best
Erick

On 8/9/07, Donna L Gresh <gr...@us.ibm.com> wrote:
>
> Is there a good way to handle the following scenario:
>
> I have certain terms with embedded periods for which I want to leave 
them
> intact (not split at the periods). For
> example in my application a particular skill might be SAP.FIN (SAP
> financial), and it should not be split into
> SAP and FIN. Is there a way to specify a list of terms such as these 
which
> should not be split? I am
> currently using my own "SynonymAnalyzer" for which the token stream 
looks
> like below
> (pretty standard I think) and where engine is a custom SynonymEngine
> where I provide the synonyms.
> Is there a typical way to handle this situation?
>
> public TokenStream tokenStream(String fieldName, Reader reader) {
>
> TokenStream result = new SnowballFilter(
>    new SynonymFilter(
>         new StopFilter(
>            new LowerCaseFilter(
>              new StandardFilter(
>                new StandardTokenizer(reader))),
>                   StandardAnalyzer.STOP_WORDS),
>           engine),"English"
> );
> return result;
> }
>
> Donna L. Gresh
> Services Research, Mathematical Sciences Department
> IBM T.J. Watson Research Center
> (914) 945-2472
> http://www.research.ibm.com/people/g/donnagresh
> gresh@us.ibm.com
>

Re: special handling of certain terms with embedded periods

Posted by Erick Erickson <er...@gmail.com>.

Some possibilities...
> write your own tokenizer and/or filter. If you alter your BNF,
     you'll have to maintain it in later releases.
> use some simple transformations for the input *before* tokenizing.
> there's been some discussion that StandardAnalyzer (and, I assume,
   the Standard* beasts) are slower than the other analyzers, so you
   may be better off eschewing them.

Best
Erick


On 8/9/07, Donna L Gresh <gr...@us.ibm.com> wrote:
>
> Is there a good way to handle the following scenario:
>
> I have certain terms with embedded periods for which I want to leave them
> intact (not split at the periods). For
> example in my application a particular skill might be SAP.FIN (SAP
> financial), and it should not be split into
> SAP and FIN. Is there a way to specify a list of terms such as these which
> should not be split? I am
> currently using my own "SynonymAnalyzer" for which the token stream looks
> like below
> (pretty standard I think) and where engine is a custom SynonymEngine
> where I provide the synonyms.
> Is there a typical way to handle this situation?
>
> public TokenStream tokenStream(String fieldName, Reader reader) {
>
> TokenStream result = new SnowballFilter(
>    new SynonymFilter(
>         new StopFilter(
>            new LowerCaseFilter(
>              new StandardFilter(
>                new StandardTokenizer(reader))),
>                   StandardAnalyzer.STOP_WORDS),
>           engine),"English"
> );
> return result;
> }
>
> Donna L. Gresh
> Services Research, Mathematical Sciences Department
> IBM T.J. Watson Research Center
> (914) 945-2472
> http://www.research.ibm.com/people/g/donnagresh
> gresh@us.ibm.com
>

Re: special handling of certain terms with embedded periods

Posted by karl wettin <ka...@gmail.com>.

9 aug 2007 kl. 16.36 skrev Donna L Gresh:

> Is there a good way to handle the following scenario:
>
> I have certain terms with embedded periods for which I want to  
> leave them
> intact (not split at the periods). For example in my application a
> particular skill might be SAP.FIN (SAP financial), and it should  
> not be
> split into SAP and FIN. Is there a way to specify a list of terms  
> such as
> these which should not be split?

Updating the standard analyzer BNF to allow terms with punctuation is  
not a
big deal. If there is a list of terms you want to allow, you would  
handle
them in a TokenFilter. See StandadardTokenizer and StandardFilter.

You might save a couple of clock ticks by implementing a BNF rule rather
than a filter though.


-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org