You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Bob Mason <bm...@library.ucsf.edu> on 2005/10/24 21:18:15 UTC
Is there a way to get absolutely exact phrase matching (no stop
words, etc)
We have a large body of documents that have xml
and ocr embedded within one of the xml fields.
Searches such as "group effect"
are returning hits for docs such as ones that include the following:
...group of ~a- The effect...
because, I take it, stop words like 'of' and 'the' and punctuation
are ignored. Is there anything I can do about this other
than write an alternative to the Standard Analyzer?
thanks,
Bob Mason
UCSF Tobacco Industy Digital Library
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Is there a way to get absolutely exact phrase matching (no stop
words, etc)
Posted by Steven Rowe <sa...@syr.edu>.
Hi Bob,
StandardAnalyzer filters the token stream created by StandardTokenizer
through StandardFilter, LowercaseFilter, and then StopFilter. Unless
you supply a stoplist to the StandardAnalyzer constructor, you get the
default set of English stopwords, from StopAnalyzer:
public static final String[] ENGLISH_STOP_WORDS = {
"a", "an", "and", "are", "as", "at", "be", "but", "by",
"for", "if", "in", "into", "is", "it",
"no", "not", "of", "on", "or", "s", "such",
"t", "that", "the", "their", "then", "there", "these",
"they", "this", "to", "was", "will", "with"
};
One approach to the problem you're seeing is to advance the token
position in StopFilter with each stopword encountered, so that phrase
queries like
"group effect"
will fail to match against
"...group of ~a- The effect..."
because the positions for tokens "group" and "effect" would not be adjacent.
(My naive reading of StandardTokenizer.jj, the JavaCC grammar used to
create StandardTokenizer.java, is that "~a-" will generate a single
token "a", which will then be filtered out by StopFilter.)
A patch implementing this approach was actually applied to
StopFilter.java in late 2003, but was reverted shortly afterward,
because this approach conflicts with the QueryParser and PhraseQuery
implementations.
See Doug Cutting's description of the problem with the position
increment modification approach here:
<http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200312.mbox/%3c3FCFB3CA.9000103@lucene.com%3e>
See a colored diff of StopFilter.java, just before and after the
position increment modification patch was reverted, here:
<http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/src/java/org/apache/lucene/analysis/StopFilter.java?rev=150152&r1=150150&r2=150152&diff_format=h>
This modification is simple and straightforward. You could make the
same changes to a local copy of StopFilter (call it PosIncrStopFilter),
then create and use a StandardAnalyzer clone that uses PosIncrStopFilter
instead of StopFilter.
Good luck,
Steve Rowe
Bob Mason wrote:
> We have a large body of documents that have xml
> and ocr embedded within one of the xml fields.
>
> Searches such as "group effect"
>
> are returning hits for docs such as ones that include the following:
>
> ...group of ~a- The effect...
>
> because, I take it, stop words like 'of' and 'the' and punctuation
> are ignored. Is there anything I can do about this other
> than write an alternative to the Standard Analyzer?
>
> thanks,
>
> Bob Mason
> UCSF Tobacco Industy Digital Library
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org