You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Peter Posselt Vestergaard <pp...@hotmail.com> on 2004/12/20 15:24:07 UTC

analyzer effecting phrases?

Hi
I am building an index of texts, each related to a unique id. The unique ids
might contain a number of underscores which will make the standardanalyzer
shorten them after it sees the second underscore in a row. Furthermore many
of the texts I am indexing is in Italian so the removal of 'trivial' words
done by the standard analyzer is not necessarily meaningful for these texts.
Therefore I am instead using an analyzer made from the WhitespaceTokenizer
and the LowerCaseFilter.
This works fine for me until I try searching for a phrase. I am searching
for a simple phrase containing two words and with double-quotes around it. I
have found the phrase in one of the texts so I know it should return at
least one result, but none is found. If I remove the double-quotes and
searches for the 2 words with AND between them I do find the story.
Can anyone tell me if this is an obvious (side-)effect of not using the
standard analyzer? And is there a better solution to my problem than using
the very simple analyzer?
Best regards
Peter Vestergaard
PS: I use the same analyzer for both searching and indexing (of course).

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: analyzer effecting phrases?

Posted by Otis Gospodnetic <ot...@yahoo.com>.

When searching for phrases, what's important is the position of each
token/word extracted by the Analyzer. 
WhitespaceAnalyzer/LowerCaseFilter don't do anything with the
positional information.  There is nothing else in your Analyzer?

In any case, the following should help you see what your Analyzer is
doing:
http://wiki.apache.org/jakarta-lucene/AnalysisParalysis and you can
augment the code there to provide positional information, too.

Otis

--- Peter Posselt Vestergaard <pp...@hotmail.com> wrote:

> Hi
> I am building an index of texts, each related to a unique id. The
> unique ids
> might contain a number of underscores which will make the
> standardanalyzer
> shorten them after it sees the second underscore in a row.
> Furthermore many
> of the texts I am indexing is in Italian so the removal of 'trivial'
> words
> done by the standard analyzer is not necessarily meaningful for these
> texts.
> Therefore I am instead using an analyzer made from the
> WhitespaceTokenizer
> and the LowerCaseFilter.
> This works fine for me until I try searching for a phrase. I am
> searching
> for a simple phrase containing two words and with double-quotes
> around it. I
> have found the phrase in one of the texts so I know it should return
> at
> least one result, but none is found. If I remove the double-quotes
> and
> searches for the 2 words with AND between them I do find the story.
> Can anyone tell me if this is an obvious (side-)effect of not using
> the
> standard analyzer? And is there a better solution to my problem than
> using
> the very simple analyzer?
> Best regards
> Peter Vestergaard
> PS: I use the same analyzer for both searching and indexing (of
> course).
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org