You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by hugo burm <hu...@xs4all.nl> on 2002/02/13 17:32:07 UTC

How does Lucene handle phrases containing words that are not indexed?

How does Lucene handle phrases (literals) containing words that are not
indexed? (e.g. stopwords, one-letter words, numbers)? I did some tests
(lucene demo, my own 120000 xml documents, Cocoon search) and in all cases
it looks like that when you are looking for the phrase "a specification" it
also finds documents which contain "the specification". (or: "D. Washington"
instead of "G. Washington").

Of course you can change the index behaviour and make sure there are no
stopwords, and all one-letter words and numbers are indexed. But that seems
a bad approach. A better approach: 1) find all indexed words in the phrase
and from these words find all documents containing these words. 2) check the
occurence of the phrase by opening the original document.  I am wondering:
does Lucene performs step 2)? Off course this step burns some cpu cycles.

Hugo

hugob@xs4all.nl


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re : How does Lucene handle phrases containing words that are not indexed?

Posted by Julien Nioche <ju...@lingway.com>.

By the way, I was wondering if there is any Analyzer that uses the following
constructor
  public Token(String text, int start, int end, String typ) ?

Maybe it could be interesting to build an analyzer that recognizes
punctuation marks and
keeps it in the index as Tokens with a given Type (say for example
"punctuation") ?

The advantage is that information could be used by a
SloppyPhraseScorer.phraseFreq() method
to avoid PhraseQuery containing a punctuation mark. Since PhraseQueries are
used for compound words
(e.g. "personal computer") with a given slop value (say 3), it could be
great not to match things such as "It is not personal. My computer hates
me..." .

A solution could be to set a slop value of zero, but it is not possible in
my case (I use a  module that generates compound terms with slop values, in
order to handle morphologic variations - eg in French "gestion de la casse"
and "gestion des casses" which are represented by "gestion casse"^3 and
"gestion casses"^3).

This involves creating a subclasse of PhraseQuery or modifing it by adding a
boolean to it and  modifying the phraseFreq() method so that it checks that
there is no Token with a punctuation Type in the scope of the slop.

What do you think about it? Has anyone already tried in that direction? Does
it implies heavy changes?

Hugo : maybe you could store your stopwords as tokens with a different type?


----- Original Message -----
From: "hugo burm" <hu...@xs4all.nl>
To: <lu...@jakarta.apache.org>
Sent: Wednesday, February 13, 2002 5:32 PM
Subject: How does Lucene handle phrases containing words that are not
indexed?


>
> How does Lucene handle phrases (literals) containing words that are not
> indexed? (e.g. stopwords, one-letter words, numbers)? I did some tests
> (lucene demo, my own 120000 xml documents, Cocoon search) and in all cases
> it looks like that when you are looking for the phrase "a specification"
it
> also finds documents which contain "the specification". (or: "D.
Washington"
> instead of "G. Washington").
>
> Of course you can change the index behaviour and make sure there are no
> stopwords, and all one-letter words and numbers are indexed. But that
seems
> a bad approach. A better approach: 1) find all indexed words in the phrase
> and from these words find all documents containing these words. 2) check
the
> occurence of the phrase by opening the original document.  I am wondering:
> does Lucene performs step 2)? Off course this step burns some cpu cycles.
>
> Hugo
>
> hugob@xs4all.nl
>
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>