You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Ahmed El-dawy <as...@gmail.com> on 2005/10/03 09:47:14 UTC

Reordering search results

Hello,
  I have made a new Analyzer that does the following:
1- Remove common prefixes and suffixes. For example, uncommon will be
converted to common.
2- Change some letters in the words with common spelling mistakes. For
example, wellcome will be changed to welcome.
3- Stop words are removed.

I have no problem with the analyzer. My problem is that, because of
all changes made, the search results are not so accurate. I need a way
to reorder these results such that:
1- Words in Document that are more close to original search terms have
a larger Score. For example, if I was searching for "wellcome",
Document("wellcome") must be better than Document("welcome")
2- Documents that have search terms close to each other, have a larger
Score. For example, if I was searching for "welcome there",
Document("welcome there") must be better than Document("welcome all
there"). Note that "all" is a stop word in my implementation.

What classes should I inherit, or what interfaces should I implement
to accomplish my needs?
I have tried to load all search results in memory then I do my
reordering, but it took a lot of time to load search result in memory.
I need the reordering to happen by Lucene with the search results
(Hits).

Thanks in advance

--
regards,
Ahmed Saad

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Reordering search results

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Oct 6, 2005, at 8:28 AM, Ahmed El-dawy wrote:
> Thanks for your help.
> I used PhraseQuery to boost close terms. I think of an idea for sop
> words but I don't know, if it has any drawbacks. I can index any dummy
> Token in place of all stop words. This token will never be searched
> but it will be counted as a Token and will make a space between words.
> Does this solution has any drawbacks?

There is no need to index a dummy token to make a space.  You can  
simply set the position increment on the 2nd token to be 2.. which  
means 2 positions past the last one.  The default is 1, meaning  
successive positions.

     Erik


>
>
> On 10/3/05, Joaquin Delgado <jo...@oracle.com> wrote:
>
>> Chris, you may consider using a modified version of the Nutch  
>> analysis
>> (http://lucene.apache.org/nutch/apidocs/org/apache/nutch/analysis/ 
>> package-summary.html)
>> which has a very slick treatment of stopwords. Please refer to  
>> chapter
>> 4, page 145 of the Lucene in Action written by Eric and Otis for some
>> details about the nutch implementation.
>>
>> -- J.D.
>>
>> Erik Hatcher wrote:
>>
>>
>>>
>>> On Oct 3, 2005, at 4:56 AM, Chris Lamprecht wrote:
>>>
>>>
>>>>> 1- Words in Document that are more close to original search  
>>>>> terms  have
>>>>> a larger Score. For example, if I was searching for "wellcome",
>>>>> Document("wellcome") must be better than Document("welcome")
>>>>>
>>>>>
>>>>
>>>> I'm just "thinking outloud" here, but some ideas that come to mind
>>>> are:  Index both the original text (with spelling errors), and the
>>>> spelling-corrected text.  When you search, search on both the
>>>> corrected text, and in a non-required query clause search on the
>>>> uncorrected text, maybe boosted down a bit.  This way, if the  
>>>> spelling
>>>> was correct, it will match both the original term and the corrected
>>>> term (since they're the same), but a document with a misspelling  
>>>> would
>>>> match only the corrected term.  You'll have to experiment with  
>>>> boosts
>>>> and relevance/rankings here.
>>>>
>>>> Another idea is, if you know the number of misspellings made at
>>>> indexing time (it seems like you do), then boost documents based on
>>>> the number of spelling errors -- higher boost factor for fewer  
>>>> errors.
>>>>
>>>
>>>
>>> Another tip is that score is based on term frequency - so when
>>> tokenizing correct spellings, add multiple of the correct words to
>>> weight towards them.
>>>
>>>
>>>>> 2- Documents that have search terms close to each other, have  
>>>>> a  larger
>>>>> Score. For example, if I was searching for "welcome there",
>>>>> Document("welcome there") must be better than Document("welcome  
>>>>> all
>>>>> there"). Note that "all" is a stop word in my implementation.
>>>>>
>>>>>
>>>>
>>>> PhraseQuery with a high slop factor (MAX_INT works) scores  
>>>> higher for
>>>> terms that are closer together.  You can construct the PhraseQuery
>>>> yourself (programmatically), or QueryParser takes it as:
>>>>
>>>> "welcome there"~99999
>>>>
>>>> (with the quotes)  99999 is the slop factor, which means to accept
>>>> documents where "welcome" is within 99999 positions from "there".
>>>>
>>>
>>>
>>> The issue is that "all" is a stop word, though.  The StopFilter does
>>> not leave a hole when stop words are removed, so indexing "welcome
>>> all there" is exactly the same as indexing "welcome there" as far as
>>> the index is concerned.  I started to address this situation in the
>>> 1.4.x Lucene releases but it introduced a backward incompatible  
>>> issue
>>> so we reverted.  Care must be taken on the Query side of things -
>>> PhraseQuery did not deal with anything but term position increments
>>> of 1, but this has been addressed in the latest codebase (in
>>> Subversion).
>>>
>>> I built a PositionalStopFilter for and discussed these details in  
>>> the
>>> Analysis chapter of "Lucene in Action" - it is available in the   
>>> code
>>> .zip at http://www.lucenebook.com
>>>
>>>     Erik
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>>
>
>
> --
> regards,
> Ahmed Saad
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Reordering search results

Posted by Ahmed El-dawy <as...@gmail.com>.

Thanks for your help.
I used PhraseQuery to boost close terms. I think of an idea for sop
words but I don't know, if it has any drawbacks. I can index any dummy
Token in place of all stop words. This token will never be searched
but it will be counted as a Token and will make a space between words.
Does this solution has any drawbacks?


On 10/3/05, Joaquin Delgado <jo...@oracle.com> wrote:
> Chris, you may consider using a modified version of the Nutch analysis
> (http://lucene.apache.org/nutch/apidocs/org/apache/nutch/analysis/package-summary.html)
> which has a very slick treatment of stopwords. Please refer to chapter
> 4, page 145 of the Lucene in Action written by Eric and Otis for some
> details about the nutch implementation.
>
> -- J.D.
>
> Erik Hatcher wrote:
>
> >
> > On Oct 3, 2005, at 4:56 AM, Chris Lamprecht wrote:
> >
> >>> 1- Words in Document that are more close to original search terms  have
> >>> a larger Score. For example, if I was searching for "wellcome",
> >>> Document("wellcome") must be better than Document("welcome")
> >>>
> >>
> >> I'm just "thinking outloud" here, but some ideas that come to mind
> >> are:  Index both the original text (with spelling errors), and the
> >> spelling-corrected text.  When you search, search on both the
> >> corrected text, and in a non-required query clause search on the
> >> uncorrected text, maybe boosted down a bit.  This way, if the spelling
> >> was correct, it will match both the original term and the corrected
> >> term (since they're the same), but a document with a misspelling would
> >> match only the corrected term.  You'll have to experiment with boosts
> >> and relevance/rankings here.
> >>
> >> Another idea is, if you know the number of misspellings made at
> >> indexing time (it seems like you do), then boost documents based on
> >> the number of spelling errors -- higher boost factor for fewer errors.
> >
> >
> > Another tip is that score is based on term frequency - so when
> > tokenizing correct spellings, add multiple of the correct words to
> > weight towards them.
> >
> >>> 2- Documents that have search terms close to each other, have a  larger
> >>> Score. For example, if I was searching for "welcome there",
> >>> Document("welcome there") must be better than Document("welcome all
> >>> there"). Note that "all" is a stop word in my implementation.
> >>>
> >>
> >> PhraseQuery with a high slop factor (MAX_INT works) scores higher for
> >> terms that are closer together.  You can construct the PhraseQuery
> >> yourself (programmatically), or QueryParser takes it as:
> >>
> >> "welcome there"~99999
> >>
> >> (with the quotes)  99999 is the slop factor, which means to accept
> >> documents where "welcome" is within 99999 positions from "there".
> >
> >
> > The issue is that "all" is a stop word, though.  The StopFilter does
> > not leave a hole when stop words are removed, so indexing "welcome
> > all there" is exactly the same as indexing "welcome there" as far as
> > the index is concerned.  I started to address this situation in the
> > 1.4.x Lucene releases but it introduced a backward incompatible issue
> > so we reverted.  Care must be taken on the Query side of things -
> > PhraseQuery did not deal with anything but term position increments
> > of 1, but this has been addressed in the latest codebase (in
> > Subversion).
> >
> > I built a PositionalStopFilter for and discussed these details in the
> > Analysis chapter of "Lucene in Action" - it is available in the  code
> > .zip at http://www.lucenebook.com
> >
> >     Erik
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


--
regards,
Ahmed Saad

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Reordering search results

Posted by Joaquin Delgado <jo...@oracle.com>.

Chris, you may consider using a modified version of the Nutch analysis 
(http://lucene.apache.org/nutch/apidocs/org/apache/nutch/analysis/package-summary.html) 
which has a very slick treatment of stopwords. Please refer to chapter 
4, page 145 of the Lucene in Action written by Eric and Otis for some 
details about the nutch implementation.

-- J.D.

Erik Hatcher wrote:

>
> On Oct 3, 2005, at 4:56 AM, Chris Lamprecht wrote:
>
>>> 1- Words in Document that are more close to original search terms  have
>>> a larger Score. For example, if I was searching for "wellcome",
>>> Document("wellcome") must be better than Document("welcome")
>>>
>>
>> I'm just "thinking outloud" here, but some ideas that come to mind
>> are:  Index both the original text (with spelling errors), and the
>> spelling-corrected text.  When you search, search on both the
>> corrected text, and in a non-required query clause search on the
>> uncorrected text, maybe boosted down a bit.  This way, if the spelling
>> was correct, it will match both the original term and the corrected
>> term (since they're the same), but a document with a misspelling would
>> match only the corrected term.  You'll have to experiment with boosts
>> and relevance/rankings here.
>>
>> Another idea is, if you know the number of misspellings made at
>> indexing time (it seems like you do), then boost documents based on
>> the number of spelling errors -- higher boost factor for fewer errors.
>
>
> Another tip is that score is based on term frequency - so when  
> tokenizing correct spellings, add multiple of the correct words to  
> weight towards them.
>
>>> 2- Documents that have search terms close to each other, have a  larger
>>> Score. For example, if I was searching for "welcome there",
>>> Document("welcome there") must be better than Document("welcome all
>>> there"). Note that "all" is a stop word in my implementation.
>>>
>>
>> PhraseQuery with a high slop factor (MAX_INT works) scores higher for
>> terms that are closer together.  You can construct the PhraseQuery
>> yourself (programmatically), or QueryParser takes it as:
>>
>> "welcome there"~99999
>>
>> (with the quotes)  99999 is the slop factor, which means to accept
>> documents where "welcome" is within 99999 positions from "there".
>
>
> The issue is that "all" is a stop word, though.  The StopFilter does  
> not leave a hole when stop words are removed, so indexing "welcome  
> all there" is exactly the same as indexing "welcome there" as far as  
> the index is concerned.  I started to address this situation in the  
> 1.4.x Lucene releases but it introduced a backward incompatible issue  
> so we reverted.  Care must be taken on the Query side of things -  
> PhraseQuery did not deal with anything but term position increments  
> of 1, but this has been addressed in the latest codebase (in  
> Subversion).
>
> I built a PositionalStopFilter for and discussed these details in the  
> Analysis chapter of "Lucene in Action" - it is available in the  code 
> .zip at http://www.lucenebook.com
>
>     Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Reordering search results

Posted by Chris Lamprecht <cl...@gmail.com>.

> > "welcome there"~99999
> >
>
> The issue is that "all" is a stop word, though.  The StopFilter does
> not leave a hole when stop words are removed, so indexing "welcome
> all there" is exactly the same as indexing "welcome there" as far as
> the index is concerned.  I started to address this situation in the
> 1.4.x Lucene releases but it introduced a backward incompatible issue
> so we reverted.  Care must be taken on the Query side of things -
> PhraseQuery did not deal with anything but term position increments
> of 1, but this has been addressed in the latest codebase (in
> Subversion).

Good point.  I've been using 1.9 for so long now I think of it as the
"latest version" :)   For what it's worth, 1.9 has been completely
stable in (high volume) production, and performance is better too
(mostly due to the new BooleanScorer, I think).

-chris

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Reordering search results

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Oct 3, 2005, at 4:56 AM, Chris Lamprecht wrote:
>> 1- Words in Document that are more close to original search terms  
>> have
>> a larger Score. For example, if I was searching for "wellcome",
>> Document("wellcome") must be better than Document("welcome")
>>
>
> I'm just "thinking outloud" here, but some ideas that come to mind
> are:  Index both the original text (with spelling errors), and the
> spelling-corrected text.  When you search, search on both the
> corrected text, and in a non-required query clause search on the
> uncorrected text, maybe boosted down a bit.  This way, if the spelling
> was correct, it will match both the original term and the corrected
> term (since they're the same), but a document with a misspelling would
> match only the corrected term.  You'll have to experiment with boosts
> and relevance/rankings here.
>
> Another idea is, if you know the number of misspellings made at
> indexing time (it seems like you do), then boost documents based on
> the number of spelling errors -- higher boost factor for fewer errors.

Another tip is that score is based on term frequency - so when  
tokenizing correct spellings, add multiple of the correct words to  
weight towards them.

>> 2- Documents that have search terms close to each other, have a  
>> larger
>> Score. For example, if I was searching for "welcome there",
>> Document("welcome there") must be better than Document("welcome all
>> there"). Note that "all" is a stop word in my implementation.
>>
>
> PhraseQuery with a high slop factor (MAX_INT works) scores higher for
> terms that are closer together.  You can construct the PhraseQuery
> yourself (programmatically), or QueryParser takes it as:
>
> "welcome there"~99999
>
> (with the quotes)  99999 is the slop factor, which means to accept
> documents where "welcome" is within 99999 positions from "there".

The issue is that "all" is a stop word, though.  The StopFilter does  
not leave a hole when stop words are removed, so indexing "welcome  
all there" is exactly the same as indexing "welcome there" as far as  
the index is concerned.  I started to address this situation in the  
1.4.x Lucene releases but it introduced a backward incompatible issue  
so we reverted.  Care must be taken on the Query side of things -  
PhraseQuery did not deal with anything but term position increments  
of 1, but this has been addressed in the latest codebase (in  
Subversion).

I built a PositionalStopFilter for and discussed these details in the  
Analysis chapter of "Lucene in Action" - it is available in the  
code .zip at http://www.lucenebook.com

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Reordering search results

Posted by Chris Lamprecht <cl...@gmail.com>.

Hi Ahmed,

> 2- Change some letters in the words with common spelling mistakes. For
> example, wellcome will be changed to welcome.

Sounds pretty cool

> 1- Words in Document that are more close to original search terms have
> a larger Score. For example, if I was searching for "wellcome",
> Document("wellcome") must be better than Document("welcome")

I'm just "thinking outloud" here, but some ideas that come to mind
are:  Index both the original text (with spelling errors), and the
spelling-corrected text.  When you search, search on both the
corrected text, and in a non-required query clause search on the
uncorrected text, maybe boosted down a bit.  This way, if the spelling
was correct, it will match both the original term and the corrected
term (since they're the same), but a document with a misspelling would
match only the corrected term.  You'll have to experiment with boosts
and relevance/rankings here.

Another idea is, if you know the number of misspellings made at
indexing time (it seems like you do), then boost documents based on
the number of spelling errors -- higher boost factor for fewer errors.

> 2- Documents that have search terms close to each other, have a larger
> Score. For example, if I was searching for "welcome there",
> Document("welcome there") must be better than Document("welcome all
> there"). Note that "all" is a stop word in my implementation.

PhraseQuery with a high slop factor (MAX_INT works) scores higher for
terms that are closer together.  You can construct the PhraseQuery
yourself (programmatically), or QueryParser takes it as:

"welcome there"~99999

(with the quotes)  99999 is the slop factor, which means to accept
documents where "welcome" is within 99999 positions from "there".

-chris

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org