You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Daniel B. Davis" <db...@smart.net> on 2004/02/15 21:49:17 UTC

'Sponsored' links

I am a newbie to Lucene, and this is my first serious posting
to Lucene-user.

This is to solicit comment upon the problem of supplying
a "sponsored links" capability within Lucene. This capability
would not affect at all which documents are returned by a query,
but would cause any 'sponsored' documents present among the
results to be displayed before other documents in the list
returned.

I have looked over the correspondence in Lucene-user, but
not found anything addressing this topic; if I have missed it,
please tell me where and when, and ignore the rest of this.

It seems to me that there are three ways to achieve the
capability:

1. Preset boost values for 'sponsored' documents, with an
    implied burden of reindexing when sponsors are modified.

2. Post-qualify documents present in the hit list for their
    sponsorship status, building a new hit list.

3. Modify the query to search using both the full query as
    an unsponsored boolean clause with the default boost value,
    and for each sponsor, to repeat the full query ANDed with
    that sponsor with the appropriate boost value.

Are there other strategies not considered?

Assuming a small list of sponsors (10 or fewer), and low
volatility amongst the sponsors (1 change / month or less)
which method is best?

I have been pursuing method #1, almost to the exclusion of
the others, but have encountered an unknown difficulty in the
implementation (separate posting).  In particular, while it is clear
that #3 is doable, I know nothing about the searching burden
added by multiplying the user's query by one plus the count of
sponsors.

Regarding #3, if my understanding is right, then:
    Sponsors name: s1, s2, s3 ...
             words or phrases: s1w1, s1w2, ... , s2w1, s2w2, ... , s3w1 ...
             boost values: s1v, s2v, s3v

    then given query q as user input, form:
             q
             or (q and (s1w1 | s1w2 | s1w3 | ...)^s1v)
             or (q and (s2w1 | s2w2 ...)^s2v)
             or (q and (s3w1 ...)^s3v)
Is this correct?

Does the strategy of search identify any kind of intermediate
sublist to speed up searching? (But then it would start to
resemble #2.)

Rolling ones own for #2 would run query q, and get the
HitCollector. Separately running queries for each of:
             s1w1 | s1w2 | s1w3 | ...,
             s2w1 | s2w2 ...
             s3w1 ...
and merge each hit collector with the one from query q.
(Just AND the bitsets???) Lastly adjust scores and form
a new composite HitCollecter.  By this time I have told
everyone much more than I know.

Stray thought:-- can HitCollectors be cached at application init?

There are many other questions regarding details of implementation,
but their proper place is another communication.

Just by preparing this document for dissemination has helped
greatly.  All and any comments are much appreciated.

Thank you all.



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: 'Sponsored' links

Posted by "Daniel B. Davis" <db...@smart.net>.

At 02:27 PM 2/16/2004 -0800, you wrote:

>Daniel B. Davis wrote:
>>Are there other strategies not considered?
>
>Why not store sponsored documents in a separate index, separately 
>searched, whose results are placed above those from the non-sponsored 
>documents?
>
>Doug

What would that do?  Each separate index would need to be searched for 
eligible documents.  The idea is to find all qualifying documents held at 
the site, then to prefer certain ones by displaying them before others.  I 
understand Lucene does this naturally using the score achieved by the 
search process. I understand that 'boost' affects that score, and thus, the 
display order. Is this incorrect or inaccurate?  What is gained by a 
separate index?  One separate index? A separate index for each 
sponsor?  What if the same document has multiple sponsors?  In #1 and #2 of 
the original discussion, the maximum sponsored value could be used to set 
the boost, but at different times.

>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: 'Sponsored' links

Posted by Doug Cutting <cu...@apache.org>.

Daniel B. Davis wrote:
> Are there other strategies not considered?

Why not store sponsored documents in a separate index, separately 
searched, whose results are placed above those from the non-sponsored 
documents?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org