You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Chris Harris <ry...@gmail.com> on 2009/11/13 21:48:17 UTC

Making search results more stable as index is updated

If documents are being added to and removed from an index (and commits
are being issued) while a user is searching, then the experience of
paging through search results using the obvious solr mechanism
(&start=100&Rows=10) may be disorienting for the user. For one
example, by the time the user clicks "next page" for the first time, a
document that they saw on page 1 may have been pushed onto page 2.
(This may be especially pronounced if docs are being sorted by date.)

I'm wondering what are the best options available for presenting a
more stable set of search results to users in such cases. The obvious
candidates to me are:

#1: Cache results in the user session of the web tier. (In particular,
maybe just cache the uniqueKey of each maching document.)

  Pro: Simple
  Con: May require capping the # of search results in order to make
the initial query (which now has Solr numRows param >> web pageSize)
fast enough. For example, maybe it's only practical to cache the first
500 records.

#2: Create some kind of per-user results cache in Solr. (One simple
implementation idea: You could make your Solr search handler take a
userid parameter, and cache each user's last search in a special
per-user results cache. You then also provide an API that says, "give
me records n through m of userid #1334's last search". For your
subsequent queries, you consult the latter API rather than redoing
your search. Because Lucene docids are unstable across commits and
such, I think this means caching the uniqueKey of each maching
document. This in turn means looking up the uniqueKey of each maching
document at search time. It also means you can't use the existing Solr
caches, but need to make a new one.)

  Pro: Maybe faster than #1?? (Saves on data transfer between Solr and
web tier, at least during the initial query.)
  Con: More complicated than #1.

#3: Use filter queries to attempt to make your subsequent queries (for
page 2, page 3, etc.) return results consistent with your original
query. (One idea is to give each document a docAddedTimestamp field,
which would have precision down to the millisecond or something. On
your initial query, you could note the current time, T. Then for the
subsequent queries you add a filter query for docAddedTimestamp<=T.
Hopefully with a trie date field this would be fast. This should
hopefully keep any docs newly added after T from showing up in the
user's search results as they page through them. However, it won't
necessarily protect you from docs that were *reindexed* (i.e. re-add a
doc with the same uniqueKey as an existing doc) or docs that were
deleted.)

  Pro: Doesn't require a new cache, and no cap on # of search results
  Con: Maybe doesn't provide total stability.

Any feedback on these options? Are there other ideas to consider?

Thanks,
Chris

Re: Making search results more stable as index is updated

Posted by Chris Hostetter <ho...@fucit.org>.
: If documents are being added to and removed from an index (and commits
: are being issued) while a user is searching, then the experience of
: paging through search results using the obvious solr mechanism
: (&start=100&Rows=10) may be disorienting for the user. For one
: example, by the time the user clicks "next page" for the first time, a
: document that they saw on page 1 may have been pushed onto page 2.
: (This may be especially pronounced if docs are being sorted by date.)

FWIW: I've found that in practice this doesn't cofuse users as much as you 
might think.  people understand that data changes, epecially when dealing 
with webpages, so they don't tend to freak out when results "shift" as 
long as it's clear why (ie: the total number of results 
increases/decreases)

in the age of twitter, providing consistent results to a user during their 
entire session may actually frustrate them more then having inconsistent 
results as the data changes ... if they *know* that updates are happening 
frequently on the backend, but their searches all look static, that can be 
an even more unpleasent user experience.

: #1: Cache results in the user session of the web tier. (In particular,
: maybe just cache the uniqueKey of each maching document.)

if it's really important to you, that would be my suggestion.  it has the 
advantage of requiring session affinity oly at the higest level, without 
attempting to push it down into Solr.

:   Con: May require capping the # of search results in order to make
: the initial query (which now has Solr numRows param >> web pageSize)
: fast enough. For example, maybe it's only practical to cache the first
: 500 records.

that's pretty much all that's practical for paginated search anyway.

: per-user results cache. You then also provide an API that says, "give
: me records n through m of userid #1334's last search". For your
: subsequent queries, you consult the latter API rather than redoing
: your search. Because Lucene docids are unstable across commits and
: such, I think this means caching the uniqueKey of each maching
: document. This in turn means looking up the uniqueKey of each maching

this still doesn't handle the case f documents getting dleted though ... 
now you have the uniqueKey of a doc you can't show anyway.




-Hoss


Re: Making search results more stable as index is updated

Posted by Lance Norskog <go...@gmail.com>.
This is one case where permanent caches are interesting. Another case
is highlighting: in some cases highlighting takes a lot of work, and
this work is not cached.

It might be a cleaner architecture to have session-maintaining code in
a separate front-end app, and leave Solr session-free.

On Fri, Nov 13, 2009 at 12:48 PM, Chris Harris <ry...@gmail.com> wrote:
> If documents are being added to and removed from an index (and commits
> are being issued) while a user is searching, then the experience of
> paging through search results using the obvious solr mechanism
> (&start=100&Rows=10) may be disorienting for the user. For one
> example, by the time the user clicks "next page" for the first time, a
> document that they saw on page 1 may have been pushed onto page 2.
> (This may be especially pronounced if docs are being sorted by date.)
>
> I'm wondering what are the best options available for presenting a
> more stable set of search results to users in such cases. The obvious
> candidates to me are:
>
> #1: Cache results in the user session of the web tier. (In particular,
> maybe just cache the uniqueKey of each maching document.)
>
>  Pro: Simple
>  Con: May require capping the # of search results in order to make
> the initial query (which now has Solr numRows param >> web pageSize)
> fast enough. For example, maybe it's only practical to cache the first
> 500 records.
>
> #2: Create some kind of per-user results cache in Solr. (One simple
> implementation idea: You could make your Solr search handler take a
> userid parameter, and cache each user's last search in a special
> per-user results cache. You then also provide an API that says, "give
> me records n through m of userid #1334's last search". For your
> subsequent queries, you consult the latter API rather than redoing
> your search. Because Lucene docids are unstable across commits and
> such, I think this means caching the uniqueKey of each maching
> document. This in turn means looking up the uniqueKey of each maching
> document at search time. It also means you can't use the existing Solr
> caches, but need to make a new one.)
>
>  Pro: Maybe faster than #1?? (Saves on data transfer between Solr and
> web tier, at least during the initial query.)
>  Con: More complicated than #1.
>
> #3: Use filter queries to attempt to make your subsequent queries (for
> page 2, page 3, etc.) return results consistent with your original
> query. (One idea is to give each document a docAddedTimestamp field,
> which would have precision down to the millisecond or something. On
> your initial query, you could note the current time, T. Then for the
> subsequent queries you add a filter query for docAddedTimestamp<=T.
> Hopefully with a trie date field this would be fast. This should
> hopefully keep any docs newly added after T from showing up in the
> user's search results as they page through them. However, it won't
> necessarily protect you from docs that were *reindexed* (i.e. re-add a
> doc with the same uniqueKey as an existing doc) or docs that were
> deleted.)
>
>  Pro: Doesn't require a new cache, and no cap on # of search results
>  Con: Maybe doesn't provide total stability.
>
> Any feedback on these options? Are there other ideas to consider?
>
> Thanks,
> Chris
>



-- 
Lance Norskog
goksron@gmail.com