You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Hoss Man (JIRA)" <ji...@apache.org> on 2013/11/19 02:34:19 UTC
[jira] [Commented] (SOLR-5463) Provide cursor/token based "searchAfter" support that works with arbitrary sorting (ie: "deep paging")

    [ https://issues.apache.org/jira/browse/SOLR-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13826056#comment-13826056 ] 

Hoss Man commented on SOLR-5463:
--------------------------------


I've been reading up on the internals of IndexSearcher.searchAfter and the associated PagingFieldCollector used (as well as some of the problems encountered in SOLR-1726) and I'm not convinced it could be a slam dunk to try and use them directly in Solr:

* IndexSearcher.searchAfter/PagingFieldCollector relies on the "client" (ie: Solr) passing back the FieldDoc of the last doc returned, and has expectations that the (lucene) docid contained in that FieldDoc will be meaningful
** We could perhaps serialize a representation of the "last" FieldDoc to include the the response of each request, and the deserialize that into a suitable imposter object on the "searchAfter" request -- but there is still the problem of the internal docid which will be missleading in a multishard distributed solr setup)
* There are a varity of code paths in SolrIndexSearcher for executing searches and it's not immediately obvious (to me) if/when it would make sense to augment each of those paths with PagingFieldCollector  (see yonik's comment in SOLR-1726 about faceting).

With that in mind, the approach i'm going to pursue (largely for my own sanity) is:

* Attempt a minimally invasive straw man implimentation of "searchAfter" type functionality that works in distributed mode -- ideally w/o modifying any existing Solr code.
* Use this straw man implementation to sanity check that the end user API is useful
* Build up good comprehensive (passing) tests against this straw man
* circle back and revisit the implementation details looking for oportunities to:
** refactor to eliminate similar code duplication
** improve performance

My current idea is to implement this straw man solution using a new SearchComponent that would run _after_ QueryComponent, along hte lines of...

* prepare:
** No-Op unless "searchAfter" param is specified
*** Use some marker value to mean "first page"
** assert that start==0 (doesn't make sense when using searchAfter)
** assert that uniqueKey is one of the sort fields (to ensure consistent ordering)
** if searchAfter param value indicates this is not the first request: 
*** deserialize the token it into a list of sort values
*** add a new PostFilter that restricts to documents based on those values and the sort directions (same basic logic as PagingFieldCollector)
* process:
** No-Op unless "searchAfter" param is specified
** do nothing if this is a shard request
** for regular old single node solr requests: serialize the sort values of the last doc in the Doc List (that QueryComponent has already built) and put it in the response as the "next" searchAfter token
* finishStage:
** No-Op unless "searchAfter" param is specified and stage is "DONE"
** serialize the sort values of the last doc in the Doc List (that QueryComponent already merged) and put it in the response as the "next" searchAfter token




> Provide cursor/token based "searchAfter" support that works with arbitrary sorting (ie: "deep paging")
> ------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-5463
>                 URL: https://issues.apache.org/jira/browse/SOLR-5463
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Hoss Man
>            Assignee: Hoss Man
>
> I'd like to revist a solution to the problem of "deep paging" in Solr, leveraging an HTTP based API similar to how IndexSearcher.searchAfter works at the lucene level: require the clients to provide back a token indicating the sort values of the last document seen on the previous "page".  This is similar to the "cursor" model I've seen in several other REST APIs that support "pagnation" over a large sets of results (notable the twitter API and it's "since_id" param) except that we'll want something that works with arbitrary multi-level sort critera that can be either ascending or descending.
> SOLR-1726 laid some initial ground work here and was commited quite a while ago, but the key bit of argument parsing to leverage it was commented out due to some problems (see comments in that issue).  It's also somewhat out of date at this point: at the time it was commited, IndexSearcher only supported searchAfter for simple scores, not arbitrary field sorts; and the params added in SOLR-1726 suffer from this limitation as well.
> ---
> I think it would make sense to start fresh with a new issue with a focus on ensuring that we have deep paging which:
> * supports arbitrary field sorts in addition to sorting by score
> * works in distributed mode



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org