You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Hoss Man (JIRA)" <ji...@apache.org> on 2013/11/22 17:32:38 UTC

[jira] [Updated] (SOLR-5463) Provide cursor/token based "searchAfter" support that works with arbitrary sorting (ie: "deep paging")

     [ https://issues.apache.org/jira/browse/SOLR-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hoss Man updated SOLR-5463:
---------------------------

    Attachment: SOLR-5463__straw_man.patch

bq. I disagree. The fieldDoc only contains the values that were sorted on. This is what is minimal and necessary to do paging

FieldDoc subclasses ScoreDoc which includes the internal docid -- and PagingFieldCollector does look at it.  But as you say: as long as we include uniqueKey in the fields (which i already mentioned) then the docid in the FieldDoc shouldn't matter since (i think?) it's only used as a tie breaker.

bq. If solr wants to avoid lucene docids for some reason (e.g. because it does not yet implement searcher leases) ...

I'm glad you brought up searcher leases, because i wanted to mention it before but i forgot...

* I have no idea how to even try to implement searcher leases in a sane way in a distribted solr setup, given that we want clients to be able to hit any replica on subsequent requests.
* For my use cases, I actively do *NOT* want a searcher lease when doing deep paging: if documents matching my searcher, but on high pages i have not loaded yet, get deleted from the index, i don't want them included in the results once i get to that page just because they were a match X minutes ago when my search started.

I think what makes the most sense is to ensure we can support deep paging w/o searcher leases, and then if/when searcher leases are supported people who want both can have both.

----

I'm attaching my current progress with a straw man impl + tests.  It includes the basic functionality & tests for doing deep paging on a single node solr setup using numeric sorts.

There are an absurd number of nocommits in this patch: most of them are in the impl and i'm not worried about them because im hoping the impl can ultimately be thrown out; some are in the test because of additional tests i want to write; some are in the test because of silly limitations in the impl.

Only one class of nocommits really concerns me at this point and that's the issue of dealing with String sorts -- the way Solr's distributed sorting code deals with fields that use SortField.Type.STRING (and presumably SolrTield.Type.STRING_VAL) results in the coordinator node having a String object even though the underlying FieldComparator expects/uses BytesRef as the comparison value.  

I could probably hack arround this, and convert the Strings back to BytesRef myself in the DeepPaging code -- but this actually smells like a more fundamental problem we should address.  It seems to be the same root problem that sarowe has been looking into in SOLR-5354 in order to play nicer with custom FieldTypes: safely "serializing" the true sort object (regardless of what it is) between shards->coordinator, and then deserializing it & using the *real* FieldComparator for each field to do the aggregated sorting of the docs from each shard.

----

In any case, my next step is to get a some distributed tests setup and working against this straw man impl, and then dig into throwing away the straw man impl and trying to replace it with PagingFieldCollector -- posibly with a side diversion to help sarowe fix the underlying problems in SOLR-5354 first.


> Provide cursor/token based "searchAfter" support that works with arbitrary sorting (ie: "deep paging")
> ------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-5463
>                 URL: https://issues.apache.org/jira/browse/SOLR-5463
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Hoss Man
>            Assignee: Hoss Man
>         Attachments: SOLR-5463__straw_man.patch
>
>
> I'd like to revist a solution to the problem of "deep paging" in Solr, leveraging an HTTP based API similar to how IndexSearcher.searchAfter works at the lucene level: require the clients to provide back a token indicating the sort values of the last document seen on the previous "page".  This is similar to the "cursor" model I've seen in several other REST APIs that support "pagnation" over a large sets of results (notable the twitter API and it's "since_id" param) except that we'll want something that works with arbitrary multi-level sort critera that can be either ascending or descending.
> SOLR-1726 laid some initial ground work here and was commited quite a while ago, but the key bit of argument parsing to leverage it was commented out due to some problems (see comments in that issue).  It's also somewhat out of date at this point: at the time it was commited, IndexSearcher only supported searchAfter for simple scores, not arbitrary field sorts; and the params added in SOLR-1726 suffer from this limitation as well.
> ---
> I think it would make sense to start fresh with a new issue with a focus on ensuring that we have deep paging which:
> * supports arbitrary field sorts in addition to sorting by score
> * works in distributed mode



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org