You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Bram Van Dam <br...@intix.eu> on 2014/12/22 10:59:35 UTC

SolrCloud & Paging on large indexes

Hi folks,

If I understand things correctly, you can use paging & sorting in a 
SolrCloud environment. However, if I request the first 10 documents, a 
distributed query will be launched to all shards requesting the top 10, 
and then (Shards * 10) documents will then be sorted so that only the 
top 10 is returned.

This is fine.

But I'm a little worried when going beyond the first page ... This 
becomes (Page * shards * 10). I'm worried that in a 50 billion document 
setup paging will just explode.

Does anyone have any experience with paging on large cloud setups? 
Positive or negative? Or can anyone offer some reassurances or words of 
caution with this approach?

Or should I tell my users that they can never go beyond Page X (which is 
fine if the alternative is hell fire and brimstone).

Thanks,

  - Bram

Re: SolrCloud & Paging on large indexes

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
Hello Bram,

make sure you checked the doc
https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results

On Mon, Dec 22, 2014 at 12:59 PM, Bram Van Dam <br...@intix.eu> wrote:
>
> Hi folks,
>
> If I understand things correctly, you can use paging & sorting in a
> SolrCloud environment. However, if I request the first 10 documents, a
> distributed query will be launched to all shards requesting the top 10, and
> then (Shards * 10) documents will then be sorted so that only the top 10 is
> returned.
>
> This is fine.
>
> But I'm a little worried when going beyond the first page ... This becomes
> (Page * shards * 10). I'm worried that in a 50 billion document setup
> paging will just explode.
>
> Does anyone have any experience with paging on large cloud setups?
> Positive or negative? Or can anyone offer some reassurances or words of
> caution with this approach?
>
> Or should I tell my users that they can never go beyond Page X (which is
> fine if the alternative is hell fire and brimstone).
>
> Thanks,
>
>  - Bram
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<mk...@griddynamics.com>

Re: SolrCloud & Paging on large indexes

Posted by heaven <ah...@gmail.com>.
Would be cool to have ability to get not only the next page cursor, but next
page cursors, or a set of cursors for a given window, so we can draw page
numbers. Not sure about the last page though.



--
View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-Paging-on-large-indexes-tp4175535p4176044.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrCloud & Paging on large indexes

Posted by Erick Erickson <er...@gmail.com>.
> Nobody will hit next 499 times, but a lot of our users skip to the last page quite often. Maybe I should make *that* as hard as possible. Hmm

Right. I'd actually argue that providing a "last page" link in this situation is

1) useless to the user, I mean what's the point? Curiosity? If it really _must_
be supported, Toke's approach is sneaky and elegant. Sort in reverse order and
give them the first page ;).

2) dangerous as you well know...

> several orders of magnitude larger than what was tested
> there, so I'm still a bit worried.

I sympathize, but somebody has to be first ;). Besides, the
current situation is untenable from what you're saying...

Good luck!
Erick

On Tue, Dec 23, 2014 at 7:07 AM, Toke Eskildsen <te...@statsbiblioteket.dk> wrote:
> Bram Van Dam [bram.vandam@intix.eu] wrote:
>
> [Solr cursors]
>
>> Oh thanks, that's a pretty interesting read. The scale we're
>> investigating is several orders of magnitude larger than what was tested
>> there, so I'm still a bit worried.
>
> The beauty of the cursor is that it is has little to no overhead, relative to a standard top-X sorted search. A standard search uses a sliding window over the full result set, as does a cursor-search. Same amount of work. It is just a question of limits for the window.
>
>> The largest index I currently have access to is
>> about a billion documents in size. Paging there is a nightmare, but the
>> Solr version is too old to support cursors so I'm afraid I can't offer
>> any useful data.
>
> Non-cursor paging in Solr uses a sliding window sort with a heap that contains all documents up to the paging number. A heap is a very fine thing for sliding window sort, as long as it is small. But performance drops to horrible levels when it gets large as it is extremely RAM-cache unfriendly.
>
>> Does anyone have any performance data on multi-billion-document indexes?
>
> Sorry, no. I could do a test on our 7 billion documents index, but it would have to wait until the end of January.
>
>>Nobody will hit next 499 times, but a lot of our users skip to the last
>> page quite often. Maybe I should make *that* as hard as possible. Hmm.
>
> Issue a search with sort in reverse order, then reverse the returned list of documents?
>
> - Toke Eskildsen

Re: SolrCloud & Paging on large indexes

Posted by Bram Van Dam <br...@intix.eu>.
On 12/23/2014 04:07 PM, Toke Eskildsen wrote:
> The beauty of the cursor is that it is has little to no overhead, relative to a standard top-X sorted search. A standard search uses a sliding window over the full result set, as does a cursor-search. Same amount of work. It is just a question of limits for the window.

That is very good to hear. Thanks.

>> Nobody will hit next 499 times, but a lot of our users skip to the last
>> page quite often. Maybe I should make *that* as hard as possible. Hmm.
>
> Issue a search with sort in reverse order, then reverse the returned list of documents?

Sneaky. I like it. But in the end we're simply getting rid of the 
"last"-button. Solves a lot of issues. If have a billion search results, 
you might as well refine your criteria!

  - Bram


RE: SolrCloud & Paging on large indexes

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
Bram Van Dam [bram.vandam@intix.eu] wrote:

[Solr cursors]

> Oh thanks, that's a pretty interesting read. The scale we're
> investigating is several orders of magnitude larger than what was tested
> there, so I'm still a bit worried.

The beauty of the cursor is that it is has little to no overhead, relative to a standard top-X sorted search. A standard search uses a sliding window over the full result set, as does a cursor-search. Same amount of work. It is just a question of limits for the window.

> The largest index I currently have access to is
> about a billion documents in size. Paging there is a nightmare, but the
> Solr version is too old to support cursors so I'm afraid I can't offer
> any useful data.

Non-cursor paging in Solr uses a sliding window sort with a heap that contains all documents up to the paging number. A heap is a very fine thing for sliding window sort, as long as it is small. But performance drops to horrible levels when it gets large as it is extremely RAM-cache unfriendly.

> Does anyone have any performance data on multi-billion-document indexes?

Sorry, no. I could do a test on our 7 billion documents index, but it would have to wait until the end of January.

>Nobody will hit next 499 times, but a lot of our users skip to the last
> page quite often. Maybe I should make *that* as hard as possible. Hmm.

Issue a search with sort in reverse order, then reverse the returned list of documents?

- Toke Eskildsen

Re: SolrCloud & Paging on large indexes

Posted by Bram Van Dam <br...@intix.eu>.
On 12/22/2014 04:27 PM, Erick Erickson wrote:
> Have you read Hossman's blog here?
> https://lucidworks.com/blog/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/#referrer=solr.pl

Oh thanks, that's a pretty interesting read. The scale we're 
investigating is several orders of magnitude larger than what was tested 
there, so I'm still a bit worried.

> Because if you're trying this and _still_ getting bad performance we
> need to know.

I'll definitely keep you posted when our test results on larger indexes 
(~50 billion documents) come in, but this sadly won't be any time soon 
(infrastructure sucks). The largest index I currently have access to is 
about a billion documents in size. Paging there is a nightmare, but the 
Solr version is too old to support cursors so I'm afraid I can't offer 
any useful data.

Does anyone have any performance data on multi-billion-document indexes? 
With or without SolrCloud?

> Bram:
> One minor pedantic clarification.. The first round-trip only returns
> the id and sort criteria (score by default), not the whole document,
> although the effect is the same, as you page N into the corpus, the
> default implementation returns N * (pageNum + 1) entries. Even worse,
> each node itself has to _sort_ that many entries.... Then a second
> call is made to get the page-worth of docs...

I was trying to keep it short and sweet, but yes, that's the way I think 
it works ;-)

> That said, though, its pretty easy to argue that the 500th page is
> pretty useless, nobody will ever hit the "next page" button 499 times.

Nobody will hit next 499 times, but a lot of our users skip to the last 
page quite often. Maybe I should make *that* as hard as possible. Hmm.

Thanks for the tips!

  - Bram

Re: SolrCloud & Paging on large indexes

Posted by Erick Erickson <er...@gmail.com>.
Have you read Hossman's blog here?
https://lucidworks.com/blog/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/#referrer=solr.pl

And how to use it here?
http://wiki.apache.org/solr/CommonQueryParameters#Deep_paging_with_cursorMark

Because if you're trying this and _still_ getting bad performance we
need to know.

Bram:
One minor pedantic clarification.. The first round-trip only returns
the id and sort criteria (score by default), not the whole document,
although the effect is the same, as you page N into the corpus, the
default implementation returns N * (pageNum + 1) entries. Even worse,
each node itself has to _sort_ that many entries.... Then a second
call is made to get the page-worth of docs...

About telling your users not to page past N... up to you, especially
if the deep paging stuff works as advertised (and I have no reason to
believe it doesn't).

That said, though, its pretty easy to argue that the 500th page is
pretty useless, nobody will ever hit the "next page" button 499 times.

The different use-case, though, is when people want to return the
entire corpus for whatever reason and _must_ page through to the
end....

Best,
Erick

On Mon, Dec 22, 2014 at 5:03 AM, Bram Van Dam <br...@intix.eu> wrote:
> On 12/22/2014 12:47 PM, heaven wrote:
>>
>> I have a very bad experience with pagination on collections larger than a
>> few
>> millions of documents. Pagination becomes very and very slow. Just tried
>> to
>> switch to page 76662 and it took almost 30 seconds.
>
>
> Yeah that's pretty much my experience, and I think SolrCloud would only
> exacerbate the problem (due to increased complexity of sorting). If there's
> no silver bullet to be found, I guess I'll just have to disable paging on
> large data sets -- which is fine, really, who the hell browses through 50
> billion documents anyway? That's what search is for, right?
>
> Thx,
>
>  - Bram
>

Re: SolrCloud & Paging on large indexes

Posted by Bram Van Dam <br...@intix.eu>.
On 12/22/2014 12:47 PM, heaven wrote:
> I have a very bad experience with pagination on collections larger than a few
> millions of documents. Pagination becomes very and very slow. Just tried to
> switch to page 76662 and it took almost 30 seconds.

Yeah that's pretty much my experience, and I think SolrCloud would only 
exacerbate the problem (due to increased complexity of sorting). If 
there's no silver bullet to be found, I guess I'll just have to disable 
paging on large data sets -- which is fine, really, who the hell browses 
through 50 billion documents anyway? That's what search is for, right?

Thx,

  - Bram


Re: SolrCloud & Paging on large indexes

Posted by heaven <ah...@gmail.com>.
I have a very bad experience with pagination on collections larger than a few
millions of documents. Pagination becomes very and very slow. Just tried to
switch to page 76662 and it took almost 30 seconds.

Solr now supports cursors which work fast and are useful for exports and
some data processing, but I don't see how I can use those to draw page
numbers and allow users to paginate through large data sets.



--
View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-Paging-on-large-indexes-tp4175535p4175550.html
Sent from the Solr - User mailing list archive at Nabble.com.