You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Bill Bell (JIRA)" <ji...@apache.org> on 2010/11/03 03:26:26 UTC

[jira] Created: (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets

Performance of start= and rows= parameters are exponentially slow with large data sets
--------------------------------------------------------------------------------------

                 Key: SOLR-2218
                 URL: https://issues.apache.org/jira/browse/SOLR-2218
             Project: Solr
          Issue Type: Improvement
          Components: Build
    Affects Versions: 1.4.1
            Reporter: Bill Bell


With large data sets, > 10M rows.

Setting start=<large number> and rows=<large numbers> is slow, and gets slower the farther you get from start=0 with a complex query. Random also makes this slower.

Would like to somehow make this performance faster for looping through large data sets. It would be nice if we could pass a pointer to the result set to loop, or support very large rows=<number>.

Something like:
rows=1000
start=0
spointer=string_my_query_1

Then within interval (like 5 mins) I can reference this loop:
Something like:
rows=1000
start=1000
spointer=string_my_query_1

What do you think? Since the data is too great the cache is not helping.




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928895#action_12928895 ] 

Lance Norskog commented on SOLR-2218:
-------------------------------------

The search returns many things, including a Solr issue with this title: "Enable sort by docid".



> Performance of start= and rows= parameters are exponentially slow with large data sets
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-2218
>                 URL: https://issues.apache.org/jira/browse/SOLR-2218
>             Project: Solr
>          Issue Type: Improvement
>          Components: Build
>    Affects Versions: 1.4.1
>            Reporter: Bill Bell
>
> With large data sets, > 10M rows.
> Setting start=<large number> and rows=<large numbers> is slow, and gets slower the farther you get from start=0 with a complex query. Random also makes this slower.
> Would like to somehow make this performance faster for looping through large data sets. It would be nice if we could pass a pointer to the result set to loop, or support very large rows=<number>.
> Something like:
> rows=1000
> start=0
> spointer=string_my_query_1
> Then within interval (like 5 mins) I can reference this loop:
> Something like:
> rows=1000
> start=1000
> spointer=string_my_query_1
> What do you think? Since the data is too great the cache is not helping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets

Posted by "jess canabou (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049020#comment-13049020 ] 

jess canabou commented on SOLR-2218:
------------------------------------

Hi all

I'm a bit confused by this thread, but think I have the same or almost same issue. I'm searching on a document with over 7000000 entries. I'm using the start and rows parameters (querying 30000 recs at a time), and notice the query times getting increasingly large, the further into the document I get. Unlike Bill, I do not care about scores or relevancy, and am having difficulty understanding whether the docid is a suitable solution to my problem. Is there something I can simply tack onto the end of my query to help speed up these query times? From what I understand, it's not necessary for me to be sorting all the rows before the chunk of data I'm querying on
My query looks as below.
http://hostname/solr/select/?q=blablabla&version=2.2&start=4000000rows=30000&indent=on&fl=<bunch of fields>

Any help would be greatly appreciated :)

> Performance of start= and rows= parameters are exponentially slow with large data sets
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-2218
>                 URL: https://issues.apache.org/jira/browse/SOLR-2218
>             Project: Solr
>          Issue Type: Improvement
>          Components: Build
>    Affects Versions: 1.4.1
>            Reporter: Bill Bell
>
> With large data sets, > 10M rows.
> Setting start=<large number> and rows=<large numbers> is slow, and gets slower the farther you get from start=0 with a complex query. Random also makes this slower.
> Would like to somehow make this performance faster for looping through large data sets. It would be nice if we could pass a pointer to the result set to loop, or support very large rows=<number>.
> Something like:
> rows=1000
> start=0
> spointer=string_my_query_1
> Then within interval (like 5 mins) I can reference this loop:
> Something like:
> rows=1000
> start=1000
> spointer=string_my_query_1
> What do you think? Since the data is too great the cache is not helping.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets

Posted by "Peter Karich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928536#action_12928536 ] 

Peter Karich commented on SOLR-2218:
------------------------------------

Lance, would you mind explaining this a bit in detail :-) ?

The idea is to grab all/alot documents from solr even if the dataset is very large, if I haven't misunderstood what Bill was requesting. This is very useful IMHO.

> Performance of start= and rows= parameters are exponentially slow with large data sets
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-2218
>                 URL: https://issues.apache.org/jira/browse/SOLR-2218
>             Project: Solr
>          Issue Type: Improvement
>          Components: Build
>    Affects Versions: 1.4.1
>            Reporter: Bill Bell
>
> With large data sets, > 10M rows.
> Setting start=<large number> and rows=<large numbers> is slow, and gets slower the farther you get from start=0 with a complex query. Random also makes this slower.
> Would like to somehow make this performance faster for looping through large data sets. It would be nice if we could pass a pointer to the result set to loop, or support very large rows=<number>.
> Something like:
> rows=1000
> start=0
> spointer=string_my_query_1
> Then within interval (like 5 mins) I can reference this loop:
> Something like:
> rows=1000
> start=1000
> spointer=string_my_query_1
> What do you think? Since the data is too great the cache is not helping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets

Posted by "Bill Bell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928913#action_12928913 ] 

Bill Bell commented on SOLR-2218:
---------------------------------

Lance, 

I know how to do that. That is not the issue. Let me explain again.

This is a performance issue.

When you loop through results "deeply" the performance of the results get SLOWER and SLOWER.

1. http://hostname/solr/select?fl=id&start=0&rows=1000&q=*:*
<int name="QTime">2</int>

2. http://hostname/solr/select?fl=id&start=10000&rows=1000&q=*:*
<int name="QTime">8</int>

3. http://hostname/solr/select?fl=id&start=20000&rows=1000&q=*:*
<int name="QTime">38</int>

It keeps getting slower!!

We need it to be consistently fast at QTIME=2.

Any solutions?



> Performance of start= and rows= parameters are exponentially slow with large data sets
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-2218
>                 URL: https://issues.apache.org/jira/browse/SOLR-2218
>             Project: Solr
>          Issue Type: Improvement
>          Components: Build
>    Affects Versions: 1.4.1
>            Reporter: Bill Bell
>
> With large data sets, > 10M rows.
> Setting start=<large number> and rows=<large numbers> is slow, and gets slower the farther you get from start=0 with a complex query. Random also makes this slower.
> Would like to somehow make this performance faster for looping through large data sets. It would be nice if we could pass a pointer to the result set to loop, or support very large rows=<number>.
> Something like:
> rows=1000
> start=0
> spointer=string_my_query_1
> Then within interval (like 5 mins) I can reference this loop:
> Something like:
> rows=1000
> start=1000
> spointer=string_my_query_1
> What do you think? Since the data is too great the cache is not helping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928457#action_12928457 ] 

Lance Norskog commented on SOLR-2218:
-------------------------------------

There is a workaround for this called _docid_. 

http://www.lucidimagination.com/search/?q=_docid_#/p:solr

> Performance of start= and rows= parameters are exponentially slow with large data sets
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-2218
>                 URL: https://issues.apache.org/jira/browse/SOLR-2218
>             Project: Solr
>          Issue Type: Improvement
>          Components: Build
>    Affects Versions: 1.4.1
>            Reporter: Bill Bell
>
> With large data sets, > 10M rows.
> Setting start=<large number> and rows=<large numbers> is slow, and gets slower the farther you get from start=0 with a complex query. Random also makes this slower.
> Would like to somehow make this performance faster for looping through large data sets. It would be nice if we could pass a pointer to the result set to loop, or support very large rows=<number>.
> Something like:
> rows=1000
> start=0
> spointer=string_my_query_1
> Then within interval (like 5 mins) I can reference this loop:
> Something like:
> rows=1000
> start=1000
> spointer=string_my_query_1
> What do you think? Since the data is too great the cache is not helping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Issue Comment Edited: (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets

Posted by "Bill Bell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976606#action_12976606 ] 

Bill Bell edited comment on SOLR-2218 at 1/2/11 11:35 PM:
----------------------------------------------------------

Hoss,

So what you are saying is instead of:

1. http://hostname/solr/select?fl=id&start=20000&rows=1000&q=*:*&sort=id asc

I should use:

LAST_ID=20000
1. http://hostname/solr/select?fl=id&rows=1000&q=*:*&sort=id asc&fq=id:[<LAST_ID> TO *]

This should definately be faster. Unfortunately, I need the results by highest score. Does fq support score?

SCORE=5.6
1. http://hostname/solr/select?fl=id,score&rows=1000&q=*:*&sort=score desc&fq=id:[0 TO <SCORE>]

Thoughts?






      was (Author: billnbell):
    Hoss,

So what you are saying is instead of:

1. http://hostname/solr/select?fl=id&start=20000&rows=1000&q=*:*&sort=id asc

I should use:

LAST_ID=20000
1. http://hostname/solr/select?fl=id&rows=1000&q=*:*&sort=id asc&fq=id:[<LAST_ID> TO *]

This should definately be faster. Unfortunately, I need the results by highest score. Does fq support score?

SCORE=5.6
1. http://hostname/solr/select?fl=id,score&rows=1000&q=*:*&sort=score desc&fq=id:[0 to <SCORE>]

Thoughts?





  
> Performance of start= and rows= parameters are exponentially slow with large data sets
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-2218
>                 URL: https://issues.apache.org/jira/browse/SOLR-2218
>             Project: Solr
>          Issue Type: Improvement
>          Components: Build
>    Affects Versions: 1.4.1
>            Reporter: Bill Bell
>
> With large data sets, > 10M rows.
> Setting start=<large number> and rows=<large numbers> is slow, and gets slower the farther you get from start=0 with a complex query. Random also makes this slower.
> Would like to somehow make this performance faster for looping through large data sets. It would be nice if we could pass a pointer to the result set to loop, or support very large rows=<number>.
> Something like:
> rows=1000
> start=0
> spointer=string_my_query_1
> Then within interval (like 5 mins) I can reference this loop:
> Something like:
> rows=1000
> start=1000
> spointer=string_my_query_1
> What do you think? Since the data is too great the cache is not helping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12978958#action_12978958 ] 

Hoss Man commented on SOLR-2218:
--------------------------------

bq. Unfortunately, I need the results by highest score. Does fq support score?

As i mentioned..

bq. if you are sorting on score this becomes tricker, but should be possible using the "frange" parser wit the "query" function)

I think something like..

{code}
LAST_SCORE=5.6
...?q=...&fq={!frange u=5.6}query($q)&sort=score+desc
{code}

...should work (but you have the issue of docs with identical scores to worry about -- something that's not a problem with uniqueIds)

> Performance of start= and rows= parameters are exponentially slow with large data sets
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-2218
>                 URL: https://issues.apache.org/jira/browse/SOLR-2218
>             Project: Solr
>          Issue Type: Improvement
>          Components: Build
>    Affects Versions: 1.4.1
>            Reporter: Bill Bell
>
> With large data sets, > 10M rows.
> Setting start=<large number> and rows=<large numbers> is slow, and gets slower the farther you get from start=0 with a complex query. Random also makes this slower.
> Would like to somehow make this performance faster for looping through large data sets. It would be nice if we could pass a pointer to the result set to loop, or support very large rows=<number>.
> Something like:
> rows=1000
> start=0
> spointer=string_my_query_1
> Then within interval (like 5 mins) I can reference this loop:
> Something like:
> rows=1000
> start=1000
> spointer=string_my_query_1
> What do you think? Since the data is too great the cache is not helping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Issue Comment Edited: (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets

Posted by "Bill Bell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976606#action_12976606 ] 

Bill Bell edited comment on SOLR-2218 at 1/2/11 11:40 PM:
----------------------------------------------------------

Hoss,

So what you are saying is instead of:

1. http://hostname/solr/select?fl=id&start=20000&rows=1000&q=*:*&sort=id asc

I should use:

LAST_ID=20000
1. http://hostname/solr/select?fl=id&rows=1000&q=*:*&sort=id asc&fq=id:[<LAST_ID> TO *]

This should definately be faster. Unfortunately, I need the results by highest score. Does fq support score?

SCORE=5.6
1. http://hostname/solr/select?fl=id,score&rows=1000&q=*:*&sort=score desc&fq=score:[0 TO <SCORE>]

Thoughts?

I get an error when using fq=score:...

HTTP ERROR 400
Problem accessing /solr/provs/select. Reason: 

    undefined field score





      was (Author: billnbell):
    Hoss,

So what you are saying is instead of:

1. http://hostname/solr/select?fl=id&start=20000&rows=1000&q=*:*&sort=id asc

I should use:

LAST_ID=20000
1. http://hostname/solr/select?fl=id&rows=1000&q=*:*&sort=id asc&fq=id:[<LAST_ID> TO *]

This should definately be faster. Unfortunately, I need the results by highest score. Does fq support score?

SCORE=5.6
1. http://hostname/solr/select?fl=id,score&rows=1000&q=*:*&sort=score desc&fq=score:[0 TO <SCORE>]

Thoughts?





  
> Performance of start= and rows= parameters are exponentially slow with large data sets
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-2218
>                 URL: https://issues.apache.org/jira/browse/SOLR-2218
>             Project: Solr
>          Issue Type: Improvement
>          Components: Build
>    Affects Versions: 1.4.1
>            Reporter: Bill Bell
>
> With large data sets, > 10M rows.
> Setting start=<large number> and rows=<large numbers> is slow, and gets slower the farther you get from start=0 with a complex query. Random also makes this slower.
> Would like to somehow make this performance faster for looping through large data sets. It would be nice if we could pass a pointer to the result set to loop, or support very large rows=<number>.
> Something like:
> rows=1000
> start=0
> spointer=string_my_query_1
> Then within interval (like 5 mins) I can reference this loop:
> Something like:
> rows=1000
> start=1000
> spointer=string_my_query_1
> What do you think? Since the data is too great the cache is not helping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: [jira] Commented: (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets

Posted by Grant Ingersoll <gs...@apache.org>.
The weird thing is, all of our collectors, IMO, are optimized for the non-paging scenario, when I would venture to guess that the very large majority of users out there do paging.  AFAICT, about the only people who don't do paging are those who do deep, downstream analysis which requires them to retrieve 100's or 1000's or more of results at a time (I've seen as much as a million used in production) as part of a batch job.

See https://issues.apache.org/jira/browse/LUCENE-2215 and https://issues.apache.org/jira/browse/SOLR-1726 for the issues tracking this.

-Grant

On Jan 8, 2011, at 7:11 AM, Earwin Burrfoot wrote:

> On Mon, Jan 3, 2011 at 18:18, Yonik Seeley <yo...@lucidimagination.com> wrote:
>> On Thu, Nov 11, 2010 at 3:22 PM, Jan Høydahl / Cominvent<ja...@cominvent.com> wrote:
>>> The problem with large "start" is probably worse when sharding is involved. Anyone know how the shard component goes about fetching start=1000000&rows=10 from say 10 shards? Does it have to merge sorted lists of 1mill+10 docsids from each shard which is the worst case?
>> 
>> Yep, that's how it works today.
>> 
> 
> Technically, if your docs have a non-biased (in regards to their
> sort-value) distribution across shards, you can fetch much less than
> topN docs from each shard.
> I played with the idea, and it worked for me. Though later I dropped
> the opto, as it complicated things somewhat and my users aren't
> querying gazillions of docs often.
> 
> 
> -- 
> Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
> Phone: +7 (495) 683-567-4
> ICQ: 104465785
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: [jira] Commented: (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets

Posted by Earwin Burrfoot <ea...@gmail.com>.
On Mon, Jan 3, 2011 at 18:18, Yonik Seeley <yo...@lucidimagination.com> wrote:
> On Thu, Nov 11, 2010 at 3:22 PM, Jan Høydahl / Cominvent<ja...@cominvent.com> wrote:
>> The problem with large "start" is probably worse when sharding is involved. Anyone know how the shard component goes about fetching start=1000000&rows=10 from say 10 shards? Does it have to merge sorted lists of 1mill+10 docsids from each shard which is the worst case?
>
> Yep, that's how it works today.
>

Technically, if your docs have a non-biased (in regards to their
sort-value) distribution across shards, you can fetch much less than
topN docs from each shard.
I played with the idea, and it worked for me. Though later I dropped
the opto, as it complicated things somewhat and my users aren't
querying gazillions of docs often.


-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: [jira] Commented: (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Thu, Nov 11, 2010 at 3:22 PM, Jan Høydahl / Cominvent
<ja...@cominvent.com> wrote:
> The problem with large "start" is probably worse when sharding is involved. Anyone know how the shard component goes about fetching start=1000000&rows=10 from say 10 shards? Does it have to merge sorted lists of 1mill+10 docsids from each shard which is the worst case?

Yep, that's how it works today.

-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: [jira] Commented: (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets

Posted by Jan Høydahl / Cominvent <ja...@cominvent.com>.
The problem with large "start" is probably worse when sharding is involved. Anyone know how the shard component goes about fetching start=1000000&rows=10 from say 10 shards? Does it have to merge sorted lists of 1mill+10 docsids from each shard which is the worst case?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 10. nov. 2010, at 20.22, Hoss Man (JIRA) wrote:

> 
>    [ https://issues.apache.org/jira/browse/SOLR-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930723#action_12930723 ] 
> 
> Hoss Man commented on SOLR-2218:
> --------------------------------
> 
> The performance gets slower as the start increases because in order to give you rows N...M sorted by score solr must collect the the top M documents (in sorted order) Lance's point is that if you use "sort=_docid_+asc" this collection of top ranking documents in sorted order doesn't have to happen.
> 
> If you have to use sorting, keep in mind that the decrease in performance as the "start" param increases w/o bounds is primarily driven by the amount of documents that have to be collected/compared on the sort field -- something thta wouldn't change if yo have a named cursor (you would just be paying that cost up front instead of per request).
> 
> You should be able to get equivalent functionality by reducing the number of collected documents -- instead of increasing the start param, add a filter on the sort field indicating that you only want documents with a field value higher (or lower if using "desc" sort) then the last document so far encountered.  (if you are sorting on score this becomes tricker, but should be possible using the "frange" parser wit the "query" function)
> 
>> Performance of start= and rows= parameters are exponentially slow with large data sets
>> --------------------------------------------------------------------------------------
>> 
>>                Key: SOLR-2218
>>                URL: https://issues.apache.org/jira/browse/SOLR-2218
>>            Project: Solr
>>         Issue Type: Improvement
>>         Components: Build
>>   Affects Versions: 1.4.1
>>           Reporter: Bill Bell
>> 
>> With large data sets, > 10M rows.
>> Setting start=<large number> and rows=<large numbers> is slow, and gets slower the farther you get from start=0 with a complex query. Random also makes this slower.
>> Would like to somehow make this performance faster for looping through large data sets. It would be nice if we could pass a pointer to the result set to loop, or support very large rows=<number>.
>> Something like:
>> rows=1000
>> start=0
>> spointer=string_my_query_1
>> Then within interval (like 5 mins) I can reference this loop:
>> Something like:
>> rows=1000
>> start=1000
>> spointer=string_my_query_1
>> What do you think? Since the data is too great the cache is not helping.
> 
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930723#action_12930723 ] 

Hoss Man commented on SOLR-2218:
--------------------------------

The performance gets slower as the start increases because in order to give you rows N...M sorted by score solr must collect the the top M documents (in sorted order) Lance's point is that if you use "sort=_docid_+asc" this collection of top ranking documents in sorted order doesn't have to happen.

If you have to use sorting, keep in mind that the decrease in performance as the "start" param increases w/o bounds is primarily driven by the amount of documents that have to be collected/compared on the sort field -- something thta wouldn't change if yo have a named cursor (you would just be paying that cost up front instead of per request).

You should be able to get equivalent functionality by reducing the number of collected documents -- instead of increasing the start param, add a filter on the sort field indicating that you only want documents with a field value higher (or lower if using "desc" sort) then the last document so far encountered.  (if you are sorting on score this becomes tricker, but should be possible using the "frange" parser wit the "query" function)

> Performance of start= and rows= parameters are exponentially slow with large data sets
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-2218
>                 URL: https://issues.apache.org/jira/browse/SOLR-2218
>             Project: Solr
>          Issue Type: Improvement
>          Components: Build
>    Affects Versions: 1.4.1
>            Reporter: Bill Bell
>
> With large data sets, > 10M rows.
> Setting start=<large number> and rows=<large numbers> is slow, and gets slower the farther you get from start=0 with a complex query. Random also makes this slower.
> Would like to somehow make this performance faster for looping through large data sets. It would be nice if we could pass a pointer to the result set to loop, or support very large rows=<number>.
> Something like:
> rows=1000
> start=0
> spointer=string_my_query_1
> Then within interval (like 5 mins) I can reference this loop:
> Something like:
> rows=1000
> start=1000
> spointer=string_my_query_1
> What do you think? Since the data is too great the cache is not helping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Resolved] (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets

Posted by "Grant Ingersoll (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll resolved SOLR-2218.
-----------------------------------

    Resolution: Duplicate

Dup of SOLR-1726
                
> Performance of start= and rows= parameters are exponentially slow with large data sets
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-2218
>                 URL: https://issues.apache.org/jira/browse/SOLR-2218
>             Project: Solr
>          Issue Type: Improvement
>          Components: Build
>    Affects Versions: 1.4.1
>            Reporter: Bill Bell
>
> With large data sets, > 10M rows.
> Setting start=<large number> and rows=<large numbers> is slow, and gets slower the farther you get from start=0 with a complex query. Random also makes this slower.
> Would like to somehow make this performance faster for looping through large data sets. It would be nice if we could pass a pointer to the result set to loop, or support very large rows=<number>.
> Something like:
> rows=1000
> start=0
> spointer=string_my_query_1
> Then within interval (like 5 mins) I can reference this loop:
> Something like:
> rows=1000
> start=1000
> spointer=string_my_query_1
> What do you think? Since the data is too great the cache is not helping.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Issue Comment Edited: (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets

Posted by "Bill Bell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976606#action_12976606 ] 

Bill Bell edited comment on SOLR-2218 at 1/2/11 11:38 PM:
----------------------------------------------------------

Hoss,

So what you are saying is instead of:

1. http://hostname/solr/select?fl=id&start=20000&rows=1000&q=*:*&sort=id asc

I should use:

LAST_ID=20000
1. http://hostname/solr/select?fl=id&rows=1000&q=*:*&sort=id asc&fq=id:[<LAST_ID> TO *]

This should definately be faster. Unfortunately, I need the results by highest score. Does fq support score?

SCORE=5.6
1. http://hostname/solr/select?fl=id,score&rows=1000&q=*:*&sort=score desc&fq=score:[0 TO <SCORE>]

Thoughts?






      was (Author: billnbell):
    Hoss,

So what you are saying is instead of:

1. http://hostname/solr/select?fl=id&start=20000&rows=1000&q=*:*&sort=id asc

I should use:

LAST_ID=20000
1. http://hostname/solr/select?fl=id&rows=1000&q=*:*&sort=id asc&fq=id:[<LAST_ID> TO *]

This should definately be faster. Unfortunately, I need the results by highest score. Does fq support score?

SCORE=5.6
1. http://hostname/solr/select?fl=id,score&rows=1000&q=*:*&sort=score desc&fq=id:[0 TO <SCORE>]

Thoughts?





  
> Performance of start= and rows= parameters are exponentially slow with large data sets
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-2218
>                 URL: https://issues.apache.org/jira/browse/SOLR-2218
>             Project: Solr
>          Issue Type: Improvement
>          Components: Build
>    Affects Versions: 1.4.1
>            Reporter: Bill Bell
>
> With large data sets, > 10M rows.
> Setting start=<large number> and rows=<large numbers> is slow, and gets slower the farther you get from start=0 with a complex query. Random also makes this slower.
> Would like to somehow make this performance faster for looping through large data sets. It would be nice if we could pass a pointer to the result set to loop, or support very large rows=<number>.
> Something like:
> rows=1000
> start=0
> spointer=string_my_query_1
> Then within interval (like 5 mins) I can reference this loop:
> Something like:
> rows=1000
> start=1000
> spointer=string_my_query_1
> What do you think? Since the data is too great the cache is not helping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets

Posted by "Bill Bell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928674#action_12928674 ] 

Bill Bell commented on SOLR-2218:
---------------------------------

Lance,

Can you point me directly to the document on Lucid's website? That search returns a Luke handler, that is not what I am asking.

1. I have a query that returns thousands of results.
2. I want to return fl=id, start=1000, rows=1000 and af I move start farther from 0, the results slow down substantially. 
3. Need the results to come back quickly even when start=10000 if I am looping across all the results.


> Performance of start= and rows= parameters are exponentially slow with large data sets
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-2218
>                 URL: https://issues.apache.org/jira/browse/SOLR-2218
>             Project: Solr
>          Issue Type: Improvement
>          Components: Build
>    Affects Versions: 1.4.1
>            Reporter: Bill Bell
>
> With large data sets, > 10M rows.
> Setting start=<large number> and rows=<large numbers> is slow, and gets slower the farther you get from start=0 with a complex query. Random also makes this slower.
> Would like to somehow make this performance faster for looping through large data sets. It would be nice if we could pass a pointer to the result set to loop, or support very large rows=<number>.
> Something like:
> rows=1000
> start=0
> spointer=string_my_query_1
> Then within interval (like 5 mins) I can reference this loop:
> Something like:
> rows=1000
> start=1000
> spointer=string_my_query_1
> What do you think? Since the data is too great the cache is not helping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets

Posted by "Bill Bell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976606#action_12976606 ] 

Bill Bell commented on SOLR-2218:
---------------------------------

Hoss,

So what you are saying is instead of:

1. http://hostname/solr/select?fl=id&start=20000&rows=1000&q=*:*&sort=id asc

I should use:

LAST_ID=20000
1. http://hostname/solr/select?fl=id&rows=1000&q=*:*&sort=id asc&fq=id:[<LAST_ID> TO *]

This should definately be faster. Unfortunately, I need the results by highest score. Does fq support score?

SCORE=5.6
1. http://hostname/solr/select?fl=id,score&rows=1000&q=*:*&sort=score desc&fq=id:[0 to <SCORE>]

Thoughts?






> Performance of start= and rows= parameters are exponentially slow with large data sets
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-2218
>                 URL: https://issues.apache.org/jira/browse/SOLR-2218
>             Project: Solr
>          Issue Type: Improvement
>          Components: Build
>    Affects Versions: 1.4.1
>            Reporter: Bill Bell
>
> With large data sets, > 10M rows.
> Setting start=<large number> and rows=<large numbers> is slow, and gets slower the farther you get from start=0 with a complex query. Random also makes this slower.
> Would like to somehow make this performance faster for looping through large data sets. It would be nice if we could pass a pointer to the result set to loop, or support very large rows=<number>.
> Something like:
> rows=1000
> start=0
> spointer=string_my_query_1
> Then within interval (like 5 mins) I can reference this loop:
> Something like:
> rows=1000
> start=1000
> spointer=string_my_query_1
> What do you think? Since the data is too great the cache is not helping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org