You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Salman Ansari <sa...@gmail.com> on 2015/10/09 12:59:52 UTC

Solr Pagination

Hi guys,

I have been working with Solr and Solr.NET for some time for a big project
that requires around 300M documents. Consequently, I faced an issue and I
am highlighting it here in case you have any comments:

As mentioned here (
https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results),
cursors are introduced to solve the problem of pagination. However, I was
not able to find an example to do proper handling of page navigation with
multiple users. For example, what happens if the user navigates from page 1
to page 2, does the front end  need to store the next cursor at each query?
What about going to a previous page, do we need to store all cursors that
have been navigated up to now at the client side? Any comments/sample on
how proper pagination should be handled using cursors?

Regards,
Salman

Re: Solr Pagination

Posted by Erick Erickson <er...@gmail.com>.
In a word, "no". I once doubled the JVM requirements
by changing just the query. You have to prototype. Here's
a blog on the subject:

https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/



On Wed, Oct 28, 2015 at 11:06 AM, Salman Ansari <sa...@gmail.com> wrote:
> I have already indexed all the documents in Solr and not indexing anymore.
> So the problem I am running in is after all the documents are indexed. I am
> using Solr cloud with two shards and two replicas for each shard but on the
> same machine. Is there anywhere I can look at the relation between index
> size and machine specs and its effect on Solr query performance?
>
> Regards,
> Salman
>
> On Mon, Oct 26, 2015 at 5:55 PM, Upayavira <uv...@odoko.co.uk> wrote:
>
>>
>>
>> On Sun, Oct 25, 2015, at 05:43 PM, Salman Ansari wrote:
>> > Thanks guys for your responses.
>> >
>> > That's a very very large cache size.  It is likely to use a VERY large
>> > amount of heap, and autowarming up to 4096 entries at commit time might
>> > take many *minutes*.  Each filterCache entry is maxDoc/8 bytes.  On an
>> > index core with 70 million documents, each filterCache entry is at least
>> > 8.75 million bytes.  Multiply that by 16384, and a completely full cache
>> > would need about 140GB of heap memory.  4096 entries will require 35GB.
>> >  I don't think this cache is actually storing that many entries, or you
>> > would most certainly be running into OutOfMemoryError exceptions.
>> >
>> > True, however, I have tried with the default filtercache at the beginning
>> > but the problem was still there. So, I don't think that is how I should
>> > increase the performance of my Solr. Moreover, as you mentioned, when I
>> > change the configuration, I should be running out of memory but that did
>> > not happen. Do you think my Solr has not taken the latest configs? I have
>> > restarted the Solr btw.
>> >
>> > Lately I have been trying different ways to improve this and I have
>> > created
>> > a brand new index on the same machine using 2 shards and it had few
>> > entries
>> > (about 5) and the performance was booming, I got the results back in 42
>> > ms
>> > sometimes. What concerns me is that may be I am loading too much into one
>> > index so that is why this is killing the performance. Is there a
>> > recommended index size/document number and size that I should be looking
>> > at
>> > to tune this? Any other ideas other than increasing the memory size as I
>> > have already tried this?
>>
>> The optimal index size is down to the size of segments on disk. New
>> segments are created when hard commits occur, and existing on-disk
>> segments may get merged in the background when the segment count gets
>> too high. Now, if those on-disk segments get too large, copying them
>> around at merge time can get prohibitive, especially if your index is
>> changing frequently.
>>
>> Splitting such an index into shards is one approach to dealing with this
>> issue.
>>
>> Upayavira
>>

Re: Solr Pagination

Posted by Salman Ansari <sa...@gmail.com>.
I have already indexed all the documents in Solr and not indexing anymore.
So the problem I am running in is after all the documents are indexed. I am
using Solr cloud with two shards and two replicas for each shard but on the
same machine. Is there anywhere I can look at the relation between index
size and machine specs and its effect on Solr query performance?

Regards,
Salman

On Mon, Oct 26, 2015 at 5:55 PM, Upayavira <uv...@odoko.co.uk> wrote:

>
>
> On Sun, Oct 25, 2015, at 05:43 PM, Salman Ansari wrote:
> > Thanks guys for your responses.
> >
> > That's a very very large cache size.  It is likely to use a VERY large
> > amount of heap, and autowarming up to 4096 entries at commit time might
> > take many *minutes*.  Each filterCache entry is maxDoc/8 bytes.  On an
> > index core with 70 million documents, each filterCache entry is at least
> > 8.75 million bytes.  Multiply that by 16384, and a completely full cache
> > would need about 140GB of heap memory.  4096 entries will require 35GB.
> >  I don't think this cache is actually storing that many entries, or you
> > would most certainly be running into OutOfMemoryError exceptions.
> >
> > True, however, I have tried with the default filtercache at the beginning
> > but the problem was still there. So, I don't think that is how I should
> > increase the performance of my Solr. Moreover, as you mentioned, when I
> > change the configuration, I should be running out of memory but that did
> > not happen. Do you think my Solr has not taken the latest configs? I have
> > restarted the Solr btw.
> >
> > Lately I have been trying different ways to improve this and I have
> > created
> > a brand new index on the same machine using 2 shards and it had few
> > entries
> > (about 5) and the performance was booming, I got the results back in 42
> > ms
> > sometimes. What concerns me is that may be I am loading too much into one
> > index so that is why this is killing the performance. Is there a
> > recommended index size/document number and size that I should be looking
> > at
> > to tune this? Any other ideas other than increasing the memory size as I
> > have already tried this?
>
> The optimal index size is down to the size of segments on disk. New
> segments are created when hard commits occur, and existing on-disk
> segments may get merged in the background when the segment count gets
> too high. Now, if those on-disk segments get too large, copying them
> around at merge time can get prohibitive, especially if your index is
> changing frequently.
>
> Splitting such an index into shards is one approach to dealing with this
> issue.
>
> Upayavira
>

Re: Solr Pagination

Posted by Upayavira <uv...@odoko.co.uk>.

On Sun, Oct 25, 2015, at 05:43 PM, Salman Ansari wrote:
> Thanks guys for your responses.
> 
> That's a very very large cache size.  It is likely to use a VERY large
> amount of heap, and autowarming up to 4096 entries at commit time might
> take many *minutes*.  Each filterCache entry is maxDoc/8 bytes.  On an
> index core with 70 million documents, each filterCache entry is at least
> 8.75 million bytes.  Multiply that by 16384, and a completely full cache
> would need about 140GB of heap memory.  4096 entries will require 35GB.
>  I don't think this cache is actually storing that many entries, or you
> would most certainly be running into OutOfMemoryError exceptions.
> 
> True, however, I have tried with the default filtercache at the beginning
> but the problem was still there. So, I don't think that is how I should
> increase the performance of my Solr. Moreover, as you mentioned, when I
> change the configuration, I should be running out of memory but that did
> not happen. Do you think my Solr has not taken the latest configs? I have
> restarted the Solr btw.
> 
> Lately I have been trying different ways to improve this and I have
> created
> a brand new index on the same machine using 2 shards and it had few
> entries
> (about 5) and the performance was booming, I got the results back in 42
> ms
> sometimes. What concerns me is that may be I am loading too much into one
> index so that is why this is killing the performance. Is there a
> recommended index size/document number and size that I should be looking
> at
> to tune this? Any other ideas other than increasing the memory size as I
> have already tried this?

The optimal index size is down to the size of segments on disk. New
segments are created when hard commits occur, and existing on-disk
segments may get merged in the background when the segment count gets
too high. Now, if those on-disk segments get too large, copying them
around at merge time can get prohibitive, especially if your index is
changing frequently.

Splitting such an index into shards is one approach to dealing with this
issue.

Upayavira

Re: Solr Pagination

Posted by Salman Ansari <sa...@gmail.com>.
Thanks guys for your responses.

That's a very very large cache size.  It is likely to use a VERY large
amount of heap, and autowarming up to 4096 entries at commit time might
take many *minutes*.  Each filterCache entry is maxDoc/8 bytes.  On an
index core with 70 million documents, each filterCache entry is at least
8.75 million bytes.  Multiply that by 16384, and a completely full cache
would need about 140GB of heap memory.  4096 entries will require 35GB.
 I don't think this cache is actually storing that many entries, or you
would most certainly be running into OutOfMemoryError exceptions.

True, however, I have tried with the default filtercache at the beginning
but the problem was still there. So, I don't think that is how I should
increase the performance of my Solr. Moreover, as you mentioned, when I
change the configuration, I should be running out of memory but that did
not happen. Do you think my Solr has not taken the latest configs? I have
restarted the Solr btw.

Lately I have been trying different ways to improve this and I have created
a brand new index on the same machine using 2 shards and it had few entries
(about 5) and the performance was booming, I got the results back in 42 ms
sometimes. What concerns me is that may be I am loading too much into one
index so that is why this is killing the performance. Is there a
recommended index size/document number and size that I should be looking at
to tune this? Any other ideas other than increasing the memory size as I
have already tried this?


Regards,
Salman

On Thu, Oct 22, 2015 at 9:18 AM, Toke Eskildsen <te...@statsbiblioteket.dk>
wrote:

> On Wed, 2015-10-14 at 10:17 +0200, Jan Høydahl wrote:
> > I have not benchmarked various number of segments at different sizes
> > on different HW etc, so my hunch could very well be wrong for Salman’s
> case.
> > I don’t know how frequent updates there is to his data either.
> >
> > Have you done #segments benchmarking for your huge datasets?
>
> Only informally. However, the guys at UKWA run a similar scale index and
> have done multiple segment-count-oriented tests. They have not published
> a report, but there are measurements & graphs at
> https://github.com/ukwa/shine/tree/master/python/test-logs
>
> - Toke Eskildsen, State and University Library, Denmark
>
>
>

Re: Solr Pagination

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
On Wed, 2015-10-14 at 10:17 +0200, Jan Høydahl wrote:
> I have not benchmarked various number of segments at different sizes
> on different HW etc, so my hunch could very well be wrong for Salman’s case.
> I don’t know how frequent updates there is to his data either.
> 
> Have you done #segments benchmarking for your huge datasets?

Only informally. However, the guys at UKWA run a similar scale index and
have done multiple segment-count-oriented tests. They have not published
a report, but there are measurements & graphs at
https://github.com/ukwa/shine/tree/master/python/test-logs

- Toke Eskildsen, State and University Library, Denmark



Re: Solr Pagination

Posted by Jan Høydahl <ja...@cominvent.com>.
I have not benchmarked various number of segments at different sizes
on different HW etc, so my hunch could very well be wrong for Salman’s case.
I don’t know how frequent updates there is to his data either.

Have you done #segments benchmarking for your huge datasets?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 12. okt. 2015 kl. 12.56 skrev Toke Eskildsen <te...@statsbiblioteket.dk>:
> 
> On Mon, 2015-10-12 at 10:05 +0200, Jan Høydahl wrote:
>> What you do when you call optimize is to force Lucene to merge all
>> those 35M docs into ONE SINGLE index segment. You get better HW
>> utilization if you let Lucene/Solr automatically handle merging,
>> meaning you’ll have around 10 smaller segments that are faster to
>> search across than one huge segment.
> 
> As individual Lucene/Solr shard searches are very much single threaded,
> the single segment version should be faster. Have you observed
> otherwise?
> 
> 
> Optimization is a fine feature if ones workflow is batch oriented with
> sufficiently long pauses between index updates. Nightly index updates
> with few active users at that time could be an example.
> 
> - Toke Eskildsen, State and University Library, Denmark
> 
> 


Re: Solr Pagination

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
On Mon, 2015-10-12 at 10:05 +0200, Jan Høydahl wrote:
> What you do when you call optimize is to force Lucene to merge all
> those 35M docs into ONE SINGLE index segment. You get better HW
> utilization if you let Lucene/Solr automatically handle merging,
> meaning you’ll have around 10 smaller segments that are faster to
> search across than one huge segment.

As individual Lucene/Solr shard searches are very much single threaded,
the single segment version should be faster. Have you observed
otherwise?


Optimization is a fine feature if ones workflow is batch oriented with
sufficiently long pauses between index updates. Nightly index updates
with few active users at that time could be an example.

- Toke Eskildsen, State and University Library, Denmark



Re: Solr Pagination

Posted by Jan Høydahl <ja...@cominvent.com>.
Salman,

You say that you optimized your index from Admin. You should not do that, however strange it sounds.
70M docs on 2 shards means 35M docs per shard. What you do when you call optimize is to force Lucene
to merge all those 35M docs into ONE SINGLE index segment. You get better HW utilization if you let
Lucene/Solr automatically handle merging, meaning you’ll have around 10 smaller segments that are faster to
search across than one huge segment.

Your cache settings are way too high. Remember “size” here is number of *entries* not number of bytes.
Start with, say, 100 - and then let the system run for a while with realistic query load, and then
determine based on the cache statistics whether you have a high hit rate (the cache is useful) and
a high eviction rate (could indicate that you would benefit from an increase).

I would not concern myself with high paging offsets unless there is something very special about your
usecase which justifies this as a usecase to focus much energy on. People just don’t page beyond page 10 :)
and if they do you should focus on improving the relevancy first - unless you got a very special use case...

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 11. okt. 2015 kl. 06.54 skrev Shawn Heisey <ap...@elyograg.org>:
> 
> On 10/10/2015 2:55 AM, Salman Ansari wrote:
>> Thanks Shawn for your response. Based on that
>> 1) Can you please direct me where I can get more information about cold
>> shard vs hot shard?
> 
> I don't know of any information out there about hot/cold shards.  I can
> describe it, though:
> 
> A split point is determined.  Everything older than the split point gets
> divided by some method (usually hashing) between multiple cold shards.
> Everything newer than the split point goes into the hot shard.  For my
> index, there is only one hot shard, but it is possible to have multiple
> hot shards.
> 
> On some interval (nightly in my index), the split point is adjusted and
> documents are moved from the hot shard to the cold shards according to
> that split point.  The hot shard is typically a lot smaller than the
> cold shards, which helps increase indexing speed for new documents.
> 
> I am not using SolrCloud. I manage all my own sharding. There is no
> capability included in SolrCloud that can do hot/cold sharding.
> 
>> 2)  That 10GB number assumes there's no other software on the machine, like
>> a database server or a webserver.
>> Yes the machine is dedicated for Solr
>> 
>> 3) How much index data is on the machine?
>> I have 3 collections 2 for testing (so the aggregate of both of them does
>> not exceed 1M document) and the main collection that I am querying now
>> which contains around 69M. I have distributed all my collections into 2
>> shards each with 2 replicas. The consumption on the hard disk is about 40GB.
> 
> That sounds like a recipe for a performance problem, although I am not
> certain why the problem persisted after increasing the memory.  Perhaps
> it has something to do with the filterCache, which I will get to further
> down.
> 
>> 4) A memory size of 14GB would be unusual for a physical machine, and makes me
>> wonder if you're using virtual machines
>> Yes I am using virtual machine as using a bare metal will be difficult in
>> my case as all of our data center is on the cloud. I can increase its
>> capacity though. While testing some edge cases on Solr, I realized on Solr
>> admin that the memory sometimes reaches to its limit (14GB RAM, and 4GB JVM)
> 
> This is how operating systems and Java are designed to work.  When
> things are running well, all of physical memory might be allocated, and
> the heap will become full on a semi-regular basis.  If it *stays* full,
> that usually means it needs to be larger.  The admin UI is a poor tool
> for watching JVM memory usage.
> 
>> 5) Just to confirm, I have combined the lessons from
>> 
>> http://www.slideshare.net/lucidworks/high-performance-solr-and-jvm-tuning-strategies-used-for-map-quests-search-ahead-darren-spehr
>> AND
>> https://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache
>> 
>> to come up with the following settings
>> 
>> FilterCache
>> 
>>    <filterCache class="solr.FastLRUCache"
>>                 size="16384"
>>                 initialSize="4096"
>>                 autowarmCount="4096"/>
> 
> That's a very very large cache size.  It is likely to use a VERY large
> amount of heap, and autowarming up to 4096 entries at commit time might
> take many *minutes*.  Each filterCache entry is maxDoc/8 bytes.  On an
> index core with 70 million documents, each filterCache entry is at least
> 8.75 million bytes.  Multiply that by 16384, and a completely full cache
> would need about 140GB of heap memory.  4096 entries will require 35GB.
> I don't think this cache is actually storing that many entries, or you
> would most certainly be running into OutOfMemoryError exceptions.
> 
>>    <documentCache class="solr.LRUCache"
>>                   size="16384"
>>                   initialSize="16384"
>>                   autowarmCount="0"/>
>> 
>> NewSearcher and FirsSearcher
>> 
>> <listener event="newSearcher" class="solr.QuerySenderListener">
>>      <arr name="queries">
>>           <lst><str name="q">*</str><str name="sort">score desc id
>> desc</str></lst>
>>      </arr>
>>    </listener>
>>    <listener event="firstSearcher" class="solr.QuerySenderListener">
>>      <arr name="queries">
>> <lst> <str name="q">*</str> <str name="sort">score desc id desc</str> </lst>
>>        <!-- seed common facets and filter queries -->
>>        <lst> <str name="q">*</str>
>>              <str name="facet.field">category</str>        </lst>
>>      </arr>
>>    </listener>
>> 
>> Will this be using more cache in Solr and prepoupulate it?
> 
> The newSearcher entry will result in one entry in the queryResultCache,
> and an unknown number of entries in the documentCache -- that depends on
> the "rows" parameter on the /select handler (defaults to 10) and the
> queryResultMaxDocsCached parameter.
> 
> The firstSearcher entry does two queries, but because the "q" parameter
> is identical on them, it will only result in one entry in the
> queryResultCache.  One of them has facet.field, but you did not include
> facet=true, so the facet query will not actually be run.  Without the
> facet query, the filterCache will not be populated.
> 
> I think the design intent for newSearcher and firstSearcher is to load
> critical index data into the OS disk cache.  It's not so much about
> warming the Solr caches as it is about priming the system as a whole.
> 
> Note that the wildcard query you are running (q=*) is relatively slow,
> but is an excellent choice for a warming query, because it actually
> reads every single term from the default field.  Because of how slow
> this query can run, setting useColdSearcher to true is recommended.
> 
> Thanks,
> Shawn
> 


Re: Solr Pagination

Posted by Shawn Heisey <ap...@elyograg.org>.
On 10/10/2015 2:55 AM, Salman Ansari wrote:
> Thanks Shawn for your response. Based on that
> 1) Can you please direct me where I can get more information about cold
> shard vs hot shard?

I don't know of any information out there about hot/cold shards.  I can
describe it, though:

A split point is determined.  Everything older than the split point gets
divided by some method (usually hashing) between multiple cold shards.
Everything newer than the split point goes into the hot shard.  For my
index, there is only one hot shard, but it is possible to have multiple
hot shards.

On some interval (nightly in my index), the split point is adjusted and
documents are moved from the hot shard to the cold shards according to
that split point.  The hot shard is typically a lot smaller than the
cold shards, which helps increase indexing speed for new documents.

I am not using SolrCloud. I manage all my own sharding. There is no
capability included in SolrCloud that can do hot/cold sharding.

> 2)  That 10GB number assumes there's no other software on the machine, like
> a database server or a webserver.
> Yes the machine is dedicated for Solr
> 
> 3) How much index data is on the machine?
> I have 3 collections 2 for testing (so the aggregate of both of them does
> not exceed 1M document) and the main collection that I am querying now
> which contains around 69M. I have distributed all my collections into 2
> shards each with 2 replicas. The consumption on the hard disk is about 40GB.

That sounds like a recipe for a performance problem, although I am not
certain why the problem persisted after increasing the memory.  Perhaps
it has something to do with the filterCache, which I will get to further
down.

> 4) A memory size of 14GB would be unusual for a physical machine, and makes me
> wonder if you're using virtual machines
> Yes I am using virtual machine as using a bare metal will be difficult in
> my case as all of our data center is on the cloud. I can increase its
> capacity though. While testing some edge cases on Solr, I realized on Solr
> admin that the memory sometimes reaches to its limit (14GB RAM, and 4GB JVM)

This is how operating systems and Java are designed to work.  When
things are running well, all of physical memory might be allocated, and
the heap will become full on a semi-regular basis.  If it *stays* full,
that usually means it needs to be larger.  The admin UI is a poor tool
for watching JVM memory usage.

> 5) Just to confirm, I have combined the lessons from
> 
> http://www.slideshare.net/lucidworks/high-performance-solr-and-jvm-tuning-strategies-used-for-map-quests-search-ahead-darren-spehr
> AND
> https://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache
> 
> to come up with the following settings
> 
> FilterCache
> 
>     <filterCache class="solr.FastLRUCache"
>                  size="16384"
>                  initialSize="4096"
>                  autowarmCount="4096"/>

That's a very very large cache size.  It is likely to use a VERY large
amount of heap, and autowarming up to 4096 entries at commit time might
take many *minutes*.  Each filterCache entry is maxDoc/8 bytes.  On an
index core with 70 million documents, each filterCache entry is at least
8.75 million bytes.  Multiply that by 16384, and a completely full cache
would need about 140GB of heap memory.  4096 entries will require 35GB.
 I don't think this cache is actually storing that many entries, or you
would most certainly be running into OutOfMemoryError exceptions.

>     <documentCache class="solr.LRUCache"
>                    size="16384"
>                    initialSize="16384"
>                    autowarmCount="0"/>
> 
> NewSearcher and FirsSearcher
> 
> <listener event="newSearcher" class="solr.QuerySenderListener">
>       <arr name="queries">
>            <lst><str name="q">*</str><str name="sort">score desc id
> desc</str></lst>
>       </arr>
>     </listener>
>     <listener event="firstSearcher" class="solr.QuerySenderListener">
>       <arr name="queries">
> <lst> <str name="q">*</str> <str name="sort">score desc id desc</str> </lst>
>         <!-- seed common facets and filter queries -->
>         <lst> <str name="q">*</str>
>               <str name="facet.field">category</str>        </lst>
>       </arr>
>     </listener>
> 
> Will this be using more cache in Solr and prepoupulate it?

The newSearcher entry will result in one entry in the queryResultCache,
and an unknown number of entries in the documentCache -- that depends on
the "rows" parameter on the /select handler (defaults to 10) and the
queryResultMaxDocsCached parameter.

The firstSearcher entry does two queries, but because the "q" parameter
is identical on them, it will only result in one entry in the
queryResultCache.  One of them has facet.field, but you did not include
facet=true, so the facet query will not actually be run.  Without the
facet query, the filterCache will not be populated.

I think the design intent for newSearcher and firstSearcher is to load
critical index data into the OS disk cache.  It's not so much about
warming the Solr caches as it is about priming the system as a whole.

Note that the wildcard query you are running (q=*) is relatively slow,
but is an excellent choice for a warming query, because it actually
reads every single term from the default field.  Because of how slow
this query can run, setting useColdSearcher to true is recommended.

Thanks,
Shawn


Re: Solr Pagination

Posted by Salman Ansari <sa...@gmail.com>.
Regarding Solr performance issue I was facing, I upgraded my Solr machine
to have
8 cores
56 GB RAM
8 GB JVM

However, unfortunately, I am still getting delays. I have run

* the query "Football" with start=0 and rows=10 and it took around 7.329
seconds
* the query "Football" with start=1000 and rows=10 and it took around
21.994 seconds

I was looking at Solr admin that the RAM and JVM are not being utilized to
the maximum, even not half or 1/4th. How do I push data to the cache once
Solr starts? and is pushing data to cache the right strategy to solve the
issue?

Appreciate your comments.

Regards,
Salman



On Sat, Oct 10, 2015 at 11:55 AM, Salman Ansari <sa...@gmail.com>
wrote:

> Thanks Shawn for your response. Based on that
> 1) Can you please direct me where I can get more information about cold
> shard vs hot shard?
>
> 2)  That 10GB number assumes there's no other software on the machine,
> like a database server or a webserver.
> Yes the machine is dedicated for Solr
>
> 3) How much index data is on the machine?
> I have 3 collections 2 for testing (so the aggregate of both of them does
> not exceed 1M document) and the main collection that I am querying now
> which contains around 69M. I have distributed all my collections into 2
> shards each with 2 replicas. The consumption on the hard disk is about 40GB.
>
> 4) A memory size of 14GB would be unusual for a physical machine, and
> makes me wonder if you're using virtual machines
> Yes I am using virtual machine as using a bare metal will be difficult in
> my case as all of our data center is on the cloud. I can increase its
> capacity though. While testing some edge cases on Solr, I realized on Solr
> admin that the memory sometimes reaches to its limit (14GB RAM, and 4GB JVM)
>
> 5) Just to confirm, I have combined the lessons from
>
> http://www.slideshare.net/lucidworks/high-performance-solr-and-jvm-tuning-strategies-used-for-map-quests-search-ahead-darren-spehr
> AND
> https://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache
>
> to come up with the following settings
>
> FilterCache
>
>     <filterCache class="solr.FastLRUCache"
>                  size="16384"
>                  initialSize="4096"
>                  autowarmCount="4096"/>
>
> DocummentCahce
>
>     <documentCache class="solr.LRUCache"
>                    size="16384"
>                    initialSize="16384"
>                    autowarmCount="0"/>
>
> NewSearcher and FirsSearcher
>
> <listener event="newSearcher" class="solr.QuerySenderListener">
>       <arr name="queries">
>            <lst><str name="q">*</str><str name="sort">score desc id
> desc</str></lst>
>       </arr>
>     </listener>
>     <listener event="firstSearcher" class="solr.QuerySenderListener">
>       <arr name="queries">
> <lst> <str name="q">*</str> <str name="sort">score desc id desc</str>
> </lst>
>         <!-- seed common facets and filter queries -->
>         <lst> <str name="q">*</str>
>               <str name="facet.field">category</str>        </lst>
>       </arr>
>     </listener>
>
> Will this be using more cache in Solr and prepoupulate it?
>
> Regards,
> Salman
>
>
>
>
> On Sat, Oct 10, 2015 at 5:10 AM, Shawn Heisey <ap...@elyograg.org> wrote:
>
>> On 10/9/2015 1:39 PM, Salman Ansari wrote:
>>
>> > INFO  - 2015-10-09 18:46:17.953; [c:sabr102 s:shard1 r:core_node2
>> > x:sabr102_shard1_replica1] org.apache.solr.core.SolrCore;
>> > [sabr102_shard1_replica1] webapp=/solr path=/select
>> > params={start=0&q=(content_text:Football)&rows=10} hits=24408 status=0
>> > QTime=3391
>>
>> Over 3 seconds for a query like this definitely sounds like there's a
>> problem.
>>
>> > INFO  - 2015-10-09 18:47:04.727; [c:sabr102 s:shard1 r:core_node2
>> > x:sabr102_shard1_replica1] org.apache.solr.core.SolrCore;
>> > [sabr102_shard1_replica1] webapp=/solr path=/select
>> > params={start=1000&q=(content_text:Football)&rows=10} hits=24408
>> status=0
>> > QTime=21569
>>
>> Adding a start value of 1000 increases QTime by a factor of more than
>> 6?  Even more evidence of a performance problem.
>>
>> For comparison purposes, I did a couple of simple queries on a large
>> index of mine.  Here are the response headers showing the QTime value
>> and all the parameters (except my shard URLs) for each query:
>>
>>   "responseHeader": {
>>     "status": 0,
>>     "QTime": 1253,
>>     "params": {
>>       "df": "catchall",
>>       "spellcheck.maxCollationEvaluations": "2",
>>       "spellcheck.dictionary": "default",
>>       "echoParams": "all",
>>       "spellcheck.maxCollations": "5",
>>       "q.op": "AND",
>>       "shards.info": "true",
>>       "spellcheck.maxCollationTries": "2",
>>       "rows": "70",
>>       "spellcheck.extendedResults": "false",
>>       "shards": "REDACTED SEVEN SHARD URLS",
>>       "shards.tolerant": "true",
>>       "spellcheck.onlyMorePopular": "false",
>>       "facet.method": "enum",
>>       "spellcheck.count": "9",
>>       "q": "catchall:carriage",
>>       "indent": "true",
>>       "wt": "json",
>>       "_": "1444420900498"
>>     }
>>
>>
>>   "responseHeader": {
>>     "status": 0,
>>     "QTime": 176,
>>     "params": {
>>       "df": "catchall",
>>       "spellcheck.maxCollationEvaluations": "2",
>>       "spellcheck.dictionary": "default",
>>       "echoParams": "all",
>>       "spellcheck.maxCollations": "5",
>>       "q.op": "AND",
>>       "shards.info": "true",
>>       "spellcheck.maxCollationTries": "2",
>>       "rows": "70",
>>       "spellcheck.extendedResults": "false",
>>       "shards": "REDACTED SEVEN SHARD URLS",
>>       "shards.tolerant": "true",
>>       "spellcheck.onlyMorePopular": "false",
>>       "facet.method": "enum",
>>       "spellcheck.count": "9",
>>       "q": "catchall:wibble",
>>       "indent": "true",
>>       "wt": "json",
>>       "_": "1444421001024"
>>     }
>>
>> The first query had a numFound of 120906, the second a numFound of 32.
>> When I re-executed the first  query (the one with a QTime of 1253) so it
>> would use the Solr caches, QTime was 17.
>>
>> This is an index that has six cold shards with 38.8 million documents
>> each and a hot shard with 1.5 million documents.  Total document count
>> for the index is over 234 million documents, and the total size of the
>> index is about 272GB.  Each copy of the index has its shards split
>> between two servers that each have 64GB of RAM, with an 8GB max Java
>> heap.  I do not have enough memory to cache all the index contents in
>> RAM, but I can get a little less than half of it in the cache -- each
>> machine has about 56GB of cache available and contains around 135GB of
>> index data.  The index data is stored on a RAID10 array with six SATA
>> disks, so it's fairly fast, but nowhere near as fast as SSD.
>>
>> You've already mentioned the SolrPerformanceProblems wiki page that I
>> wrote, which is where I would normally send you for more information.
>> You said that your machine has 14GB of RAM and 4GB is allocated to Solr,
>> leaving about 10GB for caching.  That 10GB number assumes there's no
>> other software on the machine, like a database server or a webserver.
>> How much index data is on the machine?  You need to count all the Solr
>> cores.  If the "10GB for caching" figure is accurate, then more than
>> about 20GB of index data means you might need more memory.  If it's more
>> than about 40GB of index data, you definitely need more memory.
>>
>> A memory size of 14GB would be unusual for a physical machine, and makes
>> me wonder if you're using virtual machines.  Bare metal is always going
>> to offer better performance than a VM.  Another potential problem with
>> VMs is that the host system might have its memory oversubscribed -- the
>> total amount of memory in the host machine might be less than the total
>> amount of memory allocated to all the running virtual machines.  Solr
>> performance will be terrible if VM memory is oversubscribed.
>>
>> Thanks,
>> Shawn
>>
>>
>

Re: Solr Pagination

Posted by Salman Ansari <sa...@gmail.com>.
Thanks Shawn for your response. Based on that
1) Can you please direct me where I can get more information about cold
shard vs hot shard?

2)  That 10GB number assumes there's no other software on the machine, like
a database server or a webserver.
Yes the machine is dedicated for Solr

3) How much index data is on the machine?
I have 3 collections 2 for testing (so the aggregate of both of them does
not exceed 1M document) and the main collection that I am querying now
which contains around 69M. I have distributed all my collections into 2
shards each with 2 replicas. The consumption on the hard disk is about 40GB.

4) A memory size of 14GB would be unusual for a physical machine, and makes me
wonder if you're using virtual machines
Yes I am using virtual machine as using a bare metal will be difficult in
my case as all of our data center is on the cloud. I can increase its
capacity though. While testing some edge cases on Solr, I realized on Solr
admin that the memory sometimes reaches to its limit (14GB RAM, and 4GB JVM)

5) Just to confirm, I have combined the lessons from

http://www.slideshare.net/lucidworks/high-performance-solr-and-jvm-tuning-strategies-used-for-map-quests-search-ahead-darren-spehr
AND
https://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache

to come up with the following settings

FilterCache

    <filterCache class="solr.FastLRUCache"
                 size="16384"
                 initialSize="4096"
                 autowarmCount="4096"/>

DocummentCahce

    <documentCache class="solr.LRUCache"
                   size="16384"
                   initialSize="16384"
                   autowarmCount="0"/>

NewSearcher and FirsSearcher

<listener event="newSearcher" class="solr.QuerySenderListener">
      <arr name="queries">
           <lst><str name="q">*</str><str name="sort">score desc id
desc</str></lst>
      </arr>
    </listener>
    <listener event="firstSearcher" class="solr.QuerySenderListener">
      <arr name="queries">
<lst> <str name="q">*</str> <str name="sort">score desc id desc</str> </lst>
        <!-- seed common facets and filter queries -->
        <lst> <str name="q">*</str>
              <str name="facet.field">category</str>        </lst>
      </arr>
    </listener>

Will this be using more cache in Solr and prepoupulate it?

Regards,
Salman




On Sat, Oct 10, 2015 at 5:10 AM, Shawn Heisey <ap...@elyograg.org> wrote:

> On 10/9/2015 1:39 PM, Salman Ansari wrote:
>
> > INFO  - 2015-10-09 18:46:17.953; [c:sabr102 s:shard1 r:core_node2
> > x:sabr102_shard1_replica1] org.apache.solr.core.SolrCore;
> > [sabr102_shard1_replica1] webapp=/solr path=/select
> > params={start=0&q=(content_text:Football)&rows=10} hits=24408 status=0
> > QTime=3391
>
> Over 3 seconds for a query like this definitely sounds like there's a
> problem.
>
> > INFO  - 2015-10-09 18:47:04.727; [c:sabr102 s:shard1 r:core_node2
> > x:sabr102_shard1_replica1] org.apache.solr.core.SolrCore;
> > [sabr102_shard1_replica1] webapp=/solr path=/select
> > params={start=1000&q=(content_text:Football)&rows=10} hits=24408 status=0
> > QTime=21569
>
> Adding a start value of 1000 increases QTime by a factor of more than
> 6?  Even more evidence of a performance problem.
>
> For comparison purposes, I did a couple of simple queries on a large
> index of mine.  Here are the response headers showing the QTime value
> and all the parameters (except my shard URLs) for each query:
>
>   "responseHeader": {
>     "status": 0,
>     "QTime": 1253,
>     "params": {
>       "df": "catchall",
>       "spellcheck.maxCollationEvaluations": "2",
>       "spellcheck.dictionary": "default",
>       "echoParams": "all",
>       "spellcheck.maxCollations": "5",
>       "q.op": "AND",
>       "shards.info": "true",
>       "spellcheck.maxCollationTries": "2",
>       "rows": "70",
>       "spellcheck.extendedResults": "false",
>       "shards": "REDACTED SEVEN SHARD URLS",
>       "shards.tolerant": "true",
>       "spellcheck.onlyMorePopular": "false",
>       "facet.method": "enum",
>       "spellcheck.count": "9",
>       "q": "catchall:carriage",
>       "indent": "true",
>       "wt": "json",
>       "_": "1444420900498"
>     }
>
>
>   "responseHeader": {
>     "status": 0,
>     "QTime": 176,
>     "params": {
>       "df": "catchall",
>       "spellcheck.maxCollationEvaluations": "2",
>       "spellcheck.dictionary": "default",
>       "echoParams": "all",
>       "spellcheck.maxCollations": "5",
>       "q.op": "AND",
>       "shards.info": "true",
>       "spellcheck.maxCollationTries": "2",
>       "rows": "70",
>       "spellcheck.extendedResults": "false",
>       "shards": "REDACTED SEVEN SHARD URLS",
>       "shards.tolerant": "true",
>       "spellcheck.onlyMorePopular": "false",
>       "facet.method": "enum",
>       "spellcheck.count": "9",
>       "q": "catchall:wibble",
>       "indent": "true",
>       "wt": "json",
>       "_": "1444421001024"
>     }
>
> The first query had a numFound of 120906, the second a numFound of 32.
> When I re-executed the first  query (the one with a QTime of 1253) so it
> would use the Solr caches, QTime was 17.
>
> This is an index that has six cold shards with 38.8 million documents
> each and a hot shard with 1.5 million documents.  Total document count
> for the index is over 234 million documents, and the total size of the
> index is about 272GB.  Each copy of the index has its shards split
> between two servers that each have 64GB of RAM, with an 8GB max Java
> heap.  I do not have enough memory to cache all the index contents in
> RAM, but I can get a little less than half of it in the cache -- each
> machine has about 56GB of cache available and contains around 135GB of
> index data.  The index data is stored on a RAID10 array with six SATA
> disks, so it's fairly fast, but nowhere near as fast as SSD.
>
> You've already mentioned the SolrPerformanceProblems wiki page that I
> wrote, which is where I would normally send you for more information.
> You said that your machine has 14GB of RAM and 4GB is allocated to Solr,
> leaving about 10GB for caching.  That 10GB number assumes there's no
> other software on the machine, like a database server or a webserver.
> How much index data is on the machine?  You need to count all the Solr
> cores.  If the "10GB for caching" figure is accurate, then more than
> about 20GB of index data means you might need more memory.  If it's more
> than about 40GB of index data, you definitely need more memory.
>
> A memory size of 14GB would be unusual for a physical machine, and makes
> me wonder if you're using virtual machines.  Bare metal is always going
> to offer better performance than a VM.  Another potential problem with
> VMs is that the host system might have its memory oversubscribed -- the
> total amount of memory in the host machine might be less than the total
> amount of memory allocated to all the running virtual machines.  Solr
> performance will be terrible if VM memory is oversubscribed.
>
> Thanks,
> Shawn
>
>

Re: Solr Pagination

Posted by Shawn Heisey <ap...@elyograg.org>.
On 10/9/2015 1:39 PM, Salman Ansari wrote:

> INFO  - 2015-10-09 18:46:17.953; [c:sabr102 s:shard1 r:core_node2
> x:sabr102_shard1_replica1] org.apache.solr.core.SolrCore;
> [sabr102_shard1_replica1] webapp=/solr path=/select
> params={start=0&q=(content_text:Football)&rows=10} hits=24408 status=0
> QTime=3391

Over 3 seconds for a query like this definitely sounds like there's a
problem.

> INFO  - 2015-10-09 18:47:04.727; [c:sabr102 s:shard1 r:core_node2
> x:sabr102_shard1_replica1] org.apache.solr.core.SolrCore;
> [sabr102_shard1_replica1] webapp=/solr path=/select
> params={start=1000&q=(content_text:Football)&rows=10} hits=24408 status=0
> QTime=21569

Adding a start value of 1000 increases QTime by a factor of more than
6?  Even more evidence of a performance problem.

For comparison purposes, I did a couple of simple queries on a large
index of mine.  Here are the response headers showing the QTime value
and all the parameters (except my shard URLs) for each query:

  "responseHeader": {
    "status": 0,
    "QTime": 1253,
    "params": {
      "df": "catchall",
      "spellcheck.maxCollationEvaluations": "2",
      "spellcheck.dictionary": "default",
      "echoParams": "all",
      "spellcheck.maxCollations": "5",
      "q.op": "AND",
      "shards.info": "true",
      "spellcheck.maxCollationTries": "2",
      "rows": "70",
      "spellcheck.extendedResults": "false",
      "shards": "REDACTED SEVEN SHARD URLS",
      "shards.tolerant": "true",
      "spellcheck.onlyMorePopular": "false",
      "facet.method": "enum",
      "spellcheck.count": "9",
      "q": "catchall:carriage",
      "indent": "true",
      "wt": "json",
      "_": "1444420900498"
    }


  "responseHeader": {
    "status": 0,
    "QTime": 176,
    "params": {
      "df": "catchall",
      "spellcheck.maxCollationEvaluations": "2",
      "spellcheck.dictionary": "default",
      "echoParams": "all",
      "spellcheck.maxCollations": "5",
      "q.op": "AND",
      "shards.info": "true",
      "spellcheck.maxCollationTries": "2",
      "rows": "70",
      "spellcheck.extendedResults": "false",
      "shards": "REDACTED SEVEN SHARD URLS",
      "shards.tolerant": "true",
      "spellcheck.onlyMorePopular": "false",
      "facet.method": "enum",
      "spellcheck.count": "9",
      "q": "catchall:wibble",
      "indent": "true",
      "wt": "json",
      "_": "1444421001024"
    }

The first query had a numFound of 120906, the second a numFound of 32. 
When I re-executed the first  query (the one with a QTime of 1253) so it
would use the Solr caches, QTime was 17.

This is an index that has six cold shards with 38.8 million documents
each and a hot shard with 1.5 million documents.  Total document count
for the index is over 234 million documents, and the total size of the
index is about 272GB.  Each copy of the index has its shards split
between two servers that each have 64GB of RAM, with an 8GB max Java
heap.  I do not have enough memory to cache all the index contents in
RAM, but I can get a little less than half of it in the cache -- each
machine has about 56GB of cache available and contains around 135GB of
index data.  The index data is stored on a RAID10 array with six SATA
disks, so it's fairly fast, but nowhere near as fast as SSD.

You've already mentioned the SolrPerformanceProblems wiki page that I
wrote, which is where I would normally send you for more information. 
You said that your machine has 14GB of RAM and 4GB is allocated to Solr,
leaving about 10GB for caching.  That 10GB number assumes there's no
other software on the machine, like a database server or a webserver. 
How much index data is on the machine?  You need to count all the Solr
cores.  If the "10GB for caching" figure is accurate, then more than
about 20GB of index data means you might need more memory.  If it's more
than about 40GB of index data, you definitely need more memory.

A memory size of 14GB would be unusual for a physical machine, and makes
me wonder if you're using virtual machines.  Bare metal is always going
to offer better performance than a VM.  Another potential problem with
VMs is that the host system might have its memory oversubscribed -- the
total amount of memory in the host machine might be less than the total
amount of memory allocated to all the running virtual machines.  Solr
performance will be terrible if VM memory is oversubscribed.

Thanks,
Shawn


Re: Solr Pagination

Posted by Salman Ansari <sa...@gmail.com>.
> Thanks Eric for your response. If you find pagination is not the main
> culprit, what other factors do you guys suggest I need to tweak to test
> that?
Well, is basic search slow? What are your response times for plain
un-warmed top-20 searches?

I have restarted Solr and I have tried running a query "Football" on Solr
and here are the results
for start=0, rows=10 it took around 3.391 seconds
for start=1000, rows=10 it took around 21.569 seconds *(btw, after trying
the query the second time, it took around 332 ms, could you explain this
behavior?)*
I am not quite sure what do you mean by un-warmed search, but I do have
autowarmed set to true for filtercache
btw, here is the log for both queries and it looks like that indeed it does
take that long for Solr to query

INFO  - 2015-10-09 18:46:17.937; [c:sabr102 s:shard2 r:core_node1
x:sabr102_shard2_replica1] org.apache.solr.core.SolrCore;
[sabr102_shard2_replica1] webapp=/solr path=/select
params={ids=592367114956177408,590296378955407362,585347065619750912,584382847948951552&distrib=false&wt=javabin&version=2&rows=10&df=text&shard.url=http://
[MySolrIP]:8983/solr/sabr102_shard2_replica1/|http://[MySolrIP]:7574/solr/sabr102_shard2_replica2/&NOW=1444416374563&start=0&shards.purpose=64&q=(content_text:Football)&isShard=true&preferLocalShards=false}
status=0 QTime=13
INFO  - 2015-10-09 18:46:17.953; [c:sabr102 s:shard1 r:core_node2
x:sabr102_shard1_replica1] org.apache.solr.core.SolrCore;
[sabr102_shard1_replica1] webapp=/solr path=/select
params={start=0&q=(content_text:Football)&rows=10} hits=24408 status=0
QTime=3391

INFO  - 2015-10-09 18:46:43.207; [c:sabr102 s:shard2 r:core_node1
x:sabr102_shard2_replica1] org.apache.solr.core.SolrCore;
[sabr102_shard2_replica1] webapp=/solr path=/select
params={distrib=false&wt=javabin&version=2&rows=1010&df=text&fl=id&fl=score&shard.url=http://
[MySolrIP]:8983/solr/sabr102_shard2_replica1/|http://[MySolrIP]:7574/solr/sabr102_shard2_replica2/&NOW=1444416403161&start=0&shards.purpose=4&q=(content_text:Football)&isShard=true&fsv=true&preferLocalShards=false}
hits=12198 status=0 QTime=32
INFO  - 2015-10-09 18:47:04.727; [c:sabr102 s:shard1 r:core_node2
x:sabr102_shard1_replica1] org.apache.solr.core.SolrCore;
[sabr102_shard1_replica1] webapp=/solr path=/select
params={start=1000&q=(content_text:Football)&rows=10} hits=24408 status=0
QTime=21569


> As I mentioned, by navigating to 20000 results using start and row I
> am getting time out from Solr.NET and I need a way to fix that.
You still haven't answered my question: Do your users actually need to page
that far?

No, they do not need to navigate to that level but I was checking the edge
cases. Moreover, based on my previous query results, even navigating to the
100th page (1000 results as each page has 10 results, which they can easily
do from the query strings in the URL or jumping bunch of pages at once in
the UI as I am giving access to 10 pages at a time like Google or LinkedIn)
the performance results are not promising.

It shows that the shard-searches themselves is not what is slowing you
down. Are the returned documents very large? Try setting fl=id,score and
see if it brings response times below 1 second.
I have around 50-60 fields per document in schema but not all of them get
populated for each document. The main field that I am searching on is
called content_text but that is usually small.
I have tried running the following query on Solr
http://[MySolrMachine]:8983/solr/sabr102/select?q=(content_text:Football)&start=1000&rows=10&fl=id,score

and it took around 13.567 seconds *(the same goes here after running the
query the second time, it took around 244 ms)*
The log shows that it did take Solr that long

INFO  - 2015-10-09 18:54:44.271; [c:sabr102 s:shard1 r:core_node2
x:sabr102_shard1_replica1] org.apache.solr.core.SolrCore;
[sabr102_shard1_replica1] webapp=/solr path=/select
params={fl=id,score&start=1000&q=(content_text:Football)&rows=10}
hits=24408 status=0 QTime=13567

INFO  - 2015-10-09 19:02:41.732; [c:sabr102 s:shard2 r:core_node1
x:sabr102_shard2_replica1] org.apache.solr.core.SolrCore;
[sabr102_shard2_replica1] webapp=/solr path=/select
params={distrib=false&wt=javabin&version=2&rows=1010&df=text&fl=id&fl=score&shard.url=http://
[MySolrIP]:8983/solr/sabr102_shard2_replica1/|http://[MySolrIP]:7574/solr/sabr102_shard2_replica2/&NOW=1444417361716&start=0&shards.purpose=4&q=(content_text:Football)&isShard=true&fsv=true&preferLocalShards=false}
hits=12198 status=0 QTime=9

*Why is it the case that for some reasons shard1 is taking way more longer
than shard2?*

 I do note that one of your queries has rows=1010, a typo?
No that was not a typo,

Try again with rows=0&start=1000 to see if it's something weird with getting
the stored data, but that's highly doubtful.
I have tried the query "Salman" with rows=0, start=1000 and it took around
13.819 seconds.

I think the only real way to get to the bottom of it will be to slap a
profiler on it and see where the time is being spent.
Can you direct me to a good profiler for Solr?

Regards,
Salman












On Fri, Oct 9, 2015 at 8:02 PM, Erick Erickson <er...@gmail.com>
wrote:

> OK, this makes very little sense. The individual queries are taking < 100ms
> yet the total response is 29 seconds. I do note that one of your
> queries has rows=1010, a typo?
>
> Anyway, not at all sure what's going on here. If these are gigantic files
> you're
> returning, then it could be decompressing time, unlikely but possible.
>
> Try again with rows=0&start=1000 to see if it's something weird with
> getting
> the stored data, but that's highly doubtful.
>
> I think the only real way to get to the bottom of it will be to slap a
> profiler
> on it and see where the time is being spent.
>
> Best,
> Erick
>
> On Fri, Oct 9, 2015 at 9:53 AM, Toke Eskildsen <te...@statsbiblioteket.dk>
> wrote:
> > Salman Ansari <sa...@gmail.com> wrote:
> >> Thanks Eric for your response. If you find pagination is not the main
> >> culprit, what other factors do you guys suggest I need to tweak to test
> >> that?
> >
> > Well, is basic search slow? What are your response times for plain
> un-warmed top-20 searches?
> >
> >> As I mentioned, by navigating to 20000 results using start and row I
> >> am getting time out from Solr.NET and I need a way to fix that.
> >
> > You still haven't answered my question: Do your users actually need to
> page that far?
> >
> >
> > Again: I know there can be 10 million results. Why would your users need
> to page through all of them? Why would they need to page trough just the
> first 1000? What are they trying to achieve?
> >
> > If they used it automatically for full export of the result set, then I
> can understand it, but you talked about next & previous page, which
> indicates that this is a manual process. A manual process that requires
> clicking next 1000 times is a severe indicator that something can be done
> differently.
> >
> > - Toke Eskildsen
>

Re: Solr Pagination

Posted by Erick Erickson <er...@gmail.com>.
OK, this makes very little sense. The individual queries are taking < 100ms
yet the total response is 29 seconds. I do note that one of your
queries has rows=1010, a typo?

Anyway, not at all sure what's going on here. If these are gigantic files you're
returning, then it could be decompressing time, unlikely but possible.

Try again with rows=0&start=1000 to see if it's something weird with getting
the stored data, but that's highly doubtful.

I think the only real way to get to the bottom of it will be to slap a profiler
on it and see where the time is being spent.

Best,
Erick

On Fri, Oct 9, 2015 at 9:53 AM, Toke Eskildsen <te...@statsbiblioteket.dk> wrote:
> Salman Ansari <sa...@gmail.com> wrote:
>> Thanks Eric for your response. If you find pagination is not the main
>> culprit, what other factors do you guys suggest I need to tweak to test
>> that?
>
> Well, is basic search slow? What are your response times for plain un-warmed top-20 searches?
>
>> As I mentioned, by navigating to 20000 results using start and row I
>> am getting time out from Solr.NET and I need a way to fix that.
>
> You still haven't answered my question: Do your users actually need to page that far?
>
>
> Again: I know there can be 10 million results. Why would your users need to page through all of them? Why would they need to page trough just the first 1000? What are they trying to achieve?
>
> If they used it automatically for full export of the result set, then I can understand it, but you talked about next & previous page, which indicates that this is a manual process. A manual process that requires clicking next 1000 times is a severe indicator that something can be done differently.
>
> - Toke Eskildsen

Re: Solr Pagination

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
Salman Ansari <sa...@gmail.com> wrote:
> Thanks Eric for your response. If you find pagination is not the main
> culprit, what other factors do you guys suggest I need to tweak to test
> that?

Well, is basic search slow? What are your response times for plain un-warmed top-20 searches?

> As I mentioned, by navigating to 20000 results using start and row I
> am getting time out from Solr.NET and I need a way to fix that.

You still haven't answered my question: Do your users actually need to page that far?


Again: I know there can be 10 million results. Why would your users need to page through all of them? Why would they need to page trough just the first 1000? What are they trying to achieve?

If they used it automatically for full export of the result set, then I can understand it, but you talked about next & previous page, which indicates that this is a manual process. A manual process that requires clicking next 1000 times is a severe indicator that something can be done differently.

- Toke Eskildsen

Re: Solr Pagination

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
Salman Ansari <sa...@gmail.com> wrote:

> As for the logs, I searched for "Salman" with rows=10 and start=1000 and it
> took about 29 seconds to complete. However, it took less at each shard as
> shown in the log file

> [...] QTime=91
> [...] QTime=4

> the search in the second shard started AFTER 29 seconds. Any logic behind
> what I am seeing here?

It shows that the shard-searches themselves is not what is slowing you down. Are the returned documents very large? Try setting fl=id,score and see if it brings response times below 1 second.

- Toke Eskildsen

Re: Solr Pagination

Posted by Salman Ansari <sa...@gmail.com>.
I agree 10B will not be residing on the same machine :)

About the other issue you raised, while submitting the query to Solr I was
keeping a close eye on RAM and JVM consumption on Solr Admin and for
queries at the beginning that were taking most of the time, neither RAM nor
JVM was hitting the limit so I doubt that is the problem. For reference, I
did have an issue with JVM raising an exception of "Out of Memory" when it
was around 500MB but then I raised the machine capacity to 14GB RAM and 4GB
JVM.  I have read here (
https://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache) that
for best performance I should be able to put my entire collection in
memory. Does that sound reasonable?

As for the logs, I searched for "Salman" with rows=10 and start=1000 and it
took about 29 seconds to complete. However, it took less at each shard as
shown in the log file

INFO  - 2015-10-09 16:43:39.170; [c:sabr102 s:shard1 r:core_node4
x:sabr102_shard1_replica2] org.apache.solr.core.SolrCore;
[sabr102_shard1_replica2] webapp=/solr path=/select
params={distrib=false&wt=javabin&version=2&rows=1010&df=text&fl=id&fl=score&shard.url=http://
[MySolrIP]:8983/solr/sabr102_shard1_replica1/|
http://100.114.184.37:7574/solr/sabr102_shard1_replica2/&NOW=1444409019061&start=0&shards.purpose=4&q=(content_text:Salman)&isShard=true&fsv=true&preferLocalShards=false}
hits=1819 status=0 QTime=91

INFO  - 2015-10-09 16:44:08.116; [c:sabr102 s:shard1 r:core_node4
x:sabr102_shard1_replica2] org.apache.solr.core.SolrCore;
[sabr102_shard1_replica2] webapp=/solr path=/select
params={ids=584673511333089281,584680513887010816,584697461744111616,584668540118044672,583299685516984320&distrib=false&wt=javabin&version=2&rows=10&df=text&shard.url=
http://100.114.184.37:8983/solr/sabr102_shard1_replica1/|http://[MySolrIP]:7574/solr/sabr102_shard1_replica2/&NOW=1444409019061&start=1000&shards.purpose=64&q=(content_text:Salman)&isShard=true&preferLocalShards=false}
status=0 QTime=4

the search in the second shard started AFTER 29 seconds. Any logic behind
what I am seeing here?

Moreover, I do understand that everyone's need is different and I do need
to prototype, but there must be strategies to follow even when prototyping,
that is what I am looking forward to hear from you and the community. My
concurrent users are not that much, but I do have a good amount of data to
be stored/indexed in Solr and even if one user is not able to execute
queries quite efficiently, that will be problematic.

Regards,
Salman

On Fri, Oct 9, 2015 at 7:06 PM, Erick Erickson <er...@gmail.com>
wrote:

> bq: 10GB JVM as mentioned here...and they were getting 140 ms response
> time for 10 Billion documents
>
> This simply could _not_ work in a single shard as there's a hard 2B
> doc limit per shard. On slide 14
> it states "both collections are sharded". They are not fitting 10B
> docs in 10G of JVM on a single
> machine. Trust me on this ;). The slides do not state how many shards
> they've
> split their collection into, but I suspect it's a bunch. Each
> application is different enough that the
> numbers wouldn't translate anyway...
>
> 70M docs can fit on a single shard with quite good response time, but
> YMMV. You simply
> have to experiment. Here's a long blog on the subject:
>
> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>
> Start with a profiler and see where you're spending your time. My
> first guess is that
> you're spending a lot of CPU cycles in garbage collection. This
> sometimes happens
> when you are running near your JVM limit, a GC kicks in and recovers a
> tiny bit of memory
> and then initiates another GC cycle immediately. Turn on GC logging
> and take a look
> at the stats provided, see:
> https://lucidworks.com/blog/2011/03/27/garbage-collection-bootcamp-1-0/
>
> Tens of seconds is entirely unexpected though. Do the Solr logs point
> to anything happening?
>
> Best,
> Erick
>
> On Fri, Oct 9, 2015 at 8:51 AM, Salman Ansari <sa...@gmail.com>
> wrote:
> > Thanks Eric for your response. If you find pagination is not the main
> > culprit, what other factors do you guys suggest I need to tweak to test
> > that? As I mentioned, by navigating to 20000 results using start and row
> I
> > am getting time out from Solr.NET and I need a way to fix that.
> >
> > You suggested that 4GB JVM is not enough, I have seen MapQuest going with
> > 10GB JVM as mentioned here
> >
> http://www.slideshare.net/lucidworks/high-performance-solr-and-jvm-tuning-strategies-used-for-map-quests-search-ahead-darren-spehr
> > and they were getting 140 ms response time for 10 Billion documents. Not
> > sure how many shards they had though. With data of around 70M documents,
> > what do you guys suggest as how many shards should I use and how much
> > should I dedicate for RAM and JVM?
> >
> > Regards,
> > Salman
> >
> > On Fri, Oct 9, 2015 at 6:37 PM, Erick Erickson <er...@gmail.com>
> > wrote:
> >
> >> I think paging is something of a red herring. You say:
> >>
> >> bq: but still I get delays of around 16 seconds and sometimes even more.
> >>
> >> Even for a start of 1,000, this is ridiculously long for Solr. All
> >> you're really saving
> >> here is keeping a record of the id and score for a list 1,000 cells
> >> long (or even
> >> 20,000 assuming 1,000 pages and 20 docs/page). that's somewhat wasteful,
> >> but it's still hard to believe it's responsible for what you're seeing.
> >>
> >> Having 4G of RAM for 70M docs is very little memory, assuming this is on
> >> a single shard.
> >>
> >> So my suspicion is that you have something fundamentally slow about
> >> your system, the additional overhead shouldn't be as large as you're
> >> reporting.
> >>
> >> And I'll second Toke's comment. It's very rare that users see anything
> >> _useful_ by navigating that deep. Make them hit next next next and
> they'll
> >> tire out way before that.
> >>
> >> Cursor mark's sweet spot is handling some kind of automated process that
> >> goes through the whole result set. It'll work for what you're trying
> >> to do though.
> >>
> >> Best,
> >> Erick
> >>
> >> On Fri, Oct 9, 2015 at 8:27 AM, Salman Ansari <sa...@gmail.com>
> >> wrote:
> >> > Is this a real problem or a worry? Do you have users that page really
> >> deep
> >> > and if so, have you considered other mechanisms for delivering what
> they
> >> > need?
> >> >
> >> > The issue is that currently I have around 70M documents and some
> generic
> >> > queries are resulting in lots of pages. Now if I try deep navigation
> (to
> >> > page# 1000 for example), a lot of times the query takes so long that
> >> > Solr.NET throws operation time out exception. The first page is
> >> relatively
> >> > faster to load but it does take around few seconds as well. After
> reading
> >> > some documentation I realized that cursors could help and it does. I
> have
> >> > tried to following the test better performance:
> >> >
> >> > 1) Used cursors instead of start and row
> >> > 2) Increased the RAM on my Solr machine to 14GB
> >> > 3) Increase the JVM on that machine to 4GB
> >> > 4) Increased the filterChache
> >> > 5) Increased the docCache
> >> > 6) Run Optimize on the Solr Admin
> >> >
> >> > but still I get delays of around 16 seconds and sometimes even more.
> >> > What other mechanisms do you suggest I should use to handle this
> issue?
> >> >
> >> > While pagination is faster than increasing the start parameter, the
> >> > difference is small as long as you stay below a start of 1000. 10K
> might
> >> > also work for you. Do your users page beyond that?
> >> > I can limit users not to go beyond 10K but still think at that level
> >> > cursors will be much faster than increasing the start variable as
> >> explained
> >> > here (
> >> https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results
> >> > ), have you tried both ways on your collection and it was giving you
> >> > similar results?
> >> >
> >> > On Fri, Oct 9, 2015 at 5:20 PM, Toke Eskildsen <
> te@statsbiblioteket.dk>
> >> > wrote:
> >> >
> >> >> Salman Ansari <sa...@gmail.com> wrote:
> >> >>
> >> >> [Pagination with cursors]
> >> >>
> >> >> > For example, what happens if the user navigates from page 1 to
> page 2,
> >> >> > does the front end  need to store the next cursor at each query?
> >> >>
> >> >> Yes.
> >> >>
> >> >> > What about going to a previous page, do we need to store all
> cursors
> >> >> > that have been navigated up to now at the client side?
> >> >>
> >> >> Yes, if you want to provide that functionality.
> >> >>
> >> >> Is this a real problem or a worry? Do you have users that page really
> >> deep
> >> >> and if so, have you considered other mechanisms for delivering what
> they
> >> >> need?
> >> >>
> >> >> While pagination is faster than increasing the start parameter, the
> >> >> difference is small as long as you stay below a start of 1000. 10K
> might
> >> >> also work for you. Do your users page beyond that?
> >> >>
> >> >> - Toke Eskildsen
> >> >>
> >>
>

Re: Solr Pagination

Posted by Erick Erickson <er...@gmail.com>.
bq: 10GB JVM as mentioned here...and they were getting 140 ms response
time for 10 Billion documents

This simply could _not_ work in a single shard as there's a hard 2B
doc limit per shard. On slide 14
it states "both collections are sharded". They are not fitting 10B
docs in 10G of JVM on a single
machine. Trust me on this ;). The slides do not state how many shards they've
split their collection into, but I suspect it's a bunch. Each
application is different enough that the
numbers wouldn't translate anyway...

70M docs can fit on a single shard with quite good response time, but
YMMV. You simply
have to experiment. Here's a long blog on the subject:
https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Start with a profiler and see where you're spending your time. My
first guess is that
you're spending a lot of CPU cycles in garbage collection. This
sometimes happens
when you are running near your JVM limit, a GC kicks in and recovers a
tiny bit of memory
and then initiates another GC cycle immediately. Turn on GC logging
and take a look
at the stats provided, see:
https://lucidworks.com/blog/2011/03/27/garbage-collection-bootcamp-1-0/

Tens of seconds is entirely unexpected though. Do the Solr logs point
to anything happening?

Best,
Erick

On Fri, Oct 9, 2015 at 8:51 AM, Salman Ansari <sa...@gmail.com> wrote:
> Thanks Eric for your response. If you find pagination is not the main
> culprit, what other factors do you guys suggest I need to tweak to test
> that? As I mentioned, by navigating to 20000 results using start and row I
> am getting time out from Solr.NET and I need a way to fix that.
>
> You suggested that 4GB JVM is not enough, I have seen MapQuest going with
> 10GB JVM as mentioned here
> http://www.slideshare.net/lucidworks/high-performance-solr-and-jvm-tuning-strategies-used-for-map-quests-search-ahead-darren-spehr
> and they were getting 140 ms response time for 10 Billion documents. Not
> sure how many shards they had though. With data of around 70M documents,
> what do you guys suggest as how many shards should I use and how much
> should I dedicate for RAM and JVM?
>
> Regards,
> Salman
>
> On Fri, Oct 9, 2015 at 6:37 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> I think paging is something of a red herring. You say:
>>
>> bq: but still I get delays of around 16 seconds and sometimes even more.
>>
>> Even for a start of 1,000, this is ridiculously long for Solr. All
>> you're really saving
>> here is keeping a record of the id and score for a list 1,000 cells
>> long (or even
>> 20,000 assuming 1,000 pages and 20 docs/page). that's somewhat wasteful,
>> but it's still hard to believe it's responsible for what you're seeing.
>>
>> Having 4G of RAM for 70M docs is very little memory, assuming this is on
>> a single shard.
>>
>> So my suspicion is that you have something fundamentally slow about
>> your system, the additional overhead shouldn't be as large as you're
>> reporting.
>>
>> And I'll second Toke's comment. It's very rare that users see anything
>> _useful_ by navigating that deep. Make them hit next next next and they'll
>> tire out way before that.
>>
>> Cursor mark's sweet spot is handling some kind of automated process that
>> goes through the whole result set. It'll work for what you're trying
>> to do though.
>>
>> Best,
>> Erick
>>
>> On Fri, Oct 9, 2015 at 8:27 AM, Salman Ansari <sa...@gmail.com>
>> wrote:
>> > Is this a real problem or a worry? Do you have users that page really
>> deep
>> > and if so, have you considered other mechanisms for delivering what they
>> > need?
>> >
>> > The issue is that currently I have around 70M documents and some generic
>> > queries are resulting in lots of pages. Now if I try deep navigation (to
>> > page# 1000 for example), a lot of times the query takes so long that
>> > Solr.NET throws operation time out exception. The first page is
>> relatively
>> > faster to load but it does take around few seconds as well. After reading
>> > some documentation I realized that cursors could help and it does. I have
>> > tried to following the test better performance:
>> >
>> > 1) Used cursors instead of start and row
>> > 2) Increased the RAM on my Solr machine to 14GB
>> > 3) Increase the JVM on that machine to 4GB
>> > 4) Increased the filterChache
>> > 5) Increased the docCache
>> > 6) Run Optimize on the Solr Admin
>> >
>> > but still I get delays of around 16 seconds and sometimes even more.
>> > What other mechanisms do you suggest I should use to handle this issue?
>> >
>> > While pagination is faster than increasing the start parameter, the
>> > difference is small as long as you stay below a start of 1000. 10K might
>> > also work for you. Do your users page beyond that?
>> > I can limit users not to go beyond 10K but still think at that level
>> > cursors will be much faster than increasing the start variable as
>> explained
>> > here (
>> https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results
>> > ), have you tried both ways on your collection and it was giving you
>> > similar results?
>> >
>> > On Fri, Oct 9, 2015 at 5:20 PM, Toke Eskildsen <te...@statsbiblioteket.dk>
>> > wrote:
>> >
>> >> Salman Ansari <sa...@gmail.com> wrote:
>> >>
>> >> [Pagination with cursors]
>> >>
>> >> > For example, what happens if the user navigates from page 1 to page 2,
>> >> > does the front end  need to store the next cursor at each query?
>> >>
>> >> Yes.
>> >>
>> >> > What about going to a previous page, do we need to store all cursors
>> >> > that have been navigated up to now at the client side?
>> >>
>> >> Yes, if you want to provide that functionality.
>> >>
>> >> Is this a real problem or a worry? Do you have users that page really
>> deep
>> >> and if so, have you considered other mechanisms for delivering what they
>> >> need?
>> >>
>> >> While pagination is faster than increasing the start parameter, the
>> >> difference is small as long as you stay below a start of 1000. 10K might
>> >> also work for you. Do your users page beyond that?
>> >>
>> >> - Toke Eskildsen
>> >>
>>

Re: Solr Pagination

Posted by Salman Ansari <sa...@gmail.com>.
Thanks Eric for your response. If you find pagination is not the main
culprit, what other factors do you guys suggest I need to tweak to test
that? As I mentioned, by navigating to 20000 results using start and row I
am getting time out from Solr.NET and I need a way to fix that.

You suggested that 4GB JVM is not enough, I have seen MapQuest going with
10GB JVM as mentioned here
http://www.slideshare.net/lucidworks/high-performance-solr-and-jvm-tuning-strategies-used-for-map-quests-search-ahead-darren-spehr
and they were getting 140 ms response time for 10 Billion documents. Not
sure how many shards they had though. With data of around 70M documents,
what do you guys suggest as how many shards should I use and how much
should I dedicate for RAM and JVM?

Regards,
Salman

On Fri, Oct 9, 2015 at 6:37 PM, Erick Erickson <er...@gmail.com>
wrote:

> I think paging is something of a red herring. You say:
>
> bq: but still I get delays of around 16 seconds and sometimes even more.
>
> Even for a start of 1,000, this is ridiculously long for Solr. All
> you're really saving
> here is keeping a record of the id and score for a list 1,000 cells
> long (or even
> 20,000 assuming 1,000 pages and 20 docs/page). that's somewhat wasteful,
> but it's still hard to believe it's responsible for what you're seeing.
>
> Having 4G of RAM for 70M docs is very little memory, assuming this is on
> a single shard.
>
> So my suspicion is that you have something fundamentally slow about
> your system, the additional overhead shouldn't be as large as you're
> reporting.
>
> And I'll second Toke's comment. It's very rare that users see anything
> _useful_ by navigating that deep. Make them hit next next next and they'll
> tire out way before that.
>
> Cursor mark's sweet spot is handling some kind of automated process that
> goes through the whole result set. It'll work for what you're trying
> to do though.
>
> Best,
> Erick
>
> On Fri, Oct 9, 2015 at 8:27 AM, Salman Ansari <sa...@gmail.com>
> wrote:
> > Is this a real problem or a worry? Do you have users that page really
> deep
> > and if so, have you considered other mechanisms for delivering what they
> > need?
> >
> > The issue is that currently I have around 70M documents and some generic
> > queries are resulting in lots of pages. Now if I try deep navigation (to
> > page# 1000 for example), a lot of times the query takes so long that
> > Solr.NET throws operation time out exception. The first page is
> relatively
> > faster to load but it does take around few seconds as well. After reading
> > some documentation I realized that cursors could help and it does. I have
> > tried to following the test better performance:
> >
> > 1) Used cursors instead of start and row
> > 2) Increased the RAM on my Solr machine to 14GB
> > 3) Increase the JVM on that machine to 4GB
> > 4) Increased the filterChache
> > 5) Increased the docCache
> > 6) Run Optimize on the Solr Admin
> >
> > but still I get delays of around 16 seconds and sometimes even more.
> > What other mechanisms do you suggest I should use to handle this issue?
> >
> > While pagination is faster than increasing the start parameter, the
> > difference is small as long as you stay below a start of 1000. 10K might
> > also work for you. Do your users page beyond that?
> > I can limit users not to go beyond 10K but still think at that level
> > cursors will be much faster than increasing the start variable as
> explained
> > here (
> https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results
> > ), have you tried both ways on your collection and it was giving you
> > similar results?
> >
> > On Fri, Oct 9, 2015 at 5:20 PM, Toke Eskildsen <te...@statsbiblioteket.dk>
> > wrote:
> >
> >> Salman Ansari <sa...@gmail.com> wrote:
> >>
> >> [Pagination with cursors]
> >>
> >> > For example, what happens if the user navigates from page 1 to page 2,
> >> > does the front end  need to store the next cursor at each query?
> >>
> >> Yes.
> >>
> >> > What about going to a previous page, do we need to store all cursors
> >> > that have been navigated up to now at the client side?
> >>
> >> Yes, if you want to provide that functionality.
> >>
> >> Is this a real problem or a worry? Do you have users that page really
> deep
> >> and if so, have you considered other mechanisms for delivering what they
> >> need?
> >>
> >> While pagination is faster than increasing the start parameter, the
> >> difference is small as long as you stay below a start of 1000. 10K might
> >> also work for you. Do your users page beyond that?
> >>
> >> - Toke Eskildsen
> >>
>

Re: Solr Pagination

Posted by Erick Erickson <er...@gmail.com>.
I think paging is something of a red herring. You say:

bq: but still I get delays of around 16 seconds and sometimes even more.

Even for a start of 1,000, this is ridiculously long for Solr. All
you're really saving
here is keeping a record of the id and score for a list 1,000 cells
long (or even
20,000 assuming 1,000 pages and 20 docs/page). that's somewhat wasteful,
but it's still hard to believe it's responsible for what you're seeing.

Having 4G of RAM for 70M docs is very little memory, assuming this is on
a single shard.

So my suspicion is that you have something fundamentally slow about
your system, the additional overhead shouldn't be as large as you're
reporting.

And I'll second Toke's comment. It's very rare that users see anything
_useful_ by navigating that deep. Make them hit next next next and they'll
tire out way before that.

Cursor mark's sweet spot is handling some kind of automated process that
goes through the whole result set. It'll work for what you're trying
to do though.

Best,
Erick

On Fri, Oct 9, 2015 at 8:27 AM, Salman Ansari <sa...@gmail.com> wrote:
> Is this a real problem or a worry? Do you have users that page really deep
> and if so, have you considered other mechanisms for delivering what they
> need?
>
> The issue is that currently I have around 70M documents and some generic
> queries are resulting in lots of pages. Now if I try deep navigation (to
> page# 1000 for example), a lot of times the query takes so long that
> Solr.NET throws operation time out exception. The first page is relatively
> faster to load but it does take around few seconds as well. After reading
> some documentation I realized that cursors could help and it does. I have
> tried to following the test better performance:
>
> 1) Used cursors instead of start and row
> 2) Increased the RAM on my Solr machine to 14GB
> 3) Increase the JVM on that machine to 4GB
> 4) Increased the filterChache
> 5) Increased the docCache
> 6) Run Optimize on the Solr Admin
>
> but still I get delays of around 16 seconds and sometimes even more.
> What other mechanisms do you suggest I should use to handle this issue?
>
> While pagination is faster than increasing the start parameter, the
> difference is small as long as you stay below a start of 1000. 10K might
> also work for you. Do your users page beyond that?
> I can limit users not to go beyond 10K but still think at that level
> cursors will be much faster than increasing the start variable as explained
> here (https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results
> ), have you tried both ways on your collection and it was giving you
> similar results?
>
> On Fri, Oct 9, 2015 at 5:20 PM, Toke Eskildsen <te...@statsbiblioteket.dk>
> wrote:
>
>> Salman Ansari <sa...@gmail.com> wrote:
>>
>> [Pagination with cursors]
>>
>> > For example, what happens if the user navigates from page 1 to page 2,
>> > does the front end  need to store the next cursor at each query?
>>
>> Yes.
>>
>> > What about going to a previous page, do we need to store all cursors
>> > that have been navigated up to now at the client side?
>>
>> Yes, if you want to provide that functionality.
>>
>> Is this a real problem or a worry? Do you have users that page really deep
>> and if so, have you considered other mechanisms for delivering what they
>> need?
>>
>> While pagination is faster than increasing the start parameter, the
>> difference is small as long as you stay below a start of 1000. 10K might
>> also work for you. Do your users page beyond that?
>>
>> - Toke Eskildsen
>>

Re: Solr Pagination

Posted by Salman Ansari <sa...@gmail.com>.
Is this a real problem or a worry? Do you have users that page really deep
and if so, have you considered other mechanisms for delivering what they
need?

The issue is that currently I have around 70M documents and some generic
queries are resulting in lots of pages. Now if I try deep navigation (to
page# 1000 for example), a lot of times the query takes so long that
Solr.NET throws operation time out exception. The first page is relatively
faster to load but it does take around few seconds as well. After reading
some documentation I realized that cursors could help and it does. I have
tried to following the test better performance:

1) Used cursors instead of start and row
2) Increased the RAM on my Solr machine to 14GB
3) Increase the JVM on that machine to 4GB
4) Increased the filterChache
5) Increased the docCache
6) Run Optimize on the Solr Admin

but still I get delays of around 16 seconds and sometimes even more.
What other mechanisms do you suggest I should use to handle this issue?

While pagination is faster than increasing the start parameter, the
difference is small as long as you stay below a start of 1000. 10K might
also work for you. Do your users page beyond that?
I can limit users not to go beyond 10K but still think at that level
cursors will be much faster than increasing the start variable as explained
here (https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results
), have you tried both ways on your collection and it was giving you
similar results?

On Fri, Oct 9, 2015 at 5:20 PM, Toke Eskildsen <te...@statsbiblioteket.dk>
wrote:

> Salman Ansari <sa...@gmail.com> wrote:
>
> [Pagination with cursors]
>
> > For example, what happens if the user navigates from page 1 to page 2,
> > does the front end  need to store the next cursor at each query?
>
> Yes.
>
> > What about going to a previous page, do we need to store all cursors
> > that have been navigated up to now at the client side?
>
> Yes, if you want to provide that functionality.
>
> Is this a real problem or a worry? Do you have users that page really deep
> and if so, have you considered other mechanisms for delivering what they
> need?
>
> While pagination is faster than increasing the start parameter, the
> difference is small as long as you stay below a start of 1000. 10K might
> also work for you. Do your users page beyond that?
>
> - Toke Eskildsen
>

Re: Solr Pagination

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
Salman Ansari <sa...@gmail.com> wrote:

[Pagination with cursors]

> For example, what happens if the user navigates from page 1 to page 2,
> does the front end  need to store the next cursor at each query?

Yes.

> What about going to a previous page, do we need to store all cursors
> that have been navigated up to now at the client side?

Yes, if you want to provide that functionality.

Is this a real problem or a worry? Do you have users that page really deep and if so, have you considered other mechanisms for delivering what they need? 

While pagination is faster than increasing the start parameter, the difference is small as long as you stay below a start of 1000. 10K might also work for you. Do your users page beyond that?

- Toke Eskildsen