You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Raghuveer Kancherla <ra...@aplopio.com> on 2009/11/26 18:48:11 UTC

Retrieving large num of docs

Hi,
I am using Solr1.4 for searching through half a million documents. The
problem is, I want to retrieve nearly 200 documents for each search query.
The query time in Solr logs is showing 0.02 seconds and I am fairly happy
with that. However Solr is taking a long time (4 to 5 secs) to return the
results (I think it is because of the number of docs I am requesting). I
tried returning only the id's (unique key) without any other stored fields,
but it is not helping me improve the response times (time to return the id's
of matching documents).
I understand that retrieving 200 documents for each search term is
impractical in most scenarios but I dont have any other option. Any pointers
on how to improve the response times will be a great help.

Thanks,
 Raghu

Re: Retrieving large num of docs

Posted by Andrey Klochkov <ak...@griddynamics.com>.

Hi Raghu

Let me describe our use case in more details. Probably that will clarify
things.

The usual use case for Lucene/Solr is retrieving of small portion of the
result set (10-20 documents). In our case we need to read the whole result
set and this creates huge load on Lucene index, meaning a lot of IO. Keep in
mind that we have large number of stored fields in the index.

In our case there's one thing that makes things simpler: our index is so
small that we can get every document in cache. This means that even if we
retrieve all documents for every result set, we don't retrieve them from
Lucene index and then the performance should be Ok. But here we've got 2
problems:

1. Solr caches Lucene's Document instances. And in case of retrieving the
whole result set it recreates SolrDocument instances every time. This
creates a load on CPU and in particular on Java GC.
2. EmbeddedSolrServer converts the whole response into a byte array and then
restores it back converting Lucene's documents and DocList's to Solr's
SolrDocument and SolrDocumentList instances. This create additional load on
CPU and GC.

We patched Solr to eliminate those things and that fixed our performance
problems.

I think that if you don't place all your documents in caches and/or you
don't use stored fields, retrieving ID field only, then probably those
improvements won't help you.

I suggest you first to find your bottlenecks. Look at IO, memory usage etc.
Using a profiler is the best thing too. Probably you can use some tools from
lucidimation for profiling.

On Sat, Nov 28, 2009 at 4:47 PM, Raghuveer Kancherla <
raghuveer.kancherla@aplopio.com> wrote:

> Hi Andrew,
> I applied the patch you suggested. I am not finding any significant changes
> in the response times.
> I am wondering if I forgot some important configuration setting etc.
> Here is what I did:
>
>   1. Wrote a small program using solrj to use EmbeddedSolrServer (most of
>   the code is from the solr wiki) and run the server on an index of ~700k
> docs
>   and note down the avg response time
>   2. Applied the SOLR-797.patch to the source code of Solr1.4
>   3. complied the source code and rebuilt the jar files.
>   4. Rerun step 1 using the new jar files.
>
> Am I supposed to do any other config changes in order to see the
> performance
> jump that you are able to achieve.
>
> Thanks a lot,
> Raghu
>
>
> On Fri, Nov 27, 2009 at 3:16 PM, AHMET ARSLAN <io...@yahoo.com> wrote:
>
> > > Hi Andrew,
> > > We are running solr using its http interface from python.
> > > From the resources
> > > I could find, EmbeddedSolrServer is possible only if I am
> > > using solr from a
> > > java program.  It will be useful to understand if a
> > > significant part of the
> > > performance increase is due to bypassing HTTP before going
> > > down this path.
> > >
> > > In the mean time I am trying my luck with the other
> > > suggestions. Can you
> > > share the patch that helps cache solr documents instead of
> > > lucene documents?
> >
> > May be these links can help
> > http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
> > http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
> > http://www.lucidimagination.com/Downloads/LucidGaze-for-Solr
> >
> > how often do you update your index?
> > is your index optimized?
> > configuring caching can also help:
> >
> > http://wiki.apache.org/solr/SolrCaching
> > http://wiki.apache.org/solr/SolrPerformanceFactors
> >
> >
> >
> >
> >
> >
>

-- 
Andrew Klochkov
Senior Software Engineer,
Grid Dynamics

Re: Retrieving large num of docs

Posted by Raghuveer Kancherla <ra...@aplopio.com>.

Hi Andrew,
I applied the patch you suggested. I am not finding any significant changes
in the response times.
I am wondering if I forgot some important configuration setting etc.
Here is what I did:

   1. Wrote a small program using solrj to use EmbeddedSolrServer (most of
   the code is from the solr wiki) and run the server on an index of ~700k docs
   and note down the avg response time
   2. Applied the SOLR-797.patch to the source code of Solr1.4
   3. complied the source code and rebuilt the jar files.
   4. Rerun step 1 using the new jar files.

Am I supposed to do any other config changes in order to see the performance
jump that you are able to achieve.

Thanks a lot,
Raghu

On Fri, Nov 27, 2009 at 3:16 PM, AHMET ARSLAN <io...@yahoo.com> wrote:

> > Hi Andrew,
> > We are running solr using its http interface from python.
> > From the resources
> > I could find, EmbeddedSolrServer is possible only if I am
> > using solr from a
> > java program.  It will be useful to understand if a
> > significant part of the
> > performance increase is due to bypassing HTTP before going
> > down this path.
> >
> > In the mean time I am trying my luck with the other
> > suggestions. Can you
> > share the patch that helps cache solr documents instead of
> > lucene documents?
>
> May be these links can help
> http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
> http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
> http://www.lucidimagination.com/Downloads/LucidGaze-for-Solr
>
> how often do you update your index?
> is your index optimized?
> configuring caching can also help:
>
> http://wiki.apache.org/solr/SolrCaching
> http://wiki.apache.org/solr/SolrPerformanceFactors
>
>
>
>
>
>

Re: Retrieving large num of docs

Posted by AHMET ARSLAN <io...@yahoo.com>.

> Hi Andrew,
> We are running solr using its http interface from python.
> From the resources
> I could find, EmbeddedSolrServer is possible only if I am
> using solr from a
> java program.  It will be useful to understand if a
> significant part of the
> performance increase is due to bypassing HTTP before going
> down this path.
> 
> In the mean time I am trying my luck with the other
> suggestions. Can you
> share the patch that helps cache solr documents instead of
> lucene documents?

May be these links can help
http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
http://www.lucidimagination.com/Downloads/LucidGaze-for-Solr

how often do you update your index?
is your index optimized?
configuring caching can also help:

http://wiki.apache.org/solr/SolrCaching
http://wiki.apache.org/solr/SolrPerformanceFactors

Re: Retrieving large num of docs

Posted by Raghuveer Kancherla <ra...@aplopio.com>.

Hi Andrew,
We are running solr using its http interface from python. From the resources
I could find, EmbeddedSolrServer is possible only if I am using solr from a
java program.  It will be useful to understand if a significant part of the
performance increase is due to bypassing HTTP before going down this path.

In the mean time I am trying my luck with the other suggestions. Can you
share the patch that helps cache solr documents instead of lucene documents?


On a different note, I am wondering why does it take 4 - 5 seconds for Solr
to return the ID's of ranked documents when it can rank the results in about
20 milli seconds? Am I missing something here?

Thanks,
Raghu



On Fri, Nov 27, 2009 at 2:15 AM, Andrey Klochkov <aklochkov@griddynamics.com
> wrote:

> Hi
>
> We obtain ALL documents for every query, the index size is about 50k. We
> use
> number of stored fields. Often the result set size is several thousands of
> docs.
>
> We performed the following things to make it faster:
>
> 1. Use EmbeddedSolrServer
> 2. Patch Solr to avoid unnecessary marshalling while using
> EmbeddedSolrServer (there's an issue  in Solr JIRA)
> 3. Patch Solr to cache SolrDocument instances instead of Lucene's Document
> instances. I was going to share this patch, but then decided that our usage
> of Solr is not common and this functionality is useless in most cases
> 4. We have all documents in cache
> 5. In fact our index is stored in a data grid, not a file system. But as
> tests showed this is not important because standard FSDirectory is faster
> if
> you have enough of RAM free for OS caches.
>
> These changes improved the performance very much, so in the end we have
> performance comparable (about 3-5 times slower) to the "proper" Solr usage
> (obtaining first 20 documents).
>
> To get more details on how different Solr components perform we injected
> perf4j statements into key points in the code. And a profiler was helpful
> too.
>
> Hope it helps somehow.
>
> On Thu, Nov 26, 2009 at 8:48 PM, Raghuveer Kancherla <
> raghuveer.kancherla@aplopio.com> wrote:
>
> > Hi,
> > I am using Solr1.4 for searching through half a million documents. The
> > problem is, I want to retrieve nearly 200 documents for each search
> query.
> > The query time in Solr logs is showing 0.02 seconds and I am fairly happy
> > with that. However Solr is taking a long time (4 to 5 secs) to return the
> > results (I think it is because of the number of docs I am requesting). I
> > tried returning only the id's (unique key) without any other stored
> fields,
> > but it is not helping me improve the response times (time to return the
> > id's
> > of matching documents).
> > I understand that retrieving 200 documents for each search term is
> > impractical in most scenarios but I dont have any other option. Any
> > pointers
> > on how to improve the response times will be a great help.
> >
> > Thanks,
> >  Raghu
> >
>
>
>
> --
> Andrew Klochkov
> Senior Software Engineer,
> Grid Dynamics
>

Re: Retrieving large num of docs

Posted by Andrey Klochkov <ak...@griddynamics.com>.

Hi

We obtain ALL documents for every query, the index size is about 50k. We use
number of stored fields. Often the result set size is several thousands of
docs.

We performed the following things to make it faster:

1. Use EmbeddedSolrServer
2. Patch Solr to avoid unnecessary marshalling while using
EmbeddedSolrServer (there's an issue  in Solr JIRA)
3. Patch Solr to cache SolrDocument instances instead of Lucene's Document
instances. I was going to share this patch, but then decided that our usage
of Solr is not common and this functionality is useless in most cases
4. We have all documents in cache
5. In fact our index is stored in a data grid, not a file system. But as
tests showed this is not important because standard FSDirectory is faster if
you have enough of RAM free for OS caches.

These changes improved the performance very much, so in the end we have
performance comparable (about 3-5 times slower) to the "proper" Solr usage
(obtaining first 20 documents).

To get more details on how different Solr components perform we injected
perf4j statements into key points in the code. And a profiler was helpful
too.

Hope it helps somehow.

On Thu, Nov 26, 2009 at 8:48 PM, Raghuveer Kancherla <
raghuveer.kancherla@aplopio.com> wrote:

> Hi,
> I am using Solr1.4 for searching through half a million documents. The
> problem is, I want to retrieve nearly 200 documents for each search query.
> The query time in Solr logs is showing 0.02 seconds and I am fairly happy
> with that. However Solr is taking a long time (4 to 5 secs) to return the
> results (I think it is because of the number of docs I am requesting). I
> tried returning only the id's (unique key) without any other stored fields,
> but it is not helping me improve the response times (time to return the
> id's
> of matching documents).
> I understand that retrieving 200 documents for each search term is
> impractical in most scenarios but I dont have any other option. Any
> pointers
> on how to improve the response times will be a great help.
>
> Thanks,
>  Raghu
>

-- 
Andrew Klochkov
Senior Software Engineer,
Grid Dynamics

Re: Retrieving large num of docs

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Strange.  Ever figured out the source of performance difference?

 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



----- Original Message ----
> From: Raghuveer Kancherla <ra...@aplopio.com>
> To: solr-user@lucene.apache.org
> Sent: Sat, December 5, 2009 12:05:49 PM
> Subject: Re: Retrieving large num of docs
> 
> Hi Otis,
> I think my experiments are not conclusive about reduction in search time. I
> was playing around with various configurations to reduce the time to
> retrieve documents from Solr. I am sure that making the two multi valued
> text fields from stored to un-stored, retrieval time (query time + time to
> load the stored fields) became very fast. I was expecting the
> lazyfieldloading setting in solrconfig to take care of this but apparently
> it is not working as expected.
> 
> Out of curiosity, I removed these 2 fields from the index (this time I am
> not even indexing them) and my search time got better (10 times better).
> However, I am still trying to isolate the reason for the search time
> reduction. It may be either because of 2 less fields to search in or because
> of the reduction in size of the index or may be something else. I am not
> sure if lazyfieldloading has any part in explaining this.
> 
> - Raghu
> 
> 
> 
> On Fri, Dec 4, 2009 at 3:07 AM, Otis Gospodnetic 
> > wrote:
> 
> > Hm, hm, interesting.  I was looking into something like this the other day
> > (BIG indexed+stored text fields).  After seeing enableLazyFieldLoading=true
> > in solrconfig and after seeing "fl" didn't include those big fields, I
> > though "hm, so Lucene/Solr will not be pulling those large fields from disk,
> > OK".
> >
> > You are saying that this may not be true based on your experiment?
> > And what I'm calling your "experiment" means that you reindexed the same
> > data, but without the 2 multi-valued text fields... .and that was the only
> > change you made and got cca x10 search performance improvement?
> >
> > Sorry for repeating your words, just trying to confirm and understand.
> >
> > Thanks,
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
> >
> >
> >
> > ----- Original Message ----
> > > From: Raghuveer Kancherla 
> > > To: solr-user@lucene.apache.org
> > > Sent: Thu, December 3, 2009 8:43:16 AM
> > > Subject: Re: Retrieving large num of docs
> > >
> > > Hi Hoss,
> > >
> > > I was experimenting with various queries to solve this problem and in one
> > > such test I remember that requesting only the ID did not change the
> > > retrieval time. To be sure, I tested it again using the curl command
> > today
> > > and it confirms my previous observation.
> > >
> > > Also, enableLazyFieldLoading setting is set to true in my solrconfig.
> > >
> > > Another general observation (off topic) is that having a moderately large
> > > multi valued text field (~200 entries) in the index seems to slow down
> > the
> > > search significantly. I removed the 2 multi valued text fields from my
> > index
> > > and my search got ~10 time faster. :)
> > >
> > > - Raghu
> > >
> > >
> > > On Thu, Dec 3, 2009 at 2:14 AM, Chris Hostetter wrote:
> > >
> > > >
> > > > : I think I solved the problem of retrieving 300 docs per request for
> > now.
> > > > The
> > > > : problem was that I was storing 2 moderately large multivalued text
> > fields
> > > > : though I was not retrieving them during search time.  I reindexed all
> > my
> > > > : data without storing these fields. Now the response time (time for
> > Solr
> > > > to
> > > > : return the http response) is very close to the QTime Solr is showing
> > in
> > > > the
> > > >
> > > > Hmmm....
> > > >
> > > > two comments:
> > > >
> > > > 1) the example URL from your previous mail...
> > > >
> > > > : >
> > > >
> > >
> > 
> http://localhost:1212/solr/select/?rows=300&q=%28ResumeAllText%3A%28%28%28%22java+j2ee%22+%28java+j2ee%29%29%29%5E4%29%5E1.0%29&start=0&wt=python
> > > >
> > > > ...doesn't match your earlier statemnet that you are only returning hte
> > id
> > > > field (there is no "fl" param in that URL) ... are you certain you
> > werent'
> > > > returning those large stored fields in teh response?
> > > >
> > > > 2) assuming you were actually using an fl param to limit the fields,
> > make
> > > > sure you have this setting in your solrconfig.xml...
> > > >
> > > >    true
> > > >
> > > > ..that should make it pretty fast to return only a few fields of each
> > > > document, even if you do have some jumpto stored fields that aren't
> > being
> > > > returned.
> > > >
> > > >
> > > >
> > > > -Hoss
> > > >
> > > >
> >
> >

Re: Retrieving large num of docs

Posted by Raghuveer Kancherla <ra...@aplopio.com>.

Hi Otis,
I think my experiments are not conclusive about reduction in search time. I
was playing around with various configurations to reduce the time to
retrieve documents from Solr. I am sure that making the two multi valued
text fields from stored to un-stored, retrieval time (query time + time to
load the stored fields) became very fast. I was expecting the
lazyfieldloading setting in solrconfig to take care of this but apparently
it is not working as expected.

Out of curiosity, I removed these 2 fields from the index (this time I am
not even indexing them) and my search time got better (10 times better).
However, I am still trying to isolate the reason for the search time
reduction. It may be either because of 2 less fields to search in or because
of the reduction in size of the index or may be something else. I am not
sure if lazyfieldloading has any part in explaining this.

- Raghu



On Fri, Dec 4, 2009 at 3:07 AM, Otis Gospodnetic <otis_gospodnetic@yahoo.com
> wrote:

> Hm, hm, interesting.  I was looking into something like this the other day
> (BIG indexed+stored text fields).  After seeing enableLazyFieldLoading=true
> in solrconfig and after seeing "fl" didn't include those big fields, I
> though "hm, so Lucene/Solr will not be pulling those large fields from disk,
> OK".
>
> You are saying that this may not be true based on your experiment?
> And what I'm calling your "experiment" means that you reindexed the same
> data, but without the 2 multi-valued text fields... .and that was the only
> change you made and got cca x10 search performance improvement?
>
> Sorry for repeating your words, just trying to confirm and understand.
>
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
>
>
>
> ----- Original Message ----
> > From: Raghuveer Kancherla <ra...@aplopio.com>
> > To: solr-user@lucene.apache.org
> > Sent: Thu, December 3, 2009 8:43:16 AM
> > Subject: Re: Retrieving large num of docs
> >
> > Hi Hoss,
> >
> > I was experimenting with various queries to solve this problem and in one
> > such test I remember that requesting only the ID did not change the
> > retrieval time. To be sure, I tested it again using the curl command
> today
> > and it confirms my previous observation.
> >
> > Also, enableLazyFieldLoading setting is set to true in my solrconfig.
> >
> > Another general observation (off topic) is that having a moderately large
> > multi valued text field (~200 entries) in the index seems to slow down
> the
> > search significantly. I removed the 2 multi valued text fields from my
> index
> > and my search got ~10 time faster. :)
> >
> > - Raghu
> >
> >
> > On Thu, Dec 3, 2009 at 2:14 AM, Chris Hostetter wrote:
> >
> > >
> > > : I think I solved the problem of retrieving 300 docs per request for
> now.
> > > The
> > > : problem was that I was storing 2 moderately large multivalued text
> fields
> > > : though I was not retrieving them during search time.  I reindexed all
> my
> > > : data without storing these fields. Now the response time (time for
> Solr
> > > to
> > > : return the http response) is very close to the QTime Solr is showing
> in
> > > the
> > >
> > > Hmmm....
> > >
> > > two comments:
> > >
> > > 1) the example URL from your previous mail...
> > >
> > > : >
> > >
> >
> http://localhost:1212/solr/select/?rows=300&q=%28ResumeAllText%3A%28%28%28%22java+j2ee%22+%28java+j2ee%29%29%29%5E4%29%5E1.0%29&start=0&wt=python
> > >
> > > ...doesn't match your earlier statemnet that you are only returning hte
> id
> > > field (there is no "fl" param in that URL) ... are you certain you
> werent'
> > > returning those large stored fields in teh response?
> > >
> > > 2) assuming you were actually using an fl param to limit the fields,
> make
> > > sure you have this setting in your solrconfig.xml...
> > >
> > >    true
> > >
> > > ..that should make it pretty fast to return only a few fields of each
> > > document, even if you do have some jumpto stored fields that aren't
> being
> > > returned.
> > >
> > >
> > >
> > > -Hoss
> > >
> > >
>
>

Re: Retrieving large num of docs

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hm, hm, interesting.  I was looking into something like this the other day (BIG indexed+stored text fields).  After seeing enableLazyFieldLoading=true in solrconfig and after seeing "fl" didn't include those big fields, I though "hm, so Lucene/Solr will not be pulling those large fields from disk, OK".

You are saying that this may not be true based on your experiment?
And what I'm calling your "experiment" means that you reindexed the same data, but without the 2 multi-valued text fields... .and that was the only change you made and got cca x10 search performance improvement?

Sorry for repeating your words, just trying to confirm and understand.

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



----- Original Message ----
> From: Raghuveer Kancherla <ra...@aplopio.com>
> To: solr-user@lucene.apache.org
> Sent: Thu, December 3, 2009 8:43:16 AM
> Subject: Re: Retrieving large num of docs
> 
> Hi Hoss,
> 
> I was experimenting with various queries to solve this problem and in one
> such test I remember that requesting only the ID did not change the
> retrieval time. To be sure, I tested it again using the curl command today
> and it confirms my previous observation.
> 
> Also, enableLazyFieldLoading setting is set to true in my solrconfig.
> 
> Another general observation (off topic) is that having a moderately large
> multi valued text field (~200 entries) in the index seems to slow down the
> search significantly. I removed the 2 multi valued text fields from my index
> and my search got ~10 time faster. :)
> 
> - Raghu
> 
> 
> On Thu, Dec 3, 2009 at 2:14 AM, Chris Hostetter wrote:
> 
> >
> > : I think I solved the problem of retrieving 300 docs per request for now.
> > The
> > : problem was that I was storing 2 moderately large multivalued text fields
> > : though I was not retrieving them during search time.  I reindexed all my
> > : data without storing these fields. Now the response time (time for Solr
> > to
> > : return the http response) is very close to the QTime Solr is showing in
> > the
> >
> > Hmmm....
> >
> > two comments:
> >
> > 1) the example URL from your previous mail...
> >
> > : >
> > 
> http://localhost:1212/solr/select/?rows=300&q=%28ResumeAllText%3A%28%28%28%22java+j2ee%22+%28java+j2ee%29%29%29%5E4%29%5E1.0%29&start=0&wt=python
> >
> > ...doesn't match your earlier statemnet that you are only returning hte id
> > field (there is no "fl" param in that URL) ... are you certain you werent'
> > returning those large stored fields in teh response?
> >
> > 2) assuming you were actually using an fl param to limit the fields, make
> > sure you have this setting in your solrconfig.xml...
> >
> >    true
> >
> > ..that should make it pretty fast to return only a few fields of each
> > document, even if you do have some jumpto stored fields that aren't being
> > returned.
> >
> >
> >
> > -Hoss
> >
> >

Re: Retrieving large num of docs

Posted by Raghuveer Kancherla <ra...@aplopio.com>.

Hi Hoss,

I was experimenting with various queries to solve this problem and in one
such test I remember that requesting only the ID did not change the
retrieval time. To be sure, I tested it again using the curl command today
and it confirms my previous observation.

Also, enableLazyFieldLoading setting is set to true in my solrconfig.

Another general observation (off topic) is that having a moderately large
multi valued text field (~200 entries) in the index seems to slow down the
search significantly. I removed the 2 multi valued text fields from my index
and my search got ~10 time faster. :)

- Raghu

On Thu, Dec 3, 2009 at 2:14 AM, Chris Hostetter <ho...@fucit.org>wrote:

>
> : I think I solved the problem of retrieving 300 docs per request for now.
> The
> : problem was that I was storing 2 moderately large multivalued text fields
> : though I was not retrieving them during search time.  I reindexed all my
> : data without storing these fields. Now the response time (time for Solr
> to
> : return the http response) is very close to the QTime Solr is showing in
> the
>
> Hmmm....
>
> two comments:
>
> 1) the example URL from your previous mail...
>
> : >
> http://localhost:1212/solr/select/?rows=300&q=%28ResumeAllText%3A%28%28%28%22java+j2ee%22+%28java+j2ee%29%29%29%5E4%29%5E1.0%29&start=0&wt=python
>
> ...doesn't match your earlier statemnet that you are only returning hte id
> field (there is no "fl" param in that URL) ... are you certain you werent'
> returning those large stored fields in teh response?
>
> 2) assuming you were actually using an fl param to limit the fields, make
> sure you have this setting in your solrconfig.xml...
>
>    <enableLazyFieldLoading>true</enableLazyFieldLoading>
>
> ..that should make it pretty fast to return only a few fields of each
> document, even if you do have some jumpto stored fields that aren't being
> returned.
>
>
>
> -Hoss
>
>

Re: Retrieving large num of docs

Posted by Chris Hostetter <ho...@fucit.org>.

: I think I solved the problem of retrieving 300 docs per request for now. The
: problem was that I was storing 2 moderately large multivalued text fields
: though I was not retrieving them during search time.  I reindexed all my
: data without storing these fields. Now the response time (time for Solr to
: return the http response) is very close to the QTime Solr is showing in the

Hmmm....

two comments:

1) the example URL from your previous mail...

: > http://localhost:1212/solr/select/?rows=300&q=%28ResumeAllText%3A%28%28%28%22java+j2ee%22+%28java+j2ee%29%29%29%5E4%29%5E1.0%29&start=0&wt=python

...doesn't match your earlier statemnet that you are only returning hte id 
field (there is no "fl" param in that URL) ... are you certain you werent' 
returning those large stored fields in teh response?

2) assuming you were actually using an fl param to limit the fields, make 
sure you have this setting in your solrconfig.xml...

    <enableLazyFieldLoading>true</enableLazyFieldLoading>

..that should make it pretty fast to return only a few fields of each 
document, even if you do have some jumpto stored fields that aren't being 
returned.



-Hoss

Re: Retrieving large num of docs

Posted by Raghuveer Kancherla <ra...@aplopio.com>.

Hi Hoss/Andrew,
I think I solved the problem of retrieving 300 docs per request for now. The
problem was that I was storing 2 moderately large multivalued text fields
though I was not retrieving them during search time.  I reindexed all my
data without storing these fields. Now the response time (time for Solr to
return the http response) is very close to the QTime Solr is showing in the
logs.

Thanks for all the help,
Raghu


On Mon, Nov 30, 2009 at 11:37 AM, Raghuveer Kancherla <
raghuveer.kancherla@aplopio.com> wrote:

> Thanks Hoss,
> In my previous mail, I was measuring the system time difference between
> sending a (http) request and receiving a response. This was being run on a
> (different) client machine
>
> Like you suggested, I tried to time the response on the server itself as
> follows:
>
> $ /usr/bin/time -p curl -sS -o solr.out "
> http://localhost:1212/solr/select/?rows=300&q=%28ResumeAllText%3A%28%28%28%22java+j2ee%22+%28java+j2ee%29%29%29%5E4%29%5E1.0%29&start=0&wt=python
> "
> real 3.49
>
> user 0.00
> sys 0.00
>
> The query time in solr log shows me Qtime=600
> size of solr.out is 843 kB.
>
> As you've mentioned, Solr shouldn't give these kind of numbers for 300
> docs, and we're quite perplexed as to whats going on.
>
> Thanks,
> Raghu
>
>
>
>
> On Mon, Nov 30, 2009 at 6:00 AM, Chris Hostetter <hossman_lucene@fucit.org
> > wrote:
>
>>
>> : I am using Solr1.4 for searching through half a million documents. The
>> : problem is, I want to retrieve nearly 200 documents for each search
>> query.
>> : The query time in Solr logs is showing 0.02 seconds and I am fairly
>> happy
>> : with that. However Solr is taking a long time (4 to 5 secs) to return
>> the
>> : results (I think it is because of the number of docs I am requesting). I
>> : tried returning only the id's (unique key) without any other stored
>> fields,
>> : but it is not helping me improve the response times (time to return the
>> id's
>> : of matching documents).
>>
>> What exactly does your request URL look like, and how exactly are you
>> timing the total response time?
>>
>> 200 isn't a very big number for the rows param -- people who want to get
>> 100K documents back in their response at a time may have problems, but 200
>> is not that big.
>>
>> so like i said: how exactly are you timing things?
>>
>> My guess: it's more likely that network overhead or the performance of
>> your client code (reading the data off the wire) is causing your timing
>> code to seem slow, then it is that Solr is taking 5 seconds to write out
>> those document IDs.
>>
>> I suspect if you try hitting the same exact URL using curl via localhost,
>> you'll see the total response time be a lot less then 5 seconds.
>>
>> Here's an example of a query that asks solr to return *every* field from
>> 500 documents, in the XML format.  And these are not small documents...
>>
>> $ /usr/bin/time -p curl -sS -o /tmp/solr.out "
>> http://localhost:5051/solr/select/?q=doctype:product&version=2.2&start=0&rows=500&indent=on
>> "
>> real 0.07
>> user 0.00
>> sys 0.00
>> [chrish@c18-ssa-so-dfll-qry1 ~]$ du -sh /tmp/solr.out
>> 1.6M    /tmp/solr.out
>>
>> ...that's 1.6 MB of 500 Solr documents with all of their fields in
>> verbose XML format (including indenting) fetched in 70ms.
>>
>> If it's taking 5 seconds for you to get just the ids of 200 docs, you've
>> got a problem somewhere and i'm 99% certain it's not in Solr.
>>
>> what does a similar "time curl" command for your URL look like when you
>> run it on your solr server?
>>
>>
>> -Hoss
>>
>>
>

Re: Retrieving large num of docs

Posted by Raghuveer Kancherla <ra...@aplopio.com>.

Thanks Hoss,
In my previous mail, I was measuring the system time difference between
sending a (http) request and receiving a response. This was being run on a
(different) client machine

Like you suggested, I tried to time the response on the server itself as
follows:

$ /usr/bin/time -p curl -sS -o solr.out "
http://localhost:1212/solr/select/?rows=300&q=%28ResumeAllText%3A%28%28%28%22java+j2ee%22+%28java+j2ee%29%29%29%5E4%29%5E1.0%29&start=0&wt=python
"
real 3.49
user 0.00
sys 0.00

The query time in solr log shows me Qtime=600
size of solr.out is 843 kB.

As you've mentioned, Solr shouldn't give these kind of numbers for 300 docs,
and we're quite perplexed as to whats going on.

Thanks,
Raghu



On Mon, Nov 30, 2009 at 6:00 AM, Chris Hostetter
<ho...@fucit.org>wrote:

>
> : I am using Solr1.4 for searching through half a million documents. The
> : problem is, I want to retrieve nearly 200 documents for each search
> query.
> : The query time in Solr logs is showing 0.02 seconds and I am fairly happy
> : with that. However Solr is taking a long time (4 to 5 secs) to return the
> : results (I think it is because of the number of docs I am requesting). I
> : tried returning only the id's (unique key) without any other stored
> fields,
> : but it is not helping me improve the response times (time to return the
> id's
> : of matching documents).
>
> What exactly does your request URL look like, and how exactly are you
> timing the total response time?
>
> 200 isn't a very big number for the rows param -- people who want to get
> 100K documents back in their response at a time may have problems, but 200
> is not that big.
>
> so like i said: how exactly are you timing things?
>
> My guess: it's more likely that network overhead or the performance of
> your client code (reading the data off the wire) is causing your timing
> code to seem slow, then it is that Solr is taking 5 seconds to write out
> those document IDs.
>
> I suspect if you try hitting the same exact URL using curl via localhost,
> you'll see the total response time be a lot less then 5 seconds.
>
> Here's an example of a query that asks solr to return *every* field from
> 500 documents, in the XML format.  And these are not small documents...
>
> $ /usr/bin/time -p curl -sS -o /tmp/solr.out "
> http://localhost:5051/solr/select/?q=doctype:product&version=2.2&start=0&rows=500&indent=on
> "
> real 0.07
> user 0.00
> sys 0.00
> [chrish@c18-ssa-so-dfll-qry1 ~]$ du -sh /tmp/solr.out
> 1.6M    /tmp/solr.out
>
> ...that's 1.6 MB of 500 Solr documents with all of their fields in
> verbose XML format (including indenting) fetched in 70ms.
>
> If it's taking 5 seconds for you to get just the ids of 200 docs, you've
> got a problem somewhere and i'm 99% certain it's not in Solr.
>
> what does a similar "time curl" command for your URL look like when you
> run it on your solr server?
>
>
> -Hoss
>
>

Re: Retrieving large num of docs

Posted by Chris Hostetter <ho...@fucit.org>.

: I am using Solr1.4 for searching through half a million documents. The
: problem is, I want to retrieve nearly 200 documents for each search query.
: The query time in Solr logs is showing 0.02 seconds and I am fairly happy
: with that. However Solr is taking a long time (4 to 5 secs) to return the
: results (I think it is because of the number of docs I am requesting). I
: tried returning only the id's (unique key) without any other stored fields,
: but it is not helping me improve the response times (time to return the id's
: of matching documents).

What exactly does your request URL look like, and how exactly are you 
timing the total response time?

200 isn't a very big number for the rows param -- people who want to get 
100K documents back in their response at a time may have problems, but 200 
is not that big.

so like i said: how exactly are you timing things?

My guess: it's more likely that network overhead or the performance of 
your client code (reading the data off the wire) is causing your timing 
code to seem slow, then it is that Solr is taking 5 seconds to write out 
those document IDs.

I suspect if you try hitting the same exact URL using curl via localhost, 
you'll see the total response time be a lot less then 5 seconds.

Here's an example of a query that asks solr to return *every* field from 
500 documents, in the XML format.  And these are not small documents...

$ /usr/bin/time -p curl -sS -o /tmp/solr.out "http://localhost:5051/solr/select/?q=doctype:product&version=2.2&start=0&rows=500&indent=on"
real 0.07
user 0.00
sys 0.00
[chrish@c18-ssa-so-dfll-qry1 ~]$ du -sh /tmp/solr.out 
1.6M    /tmp/solr.out

...that's 1.6 MB of 500 Solr documents with all of their fields in 
verbose XML format (including indenting) fetched in 70ms.

If it's taking 5 seconds for you to get just the ids of 200 docs, you've 
got a problem somewhere and i'm 99% certain it's not in Solr.

what does a similar "time curl" command for your URL look like when you 
run it on your solr server?


-Hoss