You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Peter Karich <pe...@yahoo.de> on 2010/08/10 15:54:04 UTC

Improve Query Time For Large Index

Hi,

I have 5 Million small documents/tweets (=> ~3GB) and the slave index
replicates itself from master every 10-15 minutes, so the index is
optimized before querying. We are using solr 1.4.1 (patched with
SOLR-1624) via SolrJ.

Now the search speed is slow >2s for common terms which hits more than 2
mio docs and acceptable for others: <0.5s. For those numbers I don't use
highlighting or facets. I am using the following schema [1] and from
luke handler I know that numTerms =~20 mio. The query for common terms
stays slow if I retry again and again (no cache improvements).

How can I improve the query time for the common terms without using
Distributed Search [2] ?

Regards,
Peter.


[1]
<field name="id" type="tlong" indexed="true" stored="true"
required="true" />
<field name="date" type="tdate" indexed="true" stored="true" />
<!-- term* attributes to prepare faster highlighting. -->
<field name="txt" type="text" indexed="true" stored="true"
               termVectors="true" termPositions="true" termOffsets="true"/>

[2]
http://wiki.apache.org/solr/DistributedSearch


Re: Improve Query Time For Large Index

Posted by Peter Karich <pe...@yahoo.de>.
Hi Tom!

> Hi Peter,
>
> Can you give a few more examples of slow queries?  
> Are they phrase queries? Boolean queries? prefix or wildcard queries?
>   

I am experimenting with one word queries only at the moment.

> If one word queries are your slow queries, than CommonGrams won't help.  CommonGrams will only help with phrase queries.
>   

hmmh, ok.

> How are you using termvectors? 
yes.

> That may be slowing things down.  I don't have experience with termvectors, so someone else on the list might speak to that.
>   

ok. But for highlighting I'll need them to speed things up (a lot).


> When you say the query time for common terms stays slow, do you mean if you re-issue the exact query, the second query is not faster?  That seems very strange. 

Yes. Indeed. The queryResultCache has no hits at all. Strange.

>  You might restart Solr, and send a first query (the first query always takes a relatively long time.)  Then pick one of your slow queries and send it 2 times.  The second time you send the query it should be much faster due to the Solr caches and you should be able to see the cache hit in the Solr admin panel.  If you send the exact query a second time (without enough intervening queries to evict data from the cache, ) the Solr queryResultCache should get hit and you should see a response time in the .01-5 millisecond range.
>   

That's not the case. The second query is only some few milliseconds
faster (but stays >2s). But I'm not sure what I am doing wrong. The
other 3 caches have a good hitratio but queryResultCache has 0. For
queryResultCache I am using:
<queryResultCache class="solr.LRUCache" size="400" initialSize="400"
autowarmCount="400"/>

But even if I double that it didn't make the hitratio > 0

> How much memory is on the machine?  If your bottleneck is disk i/o for frequent terms, then you want to make sure you have enough memory for the OS disk cache.  
>   

Yes, there should be enough memory for the OS-disc-cache.

> I assume that http is not in your stopwords.

exactly.


> CommonGrams will only help with phrase queries. CommonGrams was committed and is in Solr 1.4.  If you decide to use CommonGrams you definitely need to re-index and you also need to use both the index time filter and the query time filter.  Your index will be larger.
>
> <fieldType name="foo" ...>
> <analyzer type="index">
> <filter class="solr.CommonGramsFilterFactory" words="new400common.txt"/>
> </analyzer>
>
> <analyzer type="query">
> <filter class="solr.CommonGramsQueryFilterFactory" words="new400common.txt"/>
> </analyzer>
> </fieldType>
>   

Thanks, I will try that, if I can solve the current issue :-)
And thanks for all your answers, I will try to experiment with my setup
in more detail now ...

Kind regards,
Peter.



> Subject: Re: Improve Query Time For Large Index
>
> Hi Tom,
>
> my index is around 3GB large and I am using 2GB RAM for the JVM although
> a some more is available.
> If I am looking into the RAM usage while a slow query runs (via
> jvisualvm) I see that only 750MB of the JVM RAM is used.
>
>   
>> Can you give us some examples of the slow queries?
>>     
> for example the empty query solr/select?q=
> takes very long or solr/select?q=http
> where 'http' is the most common term
>
>   
>> Are you using stop words?  
>>     
> yes, a lot. I stored them into stopwords.txt
>
>   
>> http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
>>     
> this looks interesting. I read through
> https://issues.apache.org/jira/browse/SOLR-908 and it seems to be in 1.4.
> I only need to enable it via:
>
> <filter class="solr.CommonGramsFilterFactory" ignoreCase="true" words="stopwords.txt"/>
>
> right? Do I need to reindex?
>
> Regards,
> Peter.
>
>   
>> Hi Peter,
>>
>> A few more details about your setup would help list members to answer your questions.
>> How large is your index?  
>> How much memory is on the machine and how much is allocated to the JVM?
>> Besides the Solr caches, Solr and Lucene depend on the operating system's disk caching for caching of postings lists.  So you need to leave some memory for the OS.  On the other hand if you are optimizing and refreshing every 10-15 minutes, that will invalidate all the caches, since an optimized index is essentially a set of new files.
>>
>> Can you give us some examples of the slow queries?  Are you using stop words?  
>>
>> If your slow queries are phrase queries, then you might try either adding the most frequent terms in your index to the stopwords list  or try CommonGrams and add them to the common words list.  (Details on CommonGrams here: http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2)
>>
>> Tom Burton-West
>>
>> -----Original Message-----
>> From: Peter Karich [mailto:peathal@yahoo.de] 
>> Sent: Tuesday, August 10, 2010 9:54 AM
>> To: solr-user@lucene.apache.org
>> Subject: Improve Query Time For Large Index
>>
>> Hi,
>>
>> I have 5 Million small documents/tweets (=> ~3GB) and the slave index
>> replicates itself from master every 10-15 minutes, so the index is
>> optimized before querying. We are using solr 1.4.1 (patched with
>> SOLR-1624) via SolrJ.
>>
>> Now the search speed is slow >2s for common terms which hits more than 2
>> mio docs and acceptable for others: <0.5s. For those numbers I don't use
>> highlighting or facets. I am using the following schema [1] and from
>> luke handler I know that numTerms =~20 mio. The query for common terms
>> stays slow if I retry again and again (no cache improvements).
>>
>> How can I improve the query time for the common terms without using
>> Distributed Search [2] ?
>>
>> Regards,
>> Peter.
>>
>>
>> [1]
>> <field name="id" type="tlong" indexed="true" stored="true"
>> required="true" />
>> <field name="date" type="tdate" indexed="true" stored="true" />
>> <!-- term* attributes to prepare faster highlighting. -->
>> <field name="txt" type="text" indexed="true" stored="true"
>>                termVectors="true" termPositions="true" termOffsets="true"/>
>>
>> [2]
>> http://wiki.apache.org/solr/DistributedSearch
>>     


Re: Improve Query Time For Large Index

Posted by Robert Muir <rc...@gmail.com>.
exactly!

On Thu, Aug 12, 2010 at 5:26 AM, Peter Karich <pe...@yahoo.de> wrote:

> Hi Robert!
>
> >  Since the example given was "http" being slow, its worth mentioning that
> if
> > queries are "one word" urls [for example http://lucene.apache.org] these
> > will actually form slow phrase queries by default.
> >
>
> do you mean that http://lucene.apache.org will be split up into "http
> lucene apache org" and solr will perform a phrase query?
>
> Regards,
> Peter.
>



-- 
Robert Muir
rcmuir@gmail.com

Re: Improve Query Time For Large Index

Posted by Peter Karich <pe...@yahoo.de>.
Hi Robert!

>  Since the example given was "http" being slow, its worth mentioning that if
> queries are "one word" urls [for example http://lucene.apache.org] these
> will actually form slow phrase queries by default.
>   

do you mean that http://lucene.apache.org will be split up into "http
lucene apache org" and solr will perform a phrase query?

Regards,
Peter.

Re: Improve Query Time For Large Index

Posted by Robert Muir <rc...@gmail.com>.
On Wed, Aug 11, 2010 at 11:47 AM, Burton-West, Tom <tb...@umich.edu>wrote:

> Hi Peter,
>
> Can you give a few more examples of slow queries?
> Are they phrase queries? Boolean queries? prefix or wildcard queries?
> If one word queries are your slow queries, than CommonGrams won't help.
>  CommonGrams will only help with phrase queries.
>

 Since the example given was "http" being slow, its worth mentioning that if
queries are "one word" urls [for example http://lucene.apache.org] these
will actually form slow phrase queries by default.

Because your content is very tiny documents, its probably good to disable
this since the phrases won't likely help the results at all, but make things
unbearably slow. in solr 3_x and trunk, you can disable these automatic
phrase queries in schema.xml with autoGeneratePhraseQueries="false":

<fieldType name="text" class="solr.TextField" positionIncrementGap="100"
autoGeneratePhraseQueries="false">

then the system won't form phrase queries unless the user explicitly puts
double quotes around it.

-- 
Robert Muir
rcmuir@gmail.com

RE: Improve Query Time For Large Index

Posted by "Burton-West, Tom" <tb...@umich.edu>.
Hi Peter,

If hits aren't showing up, and you aren't getting any queryResultCache hits even with the exact query being repeated, something is very wrong.  I'd suggest first getting the query result cache working, and then moving on to look at other possible bottlenecks.  

What are your settings for queryResultWindowSize and queryResultMaxDocsCached?

Following up on Robert's point, you might also try to run a few queries in the admin interface with the debug flag on to see if the query parser is creating phrase queries (assuming you have queries like http://foo.bar.baz).  The debug/explain will indicate whether the parsed query is a PhraseQuery.

Tom



-----Original Message-----
From: Peter Karich [mailto:peathal@yahoo.de] 
Sent: Thursday, August 12, 2010 5:36 AM
To: solr-user@lucene.apache.org
Subject: Re: Improve Query Time For Large Index

Hi Tom,

I tried again with:
  <queryResultCache class="solr.LRUCache" size="10000" initialSize="10000"
        autowarmCount="10000"/>

and even now the hitratio is still 0. What could be wrong with my setup?

('free -m' shows that the cache has over 2 GB free.)

Regards,
Peter.

> Hi Peter,
>
> Can you give a few more examples of slow queries?  
> Are they phrase queries? Boolean queries? prefix or wildcard queries?
> If one word queries are your slow queries, than CommonGrams won't help.  CommonGrams will only help with phrase queries.
>
> How are you using termvectors?  That may be slowing things down.  I don't have experience with termvectors, so someone else on the list might speak to that.
>
> When you say the query time for common terms stays slow, do you mean if you re-issue the exact query, the second query is not faster?  That seems very strange.  You might restart Solr, and send a first query (the first query always takes a relatively long time.)  Then pick one of your slow queries and send it 2 times.  The second time you send the query it should be much faster due to the Solr caches and you should be able to see the cache hit in the Solr admin panel.  If you send the exact query a second time (without enough intervening queries to evict data from the cache, ) the Solr queryResultCache should get hit and you should see a response time in the .01-5 millisecond range.
>
> What settings are you using for your Solr caches?
>
> How much memory is on the machine?  If your bottleneck is disk i/o for frequent terms, then you want to make sure you have enough memory for the OS disk cache.  
>
> I assume that http is not in your stopwords.  CommonGrams will only help with phrase queries
> CommonGrams was committed and is in Solr 1.4.  If you decide to use CommonGrams you definitely need to re-index and you also need to use both the index time filter and the query time filter.  Your index will be larger.
>
> <fieldType name="foo" ...>
> <analyzer type="index">
> <filter class="solr.CommonGramsFilterFactory" words="new400common.txt"/>
> </analyzer>
>
> <analyzer type="query">
> <filter class="solr.CommonGramsQueryFilterFactory" words="new400common.txt"/>
> </analyzer>
> </fieldType>
>
>
>
> Tom
> -----Original Message-----
> From: Peter Karich [mailto:peathal@yahoo.de] 
> Sent: Tuesday, August 10, 2010 3:32 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Improve Query Time For Large Index
>
> Hi Tom,
>
> my index is around 3GB large and I am using 2GB RAM for the JVM although
> a some more is available.
> If I am looking into the RAM usage while a slow query runs (via
> jvisualvm) I see that only 750MB of the JVM RAM is used.
>
>   
>> Can you give us some examples of the slow queries?
>>     
> for example the empty query solr/select?q=
> takes very long or solr/select?q=http
> where 'http' is the most common term
>
>   
>> Are you using stop words?  
>>     
> yes, a lot. I stored them into stopwords.txt
>
>   
>> http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
>>     
> this looks interesting. I read through
> https://issues.apache.org/jira/browse/SOLR-908 and it seems to be in 1.4.
> I only need to enable it via:
>
> <filter class="solr.CommonGramsFilterFactory" ignoreCase="true" words="stopwords.txt"/>
>
> right? Do I need to reindex?
>
> Regards,
> Peter.
>
>   
>> Hi Peter,
>>
>> A few more details about your setup would help list members to answer your questions.
>> How large is your index?  
>> How much memory is on the machine and how much is allocated to the JVM?
>> Besides the Solr caches, Solr and Lucene depend on the operating system's disk caching for caching of postings lists.  So you need to leave some memory for the OS.  On the other hand if you are optimizing and refreshing every 10-15 minutes, that will invalidate all the caches, since an optimized index is essentially a set of new files.
>>
>> Can you give us some examples of the slow queries?  Are you using stop words?  
>>
>> If your slow queries are phrase queries, then you might try either adding the most frequent terms in your index to the stopwords list  or try CommonGrams and add them to the common words list.  (Details on CommonGrams here: http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2)
>>
>> Tom Burton-West
>>
>> -----Original Message-----
>> From: Peter Karich [mailto:peathal@yahoo.de] 
>> Sent: Tuesday, August 10, 2010 9:54 AM
>> To: solr-user@lucene.apache.org
>> Subject: Improve Query Time For Large Index
>>
>> Hi,
>>
>> I have 5 Million small documents/tweets (=> ~3GB) and the slave index
>> replicates itself from master every 10-15 minutes, so the index is
>> optimized before querying. We are using solr 1.4.1 (patched with
>> SOLR-1624) via SolrJ.
>>
>> Now the search speed is slow >2s for common terms which hits more than 2
>> mio docs and acceptable for others: <0.5s. For those numbers I don't use
>> highlighting or facets. I am using the following schema [1] and from
>> luke handler I know that numTerms =~20 mio. The query for common terms
>> stays slow if I retry again and again (no cache improvements).
>>
>> How can I improve the query time for the common terms without using
>> Distributed Search [2] ?
>>
>> Regards,
>> Peter.
>>
>>
>> [1]
>> <field name="id" type="tlong" indexed="true" stored="true"
>> required="true" />
>> <field name="date" type="tdate" indexed="true" stored="true" />
>> <!-- term* attributes to prepare faster highlighting. -->
>> <field name="txt" type="text" indexed="true" stored="true"
>>                termVectors="true" termPositions="true" termOffsets="true"/>
>>
>> [2]
>> http://wiki.apache.org/solr/DistributedSearch
>>
>>
>>   
>>     
>
>   


-- 
http://karussell.wordpress.com/


Re: Improve Query Time For Large Index

Posted by Peter Karich <pe...@yahoo.de>.
Hi Tom,

I tried again with:
  <queryResultCache class="solr.LRUCache" size="10000" initialSize="10000"
        autowarmCount="10000"/>

and even now the hitratio is still 0. What could be wrong with my setup?

('free -m' shows that the cache has over 2 GB free.)

Regards,
Peter.

> Hi Peter,
>
> Can you give a few more examples of slow queries?  
> Are they phrase queries? Boolean queries? prefix or wildcard queries?
> If one word queries are your slow queries, than CommonGrams won't help.  CommonGrams will only help with phrase queries.
>
> How are you using termvectors?  That may be slowing things down.  I don't have experience with termvectors, so someone else on the list might speak to that.
>
> When you say the query time for common terms stays slow, do you mean if you re-issue the exact query, the second query is not faster?  That seems very strange.  You might restart Solr, and send a first query (the first query always takes a relatively long time.)  Then pick one of your slow queries and send it 2 times.  The second time you send the query it should be much faster due to the Solr caches and you should be able to see the cache hit in the Solr admin panel.  If you send the exact query a second time (without enough intervening queries to evict data from the cache, ) the Solr queryResultCache should get hit and you should see a response time in the .01-5 millisecond range.
>
> What settings are you using for your Solr caches?
>
> How much memory is on the machine?  If your bottleneck is disk i/o for frequent terms, then you want to make sure you have enough memory for the OS disk cache.  
>
> I assume that http is not in your stopwords.  CommonGrams will only help with phrase queries
> CommonGrams was committed and is in Solr 1.4.  If you decide to use CommonGrams you definitely need to re-index and you also need to use both the index time filter and the query time filter.  Your index will be larger.
>
> <fieldType name="foo" ...>
> <analyzer type="index">
> <filter class="solr.CommonGramsFilterFactory" words="new400common.txt"/>
> </analyzer>
>
> <analyzer type="query">
> <filter class="solr.CommonGramsQueryFilterFactory" words="new400common.txt"/>
> </analyzer>
> </fieldType>
>
>
>
> Tom
> -----Original Message-----
> From: Peter Karich [mailto:peathal@yahoo.de] 
> Sent: Tuesday, August 10, 2010 3:32 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Improve Query Time For Large Index
>
> Hi Tom,
>
> my index is around 3GB large and I am using 2GB RAM for the JVM although
> a some more is available.
> If I am looking into the RAM usage while a slow query runs (via
> jvisualvm) I see that only 750MB of the JVM RAM is used.
>
>   
>> Can you give us some examples of the slow queries?
>>     
> for example the empty query solr/select?q=
> takes very long or solr/select?q=http
> where 'http' is the most common term
>
>   
>> Are you using stop words?  
>>     
> yes, a lot. I stored them into stopwords.txt
>
>   
>> http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
>>     
> this looks interesting. I read through
> https://issues.apache.org/jira/browse/SOLR-908 and it seems to be in 1.4.
> I only need to enable it via:
>
> <filter class="solr.CommonGramsFilterFactory" ignoreCase="true" words="stopwords.txt"/>
>
> right? Do I need to reindex?
>
> Regards,
> Peter.
>
>   
>> Hi Peter,
>>
>> A few more details about your setup would help list members to answer your questions.
>> How large is your index?  
>> How much memory is on the machine and how much is allocated to the JVM?
>> Besides the Solr caches, Solr and Lucene depend on the operating system's disk caching for caching of postings lists.  So you need to leave some memory for the OS.  On the other hand if you are optimizing and refreshing every 10-15 minutes, that will invalidate all the caches, since an optimized index is essentially a set of new files.
>>
>> Can you give us some examples of the slow queries?  Are you using stop words?  
>>
>> If your slow queries are phrase queries, then you might try either adding the most frequent terms in your index to the stopwords list  or try CommonGrams and add them to the common words list.  (Details on CommonGrams here: http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2)
>>
>> Tom Burton-West
>>
>> -----Original Message-----
>> From: Peter Karich [mailto:peathal@yahoo.de] 
>> Sent: Tuesday, August 10, 2010 9:54 AM
>> To: solr-user@lucene.apache.org
>> Subject: Improve Query Time For Large Index
>>
>> Hi,
>>
>> I have 5 Million small documents/tweets (=> ~3GB) and the slave index
>> replicates itself from master every 10-15 minutes, so the index is
>> optimized before querying. We are using solr 1.4.1 (patched with
>> SOLR-1624) via SolrJ.
>>
>> Now the search speed is slow >2s for common terms which hits more than 2
>> mio docs and acceptable for others: <0.5s. For those numbers I don't use
>> highlighting or facets. I am using the following schema [1] and from
>> luke handler I know that numTerms =~20 mio. The query for common terms
>> stays slow if I retry again and again (no cache improvements).
>>
>> How can I improve the query time for the common terms without using
>> Distributed Search [2] ?
>>
>> Regards,
>> Peter.
>>
>>
>> [1]
>> <field name="id" type="tlong" indexed="true" stored="true"
>> required="true" />
>> <field name="date" type="tdate" indexed="true" stored="true" />
>> <!-- term* attributes to prepare faster highlighting. -->
>> <field name="txt" type="text" indexed="true" stored="true"
>>                termVectors="true" termPositions="true" termOffsets="true"/>
>>
>> [2]
>> http://wiki.apache.org/solr/DistributedSearch
>>
>>
>>   
>>     
>
>   


-- 
http://karussell.wordpress.com/


RE: Improve Query Time For Large Index

Posted by "Burton-West, Tom" <tb...@umich.edu>.
Hi Peter,

Can you give a few more examples of slow queries?  
Are they phrase queries? Boolean queries? prefix or wildcard queries?
If one word queries are your slow queries, than CommonGrams won't help.  CommonGrams will only help with phrase queries.

How are you using termvectors?  That may be slowing things down.  I don't have experience with termvectors, so someone else on the list might speak to that.

When you say the query time for common terms stays slow, do you mean if you re-issue the exact query, the second query is not faster?  That seems very strange.  You might restart Solr, and send a first query (the first query always takes a relatively long time.)  Then pick one of your slow queries and send it 2 times.  The second time you send the query it should be much faster due to the Solr caches and you should be able to see the cache hit in the Solr admin panel.  If you send the exact query a second time (without enough intervening queries to evict data from the cache, ) the Solr queryResultCache should get hit and you should see a response time in the .01-5 millisecond range.

What settings are you using for your Solr caches?

How much memory is on the machine?  If your bottleneck is disk i/o for frequent terms, then you want to make sure you have enough memory for the OS disk cache.  

I assume that http is not in your stopwords.  CommonGrams will only help with phrase queries
CommonGrams was committed and is in Solr 1.4.  If you decide to use CommonGrams you definitely need to re-index and you also need to use both the index time filter and the query time filter.  Your index will be larger.

<fieldType name="foo" ...>
<analyzer type="index">
<filter class="solr.CommonGramsFilterFactory" words="new400common.txt"/>
</analyzer>

<analyzer type="query">
<filter class="solr.CommonGramsQueryFilterFactory" words="new400common.txt"/>
</analyzer>
</fieldType>



Tom
-----Original Message-----
From: Peter Karich [mailto:peathal@yahoo.de] 
Sent: Tuesday, August 10, 2010 3:32 PM
To: solr-user@lucene.apache.org
Subject: Re: Improve Query Time For Large Index

Hi Tom,

my index is around 3GB large and I am using 2GB RAM for the JVM although
a some more is available.
If I am looking into the RAM usage while a slow query runs (via
jvisualvm) I see that only 750MB of the JVM RAM is used.

> Can you give us some examples of the slow queries?

for example the empty query solr/select?q=
takes very long or solr/select?q=http
where 'http' is the most common term

> Are you using stop words?  

yes, a lot. I stored them into stopwords.txt

> http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2

this looks interesting. I read through
https://issues.apache.org/jira/browse/SOLR-908 and it seems to be in 1.4.
I only need to enable it via:

<filter class="solr.CommonGramsFilterFactory" ignoreCase="true" words="stopwords.txt"/>

right? Do I need to reindex?

Regards,
Peter.

> Hi Peter,
>
> A few more details about your setup would help list members to answer your questions.
> How large is your index?  
> How much memory is on the machine and how much is allocated to the JVM?
> Besides the Solr caches, Solr and Lucene depend on the operating system's disk caching for caching of postings lists.  So you need to leave some memory for the OS.  On the other hand if you are optimizing and refreshing every 10-15 minutes, that will invalidate all the caches, since an optimized index is essentially a set of new files.
>
> Can you give us some examples of the slow queries?  Are you using stop words?  
>
> If your slow queries are phrase queries, then you might try either adding the most frequent terms in your index to the stopwords list  or try CommonGrams and add them to the common words list.  (Details on CommonGrams here: http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2)
>
> Tom Burton-West
>
> -----Original Message-----
> From: Peter Karich [mailto:peathal@yahoo.de] 
> Sent: Tuesday, August 10, 2010 9:54 AM
> To: solr-user@lucene.apache.org
> Subject: Improve Query Time For Large Index
>
> Hi,
>
> I have 5 Million small documents/tweets (=> ~3GB) and the slave index
> replicates itself from master every 10-15 minutes, so the index is
> optimized before querying. We are using solr 1.4.1 (patched with
> SOLR-1624) via SolrJ.
>
> Now the search speed is slow >2s for common terms which hits more than 2
> mio docs and acceptable for others: <0.5s. For those numbers I don't use
> highlighting or facets. I am using the following schema [1] and from
> luke handler I know that numTerms =~20 mio. The query for common terms
> stays slow if I retry again and again (no cache improvements).
>
> How can I improve the query time for the common terms without using
> Distributed Search [2] ?
>
> Regards,
> Peter.
>
>
> [1]
> <field name="id" type="tlong" indexed="true" stored="true"
> required="true" />
> <field name="date" type="tdate" indexed="true" stored="true" />
> <!-- term* attributes to prepare faster highlighting. -->
> <field name="txt" type="text" indexed="true" stored="true"
>                termVectors="true" termPositions="true" termOffsets="true"/>
>
> [2]
> http://wiki.apache.org/solr/DistributedSearch
>
>
>   


-- 
http://karussell.wordpress.com/


Re: Improve Query Time For Large Index

Posted by Peter Karich <pe...@yahoo.de>.
Hi Tom,

my index is around 3GB large and I am using 2GB RAM for the JVM although
a some more is available.
If I am looking into the RAM usage while a slow query runs (via
jvisualvm) I see that only 750MB of the JVM RAM is used.

> Can you give us some examples of the slow queries?

for example the empty query solr/select?q=
takes very long or solr/select?q=http
where 'http' is the most common term

> Are you using stop words?  

yes, a lot. I stored them into stopwords.txt

> http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2

this looks interesting. I read through
https://issues.apache.org/jira/browse/SOLR-908 and it seems to be in 1.4.
I only need to enable it via:

<filter class="solr.CommonGramsFilterFactory" ignoreCase="true" words="stopwords.txt"/>

right? Do I need to reindex?

Regards,
Peter.

> Hi Peter,
>
> A few more details about your setup would help list members to answer your questions.
> How large is your index?  
> How much memory is on the machine and how much is allocated to the JVM?
> Besides the Solr caches, Solr and Lucene depend on the operating system's disk caching for caching of postings lists.  So you need to leave some memory for the OS.  On the other hand if you are optimizing and refreshing every 10-15 minutes, that will invalidate all the caches, since an optimized index is essentially a set of new files.
>
> Can you give us some examples of the slow queries?  Are you using stop words?  
>
> If your slow queries are phrase queries, then you might try either adding the most frequent terms in your index to the stopwords list  or try CommonGrams and add them to the common words list.  (Details on CommonGrams here: http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2)
>
> Tom Burton-West
>
> -----Original Message-----
> From: Peter Karich [mailto:peathal@yahoo.de] 
> Sent: Tuesday, August 10, 2010 9:54 AM
> To: solr-user@lucene.apache.org
> Subject: Improve Query Time For Large Index
>
> Hi,
>
> I have 5 Million small documents/tweets (=> ~3GB) and the slave index
> replicates itself from master every 10-15 minutes, so the index is
> optimized before querying. We are using solr 1.4.1 (patched with
> SOLR-1624) via SolrJ.
>
> Now the search speed is slow >2s for common terms which hits more than 2
> mio docs and acceptable for others: <0.5s. For those numbers I don't use
> highlighting or facets. I am using the following schema [1] and from
> luke handler I know that numTerms =~20 mio. The query for common terms
> stays slow if I retry again and again (no cache improvements).
>
> How can I improve the query time for the common terms without using
> Distributed Search [2] ?
>
> Regards,
> Peter.
>
>
> [1]
> <field name="id" type="tlong" indexed="true" stored="true"
> required="true" />
> <field name="date" type="tdate" indexed="true" stored="true" />
> <!-- term* attributes to prepare faster highlighting. -->
> <field name="txt" type="text" indexed="true" stored="true"
>                termVectors="true" termPositions="true" termOffsets="true"/>
>
> [2]
> http://wiki.apache.org/solr/DistributedSearch
>
>
>   


-- 
http://karussell.wordpress.com/


RE: Improve Query Time For Large Index

Posted by "Burton-West, Tom" <tb...@umich.edu>.
Hi Peter,

A few more details about your setup would help list members to answer your questions.
How large is your index?  
How much memory is on the machine and how much is allocated to the JVM?
Besides the Solr caches, Solr and Lucene depend on the operating system's disk caching for caching of postings lists.  So you need to leave some memory for the OS.  On the other hand if you are optimizing and refreshing every 10-15 minutes, that will invalidate all the caches, since an optimized index is essentially a set of new files.

Can you give us some examples of the slow queries?  Are you using stop words?  

If your slow queries are phrase queries, then you might try either adding the most frequent terms in your index to the stopwords list  or try CommonGrams and add them to the common words list.  (Details on CommonGrams here: http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2)

Tom Burton-West

-----Original Message-----
From: Peter Karich [mailto:peathal@yahoo.de] 
Sent: Tuesday, August 10, 2010 9:54 AM
To: solr-user@lucene.apache.org
Subject: Improve Query Time For Large Index

Hi,

I have 5 Million small documents/tweets (=> ~3GB) and the slave index
replicates itself from master every 10-15 minutes, so the index is
optimized before querying. We are using solr 1.4.1 (patched with
SOLR-1624) via SolrJ.

Now the search speed is slow >2s for common terms which hits more than 2
mio docs and acceptable for others: <0.5s. For those numbers I don't use
highlighting or facets. I am using the following schema [1] and from
luke handler I know that numTerms =~20 mio. The query for common terms
stays slow if I retry again and again (no cache improvements).

How can I improve the query time for the common terms without using
Distributed Search [2] ?

Regards,
Peter.


[1]
<field name="id" type="tlong" indexed="true" stored="true"
required="true" />
<field name="date" type="tdate" indexed="true" stored="true" />
<!-- term* attributes to prepare faster highlighting. -->
<field name="txt" type="text" indexed="true" stored="true"
               termVectors="true" termPositions="true" termOffsets="true"/>

[2]
http://wiki.apache.org/solr/DistributedSearch