You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Chantal Ackermann <ch...@btelligent.de> on 2009/07/31 14:04:03 UTC

mergeFactor / indexing speed

Dear all,

I want to find out which settings give the best full index performance 
for my setup.
Therefore, I have been running a small index (less than 20k documents) 
with a mergeFactor of 10 and 100.
In both cases, indexing took about 11.5 min:

mergeFactor: 10
<str name="Time taken ">0:11:46.792</str>
mergeFactor: 100
/admin/cores?action=RELOAD
<str name="Time taken ">0:11:44.441</str>
Tomcat restart
<str name="Time taken ">0:11:34.143</str>

This is a Tomcat 5.5.20, started with a max heap size of 1GB. But it 
always used much less. No swapping (RedHat Linux 32bit, 3GB RAM, old ATA 
disk).


Now, I have three questions:

1. How can I check which mergeFactor is really being used? The 
solrconfig.xml that is displayed in the admin application is the 
up-to-date view on the file system. I tested that. But it's not 
necessarily what the current SOLR core is using, isn't it?
Is there a way to check on the actually used mergeFactor (while the 
index is running)?
2. I changed the mergeFactor in both available settings (default and 
main index) in the solrconfig.xml file of the core I am reindexing. That 
is the correct place? Should a change in performance be noticeable when 
increasing from 10 to 100? Or is the change not perceivable if the 
requests for data are taking far longer than all the indexing itself?
3. Do I have to increase rumBufferSizeMB if I increase mergeFactor? (Or 
some other setting?)

(I am still trying to get profiling information on how much application 
time is eaten up by db connection/requests/processing.
The root entity query is about (average) 20ms. The child entity query is 
less than 10ms.
I have my custom entity processor running on the child entity that 
populates the map using a multi-row result set. I have also attached one 
regex and one script transformer.)

Thank you for any tips!
Chantal



-- 
Chantal Ackermann

Re: 99.9% uptime requirement

Posted by Walter Underwood <wu...@wunderwood.org>.

For 99.9%, run three copies behind a load balancer. That allows you to  
take one down for upgrade, and still be fault-tolerant.

wunder

On Aug 3, 2009, at 10:46 AM, Robert Petersen wrote:

> So then would the 'right' thing to do be to run it under something  
> like
> Daemontools so it bounces back up on a crash?  Do any other people use
> this approach or is there something better to make it come back up?
>
> Speaking of overly large caches, if I have solr running on a machine
> with 8GB main memory is it going to hurt to make some huge cache  
> sizes?
> Are these settings reasonable?  With a small index I have been getting
> some great hit-rates.
> <ramBufferSizeMB>1024</ramBufferSizeMB>
>
> <filterCache      class="solr.FastLRUCache"      size="350000"
> initialSize="512"      autowarmCount="80"/>
> <queryResultCache class="solr.LRUCache"      size="512000000"
> initialSize="512"      autowarmCount="80"/>
> <documentCache    class="solr.FastLRUCache"      size="512000"
> initialSize="512"      autowarmCount="0"/>
>
> Thanks
> Robi
>
> -----Original Message-----
> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
> Sent: Friday, July 31, 2009 11:37 PM
> To: solr-user@lucene.apache.org
> Subject: Re: 99.9% uptime requirement
>
> Robi,
>
> Solr is indeed very stable.  However, it can crash and I've seen it
> crash.  Or rather, I should say I've seen the JVM that runs Solr  
> crash.
> For instance, if you have a servlet container with a number of  
> webapps,
> one of which is Solr, and one of which has a memory leak, I believe  
> all
> webapps will suffer and "crash".  And even if you have just Solr in  
> your
> servlet container, it can OOM, say if you specify overly large  
> caches or
> too frequent commits, etc.
>
> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>
>
>
> ----- Original Message ----
>> From: Robert Petersen <ro...@buy.com>
>> To: solr-user@lucene.apache.org
>> Sent: Friday, July 31, 2009 12:18:55 PM
>> Subject: 99.9% uptime requirement
>>
>> Hi all,
>>
>> My solr project powers almost all the pages in our site and so needs
> to
>> be up period.  My question is what can I do to ensure that happens?
>> Does solr ever crash, assuming reasonable load conditions and no
> extreme
>> index sizes?
>>
>> I saw some comments about running solr under daemontools in order to
> get
>> an auto-restart on crashes.  From what I have seen so far in my
> limited
>> experience, solr is very stable and never crashes (so far).  Does
> anyone
>> else have this requirement and if so how do they deal with it?  Is
>> anyone else running solr under daemontools in a production site?
>>
>> Thanks for any input you might have,
>> Robi
>

Re: 99.9% uptime requirement

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

On Thu, Aug 6, 2009 at 4:10 AM, Robert Petersen <ro...@buy.com> wrote:

> Maintenance Questions:  In a two slave one master setup where the two
> slaves are behind load balancers what happens if I have to restart solr?
> If I have to restart solr say for a schema update where I have added a
> new field then what is the recommended procedure?
>
> If I can guarantee no commits or optimizes happen on the master during
> the schema update so no new snapshots become available then can I safely
> leave rsyncd enabled?  When I stop and start a slave server, should I
> first pull it out of the load balancers list or will solr gracefully
> release connections as it shuts down so no searches are lost?
>

We pull slaves out of the load balancer, wait for 15-20 seconds and then
stop the tomcat process.

>
> What do you guys do to push out updates?
>

Disable the cron job on all slaves (which calls snappuller).
Update schema on master and re-index.
For each slave: Take it out of rotation, stop tomcat, update the schema,
start tomcat, call snappuller, start cron.

This is now a piece of cake with the java based replication in Solr 1.4
which supports replicating configuration too without downtime.

-- 
Regards,
Shalin Shekhar Mangar.

Re: 99.9% uptime requirement

Posted by Walter Underwood <wu...@wunderwood.org>.

Design so that you can handle the load with one server down (N+1  
sizing), then take one server out for any maintenance. Simple and  
works fine.

wunder

On Aug 6, 2009, at 9:25 AM, Robert Petersen wrote:

> Here is another idea.  With solr multicore you can dynamically spin up
> extra cores and bring them online.  I'm not sure how well this would
> work for us since we have hard coded the names of the cores we are
> hitting in our config files.
>
> -----Original Message-----
> From: Brian Klippel [mailto:brian@theport.com]
> Sent: Thursday, August 06, 2009 8:38 AM
> To: solr-user@lucene.apache.org
> Subject: RE: 99.9% uptime requirement
>
> You could create a new "working" core, then call the swap command once
> it is ready.  Then remove the work core and delete the appropriate  
> index
> folder at your convenience.
>
>
> -----Original Message-----
> From: Robert Petersen [mailto:robertpe@buy.com]
> Sent: Wednesday, August 05, 2009 6:41 PM
> To: solr-user@lucene.apache.org
> Subject: RE: 99.9% uptime requirement
>
> Maintenance Questions:  In a two slave one master setup where the two
> slaves are behind load balancers what happens if I have to restart  
> solr?
> If I have to restart solr say for a schema update where I have added a
> new field then what is the recommended procedure?
>
> If I can guarantee no commits or optimizes happen on the master during
> the schema update so no new snapshots become available then can I  
> safely
> leave rsyncd enabled?  When I stop and start a slave server, should I
> first pull it out of the load balancers list or will solr gracefully
> release connections as it shuts down so no searches are lost?
>
> What do you guys do to push out updates?
>
> Thanks for any thoughts,
> Robi
>
>
> -----Original Message-----
> From: Walter Underwood [mailto:wunder@wunderwood.org]
> Sent: Tuesday, August 04, 2009 8:57 AM
> To: solr-user@lucene.apache.org
> Subject: Re: 99.9% uptime requirement
>
> Right. You don't get to 99.9% by assuming that an 8 hour outage is OK.
> Design for continuous uptime, with plans for how long it takes to
> patch around a single point of failure. For example, if your load
> balancer is a single point of failure, make sure that you can redirect
> the front end servers to a single Solr server in much less than 8  
> hours.
>
> Also, think about your SLA. Can the search index be more than 8 hours
> stale? How quickly do you need to be able to replace a failed indexing
> server? You might be able to run indexing locally on each search
> server if they are lightly loaded.
>
> wunder
>
> On Aug 4, 2009, at 7:11 AM, Norberto Meijome wrote:
>
>> On Mon, 3 Aug 2009 13:15:44 -0700
>> "Robert Petersen" <ro...@buy.com> wrote:
>>
>>> Thanks all, I figured there would be more talk about daemontools if
>>> there
>>> were really a need.  I appreciate the input and for starters we'll
>>> put two
>>> slaves behind a load balancer and grow it from there.
>>>
>>
>> Robert,
>> not taking away from daemon tools, but daemon tools won't help you
>> if your
>> whole server goes down.
>>
>> don't put all your eggs in one basket - several
>> servers, load balancer (hardware load balancers x 2, haproxy, etc)
>>
>> and sure, use daemon tools to keep your services running within each
>> server...
>>
>> B
>> _________________________
>> {Beto|Norberto|Numard} Meijome
>>
>> "Why do you sit there looking like an envelope without any address
>> on it?"
>> Mark Twain
>>
>> I speak for myself, not my employer. Contents may be hot. Slippery
>> when wet.
>> Reading disclaimers makes you go blind. Writing them is worse. You
>> have been
>> Warned.
>>
>

RE: 99.9% uptime requirement

Posted by Robert Petersen <ro...@buy.com>.

Here is another idea.  With solr multicore you can dynamically spin up
extra cores and bring them online.  I'm not sure how well this would
work for us since we have hard coded the names of the cores we are
hitting in our config files.

-----Original Message-----
From: Brian Klippel [mailto:brian@theport.com] 
Sent: Thursday, August 06, 2009 8:38 AM
To: solr-user@lucene.apache.org
Subject: RE: 99.9% uptime requirement

You could create a new "working" core, then call the swap command once
it is ready.  Then remove the work core and delete the appropriate index
folder at your convenience.

-----Original Message-----
From: Robert Petersen [mailto:robertpe@buy.com] 
Sent: Wednesday, August 05, 2009 6:41 PM
To: solr-user@lucene.apache.org
Subject: RE: 99.9% uptime requirement

Maintenance Questions:  In a two slave one master setup where the two
slaves are behind load balancers what happens if I have to restart solr?
If I have to restart solr say for a schema update where I have added a
new field then what is the recommended procedure?

If I can guarantee no commits or optimizes happen on the master during
the schema update so no new snapshots become available then can I safely
leave rsyncd enabled?  When I stop and start a slave server, should I
first pull it out of the load balancers list or will solr gracefully
release connections as it shuts down so no searches are lost?

What do you guys do to push out updates?

Thanks for any thoughts,
Robi

-----Original Message-----
From: Walter Underwood [mailto:wunder@wunderwood.org] 
Sent: Tuesday, August 04, 2009 8:57 AM
To: solr-user@lucene.apache.org
Subject: Re: 99.9% uptime requirement

Right. You don't get to 99.9% by assuming that an 8 hour outage is OK.  
Design for continuous uptime, with plans for how long it takes to  
patch around a single point of failure. For example, if your load  
balancer is a single point of failure, make sure that you can redirect  
the front end servers to a single Solr server in much less than 8 hours.

Also, think about your SLA. Can the search index be more than 8 hours  
stale? How quickly do you need to be able to replace a failed indexing  
server? You might be able to run indexing locally on each search  
server if they are lightly loaded.

wunder

On Aug 4, 2009, at 7:11 AM, Norberto Meijome wrote:

> On Mon, 3 Aug 2009 13:15:44 -0700
> "Robert Petersen" <ro...@buy.com> wrote:
>
>> Thanks all, I figured there would be more talk about daemontools if  
>> there
>> were really a need.  I appreciate the input and for starters we'll  
>> put two
>> slaves behind a load balancer and grow it from there.
>>
>
> Robert,
> not taking away from daemon tools, but daemon tools won't help you  
> if your
> whole server goes down.
>
> don't put all your eggs in one basket - several
> servers, load balancer (hardware load balancers x 2, haproxy, etc)
>
> and sure, use daemon tools to keep your services running within each  
> server...
>
> B
> _________________________
> {Beto|Norberto|Numard} Meijome
>
> "Why do you sit there looking like an envelope without any address  
> on it?"
>  Mark Twain
>
> I speak for myself, not my employer. Contents may be hot. Slippery  
> when wet.
> Reading disclaimers makes you go blind. Writing them is worse. You  
> have been
> Warned.
>

RE: 99.9% uptime requirement

Posted by Brian Klippel <br...@theport.com>.

You could create a new "working" core, then call the swap command once
it is ready.  Then remove the work core and delete the appropriate index
folder at your convenience.

-----Original Message-----
From: Robert Petersen [mailto:robertpe@buy.com] 
Sent: Wednesday, August 05, 2009 6:41 PM
To: solr-user@lucene.apache.org
Subject: RE: 99.9% uptime requirement

Maintenance Questions:  In a two slave one master setup where the two
slaves are behind load balancers what happens if I have to restart solr?
If I have to restart solr say for a schema update where I have added a
new field then what is the recommended procedure?

If I can guarantee no commits or optimizes happen on the master during
the schema update so no new snapshots become available then can I safely
leave rsyncd enabled?  When I stop and start a slave server, should I
first pull it out of the load balancers list or will solr gracefully
release connections as it shuts down so no searches are lost?

What do you guys do to push out updates?

Thanks for any thoughts,
Robi

-----Original Message-----
From: Walter Underwood [mailto:wunder@wunderwood.org] 
Sent: Tuesday, August 04, 2009 8:57 AM
To: solr-user@lucene.apache.org
Subject: Re: 99.9% uptime requirement

Right. You don't get to 99.9% by assuming that an 8 hour outage is OK.  
Design for continuous uptime, with plans for how long it takes to  
patch around a single point of failure. For example, if your load  
balancer is a single point of failure, make sure that you can redirect  
the front end servers to a single Solr server in much less than 8 hours.

Also, think about your SLA. Can the search index be more than 8 hours  
stale? How quickly do you need to be able to replace a failed indexing  
server? You might be able to run indexing locally on each search  
server if they are lightly loaded.

wunder

On Aug 4, 2009, at 7:11 AM, Norberto Meijome wrote:

> On Mon, 3 Aug 2009 13:15:44 -0700
> "Robert Petersen" <ro...@buy.com> wrote:
>
>> Thanks all, I figured there would be more talk about daemontools if  
>> there
>> were really a need.  I appreciate the input and for starters we'll  
>> put two
>> slaves behind a load balancer and grow it from there.
>>
>
> Robert,
> not taking away from daemon tools, but daemon tools won't help you  
> if your
> whole server goes down.
>
> don't put all your eggs in one basket - several
> servers, load balancer (hardware load balancers x 2, haproxy, etc)
>
> and sure, use daemon tools to keep your services running within each  
> server...
>
> B
> _________________________
> {Beto|Norberto|Numard} Meijome
>
> "Why do you sit there looking like an envelope without any address  
> on it?"
>  Mark Twain
>
> I speak for myself, not my employer. Contents may be hot. Slippery  
> when wet.
> Reading disclaimers makes you go blind. Writing them is worse. You  
> have been
> Warned.
>

RE: 99.9% uptime requirement

Posted by Robert Petersen <ro...@buy.com>.

Maintenance Questions:  In a two slave one master setup where the two
slaves are behind load balancers what happens if I have to restart solr?
If I have to restart solr say for a schema update where I have added a
new field then what is the recommended procedure?

If I can guarantee no commits or optimizes happen on the master during
the schema update so no new snapshots become available then can I safely
leave rsyncd enabled?  When I stop and start a slave server, should I
first pull it out of the load balancers list or will solr gracefully
release connections as it shuts down so no searches are lost?

What do you guys do to push out updates?

Thanks for any thoughts,
Robi

-----Original Message-----
From: Walter Underwood [mailto:wunder@wunderwood.org] 
Sent: Tuesday, August 04, 2009 8:57 AM
To: solr-user@lucene.apache.org
Subject: Re: 99.9% uptime requirement

Right. You don't get to 99.9% by assuming that an 8 hour outage is OK.  
Design for continuous uptime, with plans for how long it takes to  
patch around a single point of failure. For example, if your load  
balancer is a single point of failure, make sure that you can redirect  
the front end servers to a single Solr server in much less than 8 hours.

Also, think about your SLA. Can the search index be more than 8 hours  
stale? How quickly do you need to be able to replace a failed indexing  
server? You might be able to run indexing locally on each search  
server if they are lightly loaded.

wunder

On Aug 4, 2009, at 7:11 AM, Norberto Meijome wrote:

> On Mon, 3 Aug 2009 13:15:44 -0700
> "Robert Petersen" <ro...@buy.com> wrote:
>
>> Thanks all, I figured there would be more talk about daemontools if  
>> there
>> were really a need.  I appreciate the input and for starters we'll  
>> put two
>> slaves behind a load balancer and grow it from there.
>>
>
> Robert,
> not taking away from daemon tools, but daemon tools won't help you  
> if your
> whole server goes down.
>
> don't put all your eggs in one basket - several
> servers, load balancer (hardware load balancers x 2, haproxy, etc)
>
> and sure, use daemon tools to keep your services running within each  
> server...
>
> B
> _________________________
> {Beto|Norberto|Numard} Meijome
>
> "Why do you sit there looking like an envelope without any address  
> on it?"
>  Mark Twain
>
> I speak for myself, not my employer. Contents may be hot. Slippery  
> when wet.
> Reading disclaimers makes you go blind. Writing them is worse. You  
> have been
> Warned.
>

Re: 99.9% uptime requirement

Posted by Walter Underwood <wu...@wunderwood.org>.

Right. You don't get to 99.9% by assuming that an 8 hour outage is OK.  
Design for continuous uptime, with plans for how long it takes to  
patch around a single point of failure. For example, if your load  
balancer is a single point of failure, make sure that you can redirect  
the front end servers to a single Solr server in much less than 8 hours.

Also, think about your SLA. Can the search index be more than 8 hours  
stale? How quickly do you need to be able to replace a failed indexing  
server? You might be able to run indexing locally on each search  
server if they are lightly loaded.

wunder

On Aug 4, 2009, at 7:11 AM, Norberto Meijome wrote:

> On Mon, 3 Aug 2009 13:15:44 -0700
> "Robert Petersen" <ro...@buy.com> wrote:
>
>> Thanks all, I figured there would be more talk about daemontools if  
>> there
>> were really a need.  I appreciate the input and for starters we'll  
>> put two
>> slaves behind a load balancer and grow it from there.
>>
>
> Robert,
> not taking away from daemon tools, but daemon tools won't help you  
> if your
> whole server goes down.
>
> don't put all your eggs in one basket - several
> servers, load balancer (hardware load balancers x 2, haproxy, etc)
>
> and sure, use daemon tools to keep your services running within each  
> server...
>
> B
> _________________________
> {Beto|Norberto|Numard} Meijome
>
> "Why do you sit there looking like an envelope without any address  
> on it?"
>  Mark Twain
>
> I speak for myself, not my employer. Contents may be hot. Slippery  
> when wet.
> Reading disclaimers makes you go blind. Writing them is worse. You  
> have been
> Warned.
>

Re: 99.9% uptime requirement

Posted by Norberto Meijome <nu...@gmail.com>.

On Mon, 3 Aug 2009 13:15:44 -0700
"Robert Petersen" <ro...@buy.com> wrote:

> Thanks all, I figured there would be more talk about daemontools if there
> were really a need.  I appreciate the input and for starters we'll put two
> slaves behind a load balancer and grow it from there.
> 

Robert,
not taking away from daemon tools, but daemon tools won't help you if your
whole server goes down.

 don't put all your eggs in one basket - several
servers, load balancer (hardware load balancers x 2, haproxy, etc)

and sure, use daemon tools to keep your services running within each server...

B
_________________________
{Beto|Norberto|Numard} Meijome

"Why do you sit there looking like an envelope without any address on it?"
  Mark Twain

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.

RE: 99.9% uptime requirement

Posted by Robert Petersen <ro...@buy.com>.

Thanks all, I figured there would be more talk about daemontools if there were really a need.  I appreciate the input and for starters we'll put two slaves behind a load balancer and grow it from there.

Lovin' Solr So Far!  We were using alta vista as our search engine... it was sooo 90's!  haha

Thanks again,
Robi

-----Original Message-----
From: Rafał Kuć [mailto:rafal@alud.com.pl] 
Sent: Monday, August 03, 2009 11:00 AM
To: solr-user@lucene.apache.org
Subject: Re: 99.9% uptime requirement

Hello!

Robert, from my experience with Solr (since 1.2 and running few 1.4 deployments) Solr does not need any kind of mechanism to ensure it will auto start on crash, because I didn`t see it crash on it`s own fault. Just ensure, You have not one instance of Solr, and run it behind a proxy or load balancer of some kind. 

-- 
Regards,
Rafał Kuć

> So then would the 'right' thing to do be to run it under something like
> Daemontools so it bounces back up on a crash? Do any other people use
> this approach or is there something better to make it come back up?

> Speaking of overly large caches, if I have solr running on a machine
> with 8GB main memory is it going to hurt to make some huge cache sizes?
> Are these settings reasonable? With a small index I have been getting
> some great hit-rates.
> <ramBufferSizeMB>1024</ramBufferSizeMB>

> <filterCache class="solr.FastLRUCache" size="350000"
> initialSize="512" autowarmCount="80"/>
> <queryResultCache class="solr.LRUCache" size="512000000"
> initialSize="512" autowarmCount="80"/>
> <documentCache class="solr.FastLRUCache" size="512000"
> initialSize="512" autowarmCount="0"/>

> Thanks
> Robi

> -----Original Message-----
> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] 
> Sent: Friday, July 31, 2009 11:37 PM
> To: solr-user@lucene.apache.org
> Subject: Re: 99.9% uptime requirement

> Robi,

> Solr is indeed very stable. However, it can crash and I've seen it
> crash. Or rather, I should say I've seen the JVM that runs Solr crash.
> For instance, if you have a servlet container with a number of webapps,
> one of which is Solr, and one of which has a memory leak, I believe all
> webapps will suffer and "crash". And even if you have just Solr in your
> servlet container, it can OOM, say if you specify overly large caches or
> too frequent commits, etc.

> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



> ----- Original Message ----
>> From: Robert Petersen <ro...@buy.com>
>> To: solr-user@lucene.apache.org
>> Sent: Friday, July 31, 2009 12:18:55 PM
>> Subject: 99.9% uptime requirement

>> Hi all,

>> My solr project powers almost all the pages in our site and so needs
> to
>> be up period. My question is what can I do to ensure that happens?
>> Does solr ever crash, assuming reasonable load conditions and no
> extreme
>> index sizes?

>> I saw some comments about running solr under daemontools in order to
> get
>> an auto-restart on crashes. From what I have seen so far in my
> limited
>> experience, solr is very stable and never crashes (so far). Does
> anyone
>> else have this requirement and if so how do they deal with it? Is
>> anyone else running solr under daemontools in a production site?

>> Thanks for any input you might have,
>> Robi

Re: 99.9% uptime requirement

Posted by Rafał Kuć <ra...@alud.com.pl>.

Hello!

Robert, from my experience with Solr (since 1.2 and running few 1.4 deployments) Solr does not need any kind of mechanism to ensure it will auto start on crash, because I didn`t see it crash on it`s own fault. Just ensure, You have not one instance of Solr, and run it behind a proxy or load balancer of some kind. 

-- 
Regards,
Rafał Kuć

> So then would the 'right' thing to do be to run it under something like
> Daemontools so it bounces back up on a crash? Do any other people use
> this approach or is there something better to make it come back up?

> Speaking of overly large caches, if I have solr running on a machine
> with 8GB main memory is it going to hurt to make some huge cache sizes?
> Are these settings reasonable? With a small index I have been getting
> some great hit-rates.
> <ramBufferSizeMB>1024</ramBufferSizeMB>

> <filterCache class="solr.FastLRUCache" size="350000"
> initialSize="512" autowarmCount="80"/>
> <queryResultCache class="solr.LRUCache" size="512000000"
> initialSize="512" autowarmCount="80"/>
> <documentCache class="solr.FastLRUCache" size="512000"
> initialSize="512" autowarmCount="0"/>

> Thanks
> Robi

> -----Original Message-----
> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] 
> Sent: Friday, July 31, 2009 11:37 PM
> To: solr-user@lucene.apache.org
> Subject: Re: 99.9% uptime requirement

> Robi,

> Solr is indeed very stable. However, it can crash and I've seen it
> crash. Or rather, I should say I've seen the JVM that runs Solr crash.
> For instance, if you have a servlet container with a number of webapps,
> one of which is Solr, and one of which has a memory leak, I believe all
> webapps will suffer and "crash". And even if you have just Solr in your
> servlet container, it can OOM, say if you specify overly large caches or
> too frequent commits, etc.

> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



> ----- Original Message ----
>> From: Robert Petersen <ro...@buy.com>
>> To: solr-user@lucene.apache.org
>> Sent: Friday, July 31, 2009 12:18:55 PM
>> Subject: 99.9% uptime requirement

>> Hi all,

>> My solr project powers almost all the pages in our site and so needs
> to
>> be up period. My question is what can I do to ensure that happens?
>> Does solr ever crash, assuming reasonable load conditions and no
> extreme
>> index sizes?

>> I saw some comments about running solr under daemontools in order to
> get
>> an auto-restart on crashes. From what I have seen so far in my
> limited
>> experience, solr is very stable and never crashes (so far). Does
> anyone
>> else have this requirement and if so how do they deal with it? Is
>> anyone else running solr under daemontools in a production site?

>> Thanks for any input you might have,
>> Robi

Re: 99.9% uptime requirement

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Yes, daemontools or any kind of home-grown process-watching-and-restarting tool will work.
Regarding those caches - they look too large.
Also, the ramBufferSizeMB is irrelevant on search slaves.

 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----
> From: Robert Petersen <ro...@buy.com>
> To: solr-user@lucene.apache.org
> Sent: Monday, August 3, 2009 1:46:21 PM
> Subject: RE: 99.9% uptime requirement
> 
> So then would the 'right' thing to do be to run it under something like
> Daemontools so it bounces back up on a crash?  Do any other people use
> this approach or is there something better to make it come back up?
> 
> Speaking of overly large caches, if I have solr running on a machine
> with 8GB main memory is it going to hurt to make some huge cache sizes?
> Are these settings reasonable?  With a small index I have been getting
> some great hit-rates.
> 1024
> 
> 
> initialSize="512"      autowarmCount="80"/>
> 
> initialSize="512"      autowarmCount="80"/>
> 
> initialSize="512"      autowarmCount="0"/>
> 
> Thanks
> Robi
> 
> -----Original Message-----
> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] 
> Sent: Friday, July 31, 2009 11:37 PM
> To: solr-user@lucene.apache.org
> Subject: Re: 99.9% uptime requirement
> 
> Robi,
> 
> Solr is indeed very stable.  However, it can crash and I've seen it
> crash.  Or rather, I should say I've seen the JVM that runs Solr crash.
> For instance, if you have a servlet container with a number of webapps,
> one of which is Solr, and one of which has a memory leak, I believe all
> webapps will suffer and "crash".  And even if you have just Solr in your
> servlet container, it can OOM, say if you specify overly large caches or
> too frequent commits, etc.
> 
> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> 
> 
> 
> ----- Original Message ----
> > From: Robert Petersen 
> > To: solr-user@lucene.apache.org
> > Sent: Friday, July 31, 2009 12:18:55 PM
> > Subject: 99.9% uptime requirement
> > 
> > Hi all,
> > 
> > My solr project powers almost all the pages in our site and so needs
> to
> > be up period.  My question is what can I do to ensure that happens?
> > Does solr ever crash, assuming reasonable load conditions and no
> extreme
> > index sizes?
> > 
> > I saw some comments about running solr under daemontools in order to
> get
> > an auto-restart on crashes.  From what I have seen so far in my
> limited
> > experience, solr is very stable and never crashes (so far).  Does
> anyone
> > else have this requirement and if so how do they deal with it?  Is
> > anyone else running solr under daemontools in a production site?
> > 
> > Thanks for any input you might have,
> > Robi

RE: 99.9% uptime requirement

Posted by Robert Petersen <ro...@buy.com>.

So then would the 'right' thing to do be to run it under something like
Daemontools so it bounces back up on a crash?  Do any other people use
this approach or is there something better to make it come back up?

Speaking of overly large caches, if I have solr running on a machine
with 8GB main memory is it going to hurt to make some huge cache sizes?
Are these settings reasonable?  With a small index I have been getting
some great hit-rates.
<ramBufferSizeMB>1024</ramBufferSizeMB>

<filterCache      class="solr.FastLRUCache"      size="350000"
initialSize="512"      autowarmCount="80"/>
<queryResultCache class="solr.LRUCache"      size="512000000"
initialSize="512"      autowarmCount="80"/>
<documentCache    class="solr.FastLRUCache"      size="512000"
initialSize="512"      autowarmCount="0"/>

Thanks
Robi

-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] 
Sent: Friday, July 31, 2009 11:37 PM
To: solr-user@lucene.apache.org
Subject: Re: 99.9% uptime requirement

Robi,

Solr is indeed very stable.  However, it can crash and I've seen it
crash.  Or rather, I should say I've seen the JVM that runs Solr crash.
For instance, if you have a servlet container with a number of webapps,
one of which is Solr, and one of which has a memory leak, I believe all
webapps will suffer and "crash".  And even if you have just Solr in your
servlet container, it can OOM, say if you specify overly large caches or
too frequent commits, etc.

 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----
> From: Robert Petersen <ro...@buy.com>
> To: solr-user@lucene.apache.org
> Sent: Friday, July 31, 2009 12:18:55 PM
> Subject: 99.9% uptime requirement
> 
> Hi all,
> 
> My solr project powers almost all the pages in our site and so needs
to
> be up period.  My question is what can I do to ensure that happens?
> Does solr ever crash, assuming reasonable load conditions and no
extreme
> index sizes?
> 
> I saw some comments about running solr under daemontools in order to
get
> an auto-restart on crashes.  From what I have seen so far in my
limited
> experience, solr is very stable and never crashes (so far).  Does
anyone
> else have this requirement and if so how do they deal with it?  Is
> anyone else running solr under daemontools in a production site?
> 
> Thanks for any input you might have,
> Robi

Re: 99.9% uptime requirement

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Robi,

Solr is indeed very stable.  However, it can crash and I've seen it crash.  Or rather, I should say I've seen the JVM that runs Solr crash.  For instance, if you have a servlet container with a number of webapps, one of which is Solr, and one of which has a memory leak, I believe all webapps will suffer and "crash".  And even if you have just Solr in your servlet container, it can OOM, say if you specify overly large caches or too frequent commits, etc.

 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----
> From: Robert Petersen <ro...@buy.com>
> To: solr-user@lucene.apache.org
> Sent: Friday, July 31, 2009 12:18:55 PM
> Subject: 99.9% uptime requirement
> 
> Hi all,
> 
> My solr project powers almost all the pages in our site and so needs to
> be up period.  My question is what can I do to ensure that happens?
> Does solr ever crash, assuming reasonable load conditions and no extreme
> index sizes?
> 
> I saw some comments about running solr under daemontools in order to get
> an auto-restart on crashes.  From what I have seen so far in my limited
> experience, solr is very stable and never crashes (so far).  Does anyone
> else have this requirement and if so how do they deal with it?  Is
> anyone else running solr under daemontools in a production site?
> 
> Thanks for any input you might have,
> Robi

Re: 99.9% uptime requirement

Posted by Chris Hostetter <ho...@fucit.org>.

: Subject: 99.9% uptime requirement
: In-Reply-To: <4A...@btelligent.de>

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/Thread_hijacking





-Hoss

Re: 99.9% uptime requirement

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.

we have been using Solr in production for years. The only kind of
crash that we have observed is a JVM crash.

On Fri, Jul 31, 2009 at 9:48 PM, Robert Petersen<ro...@buy.com> wrote:
> Hi all,
>
> My solr project powers almost all the pages in our site and so needs to
> be up period.  My question is what can I do to ensure that happens?
> Does solr ever crash, assuming reasonable load conditions and no extreme
> index sizes?
>
> I saw some comments about running solr under daemontools in order to get
> an auto-restart on crashes.  From what I have seen so far in my limited
> experience, solr is very stable and never crashes (so far).  Does anyone
> else have this requirement and if so how do they deal with it?  Is
> anyone else running solr under daemontools in a production site?
>
> Thanks for any input you might have,
> Robi
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

99.9% uptime requirement

Posted by Robert Petersen <ro...@buy.com>.

Hi all,

My solr project powers almost all the pages in our site and so needs to
be up period.  My question is what can I do to ensure that happens?
Does solr ever crash, assuming reasonable load conditions and no extreme
index sizes?

I saw some comments about running solr under daemontools in order to get
an auto-restart on crashes.  From what I have seen so far in my limited
experience, solr is very stable and never crashes (so far).  Does anyone
else have this requirement and if so how do they deal with it?  Is
anyone else running solr under daemontools in a production site?

Thanks for any input you might have,
Robi

Re: mergeFactor / indexing speed

Posted by Avlesh Singh <av...@gmail.com>.

>
> avg-cpu:  %user   %nice    %sys %iowait   %idle
>           1.23    0.00    0.03    0.03   98.71
>
I agree, real bad statistics, actually.

Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB.
>
To me the former appears to be too high and latter too low (for your machine
configuration). You can safely increase the ramBufferSize (or
maxBufferedDocs) to a higher value.

Couple of things -

   1. The stock solrconfig.xml comes with two sections <indexDefaults> and
   <mainIndex>. Options in the latter override the former. Just make sure that
   you have right values at the right place.
   2. Do you have too many nested entities inside the DIH's data-config? If
   yes, a database level optimization (creating views, in memory tables ...)
   might hold the answer.
   3. Tried playing around with jdbc paramters in the data source? Setting
   "batchSize" property to a considerable value might help.

Cheers
Avlesh

On Mon, Aug 3, 2009 at 10:02 PM, Chantal Ackermann <
chantal.ackermann@btelligent.de> wrote:

> Hi all,
>
> I'm still struggling with the index performance. I've moved the indexer
> to a different machine, now, which is faster and less occupied.
>
> The new machine is a 64bit 8Gig-RAM RedHat. JDK1.6, Tomcat 6.0.18,
> running with those settings (and others):
> -server -Xms1G -Xmx7G
>
> Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB.
> It has been processing roughly 70k documents in half an hour, so far. Which
> means 1,5 hours at least for 200k - which is as fast/slow as before (on the
> less performant machine).
>
> The machine is not swapping. It is only using 13% of the memory.
> iostat gives me:
>  iostat
> Linux 2.6.9-67.ELsmp      08/03/2009
>
> avg-cpu:  %user   %nice    %sys %iowait   %idle
>           1.23    0.00    0.03    0.03   98.71
>
> Basically, it is doing very little? *scratch*
>
> The sourcing database is responding as fast as ever. (I checked that from
> my own machine, and did only a ping from the linux box to the db server.)
>
> Any help, any hint on where to look would be greatly appreciated.
>
>
> Thanks!
> Chantal
>
>
> Chantal Ackermann schrieb:
>
>> Hi again!
>>
>> Thanks for the answer, Grant.
>>
>>  > It could very well be the case that you aren't seeing any merges with
>>  > only 20K docs.  Ultimately, if you really want to, you can look in
>>  > your data.dir and count the files.  If you have indexed a lot and have
>>  > an MF of 100 and haven't done an optimize, you will see a lot more
>>  > index files.
>>
>> Do you mean that 20k is not representative enough to test those settings?
>> I've chosen the smaller data set so that the index can run completely
>> but doesn't take too long at the same time.
>> If it would be faster to begin with, I could use a larger data set, of
>> course. I still can't believe that 11 minutes is normal (I haven't
>> managed to make it run faster or slower than that, that duration is very
>> stable).
>>
>> It "feels kinda" slow to me...
>> Out of your experience - what would you expect as duration for an index
>> with:
>> - 21 fields, some using a text type with 6 filters
>> - database access using DataImportHandler with a query of (far) less
>> than 20ms
>> - 2 transformers
>>
>> If I knew that indexing time should be shorter than that, at least, I
>> would know that something is definitely wrong with what I am doing or
>> with the environment I am using.
>>
>>  > Likely, but not guaranteed.  Typically, larger merge factors are good
>>  > for batch indexing, but a lot of that has changed with Lucene's new
>>  > background merger, such that I don't know if it matters as much
>> anymore.
>>
>> Ok. I also read some posting where it basically said that the default
>> parameters are ok. And one shouldn't mess around with them.
>>
>> The thing is that our current search setup uses Lucene directly, and the
>> indexer takes less than an hour (MF: 500, maxBufferedDocs: 7500). The
>> fields are different, the complete setup is different. But it will be
>> hard to advertise a new implementation/setup where indexing is three
>> times slower - unless I can give some reasons why that is.
>>
>> The full index should be fairly fast because the backing data is update
>> every few hours. I want to put in place an incremental/partial update as
>> main process, but full indexing might have to be done at certain times
>> if data has changed completely, or the schema has to be changed/extended.
>>
>>  > No, those are separate things.  The ramBufferSizeMB (although, I like
>>  > the thought of a "rum"BufferSizeMB too!  ;-)  ) controls how many docs
>>  > Lucene holds in memory before it has to flush.  MF controls how many
>>  > segments are on disk
>>
>> alas! the rum. I had that typo on the commandline before. that's my
>> subconscious telling me what I should do when I get home, tonight...
>>
>> So, increasing ramBufferSize should lead to higher memory usage,
>> shouldn't it? I'm not seeing that. :-(
>>
>> I'll try once more with MF 10 and a higher rum... well, you know... ;-)
>>
>> Cheers,
>> Chantal
>>
>> Grant Ingersoll schrieb:
>>
>>> On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote:
>>>
>>>  Dear all,
>>>>
>>>> I want to find out which settings give the best full index
>>>> performance for my setup.
>>>> Therefore, I have been running a small index (less than 20k
>>>> documents) with a mergeFactor of 10 and 100.
>>>> In both cases, indexing took about 11.5 min:
>>>>
>>>> mergeFactor: 10
>>>> <str name="Time taken ">0:11:46.792</str>
>>>> mergeFactor: 100
>>>> /admin/cores?action=RELOAD
>>>> <str name="Time taken ">0:11:44.441</str>
>>>> Tomcat restart
>>>> <str name="Time taken ">0:11:34.143</str>
>>>>
>>>> This is a Tomcat 5.5.20, started with a max heap size of 1GB. But it
>>>> always used much less. No swapping (RedHat Linux 32bit, 3GB RAM, old
>>>> ATA disk).
>>>>
>>>>
>>>> Now, I have three questions:
>>>>
>>>> 1. How can I check which mergeFactor is really being used? The
>>>> solrconfig.xml that is displayed in the admin application is the up-
>>>> to-date view on the file system. I tested that. But it's not
>>>> necessarily what the current SOLR core is using, isn't it?
>>>> Is there a way to check on the actually used mergeFactor (while the
>>>> index is running)?
>>>>
>>> It could very well be the case that you aren't seeing any merges with
>>> only 20K docs.  Ultimately, if you really want to, you can look in
>>> your data.dir and count the files.  If you have indexed a lot and have
>>> an MF of 100 and haven't done an optimize, you will see a lot more
>>> index files.
>>>
>>>
>>>  2. I changed the mergeFactor in both available settings (default and
>>>> main index) in the solrconfig.xml file of the core I am reindexing.
>>>> That is the correct place? Should a change in performance be
>>>> noticeable when increasing from 10 to 100? Or is the change not
>>>> perceivable if the requests for data are taking far longer than all
>>>> the indexing itself?
>>>>
>>> Likely, but not guaranteed.  Typically, larger merge factors are good
>>> for batch indexing, but a lot of that has changed with Lucene's new
>>> background merger, such that I don't know if it matters as much anymore.
>>>
>>>
>>>  3. Do I have to increase rumBufferSizeMB if I increase mergeFactor?
>>>> (Or some other setting?)
>>>>
>>> No, those are separate things.  The ramBufferSizeMB (although, I like
>>> the thought of a "rum"BufferSizeMB too!  ;-)  ) controls how many docs
>>> Lucene holds in memory before it has to flush.  MF controls how many
>>> segments are on disk
>>>
>>>  (I am still trying to get profiling information on how much
>>>> application time is eaten up by db connection/requests/processing.
>>>> The root entity query is about (average) 20ms. The child entity
>>>> query is less than 10ms.
>>>> I have my custom entity processor running on the child entity that
>>>> populates the map using a multi-row result set. I have also attached
>>>> one regex and one script transformer.)
>>>>
>>>> Thank you for any tips!
>>>> Chantal
>>>>
>>>>
>>>>
>>>> --
>>>> Chantal Ackermann
>>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>>> using Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>>>
>
>
>

Re: mergeFactor / indexing speed

Posted by Chantal Ackermann <ch...@btelligent.de>.

Thanks for the tip, Shalin. I'm happy with 6 indexes running in parallel 
and completing in less than 10min, right now, but I'll have look anyway.


Shalin Shekhar Mangar schrieb:
> On Fri, Aug 7, 2009 at 3:58 PM, Chantal Ackermann <
> chantal.ackermann@btelligent.de> wrote:
> 
>> Juhu, great news, guys. I merged my child entity into the root entity, and
>> changed the custom entityprocessor to handle the additional columns
>> correctly.
>> And - indexing 160k documents now takes 5min instead of 1.5h!
>>
> 
> I'm a little late to the party but you may also want to look at
> CachedSqlEntityProcessor.
> 
> --
> Regards,
> Shalin Shekhar Mangar.

Re: mergeFactor / indexing speed

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

On Fri, Aug 7, 2009 at 3:58 PM, Chantal Ackermann <
chantal.ackermann@btelligent.de> wrote:

> Juhu, great news, guys. I merged my child entity into the root entity, and
> changed the custom entityprocessor to handle the additional columns
> correctly.
> And - indexing 160k documents now takes 5min instead of 1.5h!
>

I'm a little late to the party but you may also want to look at
CachedSqlEntityProcessor.

-- 
Regards,
Shalin Shekhar Mangar.

Re: mergeFactor / indexing speed

Posted by Avlesh Singh <av...@gmail.com>.

>
> And - indexing 160k documents now takes 5min instead of 1.5h!
>
Awesome! It works for all!

(Now I can go relaxed on vacation. :-D )
>
Take me along!

Cheers
Avlesh

On Fri, Aug 7, 2009 at 3:58 PM, Chantal Ackermann <
chantal.ackermann@btelligent.de> wrote:

> Juhu, great news, guys. I merged my child entity into the root entity, and
> changed the custom entityprocessor to handle the additional columns
> correctly.
> And - indexing 160k documents now takes 5min instead of 1.5h!
>
> (Now I can go relaxed on vacation. :-D )
>
>
> Conclusion:
> In my case performance was so bad because of constantly querying a database
> on a different machine (network traffic + db query per document).
>
>
> Thanks for all your help!
> Chantal
>
>
> Avlesh Singh schrieb:
>
>> does DIH call commit periodically, or are things done in one big batch?
>>>
>>>  AFAIK, one big batch.
>>
>
> yes. There is no index available once the full-import started (and the
> searcher has no cache, other wise it still reads from that). There is no
> data (i.e. in the Admin/Luke frontend) visible until the import is finished
> correctly.
>

Re: mergeFactor / indexing speed

Posted by Chantal Ackermann <ch...@btelligent.de>.

Juhu, great news, guys. I merged my child entity into the root entity, 
and changed the custom entityprocessor to handle the additional columns 
correctly.
And - indexing 160k documents now takes 5min instead of 1.5h!

(Now I can go relaxed on vacation. :-D )


Conclusion:
In my case performance was so bad because of constantly querying a 
database on a different machine (network traffic + db query per document).


Thanks for all your help!
Chantal


Avlesh Singh schrieb:
>> does DIH call commit periodically, or are things done in one big batch?
>>
> AFAIK, one big batch.

yes. There is no index available once the full-import started (and the 
searcher has no cache, other wise it still reads from that). There is no 
data (i.e. in the Admin/Luke frontend) visible until the import is 
finished correctly.

Re: mergeFactor / indexing speed

Posted by Avlesh Singh <av...@gmail.com>.

>
> does DIH call commit periodically, or are things done in one big batch?
>
AFAIK, one big batch.

Cheers
Avlesh

On Thu, Aug 6, 2009 at 11:23 PM, Yonik Seeley <yo...@lucidimagination.com>wrote:

> On Mon, Aug 3, 2009 at 12:32 PM, Chantal
> Ackermann<ch...@btelligent.de> wrote:
> > avg-cpu:  %user   %nice    %sys %iowait   %idle
> >           1.23    0.00    0.03    0.03   98.71
> >
> > Basically, it is doing very little? *scratch*
>
> How often is commit being called?  (a  Lucene commit sync's all of the
> index files so a crash won't result in a corrupted index... this can
> be costly).
>
> Guys - does DIH call commit periodically, or are things done in one big
> batch?
> Chantal - is autocommit configured in solrconfig.xml?
>
> -Yonik
> http://www.lucidimagination.com
>

Re: mergeFactor / indexing speed

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Mon, Aug 3, 2009 at 12:32 PM, Chantal
Ackermann<ch...@btelligent.de> wrote:
> avg-cpu:  %user   %nice    %sys %iowait   %idle
>           1.23    0.00    0.03    0.03   98.71
>
> Basically, it is doing very little? *scratch*

How often is commit being called?  (a  Lucene commit sync's all of the
index files so a crash won't result in a corrupted index... this can
be costly).

Guys - does DIH call commit periodically, or are things done in one big batch?
Chantal - is autocommit configured in solrconfig.xml?

-Yonik
http://www.lucidimagination.com

Re: mergeFactor / indexing speed

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hi,

I'd have to poke around the machine(s) to give you better guidance, but here is some initial feedback:

- mergeFactor of 1000 seems crazy.  mergeFactor is probably not your problem.  I'd go back to default of 10.
- 256 MB for ramBufferSizeMB sounds OK.
- pinging the DB won't tell you much about the DB server's performance - ssh to the machine and check its CPU load, memory usage, disk IO

Other things to look into:
- Network as the bottleneck?
- Field analysis as the bottleneck?


Otis 
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----
> From: Chantal Ackermann <ch...@btelligent.de>
> To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> Sent: Monday, August 3, 2009 12:32:12 PM
> Subject: Re: mergeFactor / indexing speed
> 
> Hi all,
> 
> I'm still struggling with the index performance. I've moved the indexer
> to a different machine, now, which is faster and less occupied.
> 
> The new machine is a 64bit 8Gig-RAM RedHat. JDK1.6, Tomcat 6.0.18,
> running with those settings (and others):
> -server -Xms1G -Xmx7G
> 
> Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB.
> It has been processing roughly 70k documents in half an hour, so far. 
> Which means 1,5 hours at least for 200k - which is as fast/slow as 
> before (on the less performant machine).
> 
> The machine is not swapping. It is only using 13% of the memory.
> iostat gives me:
>   iostat
> Linux 2.6.9-67.ELsmp      08/03/2009
> 
> avg-cpu:  %user   %nice    %sys %iowait   %idle
>             1.23    0.00    0.03    0.03   98.71
> 
> Basically, it is doing very little? *scratch*
> 
> The sourcing database is responding as fast as ever. (I checked that 
> from my own machine, and did only a ping from the linux box to the db 
> server.)
> 
> Any help, any hint on where to look would be greatly appreciated.
> 
> 
> Thanks!
> Chantal
> 
> 
> Chantal Ackermann schrieb:
> > Hi again!
> >
> > Thanks for the answer, Grant.
> >
> >  > It could very well be the case that you aren't seeing any merges with
> >  > only 20K docs.  Ultimately, if you really want to, you can look in
> >  > your data.dir and count the files.  If you have indexed a lot and have
> >  > an MF of 100 and haven't done an optimize, you will see a lot more
> >  > index files.
> >
> > Do you mean that 20k is not representative enough to test those settings?
> > I've chosen the smaller data set so that the index can run completely
> > but doesn't take too long at the same time.
> > If it would be faster to begin with, I could use a larger data set, of
> > course. I still can't believe that 11 minutes is normal (I haven't
> > managed to make it run faster or slower than that, that duration is very
> > stable).
> >
> > It "feels kinda" slow to me...
> > Out of your experience - what would you expect as duration for an index
> > with:
> > - 21 fields, some using a text type with 6 filters
> > - database access using DataImportHandler with a query of (far) less
> > than 20ms
> > - 2 transformers
> >
> > If I knew that indexing time should be shorter than that, at least, I
> > would know that something is definitely wrong with what I am doing or
> > with the environment I am using.
> >
> >  > Likely, but not guaranteed.  Typically, larger merge factors are good
> >  > for batch indexing, but a lot of that has changed with Lucene's new
> >  > background merger, such that I don't know if it matters as much anymore.
> >
> > Ok. I also read some posting where it basically said that the default
> > parameters are ok. And one shouldn't mess around with them.
> >
> > The thing is that our current search setup uses Lucene directly, and the
> > indexer takes less than an hour (MF: 500, maxBufferedDocs: 7500). The
> > fields are different, the complete setup is different. But it will be
> > hard to advertise a new implementation/setup where indexing is three
> > times slower - unless I can give some reasons why that is.
> >
> > The full index should be fairly fast because the backing data is update
> > every few hours. I want to put in place an incremental/partial update as
> > main process, but full indexing might have to be done at certain times
> > if data has changed completely, or the schema has to be changed/extended.
> >
> >  > No, those are separate things.  The ramBufferSizeMB (although, I like
> >  > the thought of a "rum"BufferSizeMB too!  ;-)  ) controls how many docs
> >  > Lucene holds in memory before it has to flush.  MF controls how many
> >  > segments are on disk
> >
> > alas! the rum. I had that typo on the commandline before. that's my
> > subconscious telling me what I should do when I get home, tonight...
> >
> > So, increasing ramBufferSize should lead to higher memory usage,
> > shouldn't it? I'm not seeing that. :-(
> >
> > I'll try once more with MF 10 and a higher rum... well, you know... ;-)
> >
> > Cheers,
> > Chantal
> >
> > Grant Ingersoll schrieb:
> >> On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote:
> >>
> >>> Dear all,
> >>>
> >>> I want to find out which settings give the best full index
> >>> performance for my setup.
> >>> Therefore, I have been running a small index (less than 20k
> >>> documents) with a mergeFactor of 10 and 100.
> >>> In both cases, indexing took about 11.5 min:
> >>>
> >>> mergeFactor: 10
> >>> 0:11:46.792
> >>> mergeFactor: 100
> >>> /admin/cores?action=RELOAD
> >>> 0:11:44.441
> >>> Tomcat restart
> >>> 0:11:34.143
> >>>
> >>> This is a Tomcat 5.5.20, started with a max heap size of 1GB. But it
> >>> always used much less. No swapping (RedHat Linux 32bit, 3GB RAM, old
> >>> ATA disk).
> >>>
> >>>
> >>> Now, I have three questions:
> >>>
> >>> 1. How can I check which mergeFactor is really being used? The
> >>> solrconfig.xml that is displayed in the admin application is the up-
> >>> to-date view on the file system. I tested that. But it's not
> >>> necessarily what the current SOLR core is using, isn't it?
> >>> Is there a way to check on the actually used mergeFactor (while the
> >>> index is running)?
> >> It could very well be the case that you aren't seeing any merges with
> >> only 20K docs.  Ultimately, if you really want to, you can look in
> >> your data.dir and count the files.  If you have indexed a lot and have
> >> an MF of 100 and haven't done an optimize, you will see a lot more
> >> index files.
> >>
> >>
> >>> 2. I changed the mergeFactor in both available settings (default and
> >>> main index) in the solrconfig.xml file of the core I am reindexing.
> >>> That is the correct place? Should a change in performance be
> >>> noticeable when increasing from 10 to 100? Or is the change not
> >>> perceivable if the requests for data are taking far longer than all
> >>> the indexing itself?
> >> Likely, but not guaranteed.  Typically, larger merge factors are good
> >> for batch indexing, but a lot of that has changed with Lucene's new
> >> background merger, such that I don't know if it matters as much anymore.
> >>
> >>
> >>> 3. Do I have to increase rumBufferSizeMB if I increase mergeFactor?
> >>> (Or some other setting?)
> >> No, those are separate things.  The ramBufferSizeMB (although, I like
> >> the thought of a "rum"BufferSizeMB too!  ;-)  ) controls how many docs
> >> Lucene holds in memory before it has to flush.  MF controls how many
> >> segments are on disk
> >>
> >>> (I am still trying to get profiling information on how much
> >>> application time is eaten up by db connection/requests/processing.
> >>> The root entity query is about (average) 20ms. The child entity
> >>> query is less than 10ms.
> >>> I have my custom entity processor running on the child entity that
> >>> populates the map using a multi-row result set. I have also attached
> >>> one regex and one script transformer.)
> >>>
> >>> Thank you for any tips!
> >>> Chantal
> >>>
> >>>
> >>>
> >>> --
> >>> Chantal Ackermann
> >> --------------------------
> >> Grant Ingersoll
> >> http://www.lucidimagination.com/
> >>
> >> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
> >> using Solr/Lucene:
> >> http://www.lucidimagination.com/search
> >>

Re: mergeFactor / indexing speed

Posted by Avlesh Singh <av...@gmail.com>.

>
> Do you think it's possible to return (in the nested entity) rows
> independent of the unique id, and let the processor decide when a document
> is complete?
>
I don't think so.

In my case, I had 9 (JDBC) entities for each document. Most of these
entities returned a single column and limited number rows for each document.
I observed a significant improvement in performance by using an aggregation
query in my parent query. e.g. in MySQL, I used group_concat() function to
aggregate all the values (separated using some delimiter) into a single
column of the parent query's resultset. I would then use a RegexTransformer
to split this data on the previously used delimiter to populate in a
multi-valued field.
I actually got rid of 5 entities out of 9 in my data-config. It reduced the
import time significantly too.

Cheers
Avlesh

On Thu, Aug 6, 2009 at 10:15 PM, Chantal Ackermann <
chantal.ackermann@btelligent.de> wrote:

> Hi all,
>
> to keep this thread up to date... ;-)
>
>
> d) jdbc batch size
> changed to 10. (Was default: 500, then 1000)
>
> The problem with my dih setup is that the root entity query returns a huge
> set (all ids that shall be indexed). A larger fetchsize would be good for
> that query.
> The nested entity, however, returns only up 9 rows, ever. The constraints
> are so strict (by id) that there is no way that any additional data could be
> pre-fetched.
> (Actually, anynone using DIH with nested entities should run into that
> problem?)
>
> After changing to 10, I cannot see that this low batch size slowed the
> indexer down (significantly).
>
> As I would like to stick with DIH (instead of dumping the data into CSV and
> import it then) here is my question:
>
> Do you think it's possible to return (in the nested entity) rows
> independent of the unique id, and let the processor decide when a document
> is complete?
> The examples in the wiki always use an ID to get the data for the nested
> entity, so I'm not sure it was planned with that in mind. But as I'm already
> handling multiple db rows for one document, it might not be too difficult to
> change to handling the unique id correctly, as well?
> Of course, I would need something like a look ahead to know whether the
> next row is already part of the next document.
>
>
> Cheers,
> Chantal
>
>
>
> Concerning the other settings (just fyi):
>
> a) mergeFactor 10 (and also tried 100)
> I don't think that changed anything to the worse, rather to the better. So,
> I'll stick with 10 from now on.
>
> b) ramBufferSizeMB
> tried 512, 1024. RAM usage went up when I increased from 256 to 512. Not
> sure about 1024. I'll stick to 512.
>
>
>

Re: mergeFactor / indexing speed

Posted by Chantal Ackermann <ch...@btelligent.de>.

Hi all,

to keep this thread up to date... ;-)


d) jdbc batch size
changed to 10. (Was default: 500, then 1000)

The problem with my dih setup is that the root entity query returns a 
huge set (all ids that shall be indexed). A larger fetchsize would be 
good for that query.
The nested entity, however, returns only up 9 rows, ever. The 
constraints are so strict (by id) that there is no way that any 
additional data could be pre-fetched.
(Actually, anynone using DIH with nested entities should run into that 
problem?)

After changing to 10, I cannot see that this low batch size slowed the 
indexer down (significantly).

As I would like to stick with DIH (instead of dumping the data into CSV 
and import it then) here is my question:

Do you think it's possible to return (in the nested entity) rows 
independent of the unique id, and let the processor decide when a 
document is complete?
The examples in the wiki always use an ID to get the data for the nested 
entity, so I'm not sure it was planned with that in mind. But as I'm 
already handling multiple db rows for one document, it might not be too 
difficult to change to handling the unique id correctly, as well?
Of course, I would need something like a look ahead to know whether the 
next row is already part of the next document.


Cheers,
Chantal



Concerning the other settings (just fyi):

a) mergeFactor 10 (and also tried 100)
I don't think that changed anything to the worse, rather to the better. 
So, I'll stick with 10 from now on.

b) ramBufferSizeMB
tried 512, 1024. RAM usage went up when I increased from 256 to 512. Not 
sure about 1024. I'll stick to 512.

Re: mergeFactor / indexing speed

Posted by Chantal Ackermann <ch...@btelligent.de>.

Hi Avlesh,
hi Otis,
hi Grant,
hi all,


(enumerating to keep track of all the input)

a) mergeFactor 1000 too high
I'll change that back to 10. I thought it would make Lucene use more RAM 
before starting IO.

b) ramBufferSize:
OK, or maybe more. I'll keep that in mind.

c) solrconfig.xml - default and main index:
I've always changed both sections, the default and the main index one.

d) JDBC batch size:
I haven't set it. I'll do that.

e) DB server performance:
I agree, ping is definitely not much information. I also did queries 
from my own computer towards it (while the indexer ran) which came back 
as fast as usual.
Currently, I don't have any login to ssh to that machine, but I'm going 
to try get one.

f) Network:
I'll definitely need to have a look at that once I have access to the db 
machine.


g) the data

g.1) nested entity in DIH conf
there is only the root and one nested entity. However, that nested 
entity returns multiple rows (about 10) for one query. (Fetched rows is 
about 10 times the number of processed documents.)

g.2) my custom EntityProcessor
( The code is pasted at the very end of this e-mail. )
- iterates over those multiple rows,
- uses one column to create a key in a map,
- uses two other columns to create the corresponding value (String 
concatenation),
- if a key already exists, it gets the value, if that value is a list, 
it adds the new value to that list, if it's not a list, it creates one 
and adds the old and the new value to it.
I refrained from adding any business logic to that processor. It treats 
all rows alike, no matter whether they hold values that can appear 
multiple or values that must appear only once.

g.3) the two transformers
- to split one value into two (regex)
<field column="person" />
<field column="participant" sourceColName="person" regex="([^\|]+)\|.*"/>
<field column="role" sourceColName="person" 
regex="[^\|]+\|\d+,\d+,\d+,(.*)"/>

- to create extract a number from an existing number (bit calculation 
using the script transformer). As that one works on a field that is 
potentially multiValued, it needs to take care of creating and 
populating a list, as well.
<field column="cat" name="cat" />
<script><![CDATA[
function getMainCategory(row) {
	var cat = row.get('cat');
	var mainCat;
	if (cat != null) {
		// check whether cat is an array
		if (cat instanceof java.util.List) {
			var arr = java.util.ArrayList();
			for (var i=0; i<cat.size(); i++) {
				mainCat = new java.lang.Integer(cat.get(i)>>8);
				if (!arr.contains(mainCat)) {
					arr.add(mainCat);
				}
			}
			row.put('maincat', arr);
		} else { // it is a single value
			var mainCat = new java.lang.Integer(cat>>8);
			row.put('maincat', mainCat);
		}
	}
	return row;
}
]]></script>
(The EpgValueEntityProcessor decides on creating lists on a case by case 
basis: only if a value is specified multiple times for a certain data 
set does it create a list. This is because I didn't want to put any 
complex configuration or business logic into it.)

g.4) fields
the DIH extracts 5 fields from the root entity, 11 fields from the 
nested entity, and the transformers might create additional 3 (multiValued).
schema.xml defines 21 fields (two additional fields: the timestamp field 
(default="NOW") and a field collecting three other text fields for 
default search (using copy field)):
- 2 long
- 3 integer
- 3 sint
- 3 date
- 6 text_cs (class="solr.TextField" positionIncrementGap="100"):
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="0"
generateWordParts="0" generateNumberParts="0" catenateWords="0" 
catenateNumbers="0" catenateAll="0" />
</analyzer>
- 4 text_de (one is the field populated by copying from the 3 others):
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LengthFilterFactory" min="2" max="5000" />
<filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords_de.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1" 
catenateAll="0" splitOnCaseChange="1" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory" language="German" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>


Thank you for taking your time!
Cheers,
Chantal





************** EpgValueEntityProcessor.java *******************

import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.logging.Logger;

import org.apache.solr.handler.dataimport.Context;
import org.apache.solr.handler.dataimport.SqlEntityProcessor;

public class EpgValueEntityProcessor extends SqlEntityProcessor {
	private static final Logger log = Logger
			.getLogger(EpgValueEntityProcessor.class.getName());
	private static final String ATTR_ID_EPG_DEFINITION = 
"columnIdEpgDefinition";
	private static final String ATTR_COLUMN_ATT_NAME = "columnAttName";
	private static final String ATTR_COLUMN_EPG_VALUE = "columnEpgValue";
	private static final String ATTR_COLUMN_EPG_SUBVALUE = "columnEpgSubvalue";
	private static final String DEF_ATT_NAME = "ATT_NAME";
	private static final String DEF_EPG_VALUE = "EPG_VALUE";
	private static final String DEF_EPG_SUBVALUE = "EPG_SUBVALUE";
	private static final String DEF_ID_EPG_DEFINITION = "ID_EPG_DEFINITION";
	private String colIdEpgDef = DEF_ID_EPG_DEFINITION;
	private String colAttName = DEF_ATT_NAME;
	private String colEpgValue = DEF_EPG_VALUE;
	private String colEpgSubvalue = DEF_EPG_SUBVALUE;

	@SuppressWarnings("unchecked")
	public void init(Context context) {
		super.init(context);
		colIdEpgDef = context.getEntityAttribute(ATTR_ID_EPG_DEFINITION);
		colAttName = context.getEntityAttribute(ATTR_COLUMN_ATT_NAME);
		colEpgValue = context.getEntityAttribute(ATTR_COLUMN_EPG_VALUE);
		colEpgSubvalue = context.getEntityAttribute(ATTR_COLUMN_EPG_SUBVALUE);
	}

	public Map<String, Object> nextRow() {
		if (rowcache != null)
			return getFromRowCache();
		if (rowIterator == null) {
			String q = getQuery();
			initQuery(resolver.replaceTokens(q));
		}
		Map<String, Object> pivottedRow = new HashMap<String, Object>();
		Map<String, Object> epgValue;
		String attName, value, subvalue;
		Object existingValue, newValue;
		String id = null;
		
		// return null once the end of that data set is reached
		if (!rowIterator.hasNext()) {
			rowIterator = null;
			return null;
		}
		// as long as there is data, iterate over the rows and pivot them
		// return the pivotted row after the last row of data has been reached
		do {
			epgValue = rowIterator.next();
			id = epgValue.get(colIdEpgDef).toString();
			assert id != null;
			if (pivottedRow.containsKey(colIdEpgDef)) {
				assert id.equals(pivottedRow.get(colIdEpgDef));
			} else {
				pivottedRow.put(colIdEpgDef, id);
			}
			attName = (String) epgValue.get(colAttName);
			if (attName == null) {
				log.warning("No value returned for attribute name column "
						+ colAttName);
			}
			value = (String) epgValue.get(colEpgValue);
			subvalue = (String) epgValue.get(colEpgSubvalue);

			// create a single object for value and subvalue
			// if subvalue is not set, use value only, otherwise create string
			// array
			if (subvalue == null || subvalue.trim().length() == 0) {
				newValue = value;
			} else {
				newValue = value + "|" + subvalue;
			}

			// if there is already an entry for that attribute, extend
			// the existing value
			if (pivottedRow.containsKey(attName)) {
				existingValue = pivottedRow.get(attName);
//				newValue = existingValue + " " + newValue;
//				pivottedRow.put(attName, newValue);
				if (existingValue instanceof List) {
					((List) existingValue).add(newValue);
				} else {
					ArrayList v = new ArrayList();
					Collections.addAll(v, existingValue, newValue);
					pivottedRow.put(attName, v);
				}
			} else {
				pivottedRow.put(attName, newValue);
			}
		} while (rowIterator.hasNext());
		
		pivottedRow = applyTransformer(pivottedRow);
		return pivottedRow;
	}

}

Re: mergeFactor / indexing speed

Posted by Grant Ingersoll <gs...@apache.org>.

How big are your documents?  I haven't benchmarked DIH, so I am not  
sure what to expect, but it does seem like something isn't right.  Can  
you fully describe how you are indexing?  Have you done any profiling?

On Aug 3, 2009, at 12:32 PM, Chantal Ackermann wrote:

> Hi all,
>
> I'm still struggling with the index performance. I've moved the  
> indexer
> to a different machine, now, which is faster and less occupied.
>
> The new machine is a 64bit 8Gig-RAM RedHat. JDK1.6, Tomcat 6.0.18,
> running with those settings (and others):
> -server -Xms1G -Xmx7G
>
> Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB.
> It has been processing roughly 70k documents in half an hour, so  
> far. Which means 1,5 hours at least for 200k - which is as fast/slow  
> as before (on the less performant machine).
>
> The machine is not swapping. It is only using 13% of the memory.
> iostat gives me:
> iostat
> Linux 2.6.9-67.ELsmp      08/03/2009
>
> avg-cpu:  %user   %nice    %sys %iowait   %idle
>           1.23    0.00    0.03    0.03   98.71
>
> Basically, it is doing very little? *scratch*
>
> The sourcing database is responding as fast as ever. (I checked that  
> from my own machine, and did only a ping from the linux box to the  
> db server.)
>
> Any help, any hint on where to look would be greatly appreciated.
>
>
> Thanks!
> Chantal
>
>
> Chantal Ackermann schrieb:
>> Hi again!
>>
>> Thanks for the answer, Grant.
>>
>> > It could very well be the case that you aren't seeing any merges  
>> with
>> > only 20K docs.  Ultimately, if you really want to, you can look in
>> > your data.dir and count the files.  If you have indexed a lot and  
>> have
>> > an MF of 100 and haven't done an optimize, you will see a lot more
>> > index files.
>>
>> Do you mean that 20k is not representative enough to test those  
>> settings?
>> I've chosen the smaller data set so that the index can run completely
>> but doesn't take too long at the same time.
>> If it would be faster to begin with, I could use a larger data set,  
>> of
>> course. I still can't believe that 11 minutes is normal (I haven't
>> managed to make it run faster or slower than that, that duration is  
>> very
>> stable).
>>
>> It "feels kinda" slow to me...
>> Out of your experience - what would you expect as duration for an  
>> index
>> with:
>> - 21 fields, some using a text type with 6 filters
>> - database access using DataImportHandler with a query of (far) less
>> than 20ms
>> - 2 transformers
>>
>> If I knew that indexing time should be shorter than that, at least, I
>> would know that something is definitely wrong with what I am doing or
>> with the environment I am using.
>>
>> > Likely, but not guaranteed.  Typically, larger merge factors are  
>> good
>> > for batch indexing, but a lot of that has changed with Lucene's new
>> > background merger, such that I don't know if it matters as much  
>> anymore.
>>
>> Ok. I also read some posting where it basically said that the default
>> parameters are ok. And one shouldn't mess around with them.
>>
>> The thing is that our current search setup uses Lucene directly,  
>> and the
>> indexer takes less than an hour (MF: 500, maxBufferedDocs: 7500). The
>> fields are different, the complete setup is different. But it will be
>> hard to advertise a new implementation/setup where indexing is three
>> times slower - unless I can give some reasons why that is.
>>
>> The full index should be fairly fast because the backing data is  
>> update
>> every few hours. I want to put in place an incremental/partial  
>> update as
>> main process, but full indexing might have to be done at certain  
>> times
>> if data has changed completely, or the schema has to be changed/ 
>> extended.
>>
>> > No, those are separate things.  The ramBufferSizeMB (although, I  
>> like
>> > the thought of a "rum"BufferSizeMB too!  ;-)  ) controls how many  
>> docs
>> > Lucene holds in memory before it has to flush.  MF controls how  
>> many
>> > segments are on disk
>>
>> alas! the rum. I had that typo on the commandline before. that's my
>> subconscious telling me what I should do when I get home, tonight...
>>
>> So, increasing ramBufferSize should lead to higher memory usage,
>> shouldn't it? I'm not seeing that. :-(
>>
>> I'll try once more with MF 10 and a higher rum... well, you  
>> know... ;-)
>>
>> Cheers,
>> Chantal
>>
>> Grant Ingersoll schrieb:
>>> On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote:
>>>
>>>> Dear all,
>>>>
>>>> I want to find out which settings give the best full index
>>>> performance for my setup.
>>>> Therefore, I have been running a small index (less than 20k
>>>> documents) with a mergeFactor of 10 and 100.
>>>> In both cases, indexing took about 11.5 min:
>>>>
>>>> mergeFactor: 10
>>>> <str name="Time taken ">0:11:46.792</str>
>>>> mergeFactor: 100
>>>> /admin/cores?action=RELOAD
>>>> <str name="Time taken ">0:11:44.441</str>
>>>> Tomcat restart
>>>> <str name="Time taken ">0:11:34.143</str>
>>>>
>>>> This is a Tomcat 5.5.20, started with a max heap size of 1GB. But  
>>>> it
>>>> always used much less. No swapping (RedHat Linux 32bit, 3GB RAM,  
>>>> old
>>>> ATA disk).
>>>>
>>>>
>>>> Now, I have three questions:
>>>>
>>>> 1. How can I check which mergeFactor is really being used? The
>>>> solrconfig.xml that is displayed in the admin application is the  
>>>> up-
>>>> to-date view on the file system. I tested that. But it's not
>>>> necessarily what the current SOLR core is using, isn't it?
>>>> Is there a way to check on the actually used mergeFactor (while the
>>>> index is running)?
>>> It could very well be the case that you aren't seeing any merges  
>>> with
>>> only 20K docs.  Ultimately, if you really want to, you can look in
>>> your data.dir and count the files.  If you have indexed a lot and  
>>> have
>>> an MF of 100 and haven't done an optimize, you will see a lot more
>>> index files.
>>>
>>>
>>>> 2. I changed the mergeFactor in both available settings (default  
>>>> and
>>>> main index) in the solrconfig.xml file of the core I am reindexing.
>>>> That is the correct place? Should a change in performance be
>>>> noticeable when increasing from 10 to 100? Or is the change not
>>>> perceivable if the requests for data are taking far longer than all
>>>> the indexing itself?
>>> Likely, but not guaranteed.  Typically, larger merge factors are  
>>> good
>>> for batch indexing, but a lot of that has changed with Lucene's new
>>> background merger, such that I don't know if it matters as much  
>>> anymore.
>>>
>>>
>>>> 3. Do I have to increase rumBufferSizeMB if I increase mergeFactor?
>>>> (Or some other setting?)
>>> No, those are separate things.  The ramBufferSizeMB (although, I  
>>> like
>>> the thought of a "rum"BufferSizeMB too!  ;-)  ) controls how many  
>>> docs
>>> Lucene holds in memory before it has to flush.  MF controls how many
>>> segments are on disk
>>>
>>>> (I am still trying to get profiling information on how much
>>>> application time is eaten up by db connection/requests/processing.
>>>> The root entity query is about (average) 20ms. The child entity
>>>> query is less than 10ms.
>>>> I have my custom entity processor running on the child entity that
>>>> populates the map using a multi-row result set. I have also  
>>>> attached
>>>> one regex and one script transformer.)
>>>>
>>>> Thank you for any tips!
>>>> Chantal
>>>>
>>>>
>>>>
>>>> --
>>>> Chantal Ackermann
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>>> using Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>
>
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: mergeFactor / indexing speed

Posted by Chantal Ackermann <ch...@btelligent.de>.

Hi all,

I'm still struggling with the index performance. I've moved the indexer
to a different machine, now, which is faster and less occupied.

The new machine is a 64bit 8Gig-RAM RedHat. JDK1.6, Tomcat 6.0.18,
running with those settings (and others):
-server -Xms1G -Xmx7G

Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB.
It has been processing roughly 70k documents in half an hour, so far. 
Which means 1,5 hours at least for 200k - which is as fast/slow as 
before (on the less performant machine).

The machine is not swapping. It is only using 13% of the memory.
iostat gives me:
  iostat
Linux 2.6.9-67.ELsmp      08/03/2009

avg-cpu:  %user   %nice    %sys %iowait   %idle
            1.23    0.00    0.03    0.03   98.71

Basically, it is doing very little? *scratch*

The sourcing database is responding as fast as ever. (I checked that 
from my own machine, and did only a ping from the linux box to the db 
server.)

Any help, any hint on where to look would be greatly appreciated.


Thanks!
Chantal


Chantal Ackermann schrieb:
> Hi again!
>
> Thanks for the answer, Grant.
>
>  > It could very well be the case that you aren't seeing any merges with
>  > only 20K docs.  Ultimately, if you really want to, you can look in
>  > your data.dir and count the files.  If you have indexed a lot and have
>  > an MF of 100 and haven't done an optimize, you will see a lot more
>  > index files.
>
> Do you mean that 20k is not representative enough to test those settings?
> I've chosen the smaller data set so that the index can run completely
> but doesn't take too long at the same time.
> If it would be faster to begin with, I could use a larger data set, of
> course. I still can't believe that 11 minutes is normal (I haven't
> managed to make it run faster or slower than that, that duration is very
> stable).
>
> It "feels kinda" slow to me...
> Out of your experience - what would you expect as duration for an index
> with:
> - 21 fields, some using a text type with 6 filters
> - database access using DataImportHandler with a query of (far) less
> than 20ms
> - 2 transformers
>
> If I knew that indexing time should be shorter than that, at least, I
> would know that something is definitely wrong with what I am doing or
> with the environment I am using.
>
>  > Likely, but not guaranteed.  Typically, larger merge factors are good
>  > for batch indexing, but a lot of that has changed with Lucene's new
>  > background merger, such that I don't know if it matters as much anymore.
>
> Ok. I also read some posting where it basically said that the default
> parameters are ok. And one shouldn't mess around with them.
>
> The thing is that our current search setup uses Lucene directly, and the
> indexer takes less than an hour (MF: 500, maxBufferedDocs: 7500). The
> fields are different, the complete setup is different. But it will be
> hard to advertise a new implementation/setup where indexing is three
> times slower - unless I can give some reasons why that is.
>
> The full index should be fairly fast because the backing data is update
> every few hours. I want to put in place an incremental/partial update as
> main process, but full indexing might have to be done at certain times
> if data has changed completely, or the schema has to be changed/extended.
>
>  > No, those are separate things.  The ramBufferSizeMB (although, I like
>  > the thought of a "rum"BufferSizeMB too!  ;-)  ) controls how many docs
>  > Lucene holds in memory before it has to flush.  MF controls how many
>  > segments are on disk
>
> alas! the rum. I had that typo on the commandline before. that's my
> subconscious telling me what I should do when I get home, tonight...
>
> So, increasing ramBufferSize should lead to higher memory usage,
> shouldn't it? I'm not seeing that. :-(
>
> I'll try once more with MF 10 and a higher rum... well, you know... ;-)
>
> Cheers,
> Chantal
>
> Grant Ingersoll schrieb:
>> On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote:
>>
>>> Dear all,
>>>
>>> I want to find out which settings give the best full index
>>> performance for my setup.
>>> Therefore, I have been running a small index (less than 20k
>>> documents) with a mergeFactor of 10 and 100.
>>> In both cases, indexing took about 11.5 min:
>>>
>>> mergeFactor: 10
>>> <str name="Time taken ">0:11:46.792</str>
>>> mergeFactor: 100
>>> /admin/cores?action=RELOAD
>>> <str name="Time taken ">0:11:44.441</str>
>>> Tomcat restart
>>> <str name="Time taken ">0:11:34.143</str>
>>>
>>> This is a Tomcat 5.5.20, started with a max heap size of 1GB. But it
>>> always used much less. No swapping (RedHat Linux 32bit, 3GB RAM, old
>>> ATA disk).
>>>
>>>
>>> Now, I have three questions:
>>>
>>> 1. How can I check which mergeFactor is really being used? The
>>> solrconfig.xml that is displayed in the admin application is the up-
>>> to-date view on the file system. I tested that. But it's not
>>> necessarily what the current SOLR core is using, isn't it?
>>> Is there a way to check on the actually used mergeFactor (while the
>>> index is running)?
>> It could very well be the case that you aren't seeing any merges with
>> only 20K docs.  Ultimately, if you really want to, you can look in
>> your data.dir and count the files.  If you have indexed a lot and have
>> an MF of 100 and haven't done an optimize, you will see a lot more
>> index files.
>>
>>
>>> 2. I changed the mergeFactor in both available settings (default and
>>> main index) in the solrconfig.xml file of the core I am reindexing.
>>> That is the correct place? Should a change in performance be
>>> noticeable when increasing from 10 to 100? Or is the change not
>>> perceivable if the requests for data are taking far longer than all
>>> the indexing itself?
>> Likely, but not guaranteed.  Typically, larger merge factors are good
>> for batch indexing, but a lot of that has changed with Lucene's new
>> background merger, such that I don't know if it matters as much anymore.
>>
>>
>>> 3. Do I have to increase rumBufferSizeMB if I increase mergeFactor?
>>> (Or some other setting?)
>> No, those are separate things.  The ramBufferSizeMB (although, I like
>> the thought of a "rum"BufferSizeMB too!  ;-)  ) controls how many docs
>> Lucene holds in memory before it has to flush.  MF controls how many
>> segments are on disk
>>
>>> (I am still trying to get profiling information on how much
>>> application time is eaten up by db connection/requests/processing.
>>> The root entity query is about (average) 20ms. The child entity
>>> query is less than 10ms.
>>> I have my custom entity processor running on the child entity that
>>> populates the map using a multi-row result set. I have also attached
>>> one regex and one script transformer.)
>>>
>>> Thank you for any tips!
>>> Chantal
>>>
>>>
>>>
>>> --
>>> Chantal Ackermann
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>> using Solr/Lucene:
>> http://www.lucidimagination.com/search
>>

Re: mergeFactor / indexing speed

Posted by Chantal Ackermann <ch...@btelligent.de>.

Hi again!

Thanks for the answer, Grant.

 > It could very well be the case that you aren't seeing any merges with
 > only 20K docs.  Ultimately, if you really want to, you can look in
 > your data.dir and count the files.  If you have indexed a lot and have
 > an MF of 100 and haven't done an optimize, you will see a lot more
 > index files.

Do you mean that 20k is not representative enough to test those settings?
I've chosen the smaller data set so that the index can run completely 
but doesn't take too long at the same time.
If it would be faster to begin with, I could use a larger data set, of 
course. I still can't believe that 11 minutes is normal (I haven't 
managed to make it run faster or slower than that, that duration is very 
stable).

It "feels kinda" slow to me...
Out of your experience - what would you expect as duration for an index 
with:
- 21 fields, some using a text type with 6 filters
- database access using DataImportHandler with a query of (far) less 
than 20ms
- 2 transformers

If I knew that indexing time should be shorter than that, at least, I 
would know that something is definitely wrong with what I am doing or 
with the environment I am using.

 > Likely, but not guaranteed.  Typically, larger merge factors are good
 > for batch indexing, but a lot of that has changed with Lucene's new
 > background merger, such that I don't know if it matters as much anymore.

Ok. I also read some posting where it basically said that the default 
parameters are ok. And one shouldn't mess around with them.

The thing is that our current search setup uses Lucene directly, and the 
indexer takes less than an hour (MF: 500, maxBufferedDocs: 7500). The 
fields are different, the complete setup is different. But it will be 
hard to advertise a new implementation/setup where indexing is three 
times slower - unless I can give some reasons why that is.

The full index should be fairly fast because the backing data is update 
every few hours. I want to put in place an incremental/partial update as 
main process, but full indexing might have to be done at certain times 
if data has changed completely, or the schema has to be changed/extended.

 > No, those are separate things.  The ramBufferSizeMB (although, I like
 > the thought of a "rum"BufferSizeMB too!  ;-)  ) controls how many docs
 > Lucene holds in memory before it has to flush.  MF controls how many
 > segments are on disk

alas! the rum. I had that typo on the commandline before. that's my 
subconscious telling me what I should do when I get home, tonight...

So, increasing ramBufferSize should lead to higher memory usage, 
shouldn't it? I'm not seeing that. :-(

I'll try once more with MF 10 and a higher rum... well, you know... ;-)

Cheers,
Chantal

Grant Ingersoll schrieb:
> On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote:
> 
>> Dear all,
>>
>> I want to find out which settings give the best full index
>> performance for my setup.
>> Therefore, I have been running a small index (less than 20k
>> documents) with a mergeFactor of 10 and 100.
>> In both cases, indexing took about 11.5 min:
>>
>> mergeFactor: 10
>> <str name="Time taken ">0:11:46.792</str>
>> mergeFactor: 100
>> /admin/cores?action=RELOAD
>> <str name="Time taken ">0:11:44.441</str>
>> Tomcat restart
>> <str name="Time taken ">0:11:34.143</str>
>>
>> This is a Tomcat 5.5.20, started with a max heap size of 1GB. But it
>> always used much less. No swapping (RedHat Linux 32bit, 3GB RAM, old
>> ATA disk).
>>
>>
>> Now, I have three questions:
>>
>> 1. How can I check which mergeFactor is really being used? The
>> solrconfig.xml that is displayed in the admin application is the up-
>> to-date view on the file system. I tested that. But it's not
>> necessarily what the current SOLR core is using, isn't it?
>> Is there a way to check on the actually used mergeFactor (while the
>> index is running)?
> 
> It could very well be the case that you aren't seeing any merges with
> only 20K docs.  Ultimately, if you really want to, you can look in
> your data.dir and count the files.  If you have indexed a lot and have
> an MF of 100 and haven't done an optimize, you will see a lot more
> index files.
> 
> 
>> 2. I changed the mergeFactor in both available settings (default and
>> main index) in the solrconfig.xml file of the core I am reindexing.
>> That is the correct place? Should a change in performance be
>> noticeable when increasing from 10 to 100? Or is the change not
>> perceivable if the requests for data are taking far longer than all
>> the indexing itself?
> 
> Likely, but not guaranteed.  Typically, larger merge factors are good
> for batch indexing, but a lot of that has changed with Lucene's new
> background merger, such that I don't know if it matters as much anymore.
> 
> 
>> 3. Do I have to increase rumBufferSizeMB if I increase mergeFactor?
>> (Or some other setting?)
> 
> No, those are separate things.  The ramBufferSizeMB (although, I like
> the thought of a "rum"BufferSizeMB too!  ;-)  ) controls how many docs
> Lucene holds in memory before it has to flush.  MF controls how many
> segments are on disk
> 
>> (I am still trying to get profiling information on how much
>> application time is eaten up by db connection/requests/processing.
>> The root entity query is about (average) 20ms. The child entity
>> query is less than 10ms.
>> I have my custom entity processor running on the child entity that
>> populates the map using a multi-row result set. I have also attached
>> one regex and one script transformer.)
>>
>> Thank you for any tips!
>> Chantal
>>
>>
>>
>> --
>> Chantal Ackermann
> 
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
> using Solr/Lucene:
> http://www.lucidimagination.com/search
>

Re: mergeFactor / indexing speed

Posted by Grant Ingersoll <gs...@apache.org>.

On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote:

> Dear all,
>
> I want to find out which settings give the best full index  
> performance for my setup.
> Therefore, I have been running a small index (less than 20k  
> documents) with a mergeFactor of 10 and 100.
> In both cases, indexing took about 11.5 min:
>
> mergeFactor: 10
> <str name="Time taken ">0:11:46.792</str>
> mergeFactor: 100
> /admin/cores?action=RELOAD
> <str name="Time taken ">0:11:44.441</str>
> Tomcat restart
> <str name="Time taken ">0:11:34.143</str>
>
> This is a Tomcat 5.5.20, started with a max heap size of 1GB. But it  
> always used much less. No swapping (RedHat Linux 32bit, 3GB RAM, old  
> ATA disk).
>
>
> Now, I have three questions:
>
> 1. How can I check which mergeFactor is really being used? The  
> solrconfig.xml that is displayed in the admin application is the up- 
> to-date view on the file system. I tested that. But it's not  
> necessarily what the current SOLR core is using, isn't it?
> Is there a way to check on the actually used mergeFactor (while the  
> index is running)?

It could very well be the case that you aren't seeing any merges with  
only 20K docs.  Ultimately, if you really want to, you can look in  
your data.dir and count the files.  If you have indexed a lot and have  
an MF of 100 and haven't done an optimize, you will see a lot more  
index files.


> 2. I changed the mergeFactor in both available settings (default and  
> main index) in the solrconfig.xml file of the core I am reindexing.  
> That is the correct place? Should a change in performance be  
> noticeable when increasing from 10 to 100? Or is the change not  
> perceivable if the requests for data are taking far longer than all  
> the indexing itself?

Likely, but not guaranteed.  Typically, larger merge factors are good  
for batch indexing, but a lot of that has changed with Lucene's new  
background merger, such that I don't know if it matters as much anymore.


> 3. Do I have to increase rumBufferSizeMB if I increase mergeFactor?  
> (Or some other setting?)

No, those are separate things.  The ramBufferSizeMB (although, I like  
the thought of a "rum"BufferSizeMB too!  ;-)  ) controls how many docs  
Lucene holds in memory before it has to flush.  MF controls how many  
segments are on disk

>
> (I am still trying to get profiling information on how much  
> application time is eaten up by db connection/requests/processing.
> The root entity query is about (average) 20ms. The child entity  
> query is less than 10ms.
> I have my custom entity processor running on the child entity that  
> populates the map using a multi-row result set. I have also attached  
> one regex and one script transformer.)
>
> Thank you for any tips!
> Chantal
>
>
>
> -- 
> Chantal Ackermann

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search