You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Shawn Heisey <so...@elyograg.org> on 2012/11/27 01:48:57 UTC

Solr 4, optimizing while doing other updates?

For Solr 4.0 and higher, is it possible to optimize the index while 
other updates are happening?  Based on some behavior I just saw, I think 
it might be.

I ran a full-import using DIH -- six index shards with 13 million 
records each and a seventh shard (hot shard) with 317000. On a few of 
those large indexes, after DIH reported idle and successful completion, 
I noticed that the index size was still increasing -- Solr was doing one 
last background merge.

In the meantime, my indexing program had noticed that the DIH was done, 
and began indexing backed-up content to the new indexes.  That indexing 
worked flawlessly, even though the indexes were still merging.  I don't 
think there's any way for me to detect the "DIH done, but still merging" 
state ... but I am guessing that I don't have to worry about it.

Can anyone confirm?  I know that on older Solr versions, if I tried to 
index while optimizing, my program would not work right.

Thanks,
Shawn


Re: Solr 4, optimizing while doing other updates?

Posted by Shawn Heisey <so...@elyograg.org>.
On 11/27/2012 11:25 AM, Shawn Heisey wrote:
> To see whether my heap is too small, I connected jconsole remotely to 
> a 3.5.0 server via JMX. The numbers look OK to me, I'm including a 
> link to a jconsole screenshot.  I could probably drop the heap lower, 
> but that might cause some issues with DIH full imports, which we do 
> occasionally when there are major changes to the database.

One bit of info I left out: The JVM has been up for 21 days and 19 
hours.  Without that information, the number of garbage collections and 
the total GC time in the screenshot might look very bad.

Thanks,
Shawn


Re: Solr 4, optimizing while doing other updates?

Posted by Shawn Heisey <so...@elyograg.org>.
On 11/27/2012 1:30 PM, Yonik Seeley wrote:
> Expunging a single delete from a segment involves re-writing the 
> entire segment, so it's just as bad as optimize (assuming most 
> segments have a deletion). You might as well get the benefit of the 
> optimize as well. 

I have had the same thought here.  What I would probably do is run the 
expungeDeletes commit on approximately the same interval as I currently 
do an optimize, and then do an optimize on a much longer interval.  If I 
get lucky enough to expunge deletes from only very small segments, I 
come out ahead.  If not, I'm no worse off than I am now.

If I were to entirely eliminate optimization, then I would likely be in 
a situation where I've got 20-60 segments all the time, none of which 
would take very long to rewrite.  This is my mergePolicy config:

   <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
     <int name="maxMergeAtOnce">35</int>
     <int name="segmentsPerTier">35</int>
     <int name="maxMergeAtOnceExplicit">105</int>
   </mergePolicy>

Thanks,
Shawn


Re: Solr 4, optimizing while doing other updates?

Posted by Yonik Seeley <yo...@lucidworks.com>.
On Tue, Nov 27, 2012 at 3:21 PM, Shawn Heisey <so...@elyograg.org> wrote:
>  but even way back then, rumblings on the mailing list said "don't optimize for performance reasons."

Count me amongst the dissenters.  Optimize can make a lot of sense,
and that's why it still exists.
People should be careful to not assume they need to optimize to get
good performance, but people assuming that no one should optimize are
making just as big of a mistake IMO.

> When/if a configuration option becomes available so I can do a commit that
> expunges deletes even when there are only a few deleted documents, or if I
> can figure out how to add that option myself, I will be able to eliminate
> full optimization entirely.

Expunging a single delete from a segment involves re-writing the
entire segment, so it's just as bad as optimize (assuming most
segments have a deletion).  You might as well get the benefit of the
optimize as well.

-Yonik
http://lucidworks.com

Re: Solr 4, optimizing while doing other updates?

Posted by Shawn Heisey <so...@elyograg.org>.
On 11/27/2012 11:39 AM, Jack Krupansky wrote:
> So, if I understand your scenario correctly, you are doing a lot of 
> deletes, but since they are occurring against "cold" data, there is 
> isn't usually much if any query traffic for that old/cold data.
>
> In short, it sounds like the reason you are optimizing is to keep the 
> memory footprint from growing in a very memory-limited environment.
>
> It also looks like you have frequent garbage collections.

With 64GB of RAM, I'm not sure I'd classify my situation as 
memory-limited.  It's true that I don't have enough RAM to cache all my 
index data, but over 8GB of each 22GB index is stored fields (.fdt 
files), so I have the important bits.  I'm sure I can increase my heap 
size without drastically affecting performance, but so far I have not 
needed to.  If we start using more Solr functionality like facets, I'm 
sure I will have to increase the heap.

This is a distributed index, every query hits every shard.A large chunk 
of the data that gets returned comes from the hot shard, but users do 
page down into old results fairly often.  Only data added the last 3.5 
to 7 days lives in the hot shard.

As far as frequent garbage collections, I would agree with you if I were 
restarting Solr often.  This JVM has nearly 22 days of uptime, so on 
average there is about 5 minutes between each GC:

https://dl.dropbox.com/u/97770508/solr-jconsole-summary.png

When/if a configuration option becomes available so I can do a commit 
that expunges deletes even when there are only a few deleted documents, 
or if I can figure out how to add that option myself, I will be able to 
eliminate full optimization entirely.

Thanks,
Shawn


Re: Solr 4, optimizing while doing other updates?

Posted by Jack Krupansky <ja...@basetechnology.com>.
So, if I understand your scenario correctly, you are doing a lot of deletes, 
but since they are occurring against "cold" data, there is isn't usually 
much if any query traffic for that old/cold data.

In short, it sounds like the reason you are optimizing is to keep the memory 
footprint from growing in a very memory-limited environment.

It also looks like you have frequent garbage collections.

-- Jack Krupansky

-----Original Message----- 
From: Shawn Heisey
Sent: Tuesday, November 27, 2012 1:25 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr 4, optimizing while doing other updates?

On 11/27/2012 10:36 AM, Jack Krupansky wrote:
> Okay, if performance isn't the reason for the optimize, what is the reason 
> that you are using?
>
> 8GB for Java heap seems low for a 22GB index. How much Java heap seems 
> available when the app is running?
>
> Are these three separate Solr instances/JVMs on the same machine?
>
> How many cores for the machine?

First, thank you for taking time to look into how things are going for
me.  I really appreciate it.

I am optimizing purely to eliminate deleted documents.  I will admit
that when we first got going on Solr 1.4.0, performance was a small
concern, but even way back then, rumblings on the mailing list said
"don't optimize for performance reasons."

Each server has one Solr JVM (using the jetty6 included with 3.5) with
8GB heap, each index shard lives in a Solr core. The server has 64GB of
RAM and two quad-core CPUs, so a total of 8 CPU cores.  Two servers make
up an entire index chain.  One server has three of the 22GB (cold)
shards and the 800MB (hot) shard, the other server has the other three
22GB shards.

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+ COMMAND
17823 ncindex   20   0 80.8g  17g 9.4g S  2.0 28.6   4548:18 java

ncindex@idxa1 ~ $ du -s /index/solr/data/
71606072        /index/solr/data/

processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Xeon(R) CPU           E5440  @ 2.83GHz
stepping        : 6
cpu MHz         : 2826.535
cache size      : 6144 KB

To see whether my heap is too small, I connected jconsole remotely to a
3.5.0 server via JMX. The numbers look OK to me, I'm including a link to
a jconsole screenshot.  I could probably drop the heap lower, but that
might cause some issues with DIH full imports, which we do occasionally
when there are major changes to the database.

Jconsole screenshot:

https://dl.dropbox.com/u/97770508/solr-jconsole.png

Thanks,
Shawn 


Re: Solr 4, optimizing while doing other updates?

Posted by Shawn Heisey <so...@elyograg.org>.
On 11/27/2012 10:36 AM, Jack Krupansky wrote:
> Okay, if performance isn't the reason for the optimize, what is the 
> reason that you are using?
>
> 8GB for Java heap seems low for a 22GB index. How much Java heap seems 
> available when the app is running?
>
> Are these three separate Solr instances/JVMs on the same machine?
>
> How many cores for the machine?

First, thank you for taking time to look into how things are going for 
me.  I really appreciate it.

I am optimizing purely to eliminate deleted documents.  I will admit 
that when we first got going on Solr 1.4.0, performance was a small 
concern, but even way back then, rumblings on the mailing list said 
"don't optimize for performance reasons."

Each server has one Solr JVM (using the jetty6 included with 3.5) with 
8GB heap, each index shard lives in a Solr core. The server has 64GB of 
RAM and two quad-core CPUs, so a total of 8 CPU cores.  Two servers make 
up an entire index chain.  One server has three of the 22GB (cold) 
shards and the 800MB (hot) shard, the other server has the other three 
22GB shards.

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+ COMMAND
17823 ncindex   20   0 80.8g  17g 9.4g S  2.0 28.6   4548:18 java

ncindex@idxa1 ~ $ du -s /index/solr/data/
71606072        /index/solr/data/

processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Xeon(R) CPU           E5440  @ 2.83GHz
stepping        : 6
cpu MHz         : 2826.535
cache size      : 6144 KB

To see whether my heap is too small, I connected jconsole remotely to a 
3.5.0 server via JMX. The numbers look OK to me, I'm including a link to 
a jconsole screenshot.  I could probably drop the heap lower, but that 
might cause some issues with DIH full imports, which we do occasionally 
when there are major changes to the database.

Jconsole screenshot:

https://dl.dropbox.com/u/97770508/solr-jconsole.png

Thanks,
Shawn


Re: Solr 4, optimizing while doing other updates?

Posted by Jack Krupansky <ja...@basetechnology.com>.
Okay, if performance isn't the reason for the optimize, what is the reason 
that you are using?

8GB for Java heap seems low for a 22GB index. How much Java heap seems 
available when the app is running?

Are these three separate Solr instances/JVMs on the same machine?

How many cores for the machine?

-- Jack Krupansky

-----Original Message----- 
From: Shawn Heisey
Sent: Tuesday, November 27, 2012 11:33 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr 4, optimizing while doing other updates?

On 11/27/2012 8:19 AM, Jack Krupansky wrote:
> From your experience with your application, how big is the delta for query 
> time before and after a typical weekly optimize? 50%? 20%? 2%?

We've never tried to measure it.  If I were chasing better performance,
I would be interested, but that's not the reason for the optimize.  An
optimized index does *feel* faster than a freshly built one (DIH
full-import), but I have not recorded any numbers.  With three 22GB
shards per server (64GB RAM, 8GB for java heap) and one server handling
the tiny shard as well, it's somewhat glacial whether there are 30
segments or 1.

Thanks,
Shawn 


Re: Solr 4, optimizing while doing other updates?

Posted by Shawn Heisey <so...@elyograg.org>.
On 11/27/2012 8:19 AM, Jack Krupansky wrote:
> From your experience with your application, how big is the delta for 
> query time before and after a typical weekly optimize? 50%? 20%? 2%?

We've never tried to measure it.  If I were chasing better performance, 
I would be interested, but that's not the reason for the optimize.  An 
optimized index does *feel* faster than a freshly built one (DIH 
full-import), but I have not recorded any numbers.  With three 22GB 
shards per server (64GB RAM, 8GB for java heap) and one server handling 
the tiny shard as well, it's somewhat glacial whether there are 30 
segments or 1.

Thanks,
Shawn


Re: Solr 4, optimizing while doing other updates?

Posted by Jack Krupansky <ja...@basetechnology.com>.
>From your experience with your application, how big is the delta for query 
time before and after a typical weekly optimize? 50%? 20%? 2%?

-- Jack Krupansky

-----Original Message----- 
From: Shawn Heisey
Sent: Tuesday, November 27, 2012 9:47 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr 4, optimizing while doing other updates?

On 11/27/2012 5:46 AM, Erick Erickson wrote:
> To see how much of an issue it is, look at the admin>>statistics page. The
> delta between numDocs and maxDocs is the number of non-expunged deletes in
> your index. That may ease your temptation to, as Walter says, turn that
> knob..

I wrote a status servlet that gives me the number of deleted documents
on all my index shards, along with other useful info.  It gathers stats
mbean info from all my shards into one convenient location.  Here you
can see a screenshot of the status page.  The production systems are
3.5.0, the dev system is a 4.1 snapshot checked out 2012/11/26:

http://dl.dropbox.com/u/97770508/statuspage.png

This is a quiet week for our system ... the shard that will be optimized
tonight currently has 13272 deleted documents. Normally that would be
much higher.  An older version of the status page includes the number of
segments, but I haven't seen a need for that so far.

For the large shards (13 million docs, 22GB in 3.5.0), I never see any
merging from just doing updates/deletes.  It takes about ten minutes to
optimize one of those shards.  Currently, my indexing program postpones
all changes to those shards during the large optimize, only allowing new
document inserts (which all go to the tiny shard) to happen.  With
Solr4, I think I can eliminate that postponement and not worry.

On the tiny shard, optimizing usually only takes about ten seconds, and
my indexing system is otherwise idle for 50-59 seconds out of every
minute, so doing it once an hour isn't hurting me.  Because it runs so
fast, I do that optimize in the same thread as the updates.

I have looked into the possibility of doing a commit with
ExpungeDeletes, without an optimize.  It doesn't work for me.  The
percentage of deleted documents in my indexes is almost never high
enough to trigger the expunge, and to my knowledge, Solr currently
doesn't have a config knob to change the percentage.  If I haven't
already filed a jira for such a configuration knob, I will.  I would
honestly like to avoid doing full optimizes, but there is currently no
other way for me to get rid of deletes.

Thanks,
Shawn 


Re: Solr 4, optimizing while doing other updates?

Posted by Shawn Heisey <so...@elyograg.org>.
On 11/27/2012 5:46 AM, Erick Erickson wrote:
> To see how much of an issue it is, look at the admin>>statistics page. The
> delta between numDocs and maxDocs is the number of non-expunged deletes in
> your index. That may ease your temptation to, as Walter says, turn that
> knob..

I wrote a status servlet that gives me the number of deleted documents 
on all my index shards, along with other useful info.  It gathers stats 
mbean info from all my shards into one convenient location.  Here you 
can see a screenshot of the status page.  The production systems are 
3.5.0, the dev system is a 4.1 snapshot checked out 2012/11/26:

http://dl.dropbox.com/u/97770508/statuspage.png

This is a quiet week for our system ... the shard that will be optimized 
tonight currently has 13272 deleted documents. Normally that would be 
much higher.  An older version of the status page includes the number of 
segments, but I haven't seen a need for that so far.

For the large shards (13 million docs, 22GB in 3.5.0), I never see any 
merging from just doing updates/deletes.  It takes about ten minutes to 
optimize one of those shards.  Currently, my indexing program postpones 
all changes to those shards during the large optimize, only allowing new 
document inserts (which all go to the tiny shard) to happen.  With 
Solr4, I think I can eliminate that postponement and not worry.

On the tiny shard, optimizing usually only takes about ten seconds, and 
my indexing system is otherwise idle for 50-59 seconds out of every 
minute, so doing it once an hour isn't hurting me.  Because it runs so 
fast, I do that optimize in the same thread as the updates.

I have looked into the possibility of doing a commit with 
ExpungeDeletes, without an optimize.  It doesn't work for me.  The 
percentage of deleted documents in my indexes is almost never high 
enough to trigger the expunge, and to my knowledge, Solr currently 
doesn't have a config knob to change the percentage.  If I haven't 
already filed a jira for such a configuration knob, I will.  I would 
honestly like to avoid doing full optimizes, but there is currently no 
other way for me to get rid of deletes.

Thanks,
Shawn


Re: Solr 4, optimizing while doing other updates?

Posted by Erick Erickson <er...@gmail.com>.
Shawn:

To see how much of an issue it is, look at the admin>>statistics page. The
delta between numDocs and maxDocs is the number of non-expunged deletes in
your index. That may ease your temptation to, as Walter says, turn that
knob..

Best
Erick


On Mon, Nov 26, 2012 at 8:18 PM, Walter Underwood <wu...@wunderwood.org>wrote:

> Normal merges expunge deletes. You do not need to force a merge. Once per
> hour is almost certainly way too often.
>
> Before I used Solr, I worked was on the Ultraseek team for nine years.
> Ultraseek had the same merging strategy, with a force merge option. I've
> worked with many, many customers on this issue.
>
>  wunder
>
> On Nov 26, 2012, at 5:05 PM, Shawn Heisey wrote:
>
> > On 11/26/2012 5:56 PM, Walter Underwood wrote:
> >> You can optimize during updates, but you should not optimize at all,
> especially if you are doing continuous updates. Hands off that knob.
> >
> > I promise I'm not optimizing just because it's got a cool name, or
> because a README/HOWTO said to do it. I optimize my tiny index once an
> hour, and the large indexes once every six days (one of them gets optimized
> every day, using DAY_OF_YEAR % 6).
> >
> > The only reason I do the optimizes is to expunge deletes. The indexer
> program does inserts, reinserts, and deletes once a minute, most of which
> hit the tiny index.  On the large indexes, between 25000 and 500000
> documents get deleted over the course of the six day optimize interval.
> >
> > Thanks,
> > Shawn
>
>
>
>
>
>

Re: Solr 4, optimizing while doing other updates?

Posted by Walter Underwood <wu...@wunderwood.org>.
Normal merges expunge deletes. You do not need to force a merge. Once per hour is almost certainly way too often.

Before I used Solr, I worked was on the Ultraseek team for nine years. Ultraseek had the same merging strategy, with a force merge option. I've worked with many, many customers on this issue.

 wunder

On Nov 26, 2012, at 5:05 PM, Shawn Heisey wrote:

> On 11/26/2012 5:56 PM, Walter Underwood wrote:
>> You can optimize during updates, but you should not optimize at all, especially if you are doing continuous updates. Hands off that knob.
> 
> I promise I'm not optimizing just because it's got a cool name, or because a README/HOWTO said to do it. I optimize my tiny index once an hour, and the large indexes once every six days (one of them gets optimized every day, using DAY_OF_YEAR % 6).
> 
> The only reason I do the optimizes is to expunge deletes. The indexer program does inserts, reinserts, and deletes once a minute, most of which hit the tiny index.  On the large indexes, between 25000 and 500000 documents get deleted over the course of the six day optimize interval.
> 
> Thanks,
> Shawn






Re: Solr 4, optimizing while doing other updates?

Posted by Shawn Heisey <so...@elyograg.org>.
On 11/26/2012 5:56 PM, Walter Underwood wrote:
> You can optimize during updates, but you should not optimize at all, especially if you are doing continuous updates. Hands off that knob.

I promise I'm not optimizing just because it's got a cool name, or 
because a README/HOWTO said to do it. I optimize my tiny index once an 
hour, and the large indexes once every six days (one of them gets 
optimized every day, using DAY_OF_YEAR % 6).

The only reason I do the optimizes is to expunge deletes. The indexer 
program does inserts, reinserts, and deletes once a minute, most of 
which hit the tiny index.  On the large indexes, between 25000 and 
500000 documents get deleted over the course of the six day optimize 
interval.

Thanks,
Shawn


Re: Solr 4, optimizing while doing other updates?

Posted by Walter Underwood <wu...@wunderwood.org>.
You can optimize during updates, but you should not optimize at all, especially if you are doing continuous updates. Hands off that knob.

wunder

On Nov 26, 2012, at 4:48 PM, Shawn Heisey wrote:

> For Solr 4.0 and higher, is it possible to optimize the index while other updates are happening?  Based on some behavior I just saw, I think it might be.
> 
> I ran a full-import using DIH -- six index shards with 13 million records each and a seventh shard (hot shard) with 317000. On a few of those large indexes, after DIH reported idle and successful completion, I noticed that the index size was still increasing -- Solr was doing one last background merge.
> 
> In the meantime, my indexing program had noticed that the DIH was done, and began indexing backed-up content to the new indexes.  That indexing worked flawlessly, even though the indexes were still merging.  I don't think there's any way for me to detect the "DIH done, but still merging" state ... but I am guessing that I don't have to worry about it.
> 
> Can anyone confirm?  I know that on older Solr versions, if I tried to index while optimizing, my program would not work right.
> 
> Thanks,
> Shawn
>