You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Justus Pendleton <jp...@atlassian.com> on 2008/11/03 04:42:52 UTC

Performance of never optimizing

Howdy,

I have a couple of questions regarding some Lucene benchmarking and  
what the results mean[3]. (Skip to the numbered list at the end if you  
don't want to read the lengthy exegesis :)

I'm a developer for JIRA[1]. We are currently trying to get a better  
understanding of Lucene, and our use of it, to cope with the needs of  
our larger customers. These "large" indexes are only a couple hundred  
thousand documents but our problem is compounded by the fact that they  
have a relatively high rate of modification (=delete+insert of new  
document) and our users expect these modification to show up in query  
results pretty much instantly.

Our current default behaviour is a merge factor of 4. We perform an  
optimization on the index every 4000 additions. We also perform an  
optimize at midnight. Our fundamental problem is that these  
optimizations are locking the index for unacceptably long periods of  
time, something that we want to resolve for our next major release,  
hopefully without undermining search performance too badly.

In the Lucene javadoc there is a comment, and a link to a mailing list  
discussion[2], that suggests applications such as JIRA should never  
perform optimize but should instead set their merge factor very low.

In an attempt to understand the impact of a) lowering the merge factor  
from 4 to 2 and b) never, ever optimizing on an index (over the course  
of years and millions of additions/updates) I wanted to try to  
benchmark Lucene.

I used the contrib/benchmark framework and wrote a small algorithm  
that adds documents to an index (using the Reuters doc generator),  
does a search, does an optimize, then does another search. All the  
pretty pictures can be seen at:

   http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs

I have several questions, hopefully they aren't overwhelming in their  
quantity :-/

1. Why does the merge factor of 4 appear to be faster than the merge  
factor of 2?

2. Why does non-optimized searching appear to be faster than optimized  
searching once the index hits ~500,000 documents?

3. There appears to be a fairly sizable performance drop across the  
board around 450,000 documents. Why is that?

4. Searching performance appears to decrease towards a fairly  
pessimistic 20 searches per second (for a relatively simple search).  
Is this really what we should expect long-term from Lucene?

5. Does my benchmark even make sense? I am far from an expert on  
benchmarking so it is possible I'm not measuring what I think I am  
measuring.

Thanks in advance for any insight you can provide. This is an area  
that we very much want to understand better as Lucene is a key part of  
JIRA's success,

Cheers,
Justus
JIRA Developer

[1]: http://www.atlassian.com
[2]: http://www.gossamer-threads.com/lists/lucene/java-dev/47895
[3]: http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Performance of never optimizing

Posted by Justus Pendleton <jp...@atlassian.com>.

On 03/11/2008, at 4:27 PM, Otis Gospodnetic wrote:
> Why are you optimizing?  Trying to make the search faster?  I would  
> try to avoid optimizing during high usage periods.

I assume that the original, long-ago, decision to optimize was made to  
improve searching performance.

> One thing that you might not have tried is the constant re-opening  
> of the IndexReader, which you'll need to do if you want to see index  
> changes instantly.

We do keep track of when the index has been updated and re-open  
IndexReaders so that they see the updates instantly.

>>
> So you indexed once and then measured search performance?  Or did  
> you measure indexing performance?  I can't quite tell from your email.
> And in one case you optimized before searching and in the other you  
> did not optimize?

Yes, I indexed once and then measured search performance. (The actual  
algorithm used can be seen at http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs) 
  For my current purposes I don't care about indexing performance.

>> 1. Why does the merge factor of 4 appear to be faster than the  
>> merge factor of
>> 2?
>
>
> Faster for indexing or searching?  If indexing, then it's because 4  
> means fewer segment merges than 2.  If searching, then I don't know,  
> unless you had indexing and searching happening in parallel, which  
> then means less IO for 4.

For searching. The index and search should not have been happening in  
parallel. However, multiple searches are occurring in parallel.

> Did you index fit in RAM, by the way?

The machine has, I believe, 4 GB of RAM and the benchmark suite  
reports than 700 MB were used, so it does appear to have fit into RAM.

>> 2. Why does non-optimized searching appear to be faster than  
>> optimized searching
>> once the index hits ~500,000 documents?
>
>
> Not sure without seeing the index/machine.

The machine is an 8-core Mac Pro. If you'd like, I can provide the  
indexes online somewhere. Or if you can provide pointers on what to  
look for, I'm more than happy to investigate this myself.

>
> It sounds like you were measuring search performance while at the  
> same time increasing the index size by incrementally adding more docs?

No documents were being added to the index while the searching was  
being performed. I was trying to measure only the search performance.

> 20 reqs/sec sounds very low.  How large is your index, how much RAM,  
> and how about heap size?
> What were your queries like? random?  from log?

The queries were generated by the ReutersQueryMaker. I am not sure  
what the heap size used as various stages were. (I ran the benchmarks  
over the weekend; they took several days.)

> I'm confused by what exactly you did and measured, but it could just  
> be that I'm tired.

My apologies for not being clearer in my initial email. I appreciate  
the help,

Cheers,
Justus

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Performance of never optimizing

Posted by Michael McCandless <lu...@mikemccandless.com>.

Otis Gospodnetic wrote:

>> Our current default behaviour is a merge factor of 4. We perform an  
>> optimization
>> on the index every 4000 additions. We also perform an optimize at  
>> midnight. Our
>
> I wouldn't optimize every 4000 additions - you are killing IO,  
> rewriting the whole index, while trying to provide fast searches,  
> plus you are locking the index for other modifications.

One small clarification: optimize can run in the BG.  It doesn't block
other IndexWriter operations.  EG you can continue adding & deleting
docs.  Optimize() just guarantees that those segments that existed at
the start will be merged together.  Other segments that are flushed
after optimize had started will not be merged.

Of course, optimize is tremendously IO intensive so this may still
block out searches if the performance becomes hideously bad because
the IO system is saturated.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Performance of never optimizing

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hello,

 
Very quick comments.


----- Original Message ----
> From: Justus Pendleton <jp...@atlassian.com>
> To: java-user@lucene.apache.org
> Sent: Sunday, November 2, 2008 10:42:52 PM
> Subject: Performance of never optimizing
> 
> Howdy,
> 
> I have a couple of questions regarding some Lucene benchmarking and what the 
> results mean[3]. (Skip to the numbered list at the end if you don't want to read 
> the lengthy exegesis :)
> 
> I'm a developer for JIRA[1]. We are currently trying to get a better 
> understanding of Lucene, and our use of it, to cope with the needs of our larger 
> customers. These "large" indexes are only a couple hundred thousand documents 
> but our problem is compounded by the fact that they have a relatively high rate 
> of modification (=delete+insert of new document) and our users expect these 
> modification to show up in query results pretty much instantly.


This will be a tough call with large indices - there is no real-time search in Lucene yet.

> Our current default behaviour is a merge factor of 4. We perform an optimization 
> on the index every 4000 additions. We also perform an optimize at midnight. Our 


I wouldn't optimize every 4000 additions - you are killing IO, rewriting the whole index, while trying to provide fast searches, plus you are locking the index for other modifications.

> fundamental problem is that these optimizations are locking the index for 
> unacceptably long periods of time, something that we want to resolve for our 
> next major release, hopefully without undermining search performance too badly.


Why are you optimizing?  Trying to make the search faster?  I would try to avoid optimizing during high usage periods.

> In the Lucene javadoc there is a comment, and a link to a mailing list 
> discussion[2], that suggests applications such as JIRA should never perform 
> optimize but should instead set their merge factor very low.


Right, you can let Lucene merge segments.

> In an attempt to understand the impact of a) lowering the merge factor from 4 to 
> 2 and b) never, ever optimizing on an index (over the course of years and 
> millions of additions/updates) I wanted to try to benchmark Lucene.


One thing that you might not have tried is the constant re-opening of the IndexReader, which you'll need to do if you want to see index changes instantly.

> I used the contrib/benchmark framework and wrote a small algorithm that adds 
> documents to an index (using the Reuters doc generator), does a search, does an 
> optimize, then does another search. All the pretty pictures can be seen at:


So you indexed once and then measured search performance?  Or did you measure indexing performance?  I can't quite tell from your email.
And in one case you optimized before searching and in the other you did not optimize?

>   http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs
> 
> I have several questions, hopefully they aren't overwhelming in their quantity 
> :-/
> 
> 1. Why does the merge factor of 4 appear to be faster than the merge factor of 
> 2?


Faster for indexing or searching?  If indexing, then it's because 4 means fewer segment merges than 2.  If searching, then I don't know, unless you had indexing and searching happening in parallel, which then means less IO for 4.

Did you index fit in RAM, by the way?

> 2. Why does non-optimized searching appear to be faster than optimized searching 
> once the index hits ~500,000 documents?


Not sure without seeing the index/machine.
It sounds like you were measuring search performance while at the same time increasing the index size by incrementally adding more docs?

> 3. There appears to be a fairly sizable performance drop across the board around 
> 450,000 documents. Why is that?

Something to do with Lucene merging index segments around that point?  At this point I'm assuming you were measuring search speed while indexing.


> 4. Searching performance appears to decrease towards a fairly pessimistic 20 
> searches per second (for a relatively simple search). Is this really what we 
> should expect long-term from Lucene?


20 reqs/sec sounds very low.  How large is your index, how much RAM, and how about heap size?
What were your queries like? random?  from log?

> 5. Does my benchmark even make sense? I am far from an expert on benchmarking so 
> it is possible I'm not measuring what I think I am measuring.


I'm confused by what exactly you did and measured, but it could just be that I'm tired.

> Thanks in advance for any insight you can provide. This is an area that we very 
> much want to understand better as Lucene is a key part of JIRA's success,

>
> [1]: http://www.atlassian.com
> [2]: http://www.gossamer-threads.com/lists/lucene/java-dev/47895
> [3]: http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Performance of never optimizing

Posted by Ard Schrijvers <a....@onehippo.com>.

Hello Justus, Chris and Otis,

IIRC Ocean [1] by Jason Rutherglen addresses the issue for real time
searches on large data sets. A conceptually comparable implementation is
done for Jackrabbit, where you can see an enlighting picture over here
[2]. In short: 

1) IndexReaders are opened only once and *never* reopened
(ReadOnlyIndexReader)
2) Deletions are persisted in the CommittableIndexReader and reflected
in a in memory BitSet (which is combined with the ReadOnlyIndexReader to
reflect deletions: note that deletions are thus reflected without
re-openening an index reader)
3) New documents are added to an in memory index
4) Searching is done in the CombinedIndexReader combining (1), (2) and
(3)
5) Index merging works similar to normal segment merging within one
single lucene index.

This mechanism helps you with instant reflection of changes without
having to reopen index readers.

Hope this helps.

Regards Ard

[1]
http://wiki.apache.org/lucene-java/OceanRealtimeSearch?highlight=(GData)
[2] http://jackrabbit.apache.org/index-readers.html

> 
> Hi, Justus,
> 
> I had met with very similar problems as JIRA has, which has 
> high modification and on a large data volume. It's a pretty 
> common use case for Lucene.
> 
> The way I dealt with high rate of modification is to create a 
> secondary in-memory index. And only persist documents older 
> than a period of time.
> So searching will need to combine results from two indexes. 
> It's a bit complicated when creating the index, but it's 
> worth well to save the extra IO-heavy merging and to improve 
> response time, especially the ability to search right away 
> with just added documents.
> 
> BTW: JIRA is great!
> 
> --
> Chris Lu
> -------------------------
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes: 
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database
> _Search_in_3_minutes
> DBSight customer, a shopping comparison site, (anonymous per 
> request) got 2.6 Million Euro funding!
> 
> Justus Pendleton wrote:
> > Howdy,
> >
> > I have a couple of questions regarding some Lucene benchmarking and 
> > what the results mean[3]. (Skip to the numbered list at the 
> end if you 
> > don't want to read the lengthy exegesis :)
> >
> > I'm a developer for JIRA[1]. We are currently trying to get 
> a better 
> > understanding of Lucene, and our use of it, to cope with 
> the needs of 
> > our larger customers. These "large" indexes are only a 
> couple hundred 
> > thousand documents but our problem is compounded by the 
> fact that they 
> > have a relatively high rate of modification (=delete+insert of new
> > document) and our users expect these modification to show 
> up in query 
> > results pretty much instantly.
> >
> > Our current default behaviour is a merge factor of 4. We perform an 
> > optimization on the index every 4000 additions. We also perform an 
> > optimize at midnight. Our fundamental problem is that these 
> > optimizations are locking the index for unacceptably long 
> periods of 
> > time, something that we want to resolve for our next major release, 
> > hopefully without undermining search performance too badly.
> >
> > In the Lucene javadoc there is a comment, and a link to a 
> mailing list 
> > discussion[2], that suggests applications such as JIRA should never 
> > perform optimize but should instead set their merge factor very low.
> >
> > In an attempt to understand the impact of a) lowering the 
> merge factor 
> > from 4 to 2 and b) never, ever optimizing on an index (over 
> the course 
> > of years and millions of additions/updates) I wanted to try to 
> > benchmark Lucene.
> >
> > I used the contrib/benchmark framework and wrote a small algorithm 
> > that adds documents to an index (using the Reuters doc generator), 
> > does a search, does an optimize, then does another search. All the 
> > pretty pictures can be seen at:
> >
> >   http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs
> >
> > I have several questions, hopefully they aren't 
> overwhelming in their 
> > quantity :-/
> >
> > 1. Why does the merge factor of 4 appear to be faster than 
> the merge 
> > factor of 2?
> >
> > 2. Why does non-optimized searching appear to be faster 
> than optimized 
> > searching once the index hits ~500,000 documents?
> >
> > 3. There appears to be a fairly sizable performance drop across the 
> > board around 450,000 documents. Why is that?
> >
> > 4. Searching performance appears to decrease towards a fairly 
> > pessimistic 20 searches per second (for a relatively simple search).
> > Is this really what we should expect long-term from Lucene?
> >
> > 5. Does my benchmark even make sense? I am far from an expert on 
> > benchmarking so it is possible I'm not measuring what I think I am 
> > measuring.
> >
> > Thanks in advance for any insight you can provide. This is an area 
> > that we very much want to understand better as Lucene is a 
> key part of 
> > JIRA's success,
> >
> > Cheers,
> > Justus
> > JIRA Developer
> >
> > [1]: http://www.atlassian.com
> > [2]: http://www.gossamer-threads.com/lists/lucene/java-dev/47895
> > [3]: http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs
> >
> > 
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Performance of never optimizing

Posted by Chris Lu <ch...@gmail.com>.

Hi, Justus,

I had met with very similar problems as JIRA has, which has high 
modification and on a large data volume. It's a pretty common use case 
for Lucene.

The way I dealt with high rate of modification is to create a secondary 
in-memory index. And only persist documents older than a period of time.
So searching will need to combine results from two indexes. It's a bit 
complicated when creating the index, but it's worth well to save the 
extra IO-heavy merging and to improve response time, especially the 
ability to search right away with just added documents.

BTW: JIRA is great!

--
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got 2.6 Million Euro funding!

Justus Pendleton wrote:
> Howdy,
>
> I have a couple of questions regarding some Lucene benchmarking and 
> what the results mean[3]. (Skip to the numbered list at the end if you 
> don't want to read the lengthy exegesis :)
>
> I'm a developer for JIRA[1]. We are currently trying to get a better 
> understanding of Lucene, and our use of it, to cope with the needs of 
> our larger customers. These "large" indexes are only a couple hundred 
> thousand documents but our problem is compounded by the fact that they 
> have a relatively high rate of modification (=delete+insert of new 
> document) and our users expect these modification to show up in query 
> results pretty much instantly.
>
> Our current default behaviour is a merge factor of 4. We perform an 
> optimization on the index every 4000 additions. We also perform an 
> optimize at midnight. Our fundamental problem is that these 
> optimizations are locking the index for unacceptably long periods of 
> time, something that we want to resolve for our next major release, 
> hopefully without undermining search performance too badly.
>
> In the Lucene javadoc there is a comment, and a link to a mailing list 
> discussion[2], that suggests applications such as JIRA should never 
> perform optimize but should instead set their merge factor very low.
>
> In an attempt to understand the impact of a) lowering the merge factor 
> from 4 to 2 and b) never, ever optimizing on an index (over the course 
> of years and millions of additions/updates) I wanted to try to 
> benchmark Lucene.
>
> I used the contrib/benchmark framework and wrote a small algorithm 
> that adds documents to an index (using the Reuters doc generator), 
> does a search, does an optimize, then does another search. All the 
> pretty pictures can be seen at:
>
>   http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs
>
> I have several questions, hopefully they aren't overwhelming in their 
> quantity :-/
>
> 1. Why does the merge factor of 4 appear to be faster than the merge 
> factor of 2?
>
> 2. Why does non-optimized searching appear to be faster than optimized 
> searching once the index hits ~500,000 documents?
>
> 3. There appears to be a fairly sizable performance drop across the 
> board around 450,000 documents. Why is that?
>
> 4. Searching performance appears to decrease towards a fairly 
> pessimistic 20 searches per second (for a relatively simple search). 
> Is this really what we should expect long-term from Lucene?
>
> 5. Does my benchmark even make sense? I am far from an expert on 
> benchmarking so it is possible I'm not measuring what I think I am 
> measuring.
>
> Thanks in advance for any insight you can provide. This is an area 
> that we very much want to understand better as Lucene is a key part of 
> JIRA's success,
>
> Cheers,
> Justus
> JIRA Developer
>
> [1]: http://www.atlassian.com
> [2]: http://www.gossamer-threads.com/lists/lucene/java-dev/47895
> [3]: http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Performance of never optimizing

Posted by Michael McCandless <lu...@mikemccandless.com>.

Justus Pendleton wrote:

> On 05/11/2008, at 4:36 AM, Michael McCandless wrote:
>> If possible, you should try to use a larger corpus (eg Wikipedia)  
>> rather than multiply Reuters by N, which creates unnatural term  
>> frequency distribution.
>
> I'll replicate the tests with the wikipedia corpus over the next few  
> days and regenerate the graphs to show the data points in addition  
> to the curves. The data I am using comes from the output on the  
> benchmark framework:
>
>     [java] Operation                       round mrg   runCnt    
> recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
>     [java] UnoptSearch_100_Par     0       2               
> 1                 100        230.4               0.43         
> 29,517,680        44,834,816
>
> I am plotting the "rec/s" which I am (possibly mistakenly)  
> interpreting to mean "searches per second" as I asked for 100  
> searches and it took 0.43 seconds to perform them all.

That is the right interpretation, but you get 6 such numbers for each
of your data points (6 rounds), so I was wondering how you then digest
that to 1 number.  EG discard worst & best 2 outliers and average the
rest?  Or, pick the best one.

Also I would not trust any test that took only 0.43 seconds to run.
It's too risky that random measurement costs/overhead are skewing the
results.

>> It's best to use a real query log, if possible, to run the  
>> queries.  If you are expecting your production machines to have  
>> plenty of RAM to hold the index, then you should definitely run the  
>> queries through once, discard it, to get all stuff loaded in RAM  
>> including the OS caching all required data in its IO cache.
>>
>> Not opening/closing a reader per search should change the graphs  
>> quite a bit (for the better) and hopefully change some of the odd  
>> things you are seeing (in the questions below).
>
> I don't believe our large users to have enough memory for Lucene  
> indexes to fit in RAM. (Especially given we use quite a bit of RAM  
> for other stuff.) I think we also close readers pretty frequently  
> (whenever any user updates a JIRA issue, which I am assuming  
> happening nearly constantly when you've got thousands of users). I  
> was trying to mimic our usage as closely as I could to see whether  
> Lucene behaves pathologically poorly given our current architecture.  
> There have been some excellent suggestions about using in-memory  
> indexes for recent updates but changes of that kind are,  
> unfortunately, currently outside of my purview :-(

This is a very important test criteria to decide up front, because you
have to carefully design the test to "be too large for RAM" if that's
the goal.  EG searching the same few queries over and over is not
right, since the necessary pages are quickly cached in the OS's IO
cache and you get fabulous results after that.

Are you using the default ReutersQueryMaker to provide queries?  When
you switch to Wikipedia you should also switch QueryMaker, maybe to
FileBasedQueryMaker to load a file that you pre-populate with a rich
"typical" set of queries.  But it's best to get real queries people  
do... phrase
queries, single terms, many terms, etc.

Maybe you could talk to Apache infra about using their Jira instance
(and possibly query logs, but that may be overly optimistic)?  It
should be a fairly large test case?

Also you should fix the test so that searcher is reopened only as
often as is "typical" for Jira, not once per query which your current
algo is doing.  I guess guestimate how many searches are done between
updates to issues?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Performance of never optimizing

Posted by Paul Smith <ps...@aconex.com>.

>>
>
> I don't believe our large users to have enough memory for Lucene  
> indexes to fit in RAM. (Especially given we use quite a bit of RAM  
> for other stuff.) I think we also close readers pretty frequently  
> (whenever any user updates a JIRA issue, which I am assuming  
> happening nearly constantly when you've got thousands of users). I  
> was trying to mimic our usage as closely as I could to see whether  
> Lucene behaves pathologically poorly given our current architecture.  
> There have been some excellent suggestions about using in-memory  
> indexes for recent updates but changes of that kind are,  
> unfortunately, currently outside of my purview :-(
>
> Given that our current usage may be suboptimal :-/ does anyone have  
> any ideas about what may be causing the anomalies I identified  
> earlier?


We have exactly the same problem JIRA has only even bigger I think..   
We have large projects with 10's of millions of documents and mail  
items.  Our requirement was a 5 second refresh time (that is, an  
update (add, delete, or update) can take no longer than 5 seconds  
before a subsequent search can see it.  Worse, we have a large number  
of fields customers need to sort by, so tearing down a 15Gb index with  
a dozen sorting fields every 5 seconds and rebuilding the  
FieldSortedHitQueue's is clearly not going to work.. :)

We solved this by having a virtual index made up of an 'archive' and a  
'work' index, and then run a multi-reader over the 2.  All updates  
(adds, updates, deletes) are done as a delete to the Archive index,  
and then an add/update to the work index.  Every week during a lull we  
merge the 2 into a new archive index directory and 'switch' to it  
(blocking updates while we optimize and switch).  This means the work  
sub-index can be refreshed every 5 seconds because it is small and we  
'pin' the archive index in memory by doing... well.. a fairly  
egregious hack to be honest.  We actually have to do updates to the  
Archive to satisfy the delete, but doing that normally would require a  
total refresh for that delete to be made visible.  We accomplish that  
by allowing the delete to go to the disk (via deleted segment) but  
also we apply the deletes in memory as well so that can be seen.  This  
way the most up-to-date data can be seen in the work index.

This gives the best of both worlds, a really warmed up large archive  
index, and a smaller work index ( no more than a weeks worth of  
updates) that we can refresh every 5 seconds.  The tear down/warm up  
cycle appears to be fine for us for the work index and we can satisfy  
searches very quickly.

It would be really nice if Lucene could allow deletes to be done  
against a live IndexReader without flushing anything else out.

cheers,

Paul



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Performance of never optimizing

Posted by Justus Pendleton <jp...@atlassian.com>.

On 05/11/2008, at 4:36 AM, Michael McCandless wrote:
> If possible, you should try to use a larger corpus (eg Wikipedia)  
> rather than multiply Reuters by N, which creates unnatural term  
> frequency distribution.

I'll replicate the tests with the wikipedia corpus over the next few  
days and regenerate the graphs to show the data points in addition to  
the curves. The data I am using comes from the output on the benchmark  
framework:

      [java] Operation                       round mrg   runCnt    
recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
      [java] UnoptSearch_100_Par     0       2               
1                 100        230.4               0.43         
29,517,680        44,834,816

I am plotting the "rec/s" which I am (possibly mistakenly)  
interpreting to mean "searches per second" as I asked for 100 searches  
and it took 0.43 seconds to perform them all.

> It's best to use a real query log, if possible, to run the queries.   
> If you are expecting your production machines to have plenty of RAM  
> to hold the index, then you should definitely run the queries  
> through once, discard it, to get all stuff loaded in RAM including  
> the OS caching all required data in its IO cache.
>
> Not opening/closing a reader per search should change the graphs  
> quite a bit (for the better) and hopefully change some of the odd  
> things you are seeing (in the questions below).

I don't believe our large users to have enough memory for Lucene  
indexes to fit in RAM. (Especially given we use quite a bit of RAM for  
other stuff.) I think we also close readers pretty frequently  
(whenever any user updates a JIRA issue, which I am assuming happening  
nearly constantly when you've got thousands of users). I was trying to  
mimic our usage as closely as I could to see whether Lucene behaves  
pathologically poorly given our current architecture. There have been  
some excellent suggestions about using in-memory indexes for recent  
updates but changes of that kind are, unfortunately, currently outside  
of my purview :-(

Given that our current usage may be suboptimal :-/ does anyone have  
any ideas about what may be causing the anomalies I identified earlier?

Cheers,
Justus

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Performance of never optimizing

Posted by Michael McCandless <lu...@mikemccandless.com>.

If possible, you should try to use a larger corpus (eg Wikipedia)  
rather than multiply Reuters by N, which creates unnatural term  
frequency distribution.

The graphs are hard to read because of the spline interpolation.   
Maybe you could overlay X's where there is a real datapoint?

After the 6 rounds at each doc count, how do you then derive the  
number to put on the graph?

It's best to use a real query log, if possible, to run the queries.   
If you are expecting your production machines to have plenty of RAM to  
hold the index, then you should definitely run the queries through  
once, discard it, to get all stuff loaded in RAM including the OS  
caching all required data in its IO cache.

Not opening/closing a reader per search should change the graphs quite  
a bit (for the better) and hopefully change some of the odd things you  
are seeing (in the questions below).

Mike

Justus Pendleton wrote:

> Howdy,
>
> I have a couple of questions regarding some Lucene benchmarking and  
> what the results mean[3]. (Skip to the numbered list at the end if  
> you don't want to read the lengthy exegesis :)
>
> I'm a developer for JIRA[1]. We are currently trying to get a better  
> understanding of Lucene, and our use of it, to cope with the needs  
> of our larger customers. These "large" indexes are only a couple  
> hundred thousand documents but our problem is compounded by the fact  
> that they have a relatively high rate of modification (=delete 
> +insert of new document) and our users expect these modification to  
> show up in query results pretty much instantly.
>
> Our current default behaviour is a merge factor of 4. We perform an  
> optimization on the index every 4000 additions. We also perform an  
> optimize at midnight. Our fundamental problem is that these  
> optimizations are locking the index for unacceptably long periods of  
> time, something that we want to resolve for our next major release,  
> hopefully without undermining search performance too badly.
>
> In the Lucene javadoc there is a comment, and a link to a mailing  
> list discussion[2], that suggests applications such as JIRA should  
> never perform optimize but should instead set their merge factor  
> very low.
>
> In an attempt to understand the impact of a) lowering the merge  
> factor from 4 to 2 and b) never, ever optimizing on an index (over  
> the course of years and millions of additions/updates) I wanted to  
> try to benchmark Lucene.
>
> I used the contrib/benchmark framework and wrote a small algorithm  
> that adds documents to an index (using the Reuters doc generator),  
> does a search, does an optimize, then does another search. All the  
> pretty pictures can be seen at:
>
>  http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs
>
> I have several questions, hopefully they aren't overwhelming in  
> their quantity :-/
>
> 1. Why does the merge factor of 4 appear to be faster than the merge  
> factor of 2?
>
> 2. Why does non-optimized searching appear to be faster than  
> optimized searching once the index hits ~500,000 documents?
>
> 3. There appears to be a fairly sizable performance drop across the  
> board around 450,000 documents. Why is that?
>
> 4. Searching performance appears to decrease towards a fairly  
> pessimistic 20 searches per second (for a relatively simple search).  
> Is this really what we should expect long-term from Lucene?
>
> 5. Does my benchmark even make sense? I am far from an expert on  
> benchmarking so it is possible I'm not measuring what I think I am  
> measuring.
>
> Thanks in advance for any insight you can provide. This is an area  
> that we very much want to understand better as Lucene is a key part  
> of JIRA's success,
>
> Cheers,
> Justus
> JIRA Developer
>
> [1]: http://www.atlassian.com
> [2]: http://www.gossamer-threads.com/lists/lucene/java-dev/47895
> [3]: http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Performance of never optimizing

Posted by Michael McCandless <lu...@mikemccandless.com>.

Tomer Gabel wrote:

> Since you're using an 8-core Mac Pro
> I also assume you have some sort of RAID setup, which means your  
> storage
> subsystem can physically handle more than one concurrent request,  
> which can
> only come into play with multiple segments.

This is an important point: a multi-segment index naturally allows for
utilizing IO concurrency with multiple searches against a single
IndexReader.  It's actually a reason against optimizing, strangely
enough.

However, as of Lucene 2.4 there's a new directory impl NIOFSDirectory
that on Unix should eliminate that bias (on Windows it will be slower,
due to problems with Sun JREs impl of nio APIs specific to Windows).
Also, you can and should open read-only IndexReaders since that also
removes further internal locking.  If you do test these please report  
back on
what difference they made...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Performance of never optimizing

Posted by Yonik Seeley <yo...@apache.org>.

On Wed, Nov 5, 2008 at 9:47 AM, Tomer Gabel <to...@tomergabel.com> wrote:
> 1. Higher merge factor => more segments.

Right, and it's also important to note that it's only "on average"
more segments.
The number of segments go up and down with merging, so at particular
points in time, an index with a higher merge factor may have fewer
segments (or as small as 1 segment, equivalent to an optimized index).

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Performance of never optimizing

Posted by Tomer Gabel <to...@tomergabel.com>.

Justus Pendleton-2 wrote:
> 
> 1. Why does the merge factor of 4 appear to be faster than the merge  
> factor of 2?
> 
> 2. Why does non-optimized searching appear to be faster than optimized  
> searching once the index hits ~500,000 documents?
> 
> 3. There appears to be a fairly sizable performance drop across the  
> board around 450,000 documents. Why is that?
> 

Hi Justus,

1. Higher merge factor => more segments. Lucene (which version are you
using, by the way?) only keeps a single file handle per physical file per
index reader; if your benchmark is multi-threaded, more concurrently active
segments would mean more file handles. Since you're using an 8-core Mac Pro
I also assume you have some sort of RAID setup, which means your storage
subsystem can physically handle more than one concurrent request, which can
only come into play with multiple segments.

2. Same explanation as above - an optimized index has only one segment, and
contention on the file handle can actually becomes a bottleneck past a
certain threshold. A merge factor of 2 leaves you with very few segments
even for a non-optimized index, which is why the performance of a
non-optimized, 2-factor index is very close to that of the optimized index.
The optimal merge-factor in this case will probably be a function of the
complexity of your RAID setup (NAS devices can easily utilize dozens of
physical drives, giving a measurable benefit to multiple concurrently active
segments), but I expect your setup won't seriously benefit from an increase
in the merge factor because it probably uses 4 or less physical drives. 

3. This is trickier; my guess is that until that point most of the
term-frequency data (.frq) is small enough to be kept fully in the disk read
cache, and beyond that point considerably more I/O is actually performed by
the storage subsystem. This can be probably be measured with tools available
in the OS of your choice, if you wish to corroborate this theory (I'd
certainly be interested in the results).

Best of luck,
Tomer

-----
--

http://www.tomergabel.com Tomer Gabel 

-- 
View this message in context: http://www.nabble.com/Performance-of-never-optimizing-tp20296914p20343051.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Performance of never optimizing

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

On Mon, 2008-11-03 at 04:42 +0100, Justus Pendleton wrote:
> 1. Why does the merge factor of 4 appear to be faster than the merge
> factor of 2?

Because you alternate between updating the index and searching? With 4
segments, chances are that most of the segment-data will be unchanged
between searches, meaning that part of it will be in the disk-cache.

This is tied to question #4.

> 2. Why does non-optimized searching appear to be faster than optimized
> searching once the index hits ~500,000 documents?

Same reason as above?

> 4. Searching performance appears to decrease towards a fairly
> pessimistic 20 searches per second (for a relatively simple search).
> Is this really what we should expect long-term from Lucene?

Quick guess: You do not perform a proper warm up before measuring.

> 5. Does my benchmark even make sense? I am far from an expert on
> benchmarking so it is possible I'm not measuring what I think I am
> measuring.

You need to provide more details. Maybe a bit of pseudo-code (or real
code, if it's not too big) would help.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Performance of never optimizing

Posted by Mark Miller <ma...@gmail.com>.

Been a while since I've been in the benchmark stuff, so I am going to 
take some time to look at this when I get a chance, but off the cuff I 
think you are open and closing the reader for each search. Try using the 
openreader task before the 100 searches and then the closereader task. 
That will ensure you are reusing the same reader for each search. Hope 
to analyze further soon.

- Mark

Justus Pendleton wrote:
> On 03/11/2008, at 11:07 PM, Mark Miller wrote:
>
>> Am I missing your benchmark algorithm somewhere? We need it. 
>> Something doesn't make sense.
>
> I thought I had included in at[1] before but apparently not, my 
> apologies for that. I have updated that wiki page. I'll also reproduce 
> it here:
>
> { "Rounds"
>
>     ResetSystemErase
>     { CreateIndex >
>     { AddDoc > : NUM_DOCS
>     { CloseIndex >
>
>     [ "UnoptSearch" Search > : 100
>     { "Optimize" OpenIndex Optimize CloseIndex }
>     [ "OptSearch" Search > : 100
>
>     NewRound
>
> } : 6
>
> NUM_DOCS increases by 5,000 for each iteration.
>
> What constitutes a "proper warm up before measuring"?
>
>>> [1]: http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs
>
>
> Cheers,
> Justus
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Performance of never optimizing

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

On Mon, 2008-11-03 at 23:37 +0100, Justus Pendleton wrote:
> What constitutes a "proper warm up before measuring"?

The simplest way is to do a number of searches before you start
measuring. The first searches are always very slow, compared to later
searches.

If you look at http://wiki.statsbiblioteket.dk/summa/Hardware and scroll
down to the headline "Warming up" you will see a graph of response-times
for our setup (a 37GB index and logged queries). For SSDs, we reach 2/3
of peak performance after 1,000 queries. For conventional harddisks, we
need 15,000 queries. Your mileage will wary.



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Performance of never optimizing

Posted by Justus Pendleton <jp...@atlassian.com>.

On 03/11/2008, at 11:07 PM, Mark Miller wrote:

> Am I missing your benchmark algorithm somewhere? We need it.  
> Something doesn't make sense.

I thought I had included in at[1] before but apparently not, my  
apologies for that. I have updated that wiki page. I'll also reproduce  
it here:

{ "Rounds"

     ResetSystemErase
     { CreateIndex >
     { AddDoc > : NUM_DOCS
     { CloseIndex >

     [ "UnoptSearch" Search > : 100
     { "Optimize" OpenIndex Optimize CloseIndex }
     [ "OptSearch" Search > : 100

     NewRound

} : 6

NUM_DOCS increases by 5,000 for each iteration.

What constitutes a "proper warm up before measuring"?

>> [1]: http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs

Cheers,
Justus

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Performance of never optimizing

Posted by Mark Miller <ma...@gmail.com>.

Am I missing your benchmark algorithm somewhere? We need it. Something 
doesn't make sense.

- Mark


Justus Pendleton wrote:
> Howdy,
>
> I have a couple of questions regarding some Lucene benchmarking and 
> what the results mean[3]. (Skip to the numbered list at the end if you 
> don't want to read the lengthy exegesis :)
>
> I'm a developer for JIRA[1]. We are currently trying to get a better 
> understanding of Lucene, and our use of it, to cope with the needs of 
> our larger customers. These "large" indexes are only a couple hundred 
> thousand documents but our problem is compounded by the fact that they 
> have a relatively high rate of modification (=delete+insert of new 
> document) and our users expect these modification to show up in query 
> results pretty much instantly.
>
> Our current default behaviour is a merge factor of 4. We perform an 
> optimization on the index every 4000 additions. We also perform an 
> optimize at midnight. Our fundamental problem is that these 
> optimizations are locking the index for unacceptably long periods of 
> time, something that we want to resolve for our next major release, 
> hopefully without undermining search performance too badly.
>
> In the Lucene javadoc there is a comment, and a link to a mailing list 
> discussion[2], that suggests applications such as JIRA should never 
> perform optimize but should instead set their merge factor very low.
>
> In an attempt to understand the impact of a) lowering the merge factor 
> from 4 to 2 and b) never, ever optimizing on an index (over the course 
> of years and millions of additions/updates) I wanted to try to 
> benchmark Lucene.
>
> I used the contrib/benchmark framework and wrote a small algorithm 
> that adds documents to an index (using the Reuters doc generator), 
> does a search, does an optimize, then does another search. All the 
> pretty pictures can be seen at:
>
>   http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs
>
> I have several questions, hopefully they aren't overwhelming in their 
> quantity :-/
>
> 1. Why does the merge factor of 4 appear to be faster than the merge 
> factor of 2?
>
> 2. Why does non-optimized searching appear to be faster than optimized 
> searching once the index hits ~500,000 documents?
>
> 3. There appears to be a fairly sizable performance drop across the 
> board around 450,000 documents. Why is that?
>
> 4. Searching performance appears to decrease towards a fairly 
> pessimistic 20 searches per second (for a relatively simple search). 
> Is this really what we should expect long-term from Lucene?
>
> 5. Does my benchmark even make sense? I am far from an expert on 
> benchmarking so it is possible I'm not measuring what I think I am 
> measuring.
>
> Thanks in advance for any insight you can provide. This is an area 
> that we very much want to understand better as Lucene is a key part of 
> JIRA's success,
>
> Cheers,
> Justus
> JIRA Developer
>
> [1]: http://www.atlassian.com
> [2]: http://www.gossamer-threads.com/lists/lucene/java-dev/47895
> [3]: http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org