You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by d m <dm...@gmail.com> on 2007/04/18 16:57:23 UTC

Merge performance

I'd like to share index merge performance data and have a couple
of questions about it...

We (AXS-One, www.axsone.com) build one "master" index per day.
For backup and recovery purposes, we also build many individual
"mini" indexes from the docs added to the master index.

Should one of our master indexes become unusable (for whatever
reason - and I'm glad to say this has not yet happened), we plan to
reconstruct it by merging its mini indexes.

I've done some merge testing so we have an idea of how long it will
take to reconstruct a master index.

For testing purposes, I have created 1,000 mini indexes. Each:
  - contains 1,000 documents
  - is optimized
  - uses the compound file format

The avg doc size across the 1 million docs is: 10.8 KB

My testing has been to merge the 1,000 mini indexes to an empty
destination index. Destination index settings:

  - mergeFactor: 40
  - minMergeDocs: 10,000
  - maxMergeDocs: Integer.MAX_VALUE
  - use compound file format

Those values were obtained from some empirical (but not exhaustive)
merge testing.

In each test run I merge N mini indexes into a single destination
index. Each merge starts with an empty destination index. N increases
by 25 for each data point.

This means our test merges 25 minis to an empty index.  Then merges 50
minis to an empty index. Etc... until it merges 1,000 minis to an
empty index.

Mini indexes merged with V2.1 were indexed with V2.1.
Mini indexes merged with V2.0 were indexed with V2.0.

Hardware:
  - 64-bit
  - 4 CPUs: AMD Opteron 280, 2.41 GHz
  - 12.8 GB RAM
  - 1.63 TB disk space / 6 SCSI drives / RAID ??

Software:
  - Lucene V2.1 and V2.0
  - Java 1.6
  - Windows Server 2003, SP 1

I did tests with:
  - Lucene 2.1 using addIndexesNoOptimize() - 2 identical runs
  - Lucene 2.1 using addIndexes() - 1 run
  - Lucene 2.0 using addIndexes() - 1 run

The recorded merge times (in seconds) include a final call to
optimize() the destination index after returning from
addIndexesNoOptimize() or addIndexes().

I've included the test data below. (If you'd like, I can email an
Excel version of the data with a graph.)

A few things caught my attention (seen easily when graphing "Indexes"
vs "Merge Time (secs)"):

1. The runs (3 & 4) using addIndexes() show a relatively smooth
   increase in merge times (as expected).

2. The 2.1 runs (1 & 2) using addIndexesNoOptimize() show multiple
   spikes in times for a particular merge count - with the next merge
   counts running faster. The pattern of spikes was identical in both
   runs.

   The most notable spike occurs in the addIndexesNoOptimize() merge
   of 900 indexes with took 44:26 (mm:ss) in one run and 43:37 in the
   other. In both runs the merge times for 925, 950, 975, and 1000
   indexes took less time than the 900 merge.

3. Overall, using addIndexes() appears to be faster than
   addIndexesNoOptimize().

4. V2.0 addIndexes() performs better than V2.1 addIndexes(). Look at
   the very last row of data below - it is the merge rate (in
   docs/min) for each test run.

Can someone explain what might be happening to cause the spikes in 2.1,
not seen in 2.0?

Any thoughts on 2.0 merging faster than 2.1?

Thanks, david.


Run 1: Lucene 2.1 / addIndexesNoOptimize()
Run 2: Repeat Run 1
Run 3: Lucene 2.1 / addIndexes()
Run 4: Lucene 2.0 / addIndexes()
All runs include a final call to optimize()

      Merge Times (seconds)
Numb  Run   Run   Run   Run
Idxs  1     2     3     4
25    39    59    44    38
50    73    93    83    81
75    113   131   128   120
100   147   169   163   154
125   179   198   193   220
150   222   241   227   231
175   246   261   249   239
200   266   273   269   266
225   297   301   288   283
250   323   325   312   308
275   461   471   343   337
300   393   388   376   364
325   424   423   410   401
350   465   467   445   438
375   498   504   475   466
400   527   528   516   503
425   586   567   608   587
450   677   656   703   675
475   876   880   786   750
500   841   832   872   821
525   920   914   937   924
550   1213  1206  1038  995
575   1094  1065  1137  1069
600   1250  1207  1275  1189
625   1367  1337  1385  1315
650   1473  1433  1454  1396
675   1600  1575  1499  1468
700   1570  1552  1563  1516
725   1605  1587  1602  1581
750   1852  1808  1687  1627
775   1761  1719  1732  1668
800   1829  1821  1876  1753
825   2167  2138  1882  1832
850   2045  2042  2057  1887
875   2169  2207  2101  2025
900   2666  2617  2138  2014
925   2174  2218  2206  2057
950   2390  2391  2193  2121
975   2304  2322  2247  2149
1000  2321  2322  2321  2227

20500 43423 43248 41820 40095   <- Totals
      28326 28441 29412 30667   <- Merge rate: Docs per minute

Re: Merge performance

Posted by "Michael D. Curtin" <mi...@curtin.com>.
david m wrote:

> A couple of reasons that lead to the merge approach:
> 
> - Source documents are written to archive media and retrieval is
>  relatively slow. Add to that our processing pipeline (including
>  text extraction)... Retrieving and merging minis is faster than
>  re-processing and re-indexing from sources.
> 
> - In addition to index recovery, mini indexes may be combined into
>  custom indexes based on policy.
> 
>  From a compliance viewpoint the mini indexes contain logically
>  related documents. For example: based on a retention policy,
>  documents of type x are to be kept for y years.
> 
>  One example for constructing a custom index would be for legal
>  discovery.

I see -- it sounds like the "minis" are there for several, 
application-specific reasons besides backup and recovery.  Your scheme 
sounds like it might be a clever leveraging of everything you did to 
meet all those other requirements.

For the Lucene projects I've been on, the aggregate size of the source 
data was about the same as the resulting indexes.  In your case I'd 
guess that the aggregate size of the minis is somewhat larger than the 
final index, due to duplication of terms.  Anyhow, in my projects, 
recovery is much faster from a backup of the (final) index than from a 
backup of upstream data followed by reprocessing.  It sounds like you've 
already measured the relevant parameters, though, so maybe my projects' 
data sets have very different characteristics.

Good luck on your project!

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Merge performance

Posted by david m <dm...@gmail.com>.
Michael,

Our application includes indexing and archiving documents to meet
compliance requirements.

A couple of reasons that lead to the merge approach:

- Source documents are written to archive media and retrieval is
  relatively slow. Add to that our processing pipeline (including
  text extraction)... Retrieving and merging minis is faster than
  re-processing and re-indexing from sources.

- In addition to index recovery, mini indexes may be combined into
  custom indexes based on policy.

  From a compliance viewpoint the mini indexes contain logically
  related documents. For example: based on a retention policy,
  documents of type x are to be kept for y years.

  One example for constructing a custom index would be for legal
  discovery.

Thanks, david.

On 4/18/07, Michael D. Curtin <mi...@curtin.com> wrote:
> d m wrote:
>
> > I'd like to share index merge performance data and have a couple
> > of questions about it...
> >
> > We (AXS-One, www.axsone.com) build one "master" index per day.
> > For backup and recovery purposes, we also build many individual
> > "mini" indexes from the docs added to the master index.
> >
> > Should one of our master indexes become unusable (for whatever
> > reason - and I'm glad to say this has not yet happened), we plan to
> > reconstruct it by merging its mini indexes.
>
> The possible merge bug notwithstanding, let's take a step back in
> abstraction:  are you sure the relatively-complex iterative merge
> process you've described buys you anything over a simple
> backup-the-whole-index approach?  Or a
> backup-the-source-data-and-reindex approach?
>
> Merging is I/O intensive, and the scheme you've outlined is re-reading
> and re-writing all the index data several times anyway -- it might not
> be saving you much over a full reindex.  Since the scenario you're
> trying to protect against is a very rare occurrence (so far at least),
> would it be better to spend your development time on improving the
> application than devising (and debugging, and testing, ...) a
> complicated backup and recovery scheme?
>
> --MDC
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Merge performance

Posted by "Michael D. Curtin" <mi...@curtin.com>.
d m wrote:

> I'd like to share index merge performance data and have a couple
> of questions about it...
> 
> We (AXS-One, www.axsone.com) build one "master" index per day.
> For backup and recovery purposes, we also build many individual
> "mini" indexes from the docs added to the master index.
> 
> Should one of our master indexes become unusable (for whatever
> reason - and I'm glad to say this has not yet happened), we plan to
> reconstruct it by merging its mini indexes.

The possible merge bug notwithstanding, let's take a step back in 
abstraction:  are you sure the relatively-complex iterative merge 
process you've described buys you anything over a simple 
backup-the-whole-index approach?  Or a 
backup-the-source-data-and-reindex approach?

Merging is I/O intensive, and the scheme you've outlined is re-reading 
and re-writing all the index data several times anyway -- it might not 
be saving you much over a full reindex.  Since the scenario you're 
trying to protect against is a very rare occurrence (so far at least), 
would it be better to spend your development time on improving the 
application than devising (and debugging, and testing, ...) a 
complicated backup and recovery scheme?

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Merge performance

Posted by david m <dm...@gmail.com>.
Erick & Steven,

I looked at 845, but I'm a bit confused:

Are you suggesting that 845 is the cause for the spikes seen in test
Runs 1 & 2 - and that in 2.1 addIndexesNoOptimize() is, under the
covers, relying on calls to ramSizeInBytes() to trigger new segment
creation before hitting the 10,000 value I've set as
maxBufferedDocs?

wrt the 845 reply (by Michael McCandless):

  This will be an O(N^2) slowdown. EG if based on RAM you are flushing
  every 100 docs, then at 1000 docs you will merge to 1 segment. Then
  at 1900 docs, you merge to 1 segment again. At 2800, 3700, 4600, ...
  (every 900 docs) you keep merging to 1 segment. Your indexing
  process will get very slow because every 900 documents the entire
  index is effectively being optimized.

I thought that with a mergeFactor of 10 and maxBufferedDocs of 100 the
behavior was:

  Lucene creates a new segment for each 100 documents. When there are
  10 segments (each with 100 docs) on disk all 10 are merged into a
  single segment. That single segment contains 1,000 documents. This
  merging repeats until there are 10 segments each with 1,000
  documents. At that time, the 10 segments (of 1,000 document each)
  are merged into a single segment. That segment contains 10,000
  documents. And so on...

No?

Thanks, david.


On 4/18/07, Steven Parkes <st...@esseff.org> wrote:
> Yup, 845 is relevant, as is 847. I haven't had time to digest all that
> David wrote yet, but I'm starting. It's particularly relevant because
> before I get to the point of making 847 committable, I need a way of
> testing merge performance (the factoring in 847 proposes to simplify the
> API slightly, so the merge algorithm would be slightly modified). The
> stuff David has written gives me some ideas for benchmark tests so that
> we'll be able to test multiple merge policies.
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Wednesday, April 18, 2007 9:17 AM
> To: java-user@lucene.apache.org
> Subject: Re: Merge performance
>
> This *may* be relevant, I haven't needed to investigate
> it yet...
>
>  http://issues.apache.org/jira/browse/LUCENE-845
>
> Also, see the thread titled
> "MergeFactor and MaxBufferedDocs value should ...?" for an
> interesting discussion of how to optimize indexing, although
> I'm not sure the notion of using IndexWriter.ramSizeInBytes
> is of too much use when merging indexes....
>
> Erick
>
> On 4/18/07, d m <dm...@gmail.com> wrote:
> >
> > I'd like to share index merge performance data and have a couple
> > of questions about it...
> >
> > We (AXS-One, www.axsone.com) build one "master" index per day.
> > For backup and recovery purposes, we also build many individual
> > "mini" indexes from the docs added to the master index.
> >
> > Should one of our master indexes become unusable (for whatever
> > reason - and I'm glad to say this has not yet happened), we plan to
> > reconstruct it by merging its mini indexes.
> >
> > I've done some merge testing so we have an idea of how long it will
> > take to reconstruct a master index.
> >
> > For testing purposes, I have created 1,000 mini indexes. Each:
> >   - contains 1,000 documents
> >   - is optimized
> >   - uses the compound file format
> >
> > The avg doc size across the 1 million docs is: 10.8 KB
> >
> > My testing has been to merge the 1,000 mini indexes to an empty
> > destination index. Destination index settings:
> >
> >   - mergeFactor: 40
> >   - minMergeDocs: 10,000
> >   - maxMergeDocs: Integer.MAX_VALUE
> >   - use compound file format
> >
> > Those values were obtained from some empirical (but not exhaustive)
> > merge testing.
> >
> > In each test run I merge N mini indexes into a single destination
> > index. Each merge starts with an empty destination index. N increases
> > by 25 for each data point.
> >
> > This means our test merges 25 minis to an empty index.  Then merges 50
> > minis to an empty index. Etc... until it merges 1,000 minis to an
> > empty index.
> >
> > Mini indexes merged with V2.1 were indexed with V2.1.
> > Mini indexes merged with V2.0 were indexed with V2.0.
> >
> > Hardware:
> >   - 64-bit
> >   - 4 CPUs: AMD Opteron 280, 2.41 GHz
> >   - 12.8 GB RAM
> >   - 1.63 TB disk space / 6 SCSI drives / RAID ??
> >
> > Software:
> >   - Lucene V2.1 and V2.0
> >   - Java 1.6
> >   - Windows Server 2003, SP 1
> >
> > I did tests with:
> >   - Lucene 2.1 using addIndexesNoOptimize() - 2 identical runs
> >   - Lucene 2.1 using addIndexes() - 1 run
> >   - Lucene 2.0 using addIndexes() - 1 run
> >
> > The recorded merge times (in seconds) include a final call to
> > optimize() the destination index after returning from
> > addIndexesNoOptimize() or addIndexes().
> >
> > I've included the test data below. (If you'd like, I can email an
> > Excel version of the data with a graph.)
> >
> > A few things caught my attention (seen easily when graphing "Indexes"
> > vs "Merge Time (secs)"):
> >
> > 1. The runs (3 & 4) using addIndexes() show a relatively smooth
> >    increase in merge times (as expected).
> >
> > 2. The 2.1 runs (1 & 2) using addIndexesNoOptimize() show multiple
> >    spikes in times for a particular merge count - with the next merge
> >    counts running faster. The pattern of spikes was identical in both
> >    runs.
> >
> >    The most notable spike occurs in the addIndexesNoOptimize() merge
> >    of 900 indexes with took 44:26 (mm:ss) in one run and 43:37 in the
> >    other. In both runs the merge times for 925, 950, 975, and 1000
> >    indexes took less time than the 900 merge.
> >
> > 3. Overall, using addIndexes() appears to be faster than
> >    addIndexesNoOptimize().
> >
> > 4. V2.0 addIndexes() performs better than V2.1 addIndexes(). Look at
> >    the very last row of data below - it is the merge rate (in
> >    docs/min) for each test run.
> >
> > Can someone explain what might be happening to cause the spikes in
> 2.1,
> > not seen in 2.0?
> >
> > Any thoughts on 2.0 merging faster than 2.1?
> >
> > Thanks, david.
> >
> >
> > Run 1: Lucene 2.1 / addIndexesNoOptimize()
> > Run 2: Repeat Run 1
> > Run 3: Lucene 2.1 / addIndexes()
> > Run 4: Lucene 2.0 / addIndexes()
> > All runs include a final call to optimize()
> >
> >       Merge Times (seconds)
> > Numb  Run   Run   Run   Run
> > Idxs  1     2     3     4
> > 25    39    59    44    38
> > 50    73    93    83    81
> > 75    113   131   128   120
> > 100   147   169   163   154
> > 125   179   198   193   220
> > 150   222   241   227   231
> > 175   246   261   249   239
> > 200   266   273   269   266
> > 225   297   301   288   283
> > 250   323   325   312   308
> > 275   461   471   343   337
> > 300   393   388   376   364
> > 325   424   423   410   401
> > 350   465   467   445   438
> > 375   498   504   475   466
> > 400   527   528   516   503
> > 425   586   567   608   587
> > 450   677   656   703   675
> > 475   876   880   786   750
> > 500   841   832   872   821
> > 525   920   914   937   924
> > 550   1213  1206  1038  995
> > 575   1094  1065  1137  1069
> > 600   1250  1207  1275  1189
> > 625   1367  1337  1385  1315
> > 650   1473  1433  1454  1396
> > 675   1600  1575  1499  1468
> > 700   1570  1552  1563  1516
> > 725   1605  1587  1602  1581
> > 750   1852  1808  1687  1627
> > 775   1761  1719  1732  1668
> > 800   1829  1821  1876  1753
> > 825   2167  2138  1882  1832
> > 850   2045  2042  2057  1887
> > 875   2169  2207  2101  2025
> > 900   2666  2617  2138  2014
> > 925   2174  2218  2206  2057
> > 950   2390  2391  2193  2121
> > 975   2304  2322  2247  2149
> > 1000  2321  2322  2321  2227
> >
> > 20500 43423 43248 41820 40095   <- Totals
> >       28326 28441 29412 30667   <- Merge rate: Docs per minute
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Merge performance

Posted by Steven Parkes <st...@esseff.org>.
Yup, 845 is relevant, as is 847. I haven't had time to digest all that
David wrote yet, but I'm starting. It's particularly relevant because
before I get to the point of making 847 committable, I need a way of
testing merge performance (the factoring in 847 proposes to simplify the
API slightly, so the merge algorithm would be slightly modified). The
stuff David has written gives me some ideas for benchmark tests so that
we'll be able to test multiple merge policies.

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Wednesday, April 18, 2007 9:17 AM
To: java-user@lucene.apache.org
Subject: Re: Merge performance

This *may* be relevant, I haven't needed to investigate
it yet...

 http://issues.apache.org/jira/browse/LUCENE-845

Also, see the thread titled
"MergeFactor and MaxBufferedDocs value should ...?" for an
interesting discussion of how to optimize indexing, although
I'm not sure the notion of using IndexWriter.ramSizeInBytes
is of too much use when merging indexes....

Erick

On 4/18/07, d m <dm...@gmail.com> wrote:
>
> I'd like to share index merge performance data and have a couple
> of questions about it...
>
> We (AXS-One, www.axsone.com) build one "master" index per day.
> For backup and recovery purposes, we also build many individual
> "mini" indexes from the docs added to the master index.
>
> Should one of our master indexes become unusable (for whatever
> reason - and I'm glad to say this has not yet happened), we plan to
> reconstruct it by merging its mini indexes.
>
> I've done some merge testing so we have an idea of how long it will
> take to reconstruct a master index.
>
> For testing purposes, I have created 1,000 mini indexes. Each:
>   - contains 1,000 documents
>   - is optimized
>   - uses the compound file format
>
> The avg doc size across the 1 million docs is: 10.8 KB
>
> My testing has been to merge the 1,000 mini indexes to an empty
> destination index. Destination index settings:
>
>   - mergeFactor: 40
>   - minMergeDocs: 10,000
>   - maxMergeDocs: Integer.MAX_VALUE
>   - use compound file format
>
> Those values were obtained from some empirical (but not exhaustive)
> merge testing.
>
> In each test run I merge N mini indexes into a single destination
> index. Each merge starts with an empty destination index. N increases
> by 25 for each data point.
>
> This means our test merges 25 minis to an empty index.  Then merges 50
> minis to an empty index. Etc... until it merges 1,000 minis to an
> empty index.
>
> Mini indexes merged with V2.1 were indexed with V2.1.
> Mini indexes merged with V2.0 were indexed with V2.0.
>
> Hardware:
>   - 64-bit
>   - 4 CPUs: AMD Opteron 280, 2.41 GHz
>   - 12.8 GB RAM
>   - 1.63 TB disk space / 6 SCSI drives / RAID ??
>
> Software:
>   - Lucene V2.1 and V2.0
>   - Java 1.6
>   - Windows Server 2003, SP 1
>
> I did tests with:
>   - Lucene 2.1 using addIndexesNoOptimize() - 2 identical runs
>   - Lucene 2.1 using addIndexes() - 1 run
>   - Lucene 2.0 using addIndexes() - 1 run
>
> The recorded merge times (in seconds) include a final call to
> optimize() the destination index after returning from
> addIndexesNoOptimize() or addIndexes().
>
> I've included the test data below. (If you'd like, I can email an
> Excel version of the data with a graph.)
>
> A few things caught my attention (seen easily when graphing "Indexes"
> vs "Merge Time (secs)"):
>
> 1. The runs (3 & 4) using addIndexes() show a relatively smooth
>    increase in merge times (as expected).
>
> 2. The 2.1 runs (1 & 2) using addIndexesNoOptimize() show multiple
>    spikes in times for a particular merge count - with the next merge
>    counts running faster. The pattern of spikes was identical in both
>    runs.
>
>    The most notable spike occurs in the addIndexesNoOptimize() merge
>    of 900 indexes with took 44:26 (mm:ss) in one run and 43:37 in the
>    other. In both runs the merge times for 925, 950, 975, and 1000
>    indexes took less time than the 900 merge.
>
> 3. Overall, using addIndexes() appears to be faster than
>    addIndexesNoOptimize().
>
> 4. V2.0 addIndexes() performs better than V2.1 addIndexes(). Look at
>    the very last row of data below - it is the merge rate (in
>    docs/min) for each test run.
>
> Can someone explain what might be happening to cause the spikes in
2.1,
> not seen in 2.0?
>
> Any thoughts on 2.0 merging faster than 2.1?
>
> Thanks, david.
>
>
> Run 1: Lucene 2.1 / addIndexesNoOptimize()
> Run 2: Repeat Run 1
> Run 3: Lucene 2.1 / addIndexes()
> Run 4: Lucene 2.0 / addIndexes()
> All runs include a final call to optimize()
>
>       Merge Times (seconds)
> Numb  Run   Run   Run   Run
> Idxs  1     2     3     4
> 25    39    59    44    38
> 50    73    93    83    81
> 75    113   131   128   120
> 100   147   169   163   154
> 125   179   198   193   220
> 150   222   241   227   231
> 175   246   261   249   239
> 200   266   273   269   266
> 225   297   301   288   283
> 250   323   325   312   308
> 275   461   471   343   337
> 300   393   388   376   364
> 325   424   423   410   401
> 350   465   467   445   438
> 375   498   504   475   466
> 400   527   528   516   503
> 425   586   567   608   587
> 450   677   656   703   675
> 475   876   880   786   750
> 500   841   832   872   821
> 525   920   914   937   924
> 550   1213  1206  1038  995
> 575   1094  1065  1137  1069
> 600   1250  1207  1275  1189
> 625   1367  1337  1385  1315
> 650   1473  1433  1454  1396
> 675   1600  1575  1499  1468
> 700   1570  1552  1563  1516
> 725   1605  1587  1602  1581
> 750   1852  1808  1687  1627
> 775   1761  1719  1732  1668
> 800   1829  1821  1876  1753
> 825   2167  2138  1882  1832
> 850   2045  2042  2057  1887
> 875   2169  2207  2101  2025
> 900   2666  2617  2138  2014
> 925   2174  2218  2206  2057
> 950   2390  2391  2193  2121
> 975   2304  2322  2247  2149
> 1000  2321  2322  2321  2227
>
> 20500 43423 43248 41820 40095   <- Totals
>       28326 28441 29412 30667   <- Merge rate: Docs per minute
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Merge performance

Posted by Erick Erickson <er...@gmail.com>.
This *may* be relevant, I haven't needed to investigate
it yet...

 http://issues.apache.org/jira/browse/LUCENE-845

Also, see the thread titled
"MergeFactor and MaxBufferedDocs value should ...?" for an
interesting discussion of how to optimize indexing, although
I'm not sure the notion of using IndexWriter.ramSizeInBytes
is of too much use when merging indexes....

Erick

On 4/18/07, d m <dm...@gmail.com> wrote:
>
> I'd like to share index merge performance data and have a couple
> of questions about it...
>
> We (AXS-One, www.axsone.com) build one "master" index per day.
> For backup and recovery purposes, we also build many individual
> "mini" indexes from the docs added to the master index.
>
> Should one of our master indexes become unusable (for whatever
> reason - and I'm glad to say this has not yet happened), we plan to
> reconstruct it by merging its mini indexes.
>
> I've done some merge testing so we have an idea of how long it will
> take to reconstruct a master index.
>
> For testing purposes, I have created 1,000 mini indexes. Each:
>   - contains 1,000 documents
>   - is optimized
>   - uses the compound file format
>
> The avg doc size across the 1 million docs is: 10.8 KB
>
> My testing has been to merge the 1,000 mini indexes to an empty
> destination index. Destination index settings:
>
>   - mergeFactor: 40
>   - minMergeDocs: 10,000
>   - maxMergeDocs: Integer.MAX_VALUE
>   - use compound file format
>
> Those values were obtained from some empirical (but not exhaustive)
> merge testing.
>
> In each test run I merge N mini indexes into a single destination
> index. Each merge starts with an empty destination index. N increases
> by 25 for each data point.
>
> This means our test merges 25 minis to an empty index.  Then merges 50
> minis to an empty index. Etc... until it merges 1,000 minis to an
> empty index.
>
> Mini indexes merged with V2.1 were indexed with V2.1.
> Mini indexes merged with V2.0 were indexed with V2.0.
>
> Hardware:
>   - 64-bit
>   - 4 CPUs: AMD Opteron 280, 2.41 GHz
>   - 12.8 GB RAM
>   - 1.63 TB disk space / 6 SCSI drives / RAID ??
>
> Software:
>   - Lucene V2.1 and V2.0
>   - Java 1.6
>   - Windows Server 2003, SP 1
>
> I did tests with:
>   - Lucene 2.1 using addIndexesNoOptimize() - 2 identical runs
>   - Lucene 2.1 using addIndexes() - 1 run
>   - Lucene 2.0 using addIndexes() - 1 run
>
> The recorded merge times (in seconds) include a final call to
> optimize() the destination index after returning from
> addIndexesNoOptimize() or addIndexes().
>
> I've included the test data below. (If you'd like, I can email an
> Excel version of the data with a graph.)
>
> A few things caught my attention (seen easily when graphing "Indexes"
> vs "Merge Time (secs)"):
>
> 1. The runs (3 & 4) using addIndexes() show a relatively smooth
>    increase in merge times (as expected).
>
> 2. The 2.1 runs (1 & 2) using addIndexesNoOptimize() show multiple
>    spikes in times for a particular merge count - with the next merge
>    counts running faster. The pattern of spikes was identical in both
>    runs.
>
>    The most notable spike occurs in the addIndexesNoOptimize() merge
>    of 900 indexes with took 44:26 (mm:ss) in one run and 43:37 in the
>    other. In both runs the merge times for 925, 950, 975, and 1000
>    indexes took less time than the 900 merge.
>
> 3. Overall, using addIndexes() appears to be faster than
>    addIndexesNoOptimize().
>
> 4. V2.0 addIndexes() performs better than V2.1 addIndexes(). Look at
>    the very last row of data below - it is the merge rate (in
>    docs/min) for each test run.
>
> Can someone explain what might be happening to cause the spikes in 2.1,
> not seen in 2.0?
>
> Any thoughts on 2.0 merging faster than 2.1?
>
> Thanks, david.
>
>
> Run 1: Lucene 2.1 / addIndexesNoOptimize()
> Run 2: Repeat Run 1
> Run 3: Lucene 2.1 / addIndexes()
> Run 4: Lucene 2.0 / addIndexes()
> All runs include a final call to optimize()
>
>       Merge Times (seconds)
> Numb  Run   Run   Run   Run
> Idxs  1     2     3     4
> 25    39    59    44    38
> 50    73    93    83    81
> 75    113   131   128   120
> 100   147   169   163   154
> 125   179   198   193   220
> 150   222   241   227   231
> 175   246   261   249   239
> 200   266   273   269   266
> 225   297   301   288   283
> 250   323   325   312   308
> 275   461   471   343   337
> 300   393   388   376   364
> 325   424   423   410   401
> 350   465   467   445   438
> 375   498   504   475   466
> 400   527   528   516   503
> 425   586   567   608   587
> 450   677   656   703   675
> 475   876   880   786   750
> 500   841   832   872   821
> 525   920   914   937   924
> 550   1213  1206  1038  995
> 575   1094  1065  1137  1069
> 600   1250  1207  1275  1189
> 625   1367  1337  1385  1315
> 650   1473  1433  1454  1396
> 675   1600  1575  1499  1468
> 700   1570  1552  1563  1516
> 725   1605  1587  1602  1581
> 750   1852  1808  1687  1627
> 775   1761  1719  1732  1668
> 800   1829  1821  1876  1753
> 825   2167  2138  1882  1832
> 850   2045  2042  2057  1887
> 875   2169  2207  2101  2025
> 900   2666  2617  2138  2014
> 925   2174  2218  2206  2057
> 950   2390  2391  2193  2121
> 975   2304  2322  2247  2149
> 1000  2321  2322  2321  2227
>
> 20500 43423 43248 41820 40095   <- Totals
>       28326 28441 29412 30667   <- Merge rate: Docs per minute
>