You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by d m <dm...@gmail.com> on 2007/04/18 16:57:23 UTC
Merge performance
I'd like to share index merge performance data and have a couple
of questions about it...
We (AXS-One, www.axsone.com) build one "master" index per day.
For backup and recovery purposes, we also build many individual
"mini" indexes from the docs added to the master index.
Should one of our master indexes become unusable (for whatever
reason - and I'm glad to say this has not yet happened), we plan to
reconstruct it by merging its mini indexes.
I've done some merge testing so we have an idea of how long it will
take to reconstruct a master index.
For testing purposes, I have created 1,000 mini indexes. Each:
- contains 1,000 documents
- is optimized
- uses the compound file format
The avg doc size across the 1 million docs is: 10.8 KB
My testing has been to merge the 1,000 mini indexes to an empty
destination index. Destination index settings:
- mergeFactor: 40
- minMergeDocs: 10,000
- maxMergeDocs: Integer.MAX_VALUE
- use compound file format
Those values were obtained from some empirical (but not exhaustive)
merge testing.
In each test run I merge N mini indexes into a single destination
index. Each merge starts with an empty destination index. N increases
by 25 for each data point.
This means our test merges 25 minis to an empty index. Then merges 50
minis to an empty index. Etc... until it merges 1,000 minis to an
empty index.
Mini indexes merged with V2.1 were indexed with V2.1.
Mini indexes merged with V2.0 were indexed with V2.0.
Hardware:
- 64-bit
- 4 CPUs: AMD Opteron 280, 2.41 GHz
- 12.8 GB RAM
- 1.63 TB disk space / 6 SCSI drives / RAID ??
Software:
- Lucene V2.1 and V2.0
- Java 1.6
- Windows Server 2003, SP 1
I did tests with:
- Lucene 2.1 using addIndexesNoOptimize() - 2 identical runs
- Lucene 2.1 using addIndexes() - 1 run
- Lucene 2.0 using addIndexes() - 1 run
The recorded merge times (in seconds) include a final call to
optimize() the destination index after returning from
addIndexesNoOptimize() or addIndexes().
I've included the test data below. (If you'd like, I can email an
Excel version of the data with a graph.)
A few things caught my attention (seen easily when graphing "Indexes"
vs "Merge Time (secs)"):
1. The runs (3 & 4) using addIndexes() show a relatively smooth
increase in merge times (as expected).
2. The 2.1 runs (1 & 2) using addIndexesNoOptimize() show multiple
spikes in times for a particular merge count - with the next merge
counts running faster. The pattern of spikes was identical in both
runs.
The most notable spike occurs in the addIndexesNoOptimize() merge
of 900 indexes with took 44:26 (mm:ss) in one run and 43:37 in the
other. In both runs the merge times for 925, 950, 975, and 1000
indexes took less time than the 900 merge.
3. Overall, using addIndexes() appears to be faster than
addIndexesNoOptimize().
4. V2.0 addIndexes() performs better than V2.1 addIndexes(). Look at
the very last row of data below - it is the merge rate (in
docs/min) for each test run.
Can someone explain what might be happening to cause the spikes in 2.1,
not seen in 2.0?
Any thoughts on 2.0 merging faster than 2.1?
Thanks, david.
Run 1: Lucene 2.1 / addIndexesNoOptimize()
Run 2: Repeat Run 1
Run 3: Lucene 2.1 / addIndexes()
Run 4: Lucene 2.0 / addIndexes()
All runs include a final call to optimize()
Merge Times (seconds)
Numb Run Run Run Run
Idxs 1 2 3 4
25 39 59 44 38
50 73 93 83 81
75 113 131 128 120
100 147 169 163 154
125 179 198 193 220
150 222 241 227 231
175 246 261 249 239
200 266 273 269 266
225 297 301 288 283
250 323 325 312 308
275 461 471 343 337
300 393 388 376 364
325 424 423 410 401
350 465 467 445 438
375 498 504 475 466
400 527 528 516 503
425 586 567 608 587
450 677 656 703 675
475 876 880 786 750
500 841 832 872 821
525 920 914 937 924
550 1213 1206 1038 995
575 1094 1065 1137 1069
600 1250 1207 1275 1189
625 1367 1337 1385 1315
650 1473 1433 1454 1396
675 1600 1575 1499 1468
700 1570 1552 1563 1516
725 1605 1587 1602 1581
750 1852 1808 1687 1627
775 1761 1719 1732 1668
800 1829 1821 1876 1753
825 2167 2138 1882 1832
850 2045 2042 2057 1887
875 2169 2207 2101 2025
900 2666 2617 2138 2014
925 2174 2218 2206 2057
950 2390 2391 2193 2121
975 2304 2322 2247 2149
1000 2321 2322 2321 2227
20500 43423 43248 41820 40095 <- Totals
28326 28441 29412 30667 <- Merge rate: Docs per minute
Re: Merge performance
Posted by "Michael D. Curtin" <mi...@curtin.com>.
david m wrote:
> A couple of reasons that lead to the merge approach:
>
> - Source documents are written to archive media and retrieval is
> relatively slow. Add to that our processing pipeline (including
> text extraction)... Retrieving and merging minis is faster than
> re-processing and re-indexing from sources.
>
> - In addition to index recovery, mini indexes may be combined into
> custom indexes based on policy.
>
> From a compliance viewpoint the mini indexes contain logically
> related documents. For example: based on a retention policy,
> documents of type x are to be kept for y years.
>
> One example for constructing a custom index would be for legal
> discovery.
I see -- it sounds like the "minis" are there for several,
application-specific reasons besides backup and recovery. Your scheme
sounds like it might be a clever leveraging of everything you did to
meet all those other requirements.
For the Lucene projects I've been on, the aggregate size of the source
data was about the same as the resulting indexes. In your case I'd
guess that the aggregate size of the minis is somewhat larger than the
final index, due to duplication of terms. Anyhow, in my projects,
recovery is much faster from a backup of the (final) index than from a
backup of upstream data followed by reprocessing. It sounds like you've
already measured the relevant parameters, though, so maybe my projects'
data sets have very different characteristics.
Good luck on your project!
--MDC
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Merge performance
Posted by david m <dm...@gmail.com>.
Michael,
Our application includes indexing and archiving documents to meet
compliance requirements.
A couple of reasons that lead to the merge approach:
- Source documents are written to archive media and retrieval is
relatively slow. Add to that our processing pipeline (including
text extraction)... Retrieving and merging minis is faster than
re-processing and re-indexing from sources.
- In addition to index recovery, mini indexes may be combined into
custom indexes based on policy.
From a compliance viewpoint the mini indexes contain logically
related documents. For example: based on a retention policy,
documents of type x are to be kept for y years.
One example for constructing a custom index would be for legal
discovery.
Thanks, david.
On 4/18/07, Michael D. Curtin <mi...@curtin.com> wrote:
> d m wrote:
>
> > I'd like to share index merge performance data and have a couple
> > of questions about it...
> >
> > We (AXS-One, www.axsone.com) build one "master" index per day.
> > For backup and recovery purposes, we also build many individual
> > "mini" indexes from the docs added to the master index.
> >
> > Should one of our master indexes become unusable (for whatever
> > reason - and I'm glad to say this has not yet happened), we plan to
> > reconstruct it by merging its mini indexes.
>
> The possible merge bug notwithstanding, let's take a step back in
> abstraction: are you sure the relatively-complex iterative merge
> process you've described buys you anything over a simple
> backup-the-whole-index approach? Or a
> backup-the-source-data-and-reindex approach?
>
> Merging is I/O intensive, and the scheme you've outlined is re-reading
> and re-writing all the index data several times anyway -- it might not
> be saving you much over a full reindex. Since the scenario you're
> trying to protect against is a very rare occurrence (so far at least),
> would it be better to spend your development time on improving the
> application than devising (and debugging, and testing, ...) a
> complicated backup and recovery scheme?
>
> --MDC
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Merge performance
Posted by "Michael D. Curtin" <mi...@curtin.com>.
d m wrote:
> I'd like to share index merge performance data and have a couple
> of questions about it...
>
> We (AXS-One, www.axsone.com) build one "master" index per day.
> For backup and recovery purposes, we also build many individual
> "mini" indexes from the docs added to the master index.
>
> Should one of our master indexes become unusable (for whatever
> reason - and I'm glad to say this has not yet happened), we plan to
> reconstruct it by merging its mini indexes.
The possible merge bug notwithstanding, let's take a step back in
abstraction: are you sure the relatively-complex iterative merge
process you've described buys you anything over a simple
backup-the-whole-index approach? Or a
backup-the-source-data-and-reindex approach?
Merging is I/O intensive, and the scheme you've outlined is re-reading
and re-writing all the index data several times anyway -- it might not
be saving you much over a full reindex. Since the scenario you're
trying to protect against is a very rare occurrence (so far at least),
would it be better to spend your development time on improving the
application than devising (and debugging, and testing, ...) a
complicated backup and recovery scheme?
--MDC
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Merge performance
Posted by david m <dm...@gmail.com>.
Erick & Steven,
I looked at 845, but I'm a bit confused:
Are you suggesting that 845 is the cause for the spikes seen in test
Runs 1 & 2 - and that in 2.1 addIndexesNoOptimize() is, under the
covers, relying on calls to ramSizeInBytes() to trigger new segment
creation before hitting the 10,000 value I've set as
maxBufferedDocs?
wrt the 845 reply (by Michael McCandless):
This will be an O(N^2) slowdown. EG if based on RAM you are flushing
every 100 docs, then at 1000 docs you will merge to 1 segment. Then
at 1900 docs, you merge to 1 segment again. At 2800, 3700, 4600, ...
(every 900 docs) you keep merging to 1 segment. Your indexing
process will get very slow because every 900 documents the entire
index is effectively being optimized.
I thought that with a mergeFactor of 10 and maxBufferedDocs of 100 the
behavior was:
Lucene creates a new segment for each 100 documents. When there are
10 segments (each with 100 docs) on disk all 10 are merged into a
single segment. That single segment contains 1,000 documents. This
merging repeats until there are 10 segments each with 1,000
documents. At that time, the 10 segments (of 1,000 document each)
are merged into a single segment. That segment contains 10,000
documents. And so on...
No?
Thanks, david.
On 4/18/07, Steven Parkes <st...@esseff.org> wrote:
> Yup, 845 is relevant, as is 847. I haven't had time to digest all that
> David wrote yet, but I'm starting. It's particularly relevant because
> before I get to the point of making 847 committable, I need a way of
> testing merge performance (the factoring in 847 proposes to simplify the
> API slightly, so the merge algorithm would be slightly modified). The
> stuff David has written gives me some ideas for benchmark tests so that
> we'll be able to test multiple merge policies.
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Wednesday, April 18, 2007 9:17 AM
> To: java-user@lucene.apache.org
> Subject: Re: Merge performance
>
> This *may* be relevant, I haven't needed to investigate
> it yet...
>
> http://issues.apache.org/jira/browse/LUCENE-845
>
> Also, see the thread titled
> "MergeFactor and MaxBufferedDocs value should ...?" for an
> interesting discussion of how to optimize indexing, although
> I'm not sure the notion of using IndexWriter.ramSizeInBytes
> is of too much use when merging indexes....
>
> Erick
>
> On 4/18/07, d m <dm...@gmail.com> wrote:
> >
> > I'd like to share index merge performance data and have a couple
> > of questions about it...
> >
> > We (AXS-One, www.axsone.com) build one "master" index per day.
> > For backup and recovery purposes, we also build many individual
> > "mini" indexes from the docs added to the master index.
> >
> > Should one of our master indexes become unusable (for whatever
> > reason - and I'm glad to say this has not yet happened), we plan to
> > reconstruct it by merging its mini indexes.
> >
> > I've done some merge testing so we have an idea of how long it will
> > take to reconstruct a master index.
> >
> > For testing purposes, I have created 1,000 mini indexes. Each:
> > - contains 1,000 documents
> > - is optimized
> > - uses the compound file format
> >
> > The avg doc size across the 1 million docs is: 10.8 KB
> >
> > My testing has been to merge the 1,000 mini indexes to an empty
> > destination index. Destination index settings:
> >
> > - mergeFactor: 40
> > - minMergeDocs: 10,000
> > - maxMergeDocs: Integer.MAX_VALUE
> > - use compound file format
> >
> > Those values were obtained from some empirical (but not exhaustive)
> > merge testing.
> >
> > In each test run I merge N mini indexes into a single destination
> > index. Each merge starts with an empty destination index. N increases
> > by 25 for each data point.
> >
> > This means our test merges 25 minis to an empty index. Then merges 50
> > minis to an empty index. Etc... until it merges 1,000 minis to an
> > empty index.
> >
> > Mini indexes merged with V2.1 were indexed with V2.1.
> > Mini indexes merged with V2.0 were indexed with V2.0.
> >
> > Hardware:
> > - 64-bit
> > - 4 CPUs: AMD Opteron 280, 2.41 GHz
> > - 12.8 GB RAM
> > - 1.63 TB disk space / 6 SCSI drives / RAID ??
> >
> > Software:
> > - Lucene V2.1 and V2.0
> > - Java 1.6
> > - Windows Server 2003, SP 1
> >
> > I did tests with:
> > - Lucene 2.1 using addIndexesNoOptimize() - 2 identical runs
> > - Lucene 2.1 using addIndexes() - 1 run
> > - Lucene 2.0 using addIndexes() - 1 run
> >
> > The recorded merge times (in seconds) include a final call to
> > optimize() the destination index after returning from
> > addIndexesNoOptimize() or addIndexes().
> >
> > I've included the test data below. (If you'd like, I can email an
> > Excel version of the data with a graph.)
> >
> > A few things caught my attention (seen easily when graphing "Indexes"
> > vs "Merge Time (secs)"):
> >
> > 1. The runs (3 & 4) using addIndexes() show a relatively smooth
> > increase in merge times (as expected).
> >
> > 2. The 2.1 runs (1 & 2) using addIndexesNoOptimize() show multiple
> > spikes in times for a particular merge count - with the next merge
> > counts running faster. The pattern of spikes was identical in both
> > runs.
> >
> > The most notable spike occurs in the addIndexesNoOptimize() merge
> > of 900 indexes with took 44:26 (mm:ss) in one run and 43:37 in the
> > other. In both runs the merge times for 925, 950, 975, and 1000
> > indexes took less time than the 900 merge.
> >
> > 3. Overall, using addIndexes() appears to be faster than
> > addIndexesNoOptimize().
> >
> > 4. V2.0 addIndexes() performs better than V2.1 addIndexes(). Look at
> > the very last row of data below - it is the merge rate (in
> > docs/min) for each test run.
> >
> > Can someone explain what might be happening to cause the spikes in
> 2.1,
> > not seen in 2.0?
> >
> > Any thoughts on 2.0 merging faster than 2.1?
> >
> > Thanks, david.
> >
> >
> > Run 1: Lucene 2.1 / addIndexesNoOptimize()
> > Run 2: Repeat Run 1
> > Run 3: Lucene 2.1 / addIndexes()
> > Run 4: Lucene 2.0 / addIndexes()
> > All runs include a final call to optimize()
> >
> > Merge Times (seconds)
> > Numb Run Run Run Run
> > Idxs 1 2 3 4
> > 25 39 59 44 38
> > 50 73 93 83 81
> > 75 113 131 128 120
> > 100 147 169 163 154
> > 125 179 198 193 220
> > 150 222 241 227 231
> > 175 246 261 249 239
> > 200 266 273 269 266
> > 225 297 301 288 283
> > 250 323 325 312 308
> > 275 461 471 343 337
> > 300 393 388 376 364
> > 325 424 423 410 401
> > 350 465 467 445 438
> > 375 498 504 475 466
> > 400 527 528 516 503
> > 425 586 567 608 587
> > 450 677 656 703 675
> > 475 876 880 786 750
> > 500 841 832 872 821
> > 525 920 914 937 924
> > 550 1213 1206 1038 995
> > 575 1094 1065 1137 1069
> > 600 1250 1207 1275 1189
> > 625 1367 1337 1385 1315
> > 650 1473 1433 1454 1396
> > 675 1600 1575 1499 1468
> > 700 1570 1552 1563 1516
> > 725 1605 1587 1602 1581
> > 750 1852 1808 1687 1627
> > 775 1761 1719 1732 1668
> > 800 1829 1821 1876 1753
> > 825 2167 2138 1882 1832
> > 850 2045 2042 2057 1887
> > 875 2169 2207 2101 2025
> > 900 2666 2617 2138 2014
> > 925 2174 2218 2206 2057
> > 950 2390 2391 2193 2121
> > 975 2304 2322 2247 2149
> > 1000 2321 2322 2321 2227
> >
> > 20500 43423 43248 41820 40095 <- Totals
> > 28326 28441 29412 30667 <- Merge rate: Docs per minute
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: Merge performance
Posted by Steven Parkes <st...@esseff.org>.
Yup, 845 is relevant, as is 847. I haven't had time to digest all that
David wrote yet, but I'm starting. It's particularly relevant because
before I get to the point of making 847 committable, I need a way of
testing merge performance (the factoring in 847 proposes to simplify the
API slightly, so the merge algorithm would be slightly modified). The
stuff David has written gives me some ideas for benchmark tests so that
we'll be able to test multiple merge policies.
-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: Wednesday, April 18, 2007 9:17 AM
To: java-user@lucene.apache.org
Subject: Re: Merge performance
This *may* be relevant, I haven't needed to investigate
it yet...
http://issues.apache.org/jira/browse/LUCENE-845
Also, see the thread titled
"MergeFactor and MaxBufferedDocs value should ...?" for an
interesting discussion of how to optimize indexing, although
I'm not sure the notion of using IndexWriter.ramSizeInBytes
is of too much use when merging indexes....
Erick
On 4/18/07, d m <dm...@gmail.com> wrote:
>
> I'd like to share index merge performance data and have a couple
> of questions about it...
>
> We (AXS-One, www.axsone.com) build one "master" index per day.
> For backup and recovery purposes, we also build many individual
> "mini" indexes from the docs added to the master index.
>
> Should one of our master indexes become unusable (for whatever
> reason - and I'm glad to say this has not yet happened), we plan to
> reconstruct it by merging its mini indexes.
>
> I've done some merge testing so we have an idea of how long it will
> take to reconstruct a master index.
>
> For testing purposes, I have created 1,000 mini indexes. Each:
> - contains 1,000 documents
> - is optimized
> - uses the compound file format
>
> The avg doc size across the 1 million docs is: 10.8 KB
>
> My testing has been to merge the 1,000 mini indexes to an empty
> destination index. Destination index settings:
>
> - mergeFactor: 40
> - minMergeDocs: 10,000
> - maxMergeDocs: Integer.MAX_VALUE
> - use compound file format
>
> Those values were obtained from some empirical (but not exhaustive)
> merge testing.
>
> In each test run I merge N mini indexes into a single destination
> index. Each merge starts with an empty destination index. N increases
> by 25 for each data point.
>
> This means our test merges 25 minis to an empty index. Then merges 50
> minis to an empty index. Etc... until it merges 1,000 minis to an
> empty index.
>
> Mini indexes merged with V2.1 were indexed with V2.1.
> Mini indexes merged with V2.0 were indexed with V2.0.
>
> Hardware:
> - 64-bit
> - 4 CPUs: AMD Opteron 280, 2.41 GHz
> - 12.8 GB RAM
> - 1.63 TB disk space / 6 SCSI drives / RAID ??
>
> Software:
> - Lucene V2.1 and V2.0
> - Java 1.6
> - Windows Server 2003, SP 1
>
> I did tests with:
> - Lucene 2.1 using addIndexesNoOptimize() - 2 identical runs
> - Lucene 2.1 using addIndexes() - 1 run
> - Lucene 2.0 using addIndexes() - 1 run
>
> The recorded merge times (in seconds) include a final call to
> optimize() the destination index after returning from
> addIndexesNoOptimize() or addIndexes().
>
> I've included the test data below. (If you'd like, I can email an
> Excel version of the data with a graph.)
>
> A few things caught my attention (seen easily when graphing "Indexes"
> vs "Merge Time (secs)"):
>
> 1. The runs (3 & 4) using addIndexes() show a relatively smooth
> increase in merge times (as expected).
>
> 2. The 2.1 runs (1 & 2) using addIndexesNoOptimize() show multiple
> spikes in times for a particular merge count - with the next merge
> counts running faster. The pattern of spikes was identical in both
> runs.
>
> The most notable spike occurs in the addIndexesNoOptimize() merge
> of 900 indexes with took 44:26 (mm:ss) in one run and 43:37 in the
> other. In both runs the merge times for 925, 950, 975, and 1000
> indexes took less time than the 900 merge.
>
> 3. Overall, using addIndexes() appears to be faster than
> addIndexesNoOptimize().
>
> 4. V2.0 addIndexes() performs better than V2.1 addIndexes(). Look at
> the very last row of data below - it is the merge rate (in
> docs/min) for each test run.
>
> Can someone explain what might be happening to cause the spikes in
2.1,
> not seen in 2.0?
>
> Any thoughts on 2.0 merging faster than 2.1?
>
> Thanks, david.
>
>
> Run 1: Lucene 2.1 / addIndexesNoOptimize()
> Run 2: Repeat Run 1
> Run 3: Lucene 2.1 / addIndexes()
> Run 4: Lucene 2.0 / addIndexes()
> All runs include a final call to optimize()
>
> Merge Times (seconds)
> Numb Run Run Run Run
> Idxs 1 2 3 4
> 25 39 59 44 38
> 50 73 93 83 81
> 75 113 131 128 120
> 100 147 169 163 154
> 125 179 198 193 220
> 150 222 241 227 231
> 175 246 261 249 239
> 200 266 273 269 266
> 225 297 301 288 283
> 250 323 325 312 308
> 275 461 471 343 337
> 300 393 388 376 364
> 325 424 423 410 401
> 350 465 467 445 438
> 375 498 504 475 466
> 400 527 528 516 503
> 425 586 567 608 587
> 450 677 656 703 675
> 475 876 880 786 750
> 500 841 832 872 821
> 525 920 914 937 924
> 550 1213 1206 1038 995
> 575 1094 1065 1137 1069
> 600 1250 1207 1275 1189
> 625 1367 1337 1385 1315
> 650 1473 1433 1454 1396
> 675 1600 1575 1499 1468
> 700 1570 1552 1563 1516
> 725 1605 1587 1602 1581
> 750 1852 1808 1687 1627
> 775 1761 1719 1732 1668
> 800 1829 1821 1876 1753
> 825 2167 2138 1882 1832
> 850 2045 2042 2057 1887
> 875 2169 2207 2101 2025
> 900 2666 2617 2138 2014
> 925 2174 2218 2206 2057
> 950 2390 2391 2193 2121
> 975 2304 2322 2247 2149
> 1000 2321 2322 2321 2227
>
> 20500 43423 43248 41820 40095 <- Totals
> 28326 28441 29412 30667 <- Merge rate: Docs per minute
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Merge performance
Posted by Erick Erickson <er...@gmail.com>.
This *may* be relevant, I haven't needed to investigate
it yet...
http://issues.apache.org/jira/browse/LUCENE-845
Also, see the thread titled
"MergeFactor and MaxBufferedDocs value should ...?" for an
interesting discussion of how to optimize indexing, although
I'm not sure the notion of using IndexWriter.ramSizeInBytes
is of too much use when merging indexes....
Erick
On 4/18/07, d m <dm...@gmail.com> wrote:
>
> I'd like to share index merge performance data and have a couple
> of questions about it...
>
> We (AXS-One, www.axsone.com) build one "master" index per day.
> For backup and recovery purposes, we also build many individual
> "mini" indexes from the docs added to the master index.
>
> Should one of our master indexes become unusable (for whatever
> reason - and I'm glad to say this has not yet happened), we plan to
> reconstruct it by merging its mini indexes.
>
> I've done some merge testing so we have an idea of how long it will
> take to reconstruct a master index.
>
> For testing purposes, I have created 1,000 mini indexes. Each:
> - contains 1,000 documents
> - is optimized
> - uses the compound file format
>
> The avg doc size across the 1 million docs is: 10.8 KB
>
> My testing has been to merge the 1,000 mini indexes to an empty
> destination index. Destination index settings:
>
> - mergeFactor: 40
> - minMergeDocs: 10,000
> - maxMergeDocs: Integer.MAX_VALUE
> - use compound file format
>
> Those values were obtained from some empirical (but not exhaustive)
> merge testing.
>
> In each test run I merge N mini indexes into a single destination
> index. Each merge starts with an empty destination index. N increases
> by 25 for each data point.
>
> This means our test merges 25 minis to an empty index. Then merges 50
> minis to an empty index. Etc... until it merges 1,000 minis to an
> empty index.
>
> Mini indexes merged with V2.1 were indexed with V2.1.
> Mini indexes merged with V2.0 were indexed with V2.0.
>
> Hardware:
> - 64-bit
> - 4 CPUs: AMD Opteron 280, 2.41 GHz
> - 12.8 GB RAM
> - 1.63 TB disk space / 6 SCSI drives / RAID ??
>
> Software:
> - Lucene V2.1 and V2.0
> - Java 1.6
> - Windows Server 2003, SP 1
>
> I did tests with:
> - Lucene 2.1 using addIndexesNoOptimize() - 2 identical runs
> - Lucene 2.1 using addIndexes() - 1 run
> - Lucene 2.0 using addIndexes() - 1 run
>
> The recorded merge times (in seconds) include a final call to
> optimize() the destination index after returning from
> addIndexesNoOptimize() or addIndexes().
>
> I've included the test data below. (If you'd like, I can email an
> Excel version of the data with a graph.)
>
> A few things caught my attention (seen easily when graphing "Indexes"
> vs "Merge Time (secs)"):
>
> 1. The runs (3 & 4) using addIndexes() show a relatively smooth
> increase in merge times (as expected).
>
> 2. The 2.1 runs (1 & 2) using addIndexesNoOptimize() show multiple
> spikes in times for a particular merge count - with the next merge
> counts running faster. The pattern of spikes was identical in both
> runs.
>
> The most notable spike occurs in the addIndexesNoOptimize() merge
> of 900 indexes with took 44:26 (mm:ss) in one run and 43:37 in the
> other. In both runs the merge times for 925, 950, 975, and 1000
> indexes took less time than the 900 merge.
>
> 3. Overall, using addIndexes() appears to be faster than
> addIndexesNoOptimize().
>
> 4. V2.0 addIndexes() performs better than V2.1 addIndexes(). Look at
> the very last row of data below - it is the merge rate (in
> docs/min) for each test run.
>
> Can someone explain what might be happening to cause the spikes in 2.1,
> not seen in 2.0?
>
> Any thoughts on 2.0 merging faster than 2.1?
>
> Thanks, david.
>
>
> Run 1: Lucene 2.1 / addIndexesNoOptimize()
> Run 2: Repeat Run 1
> Run 3: Lucene 2.1 / addIndexes()
> Run 4: Lucene 2.0 / addIndexes()
> All runs include a final call to optimize()
>
> Merge Times (seconds)
> Numb Run Run Run Run
> Idxs 1 2 3 4
> 25 39 59 44 38
> 50 73 93 83 81
> 75 113 131 128 120
> 100 147 169 163 154
> 125 179 198 193 220
> 150 222 241 227 231
> 175 246 261 249 239
> 200 266 273 269 266
> 225 297 301 288 283
> 250 323 325 312 308
> 275 461 471 343 337
> 300 393 388 376 364
> 325 424 423 410 401
> 350 465 467 445 438
> 375 498 504 475 466
> 400 527 528 516 503
> 425 586 567 608 587
> 450 677 656 703 675
> 475 876 880 786 750
> 500 841 832 872 821
> 525 920 914 937 924
> 550 1213 1206 1038 995
> 575 1094 1065 1137 1069
> 600 1250 1207 1275 1189
> 625 1367 1337 1385 1315
> 650 1473 1433 1454 1396
> 675 1600 1575 1499 1468
> 700 1570 1552 1563 1516
> 725 1605 1587 1602 1581
> 750 1852 1808 1687 1627
> 775 1761 1719 1732 1668
> 800 1829 1821 1876 1753
> 825 2167 2138 1882 1832
> 850 2045 2042 2057 1887
> 875 2169 2207 2101 2025
> 900 2666 2617 2138 2014
> 925 2174 2218 2206 2057
> 950 2390 2391 2193 2121
> 975 2304 2322 2247 2149
> 1000 2321 2322 2321 2227
>
> 20500 43423 43248 41820 40095 <- Totals
> 28326 28441 29412 30667 <- Merge rate: Docs per minute
>