You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Sean Tong <st...@jamasoftware.com> on 2011/12/12 08:54:21 UTC
Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?
Hi,
We plan to upgrade the Lucene library in our application from 2.4.1 to 3.5.0. I have been running benchmark tests that come with Lucence. To my surprise, I found that the indexing in 3.5.0 is significant slower than 2.4.1 for the Wikipedia data.
Attached is the algorithm for the tests. The tests used default Lucence settings for flush memory size and merge factor. 512M memory was used for the tasks. The test machine is a 64-bit Windows 7 machine with Intel Core i7.
The command:
%ant -Dtask.alg=conf/wikipedia-default.alg -Dtask.mem=512M run-task
Here are the test results:
Lucece 2.4.1
[java] ------------> Report sum by Prefix (MAddDocs) and Round (3 about 3 out of 14)
[java] Operation round flush mrg runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
[java] MAddDocs_200000 0 16.00 10 1 200000 1,609.1 124.29 89,218,496 241,631,232
[java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - - 1,746.4 - - 114.52 - 102,365,864 - 241,762,304
[java] MAddDocs_200000 2 16.00 10 1 200000 1,566.8 127.65 69,428,144 174,194,688
Lucene 2.9.4
[java] ------------> Report sum by Prefix (MAddDocs) and Round (3 about 3 out of 14)
[java] Operation round flush mrg runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
[java] MAddDocs_200000 0 16.00 10 1 200000 1,046.49 191.12 82,676,152 139,657,216
[java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - 1,165.35 - - 171.62 - 119,364,128 - 156,762,112
[java] MAddDocs_200000 2 16.00 10 1 200000 1,245.86 160.53 50,361,760 137,625,600
Lucene 3.5.0
[java] ------------> Report sum by Prefix (MAddDocs) and Round (3 about 3 out of 14)
[java] Operation round flush mrg runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
[java] MAddDocs_200000 0 16.00 10 1 200000 676.48 295.65 70,917,592 129,695,744
[java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - - 626.13 - - 319.42 - 50,329,552 - 94,240,768
[java] MAddDocs_200000 2 16.00 10 1 200000 687.68 290.83 57,732,640 92,864,512
The indexing speed using 2.4.1 is 2.3x of the speed using 3.5.0. Did I miss any settings or configurations?
Thanks,
Sean
RE: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia
data?
Posted by Sean Tong <st...@jamasoftware.com>.
Simon,
I checked the indexes with Luke and you were right about the benchmarks may not be comparable since they had different number of fields and index functionalities. You can find the summaries of the index statistics for 2.4.1, 2.9.4, and 3.5.0 below.
I also ran the benchmarks for the standard Reuter's data (20,000 documents) with the default settings (merge factor 10, flush memory:16m) and it turned out that 2.4.1 and 3.5.0 benchmarks were similar though the indexes had different number of fields too.
In your experience, do you think the 3.5.0 indexing performance is at least as good as 2.4.1 or 2.9.4? Do you have any recommendations on indexing configurations/settings? Through my experiments, I found large flush memory settings (e.g 64m or 128m) helps with the index performance for the Wikipeida data in 3.5.0 but not so much in 2.4.1.
Thanks,
Sean
*****
Here are the data for the Wikipedia indexes:
3.5.0
Number of fields: 7
Number of documents: 200,000
Number of terms: 4,849,195
Has deletions?/Optimized? No/No
Incex formact: -11 (lucene 3.1)
Index functionality: lock-less, single norms, shared doc store, checksum, del count, omotTf, user data, diagnostics, hasVectors
TermInfos index divisor: N/A
Directory implementation: org.apache.lucene.store.MMapDirectory
Fields
Name Term Count %
body 3,391,277 69.93%
docdate 1,160 0.02%
docdatenum 872,060 17.98%
docid 200,000 4.12%
docname 200,000 4.12%
doctimesecn 82,231 1.7%
doctitle 102,467 2.11%
2.9.4
Number of fields: 5
Number of documents: 200,000
Number of terms: 4,760,747
Has deletions?/Optimized? No/No
Incex formact: -9 (lucene 2.9)
Index functionality: lock-less, single norms, shared doc store, checksum, del count, omitTf, user data, diagnostics
TermInfos index divisor: N/A
Directory implementation: org.apache.lucene.store.MMapDirectory
Fields:
body 3,391,277 90.18%
docdate 1,160 0.03%
docid 200,000 5.32%
docname 65,843 1.75%
doctitle 102,467 2.77%
2.4.1
Number of fields: 4
Number of documents: 200,000
Number of terms: 3,694,904
Has deletions?/Optimized? No/No
Index formact: -7 (lucene 2.4)
Index functionality: lock-less, single norms, shared doc store, checksum, del count, omtTf
Directory implementation: org.apache.lucene.store.MMapDirectory
Fields
body 3,391,277 91.78%
docdate 1,160 0.03%
docid 200,000 5.41%
doctitle 102,467 2.77%
-----Original Message-----
From: Simon Willnauer [mailto:simon.willnauer@googlemail.com]
Sent: Tuesday, December 13, 2011 4:30 AM
To: java-user@lucene.apache.org
Subject: Re: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?
hey,
so what I wonder in general is if the benchmarks are comparable. What I mean is that the benchmark code has changed since 2.4 a lot so there might be additional fields and / or different settings on what to index and how.
could you check with luke if the index has the same fields and if the settings are the same / similar and report it back? I also wonder if it maybe now uses update instead of add ie. buffers and applies deletes etc.
simon
On Mon, Dec 12, 2011 at 10:03 PM, Sean Tong <st...@jamasoftware.com> wrote:
> Thanks Simon for your response.
>
> I just re-ran the 3.5 benchmark with the ClassicAnalyzer. Here are the results:
>
> [java] ------------> Report sum by Prefix (MAddDocs) and Round (3
> about 3 out of 14)
> [java] Operation round flush mrg runCnt recsPerRun
> rec/s elapsedSec avgUsedMem avgTotalMem
> [java] MAddDocs_200000 0 16.00 10 1 200000
> 715.76 279.42 48,828,144 128,057,344
> [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - -
> 679.04 - - 294.53 - 68,321,424 - 85,721,088
> [java] MAddDocs_200000 2 16.00 10 1 200000
> 761.95 262.49 63,139,256 91,881,472
>
> The performance is slightly better than the one using StandardAnalyzer, but this is still much worse than the performance with 2.4.1.
>
> Sean
>
> -----Original Message-----
> From: Simon Willnauer [mailto:simon.willnauer@googlemail.com]
> Sent: Monday, December 12, 2011 12:20 PM
> To: java-user@lucene.apache.org
> Subject: Re: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?
>
> hey,
>
> can you try to use the ClassicAnalyzer instead of StandartAnalzyer in
> 3.5 since in 3.5 the StandartAnalyzer is a different implementation than in 2.9 and 2.4 or rerun the 2.4 benchmarks with a WhitespaceAnalyzer just for the comparison.
>
> simon
>
> On Mon, Dec 12, 2011 at 7:08 PM, Sean Tong <st...@jamasoftware.com> wrote:
>> Looks like the attachment for the algorithm is missing from last email. I have pasted the text here. Thanks in advance for any help.
>>
>> #Start of the wikipedia-default.alg file
>>
>> merge.factor=mrg:10:10:10
>> max.field.length=2147483647
>> #max.buffered=buf:10:10:100:100
>> ram.flush.mb=flush:16:16:16
>>
>> compound=true
>>
>> analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
>> directory=FSDirectory
>>
>> doc.stored=true
>> doc.tokenized=true
>> doc.term.vector=false
>> log.step=5000
>>
>> docs.file=temp/enwiki-20070527-pages-articles.xml
>>
>> content.source=org.apache.lucene.benchmark.byTask.feeds.EnwikiContent
>> S
>> ource
>>
>> query.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMake
>> r
>>
>> # task at this depth or less would print when they start
>> task.max.depth.log=2
>>
>> log.queries=false
>> #
>> ---------------------------------------------------------------------
>> -
>> ---------------
>>
>> { "Rounds"
>>
>> ResetSystemErase
>>
>> { "Populate"
>> CreateIndex
>> { "MAddDocs" AddDoc > : 200000
>> CloseIndex
>> }
>>
>> NewRound
>>
>> } : 3
>>
>> RepSumByName
>> RepSumByPrefRound MAddDocs
>>
>> #End of wikipedia-default.alg file
>>
>> Thanks,
>>
>> Sean
>>
>>
>> From: Sean Tong [mailto:stong@jamasoftware.com]
>> Sent: Sunday, December 11, 2011 11:54 PM
>> To: java-user@lucene.apache.org
>> Subject: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?
>>
>> Hi,
>>
>> We plan to upgrade the Lucene library in our application from 2.4.1 to 3.5.0. I have been running benchmark tests that come with Lucence. To my surprise, I found that the indexing in 3.5.0 is significant slower than 2.4.1 for the Wikipedia data.
>>
>> Attached is the algorithm for the tests. The tests used default Lucence settings for flush memory size and merge factor. 512M memory was used for the tasks. The test machine is a 64-bit Windows 7 machine with Intel Core i7.
>>
>> The command:
>> %ant -Dtask.alg=conf/wikipedia-default.alg -Dtask.mem=512M run-task
>>
>> Here are the test results:
>>
>> Lucece 2.4.1
>>
>> [java] ------------> Report sum by Prefix (MAddDocs) and Round
>> (3 about 3 out of 14)
>>
>> [java] Operation round flush mrg runCnt recsPerRun
>> rec/s elapsedSec avgUsedMem avgTotalMem
>>
>> [java] MAddDocs_200000 0 16.00 10 1 200000
>> 1,609.1 124.29 89,218,496 241,631,232
>>
>> [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - -
>> 1,746.4 - - 114.52 - 102,365,864 - 241,762,304
>>
>> [java] MAddDocs_200000 2 16.00 10 1 200000
>> 1,566.8 127.65 69,428,144 174,194,688
>>
>>
>> Lucene 2.9.4
>>
>> [java] ------------> Report sum by Prefix (MAddDocs) and Round (3
>> about 3 out of 14)
>>
>> [java] Operation round flush mrg runCnt recsPerRun
>> rec/s elapsedSec avgUsedMem avgTotalMem
>>
>> [java] MAddDocs_200000 0 16.00 10 1 200000
>> 1,046.49 191.12 82,676,152 139,657,216
>>
>> [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 -
>> 1,165.35 - - 171.62 - 119,364,128 - 156,762,112
>>
>> [java] MAddDocs_200000 2 16.00 10 1 200000
>> 1,245.86 160.53 50,361,760 137,625,600
>>
>> Lucene 3.5.0
>>
>> [java] ------------> Report sum by Prefix (MAddDocs) and Round (3
>> about 3 out of 14)
>>
>> [java] Operation round flush mrg runCnt recsPerRun
>> rec/s elapsedSec avgUsedMem avgTotalMem
>>
>> [java] MAddDocs_200000 0 16.00 10 1 200000
>> 676.48 295.65 70,917,592 129,695,744
>>
>> [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - -
>> 626.13 - - 319.42 - 50,329,552 - 94,240,768
>>
>> [java] MAddDocs_200000 2 16.00 10 1 200000
>> 687.68 290.83 57,732,640 92,864,512
>>
>>
>> The indexing speed using 2.4.1 is 2.3x of the speed using 3.5.0. Did I miss any settings or configurations?
>>
>> Thanks,
>>
>> Sean
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?
Posted by Robert Muir <rc...@gmail.com>.
On Tue, Dec 13, 2011 at 4:45 PM, Sean Tong <st...@jamasoftware.com> wrote:
> Hi,
>
> I modified the DocMaker in 3.5 to make it index the same 4 fields as 2.4.1 does. Now I got very similar stats in the index by checking Luke. The index performance was slightly better than that by indexing 7 fields but still not comparable with the 2.4.1 performance:
>
> [java] ------------> Report sum by Prefix (MAddDocs) and Round (3 about 3 out of 14)
> [java] Operation round flush mrg runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
> [java] MAddDocs_200000 0 16.00 10 1 200000 767.18 260.70 113,206,984 144,637,952
> [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - - 801.61 - - 249.50 - 117,778,992 - 144,637,952
> [java] MAddDocs_200000 2 16.00 10 1 200000 734.39 272.33 121,479,568 126,287,872
>
> Maybe there are some other settings that make the benchmarks not comparable.
I think for benchmarking, this CloseIndex task should default to
doWait=false... maybe try passing that parameter to it.
--
lucidimagination.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia
data?
Posted by Sean Tong <st...@jamasoftware.com>.
Hi,
I modified the DocMaker in 3.5 to make it index the same 4 fields as 2.4.1 does. Now I got very similar stats in the index by checking Luke. The index performance was slightly better than that by indexing 7 fields but still not comparable with the 2.4.1 performance:
[java] ------------> Report sum by Prefix (MAddDocs) and Round (3 about 3 out of 14)
[java] Operation round flush mrg runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
[java] MAddDocs_200000 0 16.00 10 1 200000 767.18 260.70 113,206,984 144,637,952
[java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - - 801.61 - - 249.50 - 117,778,992 - 144,637,952
[java] MAddDocs_200000 2 16.00 10 1 200000 734.39 272.33 121,479,568 126,287,872
Maybe there are some other settings that make the benchmarks not comparable.
Thanks,
Sean
3.5.0 Index Stats with modified DocMaker:
Number of fields: 4
Number of documents: 200,000
Number of terms: 3,694,904
Has deletions?/Optimized? No/No
Index format: -11 (Lucene 3.1)
Index functionality: lock-less, single norms, shared doc store, checksum, del count, omitTf, user data, diagnostics, hasVectors
Directory implementation: org.apache.lucene.store.MMapDirectory
Fields
body 3,391,277 91.78%
docdate 1,160 0.03%
docid 200,000 5.41%
doctitle 102,467 2.77%
-----Original Message-----
From: Sean Tong
Sent: Tuesday, December 13, 2011 10:47 AM
To: 'java-user@lucene.apache.org'
Subject: RE: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?
Simon,
I checked the indexes with Luke and you were right about the benchmarks may not be comparable since they had different number of fields and index functionalities. You can find the summaries of the index statistics for 2.4.1, 2.9.4, and 3.5.0 below.
I also ran the benchmarks for the standard Reuter's data (20,000 documents) with the default settings (merge factor 10, flush memory:16m) and it turned out that 2.4.1 and 3.5.0 benchmarks were similar though the indexes had different number of fields too.
In your experience, do you think the 3.5.0 indexing performance is at least as good as 2.4.1 or 2.9.4? Do you have any recommendations on indexing configurations/settings? Through my experiments, I found large flush memory settings (e.g 64m or 128m) helps with the index performance for the Wikipeida data in 3.5.0 but not so much in 2.4.1.
Thanks,
Sean
*****
Here are the data for the Wikipedia indexes:
3.5.0
Number of fields: 7
Number of documents: 200,000
Number of terms: 4,849,195
Has deletions?/Optimized? No/No
Incex formact: -11 (lucene 3.1)
Index functionality: lock-less, single norms, shared doc store, checksum, del count, omotTf, user data, diagnostics, hasVectors TermInfos index divisor: N/A Directory implementation: org.apache.lucene.store.MMapDirectory
Fields
Name Term Count %
body 3,391,277 69.93%
docdate 1,160 0.02%
docdatenum 872,060 17.98%
docid 200,000 4.12%
docname 200,000 4.12%
doctimesecn 82,231 1.7%
doctitle 102,467 2.11%
2.9.4
Number of fields: 5
Number of documents: 200,000
Number of terms: 4,760,747
Has deletions?/Optimized? No/No
Incex formact: -9 (lucene 2.9)
Index functionality: lock-less, single norms, shared doc store, checksum, del count, omitTf, user data, diagnostics TermInfos index divisor: N/A Directory implementation: org.apache.lucene.store.MMapDirectory
Fields:
body 3,391,277 90.18%
docdate 1,160 0.03%
docid 200,000 5.32%
docname 65,843 1.75%
doctitle 102,467 2.77%
2.4.1
Number of fields: 4
Number of documents: 200,000
Number of terms: 3,694,904
Has deletions?/Optimized? No/No
Index formact: -7 (lucene 2.4)
Index functionality: lock-less, single norms, shared doc store, checksum, del count, omtTf Directory implementation: org.apache.lucene.store.MMapDirectory
Fields
body 3,391,277 91.78%
docdate 1,160 0.03%
docid 200,000 5.41%
doctitle 102,467 2.77%
-----Original Message-----
From: Simon Willnauer [mailto:simon.willnauer@googlemail.com]
Sent: Tuesday, December 13, 2011 4:30 AM
To: java-user@lucene.apache.org
Subject: Re: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?
hey,
so what I wonder in general is if the benchmarks are comparable. What I mean is that the benchmark code has changed since 2.4 a lot so there might be additional fields and / or different settings on what to index and how.
could you check with luke if the index has the same fields and if the settings are the same / similar and report it back? I also wonder if it maybe now uses update instead of add ie. buffers and applies deletes etc.
simon
On Mon, Dec 12, 2011 at 10:03 PM, Sean Tong <st...@jamasoftware.com> wrote:
> Thanks Simon for your response.
>
> I just re-ran the 3.5 benchmark with the ClassicAnalyzer. Here are the results:
>
> [java] ------------> Report sum by Prefix (MAddDocs) and Round (3
> about 3 out of 14)
> [java] Operation round flush mrg runCnt recsPerRun rec/s
> elapsedSec avgUsedMem avgTotalMem
> [java] MAddDocs_200000 0 16.00 10 1 200000
> 715.76 279.42 48,828,144 128,057,344
> [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - -
> 679.04 - - 294.53 - 68,321,424 - 85,721,088
> [java] MAddDocs_200000 2 16.00 10 1 200000
> 761.95 262.49 63,139,256 91,881,472
>
> The performance is slightly better than the one using StandardAnalyzer, but this is still much worse than the performance with 2.4.1.
>
> Sean
>
> -----Original Message-----
> From: Simon Willnauer [mailto:simon.willnauer@googlemail.com]
> Sent: Monday, December 12, 2011 12:20 PM
> To: java-user@lucene.apache.org
> Subject: Re: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?
>
> hey,
>
> can you try to use the ClassicAnalyzer instead of StandartAnalzyer in
> 3.5 since in 3.5 the StandartAnalyzer is a different implementation than in 2.9 and 2.4 or rerun the 2.4 benchmarks with a WhitespaceAnalyzer just for the comparison.
>
> simon
>
> On Mon, Dec 12, 2011 at 7:08 PM, Sean Tong <st...@jamasoftware.com> wrote:
>> Looks like the attachment for the algorithm is missing from last email. I have pasted the text here. Thanks in advance for any help.
>>
>> #Start of the wikipedia-default.alg file
>>
>> merge.factor=mrg:10:10:10
>> max.field.length=2147483647
>> #max.buffered=buf:10:10:100:100
>> ram.flush.mb=flush:16:16:16
>>
>> compound=true
>>
>> analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
>> directory=FSDirectory
>>
>> doc.stored=true
>> doc.tokenized=true
>> doc.term.vector=false
>> log.step=5000
>>
>> docs.file=temp/enwiki-20070527-pages-articles.xml
>>
>> content.source=org.apache.lucene.benchmark.byTask.feeds.EnwikiContent
>> S
>> ource
>>
>> query.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMake
>> r
>>
>> # task at this depth or less would print when they start
>> task.max.depth.log=2
>>
>> log.queries=false
>> #
>> ---------------------------------------------------------------------
>> -
>> ---------------
>>
>> { "Rounds"
>>
>> ResetSystemErase
>>
>> { "Populate"
>> CreateIndex
>> { "MAddDocs" AddDoc > : 200000
>> CloseIndex
>> }
>>
>> NewRound
>>
>> } : 3
>>
>> RepSumByName
>> RepSumByPrefRound MAddDocs
>>
>> #End of wikipedia-default.alg file
>>
>> Thanks,
>>
>> Sean
>>
>>
>> From: Sean Tong [mailto:stong@jamasoftware.com]
>> Sent: Sunday, December 11, 2011 11:54 PM
>> To: java-user@lucene.apache.org
>> Subject: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?
>>
>> Hi,
>>
>> We plan to upgrade the Lucene library in our application from 2.4.1 to 3.5.0. I have been running benchmark tests that come with Lucence. To my surprise, I found that the indexing in 3.5.0 is significant slower than 2.4.1 for the Wikipedia data.
>>
>> Attached is the algorithm for the tests. The tests used default Lucence settings for flush memory size and merge factor. 512M memory was used for the tasks. The test machine is a 64-bit Windows 7 machine with Intel Core i7.
>>
>> The command:
>> %ant -Dtask.alg=conf/wikipedia-default.alg -Dtask.mem=512M run-task
>>
>> Here are the test results:
>>
>> Lucece 2.4.1
>>
>> [java] ------------> Report sum by Prefix (MAddDocs) and Round
>> (3 about 3 out of 14)
>>
>> [java] Operation round flush mrg runCnt recsPerRun
>> rec/s elapsedSec avgUsedMem avgTotalMem
>>
>> [java] MAddDocs_200000 0 16.00 10 1 200000
>> 1,609.1 124.29 89,218,496 241,631,232
>>
>> [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - -
>> 1,746.4 - - 114.52 - 102,365,864 - 241,762,304
>>
>> [java] MAddDocs_200000 2 16.00 10 1 200000
>> 1,566.8 127.65 69,428,144 174,194,688
>>
>>
>> Lucene 2.9.4
>>
>> [java] ------------> Report sum by Prefix (MAddDocs) and Round (3
>> about 3 out of 14)
>>
>> [java] Operation round flush mrg runCnt recsPerRun
>> rec/s elapsedSec avgUsedMem avgTotalMem
>>
>> [java] MAddDocs_200000 0 16.00 10 1 200000
>> 1,046.49 191.12 82,676,152 139,657,216
>>
>> [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 -
>> 1,165.35 - - 171.62 - 119,364,128 - 156,762,112
>>
>> [java] MAddDocs_200000 2 16.00 10 1 200000
>> 1,245.86 160.53 50,361,760 137,625,600
>>
>> Lucene 3.5.0
>>
>> [java] ------------> Report sum by Prefix (MAddDocs) and Round (3
>> about 3 out of 14)
>>
>> [java] Operation round flush mrg runCnt recsPerRun
>> rec/s elapsedSec avgUsedMem avgTotalMem
>>
>> [java] MAddDocs_200000 0 16.00 10 1 200000
>> 676.48 295.65 70,917,592 129,695,744
>>
>> [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - -
>> 626.13 - - 319.42 - 50,329,552 - 94,240,768
>>
>> [java] MAddDocs_200000 2 16.00 10 1 200000
>> 687.68 290.83 57,732,640 92,864,512
>>
>>
>> The indexing speed using 2.4.1 is 2.3x of the speed using 3.5.0. Did I miss any settings or configurations?
>>
>> Thanks,
>>
>> Sean
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?
Posted by Simon Willnauer <si...@googlemail.com>.
hey,
so what I wonder in general is if the benchmarks are comparable. What
I mean is that the benchmark code has changed since 2.4 a lot so there
might be additional fields and / or different settings on what to
index and how.
could you check with luke if the index has the same fields and if the
settings are the same / similar and report it back? I also wonder if
it maybe now uses update instead of add ie. buffers and applies
deletes etc.
simon
On Mon, Dec 12, 2011 at 10:03 PM, Sean Tong <st...@jamasoftware.com> wrote:
> Thanks Simon for your response.
>
> I just re-ran the 3.5 benchmark with the ClassicAnalyzer. Here are the results:
>
> [java] ------------> Report sum by Prefix (MAddDocs) and Round (3 about 3 out of 14)
> [java] Operation round flush mrg runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
> [java] MAddDocs_200000 0 16.00 10 1 200000 715.76 279.42 48,828,144 128,057,344
> [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - - 679.04 - - 294.53 - 68,321,424 - 85,721,088
> [java] MAddDocs_200000 2 16.00 10 1 200000 761.95 262.49 63,139,256 91,881,472
>
> The performance is slightly better than the one using StandardAnalyzer, but this is still much worse than the performance with 2.4.1.
>
> Sean
>
> -----Original Message-----
> From: Simon Willnauer [mailto:simon.willnauer@googlemail.com]
> Sent: Monday, December 12, 2011 12:20 PM
> To: java-user@lucene.apache.org
> Subject: Re: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?
>
> hey,
>
> can you try to use the ClassicAnalyzer instead of StandartAnalzyer in
> 3.5 since in 3.5 the StandartAnalyzer is a different implementation than in 2.9 and 2.4 or rerun the 2.4 benchmarks with a WhitespaceAnalyzer just for the comparison.
>
> simon
>
> On Mon, Dec 12, 2011 at 7:08 PM, Sean Tong <st...@jamasoftware.com> wrote:
>> Looks like the attachment for the algorithm is missing from last email. I have pasted the text here. Thanks in advance for any help.
>>
>> #Start of the wikipedia-default.alg file
>>
>> merge.factor=mrg:10:10:10
>> max.field.length=2147483647
>> #max.buffered=buf:10:10:100:100
>> ram.flush.mb=flush:16:16:16
>>
>> compound=true
>>
>> analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
>> directory=FSDirectory
>>
>> doc.stored=true
>> doc.tokenized=true
>> doc.term.vector=false
>> log.step=5000
>>
>> docs.file=temp/enwiki-20070527-pages-articles.xml
>>
>> content.source=org.apache.lucene.benchmark.byTask.feeds.EnwikiContentS
>> ource
>>
>> query.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMaker
>>
>> # task at this depth or less would print when they start
>> task.max.depth.log=2
>>
>> log.queries=false
>> #
>> ----------------------------------------------------------------------
>> ---------------
>>
>> { "Rounds"
>>
>> ResetSystemErase
>>
>> { "Populate"
>> CreateIndex
>> { "MAddDocs" AddDoc > : 200000
>> CloseIndex
>> }
>>
>> NewRound
>>
>> } : 3
>>
>> RepSumByName
>> RepSumByPrefRound MAddDocs
>>
>> #End of wikipedia-default.alg file
>>
>> Thanks,
>>
>> Sean
>>
>>
>> From: Sean Tong [mailto:stong@jamasoftware.com]
>> Sent: Sunday, December 11, 2011 11:54 PM
>> To: java-user@lucene.apache.org
>> Subject: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?
>>
>> Hi,
>>
>> We plan to upgrade the Lucene library in our application from 2.4.1 to 3.5.0. I have been running benchmark tests that come with Lucence. To my surprise, I found that the indexing in 3.5.0 is significant slower than 2.4.1 for the Wikipedia data.
>>
>> Attached is the algorithm for the tests. The tests used default Lucence settings for flush memory size and merge factor. 512M memory was used for the tasks. The test machine is a 64-bit Windows 7 machine with Intel Core i7.
>>
>> The command:
>> %ant -Dtask.alg=conf/wikipedia-default.alg -Dtask.mem=512M run-task
>>
>> Here are the test results:
>>
>> Lucece 2.4.1
>>
>> [java] ------------> Report sum by Prefix (MAddDocs) and Round
>> (3 about 3 out of 14)
>>
>> [java] Operation round flush mrg runCnt recsPerRun
>> rec/s elapsedSec avgUsedMem avgTotalMem
>>
>> [java] MAddDocs_200000 0 16.00 10 1 200000
>> 1,609.1 124.29 89,218,496 241,631,232
>>
>> [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - -
>> 1,746.4 - - 114.52 - 102,365,864 - 241,762,304
>>
>> [java] MAddDocs_200000 2 16.00 10 1 200000
>> 1,566.8 127.65 69,428,144 174,194,688
>>
>>
>> Lucene 2.9.4
>>
>> [java] ------------> Report sum by Prefix (MAddDocs) and Round (3
>> about 3 out of 14)
>>
>> [java] Operation round flush mrg runCnt recsPerRun
>> rec/s elapsedSec avgUsedMem avgTotalMem
>>
>> [java] MAddDocs_200000 0 16.00 10 1 200000
>> 1,046.49 191.12 82,676,152 139,657,216
>>
>> [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 -
>> 1,165.35 - - 171.62 - 119,364,128 - 156,762,112
>>
>> [java] MAddDocs_200000 2 16.00 10 1 200000
>> 1,245.86 160.53 50,361,760 137,625,600
>>
>> Lucene 3.5.0
>>
>> [java] ------------> Report sum by Prefix (MAddDocs) and Round (3
>> about 3 out of 14)
>>
>> [java] Operation round flush mrg runCnt recsPerRun
>> rec/s elapsedSec avgUsedMem avgTotalMem
>>
>> [java] MAddDocs_200000 0 16.00 10 1 200000
>> 676.48 295.65 70,917,592 129,695,744
>>
>> [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - -
>> 626.13 - - 319.42 - 50,329,552 - 94,240,768
>>
>> [java] MAddDocs_200000 2 16.00 10 1 200000
>> 687.68 290.83 57,732,640 92,864,512
>>
>>
>> The indexing speed using 2.4.1 is 2.3x of the speed using 3.5.0. Did I miss any settings or configurations?
>>
>> Thanks,
>>
>> Sean
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia
data?
Posted by Sean Tong <st...@jamasoftware.com>.
Thanks Simon for your response.
I just re-ran the 3.5 benchmark with the ClassicAnalyzer. Here are the results:
[java] ------------> Report sum by Prefix (MAddDocs) and Round (3 about 3 out of 14)
[java] Operation round flush mrg runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
[java] MAddDocs_200000 0 16.00 10 1 200000 715.76 279.42 48,828,144 128,057,344
[java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - - 679.04 - - 294.53 - 68,321,424 - 85,721,088
[java] MAddDocs_200000 2 16.00 10 1 200000 761.95 262.49 63,139,256 91,881,472
The performance is slightly better than the one using StandardAnalyzer, but this is still much worse than the performance with 2.4.1.
Sean
-----Original Message-----
From: Simon Willnauer [mailto:simon.willnauer@googlemail.com]
Sent: Monday, December 12, 2011 12:20 PM
To: java-user@lucene.apache.org
Subject: Re: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?
hey,
can you try to use the ClassicAnalyzer instead of StandartAnalzyer in
3.5 since in 3.5 the StandartAnalyzer is a different implementation than in 2.9 and 2.4 or rerun the 2.4 benchmarks with a WhitespaceAnalyzer just for the comparison.
simon
On Mon, Dec 12, 2011 at 7:08 PM, Sean Tong <st...@jamasoftware.com> wrote:
> Looks like the attachment for the algorithm is missing from last email. I have pasted the text here. Thanks in advance for any help.
>
> #Start of the wikipedia-default.alg file
>
> merge.factor=mrg:10:10:10
> max.field.length=2147483647
> #max.buffered=buf:10:10:100:100
> ram.flush.mb=flush:16:16:16
>
> compound=true
>
> analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
> directory=FSDirectory
>
> doc.stored=true
> doc.tokenized=true
> doc.term.vector=false
> log.step=5000
>
> docs.file=temp/enwiki-20070527-pages-articles.xml
>
> content.source=org.apache.lucene.benchmark.byTask.feeds.EnwikiContentS
> ource
>
> query.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMaker
>
> # task at this depth or less would print when they start
> task.max.depth.log=2
>
> log.queries=false
> #
> ----------------------------------------------------------------------
> ---------------
>
> { "Rounds"
>
> ResetSystemErase
>
> { "Populate"
> CreateIndex
> { "MAddDocs" AddDoc > : 200000
> CloseIndex
> }
>
> NewRound
>
> } : 3
>
> RepSumByName
> RepSumByPrefRound MAddDocs
>
> #End of wikipedia-default.alg file
>
> Thanks,
>
> Sean
>
>
> From: Sean Tong [mailto:stong@jamasoftware.com]
> Sent: Sunday, December 11, 2011 11:54 PM
> To: java-user@lucene.apache.org
> Subject: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?
>
> Hi,
>
> We plan to upgrade the Lucene library in our application from 2.4.1 to 3.5.0. I have been running benchmark tests that come with Lucence. To my surprise, I found that the indexing in 3.5.0 is significant slower than 2.4.1 for the Wikipedia data.
>
> Attached is the algorithm for the tests. The tests used default Lucence settings for flush memory size and merge factor. 512M memory was used for the tasks. The test machine is a 64-bit Windows 7 machine with Intel Core i7.
>
> The command:
> %ant -Dtask.alg=conf/wikipedia-default.alg -Dtask.mem=512M run-task
>
> Here are the test results:
>
> Lucece 2.4.1
>
> [java] ------------> Report sum by Prefix (MAddDocs) and Round
> (3 about 3 out of 14)
>
> [java] Operation round flush mrg runCnt recsPerRun
> rec/s elapsedSec avgUsedMem avgTotalMem
>
> [java] MAddDocs_200000 0 16.00 10 1 200000
> 1,609.1 124.29 89,218,496 241,631,232
>
> [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - -
> 1,746.4 - - 114.52 - 102,365,864 - 241,762,304
>
> [java] MAddDocs_200000 2 16.00 10 1 200000
> 1,566.8 127.65 69,428,144 174,194,688
>
>
> Lucene 2.9.4
>
> [java] ------------> Report sum by Prefix (MAddDocs) and Round (3
> about 3 out of 14)
>
> [java] Operation round flush mrg runCnt recsPerRun
> rec/s elapsedSec avgUsedMem avgTotalMem
>
> [java] MAddDocs_200000 0 16.00 10 1 200000
> 1,046.49 191.12 82,676,152 139,657,216
>
> [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 -
> 1,165.35 - - 171.62 - 119,364,128 - 156,762,112
>
> [java] MAddDocs_200000 2 16.00 10 1 200000
> 1,245.86 160.53 50,361,760 137,625,600
>
> Lucene 3.5.0
>
> [java] ------------> Report sum by Prefix (MAddDocs) and Round (3
> about 3 out of 14)
>
> [java] Operation round flush mrg runCnt recsPerRun
> rec/s elapsedSec avgUsedMem avgTotalMem
>
> [java] MAddDocs_200000 0 16.00 10 1 200000
> 676.48 295.65 70,917,592 129,695,744
>
> [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - -
> 626.13 - - 319.42 - 50,329,552 - 94,240,768
>
> [java] MAddDocs_200000 2 16.00 10 1 200000
> 687.68 290.83 57,732,640 92,864,512
>
>
> The indexing speed using 2.4.1 is 2.3x of the speed using 3.5.0. Did I miss any settings or configurations?
>
> Thanks,
>
> Sean
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?
Posted by Simon Willnauer <si...@googlemail.com>.
hey,
can you try to use the ClassicAnalyzer instead of StandartAnalzyer in
3.5 since in 3.5 the StandartAnalyzer is a different implementation
than in 2.9 and 2.4 or rerun the 2.4 benchmarks with a
WhitespaceAnalyzer just for the comparison.
simon
On Mon, Dec 12, 2011 at 7:08 PM, Sean Tong <st...@jamasoftware.com> wrote:
> Looks like the attachment for the algorithm is missing from last email. I have pasted the text here. Thanks in advance for any help.
>
> #Start of the wikipedia-default.alg file
>
> merge.factor=mrg:10:10:10
> max.field.length=2147483647
> #max.buffered=buf:10:10:100:100
> ram.flush.mb=flush:16:16:16
>
> compound=true
>
> analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
> directory=FSDirectory
>
> doc.stored=true
> doc.tokenized=true
> doc.term.vector=false
> log.step=5000
>
> docs.file=temp/enwiki-20070527-pages-articles.xml
>
> content.source=org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource
>
> query.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMaker
>
> # task at this depth or less would print when they start
> task.max.depth.log=2
>
> log.queries=false
> # -------------------------------------------------------------------------------------
>
> { "Rounds"
>
> ResetSystemErase
>
> { "Populate"
> CreateIndex
> { "MAddDocs" AddDoc > : 200000
> CloseIndex
> }
>
> NewRound
>
> } : 3
>
> RepSumByName
> RepSumByPrefRound MAddDocs
>
> #End of wikipedia-default.alg file
>
> Thanks,
>
> Sean
>
>
> From: Sean Tong [mailto:stong@jamasoftware.com]
> Sent: Sunday, December 11, 2011 11:54 PM
> To: java-user@lucene.apache.org
> Subject: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?
>
> Hi,
>
> We plan to upgrade the Lucene library in our application from 2.4.1 to 3.5.0. I have been running benchmark tests that come with Lucence. To my surprise, I found that the indexing in 3.5.0 is significant slower than 2.4.1 for the Wikipedia data.
>
> Attached is the algorithm for the tests. The tests used default Lucence settings for flush memory size and merge factor. 512M memory was used for the tasks. The test machine is a 64-bit Windows 7 machine with Intel Core i7.
>
> The command:
> %ant -Dtask.alg=conf/wikipedia-default.alg -Dtask.mem=512M run-task
>
> Here are the test results:
>
> Lucece 2.4.1
>
> [java] ------------> Report sum by Prefix (MAddDocs) and Round (3 about 3 out of 14)
>
> [java] Operation round flush mrg runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
>
> [java] MAddDocs_200000 0 16.00 10 1 200000 1,609.1 124.29 89,218,496 241,631,232
>
> [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - - 1,746.4 - - 114.52 - 102,365,864 - 241,762,304
>
> [java] MAddDocs_200000 2 16.00 10 1 200000 1,566.8 127.65 69,428,144 174,194,688
>
>
> Lucene 2.9.4
>
> [java] ------------> Report sum by Prefix (MAddDocs) and Round (3 about 3 out of 14)
>
> [java] Operation round flush mrg runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
>
> [java] MAddDocs_200000 0 16.00 10 1 200000 1,046.49 191.12 82,676,152 139,657,216
>
> [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - 1,165.35 - - 171.62 - 119,364,128 - 156,762,112
>
> [java] MAddDocs_200000 2 16.00 10 1 200000 1,245.86 160.53 50,361,760 137,625,600
>
> Lucene 3.5.0
>
> [java] ------------> Report sum by Prefix (MAddDocs) and Round (3 about 3 out of 14)
>
> [java] Operation round flush mrg runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
>
> [java] MAddDocs_200000 0 16.00 10 1 200000 676.48 295.65 70,917,592 129,695,744
>
> [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - - 626.13 - - 319.42 - 50,329,552 - 94,240,768
>
> [java] MAddDocs_200000 2 16.00 10 1 200000 687.68 290.83 57,732,640 92,864,512
>
>
> The indexing speed using 2.4.1 is 2.3x of the speed using 3.5.0. Did I miss any settings or configurations?
>
> Thanks,
>
> Sean
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia
data?
Posted by Sean Tong <st...@jamasoftware.com>.
Looks like the attachment for the algorithm is missing from last email. I have pasted the text here. Thanks in advance for any help.
#Start of the wikipedia-default.alg file
merge.factor=mrg:10:10:10
max.field.length=2147483647
#max.buffered=buf:10:10:100:100
ram.flush.mb=flush:16:16:16
compound=true
analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
directory=FSDirectory
doc.stored=true
doc.tokenized=true
doc.term.vector=false
log.step=5000
docs.file=temp/enwiki-20070527-pages-articles.xml
content.source=org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource
query.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMaker
# task at this depth or less would print when they start
task.max.depth.log=2
log.queries=false
# -------------------------------------------------------------------------------------
{ "Rounds"
ResetSystemErase
{ "Populate"
CreateIndex
{ "MAddDocs" AddDoc > : 200000
CloseIndex
}
NewRound
} : 3
RepSumByName
RepSumByPrefRound MAddDocs
#End of wikipedia-default.alg file
Thanks,
Sean
From: Sean Tong [mailto:stong@jamasoftware.com]
Sent: Sunday, December 11, 2011 11:54 PM
To: java-user@lucene.apache.org
Subject: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?
Hi,
We plan to upgrade the Lucene library in our application from 2.4.1 to 3.5.0. I have been running benchmark tests that come with Lucence. To my surprise, I found that the indexing in 3.5.0 is significant slower than 2.4.1 for the Wikipedia data.
Attached is the algorithm for the tests. The tests used default Lucence settings for flush memory size and merge factor. 512M memory was used for the tasks. The test machine is a 64-bit Windows 7 machine with Intel Core i7.
The command:
%ant -Dtask.alg=conf/wikipedia-default.alg -Dtask.mem=512M run-task
Here are the test results:
Lucece 2.4.1
[java] ------------> Report sum by Prefix (MAddDocs) and Round (3 about 3 out of 14)
[java] Operation round flush mrg runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
[java] MAddDocs_200000 0 16.00 10 1 200000 1,609.1 124.29 89,218,496 241,631,232
[java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - - 1,746.4 - - 114.52 - 102,365,864 - 241,762,304
[java] MAddDocs_200000 2 16.00 10 1 200000 1,566.8 127.65 69,428,144 174,194,688
Lucene 2.9.4
[java] ------------> Report sum by Prefix (MAddDocs) and Round (3 about 3 out of 14)
[java] Operation round flush mrg runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
[java] MAddDocs_200000 0 16.00 10 1 200000 1,046.49 191.12 82,676,152 139,657,216
[java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - 1,165.35 - - 171.62 - 119,364,128 - 156,762,112
[java] MAddDocs_200000 2 16.00 10 1 200000 1,245.86 160.53 50,361,760 137,625,600
Lucene 3.5.0
[java] ------------> Report sum by Prefix (MAddDocs) and Round (3 about 3 out of 14)
[java] Operation round flush mrg runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
[java] MAddDocs_200000 0 16.00 10 1 200000 676.48 295.65 70,917,592 129,695,744
[java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - - 626.13 - - 319.42 - 50,329,552 - 94,240,768
[java] MAddDocs_200000 2 16.00 10 1 200000 687.68 290.83 57,732,640 92,864,512
The indexing speed using 2.4.1 is 2.3x of the speed using 3.5.0. Did I miss any settings or configurations?
Thanks,
Sean