You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by "Zhang, Lisheng" <Li...@BroadVision.com> on 2013/08/01 01:17:41 UTC
RE: lucene 4.3 seems to be much slower in indexing than lucene 3.6?
Hi Mike,
I retested and results are the same:
1/ I did not use sort (so FieldCache should not enter picture?)
2/ I created indexed data from scratch separately for 361 and 43
based on same text (text files), and I ran test from command
line separately against each index folder, so seems a pretty
fair test.
3/ Each test I created searcher from scrath (to measure creation
time). I did not include JVM start time in each case. The
tests are in same box.
>From indexed data it seems that 43 generated a lot more data in
folder, below I listed (ls -ltr) result (always pass in LUCENE_43
version, so lucen 42 codec should be used, why lucene41?).
Thanks very much for helps, Lisheng
///////////////////
36:
total 1228
-rw-r--r-- 1 root root 68 Jul 31 15:50 _0.fdx
-rw-r--r-- 1 root root 44 Jul 31 15:50 _0.fdt
-rw-r--r-- 1 root root 132 Jul 31 15:50 _0.tvx
-rw-r--r-- 1 root root 260207 Jul 31 15:50 _0.tvf
-rw-r--r-- 1 root root 20 Jul 31 15:50 _0.tvd
-rw-r--r-- 1 root root 345803 Jul 31 15:50 _0.tis
-rw-r--r-- 1 root root 4899 Jul 31 15:50 _0.tii
-rw-r--r-- 1 root root 539098 Jul 31 15:50 _0.prx
-rw-r--r-- 1 root root 12 Jul 31 15:50 _0.nrm
-rw-r--r-- 1 root root 61703 Jul 31 15:50 _0.frq
-rw-r--r-- 1 root root 29 Jul 31 15:50 _0.fnm
-rw-r--r-- 1 root root 252 Jul 31 15:50 segments_1
-rw-r--r-- 1 root root 20 Jul 31 15:50 segments.gen
43:
total 1828
-rw-r--r-- 1 root root 45 Jul 31 15:51 _0.fdx
-rw-r--r-- 1 root root 66 Jul 31 15:51 _0.fdt
-rw-r--r-- 1 root root 60 Jul 31 15:51 _0.tvx
-rw-r--r-- 1 root root 176845 Jul 31 15:51 _0.tvd
-rw-r--r-- 1 root root 10980 Jul 31 15:51 _0_Lucene41_0.tip
-rw-r--r-- 1 root root 401339 Jul 31 15:51 _0_Lucene41_0.tim
-rw-r--r-- 1 root root 1007621 Jul 31 15:51 _0_Lucene41_0.pos
-rw-r--r-- 1 root root 216711 Jul 31 15:51 _0_Lucene41_0.pay
-rw-r--r-- 1 root root 12048 Jul 31 15:51 _0_Lucene41_0.doc
-rw-r--r-- 1 root root 46 Jul 31 15:51 _0.nvm
-rw-r--r-- 1 root root 34 Jul 31 15:51 _0.nvd
-rw-r--r-- 1 root root 205 Jul 31 15:51 _0.fnm
-rw-r--r-- 1 root root 395 Jul 31 15:51 _0.si
-rw-r--r-- 1 root root 69 Jul 31 15:51 segments_1
-rw-r--r-- 1 root root 20 Jul 31 15:51 segments.gen
///////////////////
-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com]
Sent: Wednesday, July 31, 2013 11:31 AM
To: Lucene Users
Subject: Re: lucene 4.3 seems to be much slower in indexing than lucene
3.6?
On Tue, Jul 30, 2013 at 6:13 PM, Zhang, Lisheng
<Li...@broadvision.com> wrote:
> Hi Mike,
>
> I did more tests with realistic text from different languages (typical
> text for 8 different languages, English one is novel "Animal Farm").
>
> What I found seems to be:
>
> ## Indexing:
> 36 and 43 comparable (your previous comment is very correct).
>
> ## Search:
> 43 seems to be slower (30%), checking details, it seems it all due to
> initial searcher creation and first search (warming), as if 43 did much
> more in warming?
Hmm, I'm not sure off hand why searcher warming would be slower in 4.3.
Are you relying on FieldCache (e.g. sorting by a field instead of by
relevance). Switching to doc values should make warming much faster.
Are you sure the test was fair? Ie, in both cases the index was
either hot or cold?
For your 4.3 test you fully reindexed right? Ie, searched against a
4.3 (not 3.6) index?
Mike McCandless
http://blog.mikemccandless.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: lucene 4.3 seems to be much slower in indexing than lucene 3.6?
Posted by Simon Willnauer <si...@gmail.com>.
one thing I wonder is if you could just publish your benchmark code?
simon
On Thu, Aug 1, 2013 at 7:45 PM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> On Wed, Jul 31, 2013 at 7:17 PM, Zhang, Lisheng
> <Li...@broadvision.com> wrote:
>>
>> Hi Mike,
>>
>> I retested and results are the same:
>>
>> 1/ I did not use sort (so FieldCache should not enter picture?)
>
> No grouping or joining either (they will use FieldCache, if it's not
> against a doc values field).
>
> What sort of queries are you running?
>
>> 2/ I created indexed data from scratch separately for 361 and 43
>> based on same text (text files), and I ran test from command
>> line separately against each index folder, so seems a pretty
>> fair test.
>
> OK.
>
>> 3/ Each test I created searcher from scrath (to measure creation
>> time). I did not include JVM start time in each case. The
>> tests are in same box.
>
> OK.
>
>> From indexed data it seems that 43 generated a lot more data in
>> folder, below I listed (ls -ltr) result
>
> This is very odd: the 4.3 index is quite a bit larger than the 3.x
> index. Are you certain the two indexed the same content in the same
> way? Which analyzer are you using? Maybe run CheckIndex against each
> index and post the output?
>
>> (always pass in LUCENE_43
>> version, so lucen 42 codec should be used, why lucene41?).
>
> This is fine: the Lucene42 codec uses Lucene41PostingsFormat.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: lucene 4.3 seems to be much slower in indexing than lucene 3.6?
Posted by "Zhang, Lisheng" <Li...@BroadVision.com>.
Hi Mike,
Thanks very much for your insightful comments, I will try to test more.
Best regards, Lisheng
-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com]
Sent: Friday, August 09, 2013 9:46 AM
To: Lucene Users
Subject: Re: lucene 4.3 seems to be much slower in indexing than lucene
3.6?
Hi, sorry, I don't have enough time to drill deeper here (run your
benchmark), but some quick ideas:
Only 8 documents is really a tiny index; try testing on many more documents?
Also, I would run more rounds than just 2; better to run 10s of rounds
and watch for the time per round to "stabilize" as hotspot finishes
compiling the hot spots...
It's curious that your CheckIndex output is so similar yet the index
sizes are so different; I wonder if you make a larger index if that
still holds. It could be the block compression in 4.x is less space
efficient when there are a tiny number of documents.
Mike McCandless
http://blog.mikemccandless.com
On Fri, Aug 9, 2013 at 11:55 AM, Zhang, Lisheng
<Li...@broadvision.com> wrote:
> Hi Mike,
>
> Any more comments on this issue?
>
> Thanks and best regards, Lisheng
>
> -----Original Message-----
> From: Zhang, Lisheng [mailto:Lisheng.Zhang@broadvision.com]
> Sent: Friday, August 02, 2013 7:55 AM
> To: java-user@lucene.apache.org
> Subject: RE: lucene 4.3 seems to be much slower in indexing than lucene
> 3.6?
>
>
> I should have mentioned the commands I used to test:
>
> 1/ Index:
> java TestReal36 index -fileFolder /home/cvsupport/lzhang/files -optimize false
> java TestReal43 index -fileFolder /home/cvsupport/lzhang/files -optimize false
>
> The fileFolder contains 8 *.txt files with UTF-8 encoding. I also tried with
> different parameters -luceneDir (by default lucene chose MMap).
>
> 2/ Search:
> java TestReal36 search -query boxer,snowball -limit 2 -searchRound 2 -reuseSearcher true
> java TestReal43 search -query boxer,snowball -limit 2 -searchRound 2 -reuseSearcher true
>
> I also tried with different parameters -luceneDir
>
> Thanks and best regards, Lisheng
>
> -----Original Message-----
> From: Zhang, Lisheng
> Sent: Thursday, August 01, 2013 11:16 AM
> To: 'java-user@lucene.apache.org'
> Subject: RE: lucene 4.3 seems to be much slower in indexing than lucene
> 3.6?
>
>
> Hi Mike,
>
> First I really appreciate your help (for non commercial product)!!
>
> 1/ I attached source code of my testing (you see I used StandardAnalyzer), also from CheckIndex report below
> the unique terms are identical (token counts are slightly different). The stored field is just ID (1 - 8
> for each document). The indexed files are from 8 typical files for 8 different languages (English one is
> "Animal Farm" by George Orwell). Sure I donot mind sending the text files in case you are interested?
>
> The query I issued is a trivial one (did not even use filter, like querying "boxer" to get "Animal Farm")
>
> 2/ CheckIndex output:
>
> /// 361:
> root@ec2usevmsstgamq:/home/cvsupport/lzhang/code/solr36# java org.apache.lucene.index.CheckIndex /home/cvsupport/lzhang/index36
>
> NOTE: testing will be more thorough if you run java with '-ea:org.apache.lucene...', so assertions are enabled
>
> Opening index @ /home/cvsupport/lzhang/index36
>
> Segments file=segments_1 numSegments=1 version=3.6.1 format=FORMAT_3_1 [Lucene 3.1+]
> 1 of 1: name=_0 docCount=8
> compound=false
> hasProx=true
> numFiles=11
> size (MB)=1.156
> diagnostics = {os.version=3.2.0-49-virtual, os=Linux, lucene.version=3.6.1 1362471 - thetaphi - 2012-07-17 12:40:12, source=flush, os.arch=amd64, java.version=1.7.0_25, java.vendor=Oracle Corporation}
> no deletions
> test: open reader.........OK
> test: fields..............OK [2 fields]
> test: field norms.........OK [1 fields]
> test: terms, freq, prox...OK [38611 terms; 42557 terms/docs pairs; 335024 tokens]
> test: stored fields.......OK [8 total field count; avg 1 fields per doc]
> test: term vectors........OK [8 total vector count; avg 1 term/freq vector fields per doc]
>
> No problems were detected with this index.
>
> /// 430
> root@ec2usevmsstgamq:/home/cvsupport/lzhang/code/solr43# java org.apache.lucene.index.CheckIndex /home/cvsupport/lzhang/index43
>
> NOTE: testing will be more thorough if you run java with '-ea:org.apache.lucene...', so assertions are enabled
>
> Opening index @ /home/cvsupport/lzhang/index43
>
> Segments file=segments_1 numSegments=1 version=4.3 format=
> 1 of 1: name=_0 docCount=8
> codec=Lucene42
> compound=false
> numFiles=13
> size (MB)=1.742
> diagnostics = {timestamp=1375311061843, os=Linux, os.version=3.2.0-49-virtual, source=flush, lucene.version=4.3.0 1477023 - simonw - 2013-04-29 14:55:14, os.arch=amd64, java.version=1.7.0_25, java.vendor=Oracle Corporation}
> no deletions
> test: open reader.........OK
> test: fields..............OK [2 fields]
> test: field norms.........OK [1 fields]
> test: terms, freq, prox...OK [38611 terms; 42557 terms/docs pairs; 335016 tokens]
> test: stored fields.......OK [8 total field count; avg 1 fields per doc]
> test: term vectors........OK [8 total vector count; avg 1 term/freq vector fields per doc]
> test: docvalues...........OK [0 total doc count; 0 docvalues fields]
>
> No problems were detected with this index.
>
>
>
> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Thursday, August 01, 2013 10:45 AM
> To: Lucene Users
> Subject: Re: lucene 4.3 seems to be much slower in indexing than lucene
> 3.6?
>
>
> On Wed, Jul 31, 2013 at 7:17 PM, Zhang, Lisheng
> <Li...@broadvision.com> wrote:
>>
>> Hi Mike,
>>
>> I retested and results are the same:
>>
>> 1/ I did not use sort (so FieldCache should not enter picture?)
>
> No grouping or joining either (they will use FieldCache, if it's not
> against a doc values field).
>
> What sort of queries are you running?
>
>> 2/ I created indexed data from scratch separately for 361 and 43
>> based on same text (text files), and I ran test from command
>> line separately against each index folder, so seems a pretty
>> fair test.
>
> OK.
>
>> 3/ Each test I created searcher from scrath (to measure creation
>> time). I did not include JVM start time in each case. The
>> tests are in same box.
>
> OK.
>
>> From indexed data it seems that 43 generated a lot more data in
>> folder, below I listed (ls -ltr) result
>
> This is very odd: the 4.3 index is quite a bit larger than the 3.x
> index. Are you certain the two indexed the same content in the same
> way? Which analyzer are you using? Maybe run CheckIndex against each
> index and post the output?
>
>> (always pass in LUCENE_43
>> version, so lucen 42 codec should be used, why lucene41?).
>
> This is fine: the Lucene42 codec uses Lucene41PostingsFormat.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: lucene 4.3 seems to be much slower in indexing than lucene 3.6?
Posted by Michael McCandless <lu...@mikemccandless.com>.
Hi, sorry, I don't have enough time to drill deeper here (run your
benchmark), but some quick ideas:
Only 8 documents is really a tiny index; try testing on many more documents?
Also, I would run more rounds than just 2; better to run 10s of rounds
and watch for the time per round to "stabilize" as hotspot finishes
compiling the hot spots...
It's curious that your CheckIndex output is so similar yet the index
sizes are so different; I wonder if you make a larger index if that
still holds. It could be the block compression in 4.x is less space
efficient when there are a tiny number of documents.
Mike McCandless
http://blog.mikemccandless.com
On Fri, Aug 9, 2013 at 11:55 AM, Zhang, Lisheng
<Li...@broadvision.com> wrote:
> Hi Mike,
>
> Any more comments on this issue?
>
> Thanks and best regards, Lisheng
>
> -----Original Message-----
> From: Zhang, Lisheng [mailto:Lisheng.Zhang@broadvision.com]
> Sent: Friday, August 02, 2013 7:55 AM
> To: java-user@lucene.apache.org
> Subject: RE: lucene 4.3 seems to be much slower in indexing than lucene
> 3.6?
>
>
> I should have mentioned the commands I used to test:
>
> 1/ Index:
> java TestReal36 index -fileFolder /home/cvsupport/lzhang/files -optimize false
> java TestReal43 index -fileFolder /home/cvsupport/lzhang/files -optimize false
>
> The fileFolder contains 8 *.txt files with UTF-8 encoding. I also tried with
> different parameters -luceneDir (by default lucene chose MMap).
>
> 2/ Search:
> java TestReal36 search -query boxer,snowball -limit 2 -searchRound 2 -reuseSearcher true
> java TestReal43 search -query boxer,snowball -limit 2 -searchRound 2 -reuseSearcher true
>
> I also tried with different parameters -luceneDir
>
> Thanks and best regards, Lisheng
>
> -----Original Message-----
> From: Zhang, Lisheng
> Sent: Thursday, August 01, 2013 11:16 AM
> To: 'java-user@lucene.apache.org'
> Subject: RE: lucene 4.3 seems to be much slower in indexing than lucene
> 3.6?
>
>
> Hi Mike,
>
> First I really appreciate your help (for non commercial product)!!
>
> 1/ I attached source code of my testing (you see I used StandardAnalyzer), also from CheckIndex report below
> the unique terms are identical (token counts are slightly different). The stored field is just ID (1 - 8
> for each document). The indexed files are from 8 typical files for 8 different languages (English one is
> "Animal Farm" by George Orwell). Sure I donot mind sending the text files in case you are interested?
>
> The query I issued is a trivial one (did not even use filter, like querying "boxer" to get "Animal Farm")
>
> 2/ CheckIndex output:
>
> /// 361:
> root@ec2usevmsstgamq:/home/cvsupport/lzhang/code/solr36# java org.apache.lucene.index.CheckIndex /home/cvsupport/lzhang/index36
>
> NOTE: testing will be more thorough if you run java with '-ea:org.apache.lucene...', so assertions are enabled
>
> Opening index @ /home/cvsupport/lzhang/index36
>
> Segments file=segments_1 numSegments=1 version=3.6.1 format=FORMAT_3_1 [Lucene 3.1+]
> 1 of 1: name=_0 docCount=8
> compound=false
> hasProx=true
> numFiles=11
> size (MB)=1.156
> diagnostics = {os.version=3.2.0-49-virtual, os=Linux, lucene.version=3.6.1 1362471 - thetaphi - 2012-07-17 12:40:12, source=flush, os.arch=amd64, java.version=1.7.0_25, java.vendor=Oracle Corporation}
> no deletions
> test: open reader.........OK
> test: fields..............OK [2 fields]
> test: field norms.........OK [1 fields]
> test: terms, freq, prox...OK [38611 terms; 42557 terms/docs pairs; 335024 tokens]
> test: stored fields.......OK [8 total field count; avg 1 fields per doc]
> test: term vectors........OK [8 total vector count; avg 1 term/freq vector fields per doc]
>
> No problems were detected with this index.
>
> /// 430
> root@ec2usevmsstgamq:/home/cvsupport/lzhang/code/solr43# java org.apache.lucene.index.CheckIndex /home/cvsupport/lzhang/index43
>
> NOTE: testing will be more thorough if you run java with '-ea:org.apache.lucene...', so assertions are enabled
>
> Opening index @ /home/cvsupport/lzhang/index43
>
> Segments file=segments_1 numSegments=1 version=4.3 format=
> 1 of 1: name=_0 docCount=8
> codec=Lucene42
> compound=false
> numFiles=13
> size (MB)=1.742
> diagnostics = {timestamp=1375311061843, os=Linux, os.version=3.2.0-49-virtual, source=flush, lucene.version=4.3.0 1477023 - simonw - 2013-04-29 14:55:14, os.arch=amd64, java.version=1.7.0_25, java.vendor=Oracle Corporation}
> no deletions
> test: open reader.........OK
> test: fields..............OK [2 fields]
> test: field norms.........OK [1 fields]
> test: terms, freq, prox...OK [38611 terms; 42557 terms/docs pairs; 335016 tokens]
> test: stored fields.......OK [8 total field count; avg 1 fields per doc]
> test: term vectors........OK [8 total vector count; avg 1 term/freq vector fields per doc]
> test: docvalues...........OK [0 total doc count; 0 docvalues fields]
>
> No problems were detected with this index.
>
>
>
> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Thursday, August 01, 2013 10:45 AM
> To: Lucene Users
> Subject: Re: lucene 4.3 seems to be much slower in indexing than lucene
> 3.6?
>
>
> On Wed, Jul 31, 2013 at 7:17 PM, Zhang, Lisheng
> <Li...@broadvision.com> wrote:
>>
>> Hi Mike,
>>
>> I retested and results are the same:
>>
>> 1/ I did not use sort (so FieldCache should not enter picture?)
>
> No grouping or joining either (they will use FieldCache, if it's not
> against a doc values field).
>
> What sort of queries are you running?
>
>> 2/ I created indexed data from scratch separately for 361 and 43
>> based on same text (text files), and I ran test from command
>> line separately against each index folder, so seems a pretty
>> fair test.
>
> OK.
>
>> 3/ Each test I created searcher from scrath (to measure creation
>> time). I did not include JVM start time in each case. The
>> tests are in same box.
>
> OK.
>
>> From indexed data it seems that 43 generated a lot more data in
>> folder, below I listed (ls -ltr) result
>
> This is very odd: the 4.3 index is quite a bit larger than the 3.x
> index. Are you certain the two indexed the same content in the same
> way? Which analyzer are you using? Maybe run CheckIndex against each
> index and post the output?
>
>> (always pass in LUCENE_43
>> version, so lucen 42 codec should be used, why lucene41?).
>
> This is fine: the Lucene42 codec uses Lucene41PostingsFormat.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: lucene 4.3 seems to be much slower in indexing than lucene 3.6?
Posted by "Zhang, Lisheng" <Li...@BroadVision.com>.
Hi Mike,
Any more comments on this issue?
Thanks and best regards, Lisheng
-----Original Message-----
From: Zhang, Lisheng [mailto:Lisheng.Zhang@broadvision.com]
Sent: Friday, August 02, 2013 7:55 AM
To: java-user@lucene.apache.org
Subject: RE: lucene 4.3 seems to be much slower in indexing than lucene
3.6?
I should have mentioned the commands I used to test:
1/ Index:
java TestReal36 index -fileFolder /home/cvsupport/lzhang/files -optimize false
java TestReal43 index -fileFolder /home/cvsupport/lzhang/files -optimize false
The fileFolder contains 8 *.txt files with UTF-8 encoding. I also tried with
different parameters -luceneDir (by default lucene chose MMap).
2/ Search:
java TestReal36 search -query boxer,snowball -limit 2 -searchRound 2 -reuseSearcher true
java TestReal43 search -query boxer,snowball -limit 2 -searchRound 2 -reuseSearcher true
I also tried with different parameters -luceneDir
Thanks and best regards, Lisheng
-----Original Message-----
From: Zhang, Lisheng
Sent: Thursday, August 01, 2013 11:16 AM
To: 'java-user@lucene.apache.org'
Subject: RE: lucene 4.3 seems to be much slower in indexing than lucene
3.6?
Hi Mike,
First I really appreciate your help (for non commercial product)!!
1/ I attached source code of my testing (you see I used StandardAnalyzer), also from CheckIndex report below
the unique terms are identical (token counts are slightly different). The stored field is just ID (1 - 8
for each document). The indexed files are from 8 typical files for 8 different languages (English one is
"Animal Farm" by George Orwell). Sure I donot mind sending the text files in case you are interested?
The query I issued is a trivial one (did not even use filter, like querying "boxer" to get "Animal Farm")
2/ CheckIndex output:
/// 361:
root@ec2usevmsstgamq:/home/cvsupport/lzhang/code/solr36# java org.apache.lucene.index.CheckIndex /home/cvsupport/lzhang/index36
NOTE: testing will be more thorough if you run java with '-ea:org.apache.lucene...', so assertions are enabled
Opening index @ /home/cvsupport/lzhang/index36
Segments file=segments_1 numSegments=1 version=3.6.1 format=FORMAT_3_1 [Lucene 3.1+]
1 of 1: name=_0 docCount=8
compound=false
hasProx=true
numFiles=11
size (MB)=1.156
diagnostics = {os.version=3.2.0-49-virtual, os=Linux, lucene.version=3.6.1 1362471 - thetaphi - 2012-07-17 12:40:12, source=flush, os.arch=amd64, java.version=1.7.0_25, java.vendor=Oracle Corporation}
no deletions
test: open reader.........OK
test: fields..............OK [2 fields]
test: field norms.........OK [1 fields]
test: terms, freq, prox...OK [38611 terms; 42557 terms/docs pairs; 335024 tokens]
test: stored fields.......OK [8 total field count; avg 1 fields per doc]
test: term vectors........OK [8 total vector count; avg 1 term/freq vector fields per doc]
No problems were detected with this index.
/// 430
root@ec2usevmsstgamq:/home/cvsupport/lzhang/code/solr43# java org.apache.lucene.index.CheckIndex /home/cvsupport/lzhang/index43
NOTE: testing will be more thorough if you run java with '-ea:org.apache.lucene...', so assertions are enabled
Opening index @ /home/cvsupport/lzhang/index43
Segments file=segments_1 numSegments=1 version=4.3 format=
1 of 1: name=_0 docCount=8
codec=Lucene42
compound=false
numFiles=13
size (MB)=1.742
diagnostics = {timestamp=1375311061843, os=Linux, os.version=3.2.0-49-virtual, source=flush, lucene.version=4.3.0 1477023 - simonw - 2013-04-29 14:55:14, os.arch=amd64, java.version=1.7.0_25, java.vendor=Oracle Corporation}
no deletions
test: open reader.........OK
test: fields..............OK [2 fields]
test: field norms.........OK [1 fields]
test: terms, freq, prox...OK [38611 terms; 42557 terms/docs pairs; 335016 tokens]
test: stored fields.......OK [8 total field count; avg 1 fields per doc]
test: term vectors........OK [8 total vector count; avg 1 term/freq vector fields per doc]
test: docvalues...........OK [0 total doc count; 0 docvalues fields]
No problems were detected with this index.
-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com]
Sent: Thursday, August 01, 2013 10:45 AM
To: Lucene Users
Subject: Re: lucene 4.3 seems to be much slower in indexing than lucene
3.6?
On Wed, Jul 31, 2013 at 7:17 PM, Zhang, Lisheng
<Li...@broadvision.com> wrote:
>
> Hi Mike,
>
> I retested and results are the same:
>
> 1/ I did not use sort (so FieldCache should not enter picture?)
No grouping or joining either (they will use FieldCache, if it's not
against a doc values field).
What sort of queries are you running?
> 2/ I created indexed data from scratch separately for 361 and 43
> based on same text (text files), and I ran test from command
> line separately against each index folder, so seems a pretty
> fair test.
OK.
> 3/ Each test I created searcher from scrath (to measure creation
> time). I did not include JVM start time in each case. The
> tests are in same box.
OK.
> From indexed data it seems that 43 generated a lot more data in
> folder, below I listed (ls -ltr) result
This is very odd: the 4.3 index is quite a bit larger than the 3.x
index. Are you certain the two indexed the same content in the same
way? Which analyzer are you using? Maybe run CheckIndex against each
index and post the output?
> (always pass in LUCENE_43
> version, so lucen 42 codec should be used, why lucene41?).
This is fine: the Lucene42 codec uses Lucene41PostingsFormat.
Mike McCandless
http://blog.mikemccandless.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: lucene 4.3 seems to be much slower in indexing than lucene 3.6?
Posted by "Zhang, Lisheng" <Li...@BroadVision.com>.
I should have mentioned the commands I used to test:
1/ Index:
java TestReal36 index -fileFolder /home/cvsupport/lzhang/files -optimize false
java TestReal43 index -fileFolder /home/cvsupport/lzhang/files -optimize false
The fileFolder contains 8 *.txt files with UTF-8 encoding. I also tried with
different parameters -luceneDir (by default lucene chose MMap).
2/ Search:
java TestReal36 search -query boxer,snowball -limit 2 -searchRound 2 -reuseSearcher true
java TestReal43 search -query boxer,snowball -limit 2 -searchRound 2 -reuseSearcher true
I also tried with different parameters -luceneDir
Thanks and best regards, Lisheng
-----Original Message-----
From: Zhang, Lisheng
Sent: Thursday, August 01, 2013 11:16 AM
To: 'java-user@lucene.apache.org'
Subject: RE: lucene 4.3 seems to be much slower in indexing than lucene
3.6?
Hi Mike,
First I really appreciate your help (for non commercial product)!!
1/ I attached source code of my testing (you see I used StandardAnalyzer), also from CheckIndex report below
the unique terms are identical (token counts are slightly different). The stored field is just ID (1 - 8
for each document). The indexed files are from 8 typical files for 8 different languages (English one is
"Animal Farm" by George Orwell). Sure I donot mind sending the text files in case you are interested?
The query I issued is a trivial one (did not even use filter, like querying "boxer" to get "Animal Farm")
2/ CheckIndex output:
/// 361:
root@ec2usevmsstgamq:/home/cvsupport/lzhang/code/solr36# java org.apache.lucene.index.CheckIndex /home/cvsupport/lzhang/index36
NOTE: testing will be more thorough if you run java with '-ea:org.apache.lucene...', so assertions are enabled
Opening index @ /home/cvsupport/lzhang/index36
Segments file=segments_1 numSegments=1 version=3.6.1 format=FORMAT_3_1 [Lucene 3.1+]
1 of 1: name=_0 docCount=8
compound=false
hasProx=true
numFiles=11
size (MB)=1.156
diagnostics = {os.version=3.2.0-49-virtual, os=Linux, lucene.version=3.6.1 1362471 - thetaphi - 2012-07-17 12:40:12, source=flush, os.arch=amd64, java.version=1.7.0_25, java.vendor=Oracle Corporation}
no deletions
test: open reader.........OK
test: fields..............OK [2 fields]
test: field norms.........OK [1 fields]
test: terms, freq, prox...OK [38611 terms; 42557 terms/docs pairs; 335024 tokens]
test: stored fields.......OK [8 total field count; avg 1 fields per doc]
test: term vectors........OK [8 total vector count; avg 1 term/freq vector fields per doc]
No problems were detected with this index.
/// 430
root@ec2usevmsstgamq:/home/cvsupport/lzhang/code/solr43# java org.apache.lucene.index.CheckIndex /home/cvsupport/lzhang/index43
NOTE: testing will be more thorough if you run java with '-ea:org.apache.lucene...', so assertions are enabled
Opening index @ /home/cvsupport/lzhang/index43
Segments file=segments_1 numSegments=1 version=4.3 format=
1 of 1: name=_0 docCount=8
codec=Lucene42
compound=false
numFiles=13
size (MB)=1.742
diagnostics = {timestamp=1375311061843, os=Linux, os.version=3.2.0-49-virtual, source=flush, lucene.version=4.3.0 1477023 - simonw - 2013-04-29 14:55:14, os.arch=amd64, java.version=1.7.0_25, java.vendor=Oracle Corporation}
no deletions
test: open reader.........OK
test: fields..............OK [2 fields]
test: field norms.........OK [1 fields]
test: terms, freq, prox...OK [38611 terms; 42557 terms/docs pairs; 335016 tokens]
test: stored fields.......OK [8 total field count; avg 1 fields per doc]
test: term vectors........OK [8 total vector count; avg 1 term/freq vector fields per doc]
test: docvalues...........OK [0 total doc count; 0 docvalues fields]
No problems were detected with this index.
-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com]
Sent: Thursday, August 01, 2013 10:45 AM
To: Lucene Users
Subject: Re: lucene 4.3 seems to be much slower in indexing than lucene
3.6?
On Wed, Jul 31, 2013 at 7:17 PM, Zhang, Lisheng
<Li...@broadvision.com> wrote:
>
> Hi Mike,
>
> I retested and results are the same:
>
> 1/ I did not use sort (so FieldCache should not enter picture?)
No grouping or joining either (they will use FieldCache, if it's not
against a doc values field).
What sort of queries are you running?
> 2/ I created indexed data from scratch separately for 361 and 43
> based on same text (text files), and I ran test from command
> line separately against each index folder, so seems a pretty
> fair test.
OK.
> 3/ Each test I created searcher from scrath (to measure creation
> time). I did not include JVM start time in each case. The
> tests are in same box.
OK.
> From indexed data it seems that 43 generated a lot more data in
> folder, below I listed (ls -ltr) result
This is very odd: the 4.3 index is quite a bit larger than the 3.x
index. Are you certain the two indexed the same content in the same
way? Which analyzer are you using? Maybe run CheckIndex against each
index and post the output?
> (always pass in LUCENE_43
> version, so lucen 42 codec should be used, why lucene41?).
This is fine: the Lucene42 codec uses Lucene41PostingsFormat.
Mike McCandless
http://blog.mikemccandless.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: lucene 4.3 seems to be much slower in indexing than lucene 3.6?
Posted by "Zhang, Lisheng" <Li...@BroadVision.com>.
Hi Mike,
First I really appreciate your help (for non commercial product)!!
1/ I attached source code of my testing (you see I used StandardAnalyzer), also from CheckIndex report below
the unique terms are identical (token counts are slightly different). The stored field is just ID (1 - 8
for each document). The indexed files are from 8 typical files for 8 different languages (English one is
"Animal Farm" by George Orwell). Sure I donot mind sending the text files in case you are interested?
The query I issued is a trivial one (did not even use filter, like querying "boxer" to get "Animal Farm")
2/ CheckIndex output:
/// 361:
root@ec2usevmsstgamq:/home/cvsupport/lzhang/code/solr36# java org.apache.lucene.index.CheckIndex /home/cvsupport/lzhang/index36
NOTE: testing will be more thorough if you run java with '-ea:org.apache.lucene...', so assertions are enabled
Opening index @ /home/cvsupport/lzhang/index36
Segments file=segments_1 numSegments=1 version=3.6.1 format=FORMAT_3_1 [Lucene 3.1+]
1 of 1: name=_0 docCount=8
compound=false
hasProx=true
numFiles=11
size (MB)=1.156
diagnostics = {os.version=3.2.0-49-virtual, os=Linux, lucene.version=3.6.1 1362471 - thetaphi - 2012-07-17 12:40:12, source=flush, os.arch=amd64, java.version=1.7.0_25, java.vendor=Oracle Corporation}
no deletions
test: open reader.........OK
test: fields..............OK [2 fields]
test: field norms.........OK [1 fields]
test: terms, freq, prox...OK [38611 terms; 42557 terms/docs pairs; 335024 tokens]
test: stored fields.......OK [8 total field count; avg 1 fields per doc]
test: term vectors........OK [8 total vector count; avg 1 term/freq vector fields per doc]
No problems were detected with this index.
/// 430
root@ec2usevmsstgamq:/home/cvsupport/lzhang/code/solr43# java org.apache.lucene.index.CheckIndex /home/cvsupport/lzhang/index43
NOTE: testing will be more thorough if you run java with '-ea:org.apache.lucene...', so assertions are enabled
Opening index @ /home/cvsupport/lzhang/index43
Segments file=segments_1 numSegments=1 version=4.3 format=
1 of 1: name=_0 docCount=8
codec=Lucene42
compound=false
numFiles=13
size (MB)=1.742
diagnostics = {timestamp=1375311061843, os=Linux, os.version=3.2.0-49-virtual, source=flush, lucene.version=4.3.0 1477023 - simonw - 2013-04-29 14:55:14, os.arch=amd64, java.version=1.7.0_25, java.vendor=Oracle Corporation}
no deletions
test: open reader.........OK
test: fields..............OK [2 fields]
test: field norms.........OK [1 fields]
test: terms, freq, prox...OK [38611 terms; 42557 terms/docs pairs; 335016 tokens]
test: stored fields.......OK [8 total field count; avg 1 fields per doc]
test: term vectors........OK [8 total vector count; avg 1 term/freq vector fields per doc]
test: docvalues...........OK [0 total doc count; 0 docvalues fields]
No problems were detected with this index.
-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com]
Sent: Thursday, August 01, 2013 10:45 AM
To: Lucene Users
Subject: Re: lucene 4.3 seems to be much slower in indexing than lucene
3.6?
On Wed, Jul 31, 2013 at 7:17 PM, Zhang, Lisheng
<Li...@broadvision.com> wrote:
>
> Hi Mike,
>
> I retested and results are the same:
>
> 1/ I did not use sort (so FieldCache should not enter picture?)
No grouping or joining either (they will use FieldCache, if it's not
against a doc values field).
What sort of queries are you running?
> 2/ I created indexed data from scratch separately for 361 and 43
> based on same text (text files), and I ran test from command
> line separately against each index folder, so seems a pretty
> fair test.
OK.
> 3/ Each test I created searcher from scrath (to measure creation
> time). I did not include JVM start time in each case. The
> tests are in same box.
OK.
> From indexed data it seems that 43 generated a lot more data in
> folder, below I listed (ls -ltr) result
This is very odd: the 4.3 index is quite a bit larger than the 3.x
index. Are you certain the two indexed the same content in the same
way? Which analyzer are you using? Maybe run CheckIndex against each
index and post the output?
> (always pass in LUCENE_43
> version, so lucen 42 codec should be used, why lucene41?).
This is fine: the Lucene42 codec uses Lucene41PostingsFormat.
Mike McCandless
http://blog.mikemccandless.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: lucene 4.3 seems to be much slower in indexing than lucene 3.6?
Posted by Michael McCandless <lu...@mikemccandless.com>.
On Wed, Jul 31, 2013 at 7:17 PM, Zhang, Lisheng
<Li...@broadvision.com> wrote:
>
> Hi Mike,
>
> I retested and results are the same:
>
> 1/ I did not use sort (so FieldCache should not enter picture?)
No grouping or joining either (they will use FieldCache, if it's not
against a doc values field).
What sort of queries are you running?
> 2/ I created indexed data from scratch separately for 361 and 43
> based on same text (text files), and I ran test from command
> line separately against each index folder, so seems a pretty
> fair test.
OK.
> 3/ Each test I created searcher from scrath (to measure creation
> time). I did not include JVM start time in each case. The
> tests are in same box.
OK.
> From indexed data it seems that 43 generated a lot more data in
> folder, below I listed (ls -ltr) result
This is very odd: the 4.3 index is quite a bit larger than the 3.x
index. Are you certain the two indexed the same content in the same
way? Which analyzer are you using? Maybe run CheckIndex against each
index and post the output?
> (always pass in LUCENE_43
> version, so lucen 42 codec should be used, why lucene41?).
This is fine: the Lucene42 codec uses Lucene41PostingsFormat.
Mike McCandless
http://blog.mikemccandless.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org