You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Zhang, Lisheng" <Li...@BroadVision.com> on 2013/08/01 01:17:41 UTC

RE: lucene 4.3 seems to be much slower in indexing than lucene 3.6?

Hi Mike,

I retested and results are the same:

1/ I did not use sort (so FieldCache should not enter picture?)
2/ I created indexed data from scratch separately for 361 and 43
   based on same text (text files), and I ran test from command
   line separately against each index folder, so seems a pretty 
   fair test.
3/ Each test I created searcher from scrath (to measure creation
   time). I did not include JVM start time in each case. The 
   tests are in same box.

>From indexed data it seems that 43 generated a lot more data in
folder, below I listed (ls -ltr) result (always pass in LUCENE_43
version, so lucen 42 codec should be used, why lucene41?).

Thanks very much for helps, Lisheng


///////////////////
36:
total 1228
-rw-r--r-- 1 root root     68 Jul 31 15:50 _0.fdx
-rw-r--r-- 1 root root     44 Jul 31 15:50 _0.fdt
-rw-r--r-- 1 root root    132 Jul 31 15:50 _0.tvx
-rw-r--r-- 1 root root 260207 Jul 31 15:50 _0.tvf
-rw-r--r-- 1 root root     20 Jul 31 15:50 _0.tvd
-rw-r--r-- 1 root root 345803 Jul 31 15:50 _0.tis
-rw-r--r-- 1 root root   4899 Jul 31 15:50 _0.tii
-rw-r--r-- 1 root root 539098 Jul 31 15:50 _0.prx
-rw-r--r-- 1 root root     12 Jul 31 15:50 _0.nrm
-rw-r--r-- 1 root root  61703 Jul 31 15:50 _0.frq
-rw-r--r-- 1 root root     29 Jul 31 15:50 _0.fnm
-rw-r--r-- 1 root root    252 Jul 31 15:50 segments_1
-rw-r--r-- 1 root root     20 Jul 31 15:50 segments.gen

43:
total 1828
-rw-r--r-- 1 root root      45 Jul 31 15:51 _0.fdx
-rw-r--r-- 1 root root      66 Jul 31 15:51 _0.fdt
-rw-r--r-- 1 root root      60 Jul 31 15:51 _0.tvx
-rw-r--r-- 1 root root  176845 Jul 31 15:51 _0.tvd
-rw-r--r-- 1 root root   10980 Jul 31 15:51 _0_Lucene41_0.tip
-rw-r--r-- 1 root root  401339 Jul 31 15:51 _0_Lucene41_0.tim
-rw-r--r-- 1 root root 1007621 Jul 31 15:51 _0_Lucene41_0.pos
-rw-r--r-- 1 root root  216711 Jul 31 15:51 _0_Lucene41_0.pay
-rw-r--r-- 1 root root   12048 Jul 31 15:51 _0_Lucene41_0.doc
-rw-r--r-- 1 root root      46 Jul 31 15:51 _0.nvm
-rw-r--r-- 1 root root      34 Jul 31 15:51 _0.nvd
-rw-r--r-- 1 root root     205 Jul 31 15:51 _0.fnm
-rw-r--r-- 1 root root     395 Jul 31 15:51 _0.si
-rw-r--r-- 1 root root      69 Jul 31 15:51 segments_1
-rw-r--r-- 1 root root      20 Jul 31 15:51 segments.gen
///////////////////


-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com]
Sent: Wednesday, July 31, 2013 11:31 AM
To: Lucene Users
Subject: Re: lucene 4.3 seems to be much slower in indexing than lucene
3.6?


On Tue, Jul 30, 2013 at 6:13 PM, Zhang, Lisheng
<Li...@broadvision.com> wrote:
> Hi Mike,
>
> I did more tests with realistic text from different languages (typical
> text for 8 different languages, English one is novel "Animal Farm").
>
> What I found seems to be:
>
> ## Indexing:
> 36 and 43 comparable (your previous comment is very correct).
>
> ## Search:
> 43 seems to be slower (30%), checking details, it seems it all due to
> initial searcher creation and first search (warming), as if 43 did much
> more in warming?

Hmm, I'm not sure off hand why searcher warming would be slower in 4.3.

Are you relying on FieldCache (e.g. sorting by a field instead of by
relevance).  Switching to doc values should make warming much faster.

Are you sure the test was fair?  Ie, in both cases the index was
either hot or cold?

For your 4.3 test you fully reindexed right?  Ie, searched against a
4.3 (not 3.6) index?

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: lucene 4.3 seems to be much slower in indexing than lucene 3.6?

Posted by Simon Willnauer <si...@gmail.com>.

one thing I wonder is if you could just publish your benchmark code?

simon

On Thu, Aug 1, 2013 at 7:45 PM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> On Wed, Jul 31, 2013 at 7:17 PM, Zhang, Lisheng
> <Li...@broadvision.com> wrote:
>>
>> Hi Mike,
>>
>> I retested and results are the same:
>>
>> 1/ I did not use sort (so FieldCache should not enter picture?)
>
> No grouping or joining either (they will use FieldCache, if it's not
> against a doc values field).
>
> What sort of queries are you running?
>
>> 2/ I created indexed data from scratch separately for 361 and 43
>>    based on same text (text files), and I ran test from command
>>    line separately against each index folder, so seems a pretty
>>    fair test.
>
> OK.
>
>> 3/ Each test I created searcher from scrath (to measure creation
>>    time). I did not include JVM start time in each case. The
>>    tests are in same box.
>
> OK.
>
>> From indexed data it seems that 43 generated a lot more data in
>> folder, below I listed (ls -ltr) result
>
> This is very odd: the 4.3 index is quite a bit larger than the 3.x
> index.  Are you certain the two indexed the same content in the same
> way?  Which analyzer are you using?  Maybe run CheckIndex against each
> index and post the output?
>
>> (always pass in LUCENE_43
>> version, so lucen 42 codec should be used, why lucene41?).
>
> This is fine: the Lucene42 codec uses Lucene41PostingsFormat.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: lucene 4.3 seems to be much slower in indexing than lucene 3.6?

Posted by "Zhang, Lisheng" <Li...@BroadVision.com>.

Hi Mike,

Thanks very much for your insightful comments, I will try to test more.

Best regards, Lisheng

-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com]
Sent: Friday, August 09, 2013 9:46 AM
To: Lucene Users
Subject: Re: lucene 4.3 seems to be much slower in indexing than lucene
3.6?


Hi, sorry, I don't have enough time to drill deeper here (run your
benchmark), but some quick ideas:

Only 8 documents is really a tiny index; try testing on many more documents?

Also, I would run more rounds than just 2; better to run 10s of rounds
and watch for the time per round to "stabilize" as hotspot finishes
compiling the hot spots...

It's curious that your CheckIndex output is so similar yet the index
sizes are so different; I wonder if you make a larger index if that
still holds.  It could be the block compression in 4.x is less space
efficient when there are a tiny number of documents.

Mike McCandless

http://blog.mikemccandless.com


On Fri, Aug 9, 2013 at 11:55 AM, Zhang, Lisheng
<Li...@broadvision.com> wrote:
> Hi Mike,
>
> Any more comments on this issue?
>
> Thanks and best regards, Lisheng
>
> -----Original Message-----
> From: Zhang, Lisheng [mailto:Lisheng.Zhang@broadvision.com]
> Sent: Friday, August 02, 2013 7:55 AM
> To: java-user@lucene.apache.org
> Subject: RE: lucene 4.3 seems to be much slower in indexing than lucene
> 3.6?
>
>
> I should have mentioned the commands I used to test:
>
> 1/ Index:
> java TestReal36 index -fileFolder /home/cvsupport/lzhang/files -optimize false
> java TestReal43 index -fileFolder /home/cvsupport/lzhang/files -optimize false
>
> The fileFolder contains 8 *.txt files with UTF-8 encoding. I also tried with
> different parameters -luceneDir (by default lucene chose MMap).
>
> 2/ Search:
> java TestReal36 search -query boxer,snowball -limit 2 -searchRound 2 -reuseSearcher true
> java TestReal43 search -query boxer,snowball -limit 2 -searchRound 2 -reuseSearcher true
>
> I also tried with different parameters -luceneDir
>
> Thanks and best regards, Lisheng
>
> -----Original Message-----
> From: Zhang, Lisheng
> Sent: Thursday, August 01, 2013 11:16 AM
> To: 'java-user@lucene.apache.org'
> Subject: RE: lucene 4.3 seems to be much slower in indexing than lucene
> 3.6?
>
>
> Hi Mike,
>
> First I really appreciate your help (for non commercial product)!!
>
> 1/ I attached source code of my testing (you see I used StandardAnalyzer), also from CheckIndex report below
>    the unique terms are identical (token counts are slightly different). The stored field is just ID (1 - 8
>    for each document). The indexed files are from 8 typical files for 8 different languages (English one is
>    "Animal Farm" by George Orwell). Sure I donot mind sending the text files in case you are interested?
>
>    The query I issued is a trivial one (did not even use filter, like querying "boxer" to get "Animal Farm")
>
> 2/ CheckIndex output:
>
> /// 361:
> root@ec2usevmsstgamq:/home/cvsupport/lzhang/code/solr36# java org.apache.lucene.index.CheckIndex /home/cvsupport/lzhang/index36
>
> NOTE: testing will be more thorough if you run java with '-ea:org.apache.lucene...', so assertions are enabled
>
> Opening index @ /home/cvsupport/lzhang/index36
>
> Segments file=segments_1 numSegments=1 version=3.6.1 format=FORMAT_3_1 [Lucene 3.1+]
>   1 of 1: name=_0 docCount=8
>     compound=false
>     hasProx=true
>     numFiles=11
>     size (MB)=1.156
>     diagnostics = {os.version=3.2.0-49-virtual, os=Linux, lucene.version=3.6.1 1362471 - thetaphi - 2012-07-17 12:40:12, source=flush, os.arch=amd64, java.version=1.7.0_25, java.vendor=Oracle Corporation}
>     no deletions
>     test: open reader.........OK
>     test: fields..............OK [2 fields]
>     test: field norms.........OK [1 fields]
>     test: terms, freq, prox...OK [38611 terms; 42557 terms/docs pairs; 335024 tokens]
>     test: stored fields.......OK [8 total field count; avg 1 fields per doc]
>     test: term vectors........OK [8 total vector count; avg 1 term/freq vector fields per doc]
>
> No problems were detected with this index.
>
> /// 430
> root@ec2usevmsstgamq:/home/cvsupport/lzhang/code/solr43# java org.apache.lucene.index.CheckIndex /home/cvsupport/lzhang/index43
>
> NOTE: testing will be more thorough if you run java with '-ea:org.apache.lucene...', so assertions are enabled
>
> Opening index @ /home/cvsupport/lzhang/index43
>
> Segments file=segments_1 numSegments=1 version=4.3 format=
>   1 of 1: name=_0 docCount=8
>     codec=Lucene42
>     compound=false
>     numFiles=13
>     size (MB)=1.742
>     diagnostics = {timestamp=1375311061843, os=Linux, os.version=3.2.0-49-virtual, source=flush, lucene.version=4.3.0 1477023 - simonw - 2013-04-29 14:55:14, os.arch=amd64, java.version=1.7.0_25, java.vendor=Oracle Corporation}
>     no deletions
>     test: open reader.........OK
>     test: fields..............OK [2 fields]
>     test: field norms.........OK [1 fields]
>     test: terms, freq, prox...OK [38611 terms; 42557 terms/docs pairs; 335016 tokens]
>     test: stored fields.......OK [8 total field count; avg 1 fields per doc]
>     test: term vectors........OK [8 total vector count; avg 1 term/freq vector fields per doc]
>     test: docvalues...........OK [0 total doc count; 0 docvalues fields]
>
> No problems were detected with this index.
>
>
>
> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Thursday, August 01, 2013 10:45 AM
> To: Lucene Users
> Subject: Re: lucene 4.3 seems to be much slower in indexing than lucene
> 3.6?
>
>
> On Wed, Jul 31, 2013 at 7:17 PM, Zhang, Lisheng
> <Li...@broadvision.com> wrote:
>>
>> Hi Mike,
>>
>> I retested and results are the same:
>>
>> 1/ I did not use sort (so FieldCache should not enter picture?)
>
> No grouping or joining either (they will use FieldCache, if it's not
> against a doc values field).
>
> What sort of queries are you running?
>
>> 2/ I created indexed data from scratch separately for 361 and 43
>>    based on same text (text files), and I ran test from command
>>    line separately against each index folder, so seems a pretty
>>    fair test.
>
> OK.
>
>> 3/ Each test I created searcher from scrath (to measure creation
>>    time). I did not include JVM start time in each case. The
>>    tests are in same box.
>
> OK.
>
>> From indexed data it seems that 43 generated a lot more data in
>> folder, below I listed (ls -ltr) result
>
> This is very odd: the 4.3 index is quite a bit larger than the 3.x
> index.  Are you certain the two indexed the same content in the same
> way?  Which analyzer are you using?  Maybe run CheckIndex against each
> index and post the output?
>
>> (always pass in LUCENE_43
>> version, so lucen 42 codec should be used, why lucene41?).
>
> This is fine: the Lucene42 codec uses Lucene41PostingsFormat.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: lucene 4.3 seems to be much slower in indexing than lucene 3.6?

Posted by Michael McCandless <lu...@mikemccandless.com>.

Hi, sorry, I don't have enough time to drill deeper here (run your
benchmark), but some quick ideas:

Only 8 documents is really a tiny index; try testing on many more documents?

Also, I would run more rounds than just 2; better to run 10s of rounds
and watch for the time per round to "stabilize" as hotspot finishes
compiling the hot spots...

It's curious that your CheckIndex output is so similar yet the index
sizes are so different; I wonder if you make a larger index if that
still holds.  It could be the block compression in 4.x is less space
efficient when there are a tiny number of documents.

Mike McCandless

http://blog.mikemccandless.com


On Fri, Aug 9, 2013 at 11:55 AM, Zhang, Lisheng
<Li...@broadvision.com> wrote:
> Hi Mike,
>
> Any more comments on this issue?
>
> Thanks and best regards, Lisheng
>
> -----Original Message-----
> From: Zhang, Lisheng [mailto:Lisheng.Zhang@broadvision.com]
> Sent: Friday, August 02, 2013 7:55 AM
> To: java-user@lucene.apache.org
> Subject: RE: lucene 4.3 seems to be much slower in indexing than lucene
> 3.6?
>
>
> I should have mentioned the commands I used to test:
>
> 1/ Index:
> java TestReal36 index -fileFolder /home/cvsupport/lzhang/files -optimize false
> java TestReal43 index -fileFolder /home/cvsupport/lzhang/files -optimize false
>
> The fileFolder contains 8 *.txt files with UTF-8 encoding. I also tried with
> different parameters -luceneDir (by default lucene chose MMap).
>
> 2/ Search:
> java TestReal36 search -query boxer,snowball -limit 2 -searchRound 2 -reuseSearcher true
> java TestReal43 search -query boxer,snowball -limit 2 -searchRound 2 -reuseSearcher true
>
> I also tried with different parameters -luceneDir
>
> Thanks and best regards, Lisheng
>
> -----Original Message-----
> From: Zhang, Lisheng
> Sent: Thursday, August 01, 2013 11:16 AM
> To: 'java-user@lucene.apache.org'
> Subject: RE: lucene 4.3 seems to be much slower in indexing than lucene
> 3.6?
>
>
> Hi Mike,
>
> First I really appreciate your help (for non commercial product)!!
>
> 1/ I attached source code of my testing (you see I used StandardAnalyzer), also from CheckIndex report below
>    the unique terms are identical (token counts are slightly different). The stored field is just ID (1 - 8
>    for each document). The indexed files are from 8 typical files for 8 different languages (English one is
>    "Animal Farm" by George Orwell). Sure I donot mind sending the text files in case you are interested?
>
>    The query I issued is a trivial one (did not even use filter, like querying "boxer" to get "Animal Farm")
>
> 2/ CheckIndex output:
>
> /// 361:
> root@ec2usevmsstgamq:/home/cvsupport/lzhang/code/solr36# java org.apache.lucene.index.CheckIndex /home/cvsupport/lzhang/index36
>
> NOTE: testing will be more thorough if you run java with '-ea:org.apache.lucene...', so assertions are enabled
>
> Opening index @ /home/cvsupport/lzhang/index36
>
> Segments file=segments_1 numSegments=1 version=3.6.1 format=FORMAT_3_1 [Lucene 3.1+]
>   1 of 1: name=_0 docCount=8
>     compound=false
>     hasProx=true
>     numFiles=11
>     size (MB)=1.156
>     diagnostics = {os.version=3.2.0-49-virtual, os=Linux, lucene.version=3.6.1 1362471 - thetaphi - 2012-07-17 12:40:12, source=flush, os.arch=amd64, java.version=1.7.0_25, java.vendor=Oracle Corporation}
>     no deletions
>     test: open reader.........OK
>     test: fields..............OK [2 fields]
>     test: field norms.........OK [1 fields]
>     test: terms, freq, prox...OK [38611 terms; 42557 terms/docs pairs; 335024 tokens]
>     test: stored fields.......OK [8 total field count; avg 1 fields per doc]
>     test: term vectors........OK [8 total vector count; avg 1 term/freq vector fields per doc]
>
> No problems were detected with this index.
>
> /// 430
> root@ec2usevmsstgamq:/home/cvsupport/lzhang/code/solr43# java org.apache.lucene.index.CheckIndex /home/cvsupport/lzhang/index43
>
> NOTE: testing will be more thorough if you run java with '-ea:org.apache.lucene...', so assertions are enabled
>
> Opening index @ /home/cvsupport/lzhang/index43
>
> Segments file=segments_1 numSegments=1 version=4.3 format=
>   1 of 1: name=_0 docCount=8
>     codec=Lucene42
>     compound=false
>     numFiles=13
>     size (MB)=1.742
>     diagnostics = {timestamp=1375311061843, os=Linux, os.version=3.2.0-49-virtual, source=flush, lucene.version=4.3.0 1477023 - simonw - 2013-04-29 14:55:14, os.arch=amd64, java.version=1.7.0_25, java.vendor=Oracle Corporation}
>     no deletions
>     test: open reader.........OK
>     test: fields..............OK [2 fields]
>     test: field norms.........OK [1 fields]
>     test: terms, freq, prox...OK [38611 terms; 42557 terms/docs pairs; 335016 tokens]
>     test: stored fields.......OK [8 total field count; avg 1 fields per doc]
>     test: term vectors........OK [8 total vector count; avg 1 term/freq vector fields per doc]
>     test: docvalues...........OK [0 total doc count; 0 docvalues fields]
>
> No problems were detected with this index.
>
>
>
> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Thursday, August 01, 2013 10:45 AM
> To: Lucene Users
> Subject: Re: lucene 4.3 seems to be much slower in indexing than lucene
> 3.6?
>
>
> On Wed, Jul 31, 2013 at 7:17 PM, Zhang, Lisheng
> <Li...@broadvision.com> wrote:
>>
>> Hi Mike,
>>
>> I retested and results are the same:
>>
>> 1/ I did not use sort (so FieldCache should not enter picture?)
>
> No grouping or joining either (they will use FieldCache, if it's not
> against a doc values field).
>
> What sort of queries are you running?
>
>> 2/ I created indexed data from scratch separately for 361 and 43
>>    based on same text (text files), and I ran test from command
>>    line separately against each index folder, so seems a pretty
>>    fair test.
>
> OK.
>
>> 3/ Each test I created searcher from scrath (to measure creation
>>    time). I did not include JVM start time in each case. The
>>    tests are in same box.
>
> OK.
>
>> From indexed data it seems that 43 generated a lot more data in
>> folder, below I listed (ls -ltr) result
>
> This is very odd: the 4.3 index is quite a bit larger than the 3.x
> index.  Are you certain the two indexed the same content in the same
> way?  Which analyzer are you using?  Maybe run CheckIndex against each
> index and post the output?
>
>> (always pass in LUCENE_43
>> version, so lucen 42 codec should be used, why lucene41?).
>
> This is fine: the Lucene42 codec uses Lucene41PostingsFormat.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: lucene 4.3 seems to be much slower in indexing than lucene 3.6?

Posted by "Zhang, Lisheng" <Li...@BroadVision.com>.

Hi Mike,

Any more comments on this issue?

Thanks and best regards, Lisheng

-----Original Message-----
From: Zhang, Lisheng [mailto:Lisheng.Zhang@broadvision.com]
Sent: Friday, August 02, 2013 7:55 AM
To: java-user@lucene.apache.org
Subject: RE: lucene 4.3 seems to be much slower in indexing than lucene
3.6?


I should have mentioned the commands I used to test:

1/ Index:
java TestReal36 index -fileFolder /home/cvsupport/lzhang/files -optimize false
java TestReal43 index -fileFolder /home/cvsupport/lzhang/files -optimize false

The fileFolder contains 8 *.txt files with UTF-8 encoding. I also tried with
different parameters -luceneDir (by default lucene chose MMap).

2/ Search:
java TestReal36 search -query boxer,snowball -limit 2 -searchRound 2 -reuseSearcher true
java TestReal43 search -query boxer,snowball -limit 2 -searchRound 2 -reuseSearcher true

I also tried with different parameters -luceneDir 

Thanks and best regards, Lisheng

-----Original Message-----
From: Zhang, Lisheng 
Sent: Thursday, August 01, 2013 11:16 AM
To: 'java-user@lucene.apache.org'
Subject: RE: lucene 4.3 seems to be much slower in indexing than lucene
3.6?


Hi Mike,

First I really appreciate your help (for non commercial product)!!

1/ I attached source code of my testing (you see I used StandardAnalyzer), also from CheckIndex report below
   the unique terms are identical (token counts are slightly different). The stored field is just ID (1 - 8 
   for each document). The indexed files are from 8 typical files for 8 different languages (English one is
   "Animal Farm" by George Orwell). Sure I donot mind sending the text files in case you are interested?

   The query I issued is a trivial one (did not even use filter, like querying "boxer" to get "Animal Farm")

2/ CheckIndex output:

/// 361:
root@ec2usevmsstgamq:/home/cvsupport/lzhang/code/solr36# java org.apache.lucene.index.CheckIndex /home/cvsupport/lzhang/index36

NOTE: testing will be more thorough if you run java with '-ea:org.apache.lucene...', so assertions are enabled

Opening index @ /home/cvsupport/lzhang/index36

Segments file=segments_1 numSegments=1 version=3.6.1 format=FORMAT_3_1 [Lucene 3.1+]
  1 of 1: name=_0 docCount=8
    compound=false
    hasProx=true
    numFiles=11
    size (MB)=1.156
    diagnostics = {os.version=3.2.0-49-virtual, os=Linux, lucene.version=3.6.1 1362471 - thetaphi - 2012-07-17 12:40:12, source=flush, os.arch=amd64, java.version=1.7.0_25, java.vendor=Oracle Corporation}
    no deletions
    test: open reader.........OK
    test: fields..............OK [2 fields]
    test: field norms.........OK [1 fields]
    test: terms, freq, prox...OK [38611 terms; 42557 terms/docs pairs; 335024 tokens]
    test: stored fields.......OK [8 total field count; avg 1 fields per doc]
    test: term vectors........OK [8 total vector count; avg 1 term/freq vector fields per doc]

No problems were detected with this index.

/// 430
root@ec2usevmsstgamq:/home/cvsupport/lzhang/code/solr43# java org.apache.lucene.index.CheckIndex /home/cvsupport/lzhang/index43

NOTE: testing will be more thorough if you run java with '-ea:org.apache.lucene...', so assertions are enabled

Opening index @ /home/cvsupport/lzhang/index43

Segments file=segments_1 numSegments=1 version=4.3 format=
  1 of 1: name=_0 docCount=8
    codec=Lucene42
    compound=false
    numFiles=13
    size (MB)=1.742
    diagnostics = {timestamp=1375311061843, os=Linux, os.version=3.2.0-49-virtual, source=flush, lucene.version=4.3.0 1477023 - simonw - 2013-04-29 14:55:14, os.arch=amd64, java.version=1.7.0_25, java.vendor=Oracle Corporation}
    no deletions
    test: open reader.........OK
    test: fields..............OK [2 fields]
    test: field norms.........OK [1 fields]
    test: terms, freq, prox...OK [38611 terms; 42557 terms/docs pairs; 335016 tokens]
    test: stored fields.......OK [8 total field count; avg 1 fields per doc]
    test: term vectors........OK [8 total vector count; avg 1 term/freq vector fields per doc]
    test: docvalues...........OK [0 total doc count; 0 docvalues fields]

No problems were detected with this index.



-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com]
Sent: Thursday, August 01, 2013 10:45 AM
To: Lucene Users
Subject: Re: lucene 4.3 seems to be much slower in indexing than lucene
3.6?


On Wed, Jul 31, 2013 at 7:17 PM, Zhang, Lisheng
<Li...@broadvision.com> wrote:
>
> Hi Mike,
>
> I retested and results are the same:
>
> 1/ I did not use sort (so FieldCache should not enter picture?)

No grouping or joining either (they will use FieldCache, if it's not
against a doc values field).

What sort of queries are you running?

> 2/ I created indexed data from scratch separately for 361 and 43
>    based on same text (text files), and I ran test from command
>    line separately against each index folder, so seems a pretty
>    fair test.

OK.

> 3/ Each test I created searcher from scrath (to measure creation
>    time). I did not include JVM start time in each case. The
>    tests are in same box.

OK.

> From indexed data it seems that 43 generated a lot more data in
> folder, below I listed (ls -ltr) result

This is very odd: the 4.3 index is quite a bit larger than the 3.x
index.  Are you certain the two indexed the same content in the same
way?  Which analyzer are you using?  Maybe run CheckIndex against each
index and post the output?

> (always pass in LUCENE_43
> version, so lucen 42 codec should be used, why lucene41?).

This is fine: the Lucene42 codec uses Lucene41PostingsFormat.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: lucene 4.3 seems to be much slower in indexing than lucene 3.6?

Posted by "Zhang, Lisheng" <Li...@BroadVision.com>.

I should have mentioned the commands I used to test:

1/ Index:
java TestReal36 index -fileFolder /home/cvsupport/lzhang/files -optimize false
java TestReal43 index -fileFolder /home/cvsupport/lzhang/files -optimize false

The fileFolder contains 8 *.txt files with UTF-8 encoding. I also tried with
different parameters -luceneDir (by default lucene chose MMap).

2/ Search:
java TestReal36 search -query boxer,snowball -limit 2 -searchRound 2 -reuseSearcher true
java TestReal43 search -query boxer,snowball -limit 2 -searchRound 2 -reuseSearcher true

I also tried with different parameters -luceneDir 

Thanks and best regards, Lisheng

-----Original Message-----
From: Zhang, Lisheng 
Sent: Thursday, August 01, 2013 11:16 AM
To: 'java-user@lucene.apache.org'
Subject: RE: lucene 4.3 seems to be much slower in indexing than lucene
3.6?


Hi Mike,

First I really appreciate your help (for non commercial product)!!

1/ I attached source code of my testing (you see I used StandardAnalyzer), also from CheckIndex report below
   the unique terms are identical (token counts are slightly different). The stored field is just ID (1 - 8 
   for each document). The indexed files are from 8 typical files for 8 different languages (English one is
   "Animal Farm" by George Orwell). Sure I donot mind sending the text files in case you are interested?

   The query I issued is a trivial one (did not even use filter, like querying "boxer" to get "Animal Farm")

2/ CheckIndex output:

/// 361:
root@ec2usevmsstgamq:/home/cvsupport/lzhang/code/solr36# java org.apache.lucene.index.CheckIndex /home/cvsupport/lzhang/index36

NOTE: testing will be more thorough if you run java with '-ea:org.apache.lucene...', so assertions are enabled

Opening index @ /home/cvsupport/lzhang/index36

Segments file=segments_1 numSegments=1 version=3.6.1 format=FORMAT_3_1 [Lucene 3.1+]
  1 of 1: name=_0 docCount=8
    compound=false
    hasProx=true
    numFiles=11
    size (MB)=1.156
    diagnostics = {os.version=3.2.0-49-virtual, os=Linux, lucene.version=3.6.1 1362471 - thetaphi - 2012-07-17 12:40:12, source=flush, os.arch=amd64, java.version=1.7.0_25, java.vendor=Oracle Corporation}
    no deletions
    test: open reader.........OK
    test: fields..............OK [2 fields]
    test: field norms.........OK [1 fields]
    test: terms, freq, prox...OK [38611 terms; 42557 terms/docs pairs; 335024 tokens]
    test: stored fields.......OK [8 total field count; avg 1 fields per doc]
    test: term vectors........OK [8 total vector count; avg 1 term/freq vector fields per doc]

No problems were detected with this index.

/// 430
root@ec2usevmsstgamq:/home/cvsupport/lzhang/code/solr43# java org.apache.lucene.index.CheckIndex /home/cvsupport/lzhang/index43

NOTE: testing will be more thorough if you run java with '-ea:org.apache.lucene...', so assertions are enabled

Opening index @ /home/cvsupport/lzhang/index43

Segments file=segments_1 numSegments=1 version=4.3 format=
  1 of 1: name=_0 docCount=8
    codec=Lucene42
    compound=false
    numFiles=13
    size (MB)=1.742
    diagnostics = {timestamp=1375311061843, os=Linux, os.version=3.2.0-49-virtual, source=flush, lucene.version=4.3.0 1477023 - simonw - 2013-04-29 14:55:14, os.arch=amd64, java.version=1.7.0_25, java.vendor=Oracle Corporation}
    no deletions
    test: open reader.........OK
    test: fields..............OK [2 fields]
    test: field norms.........OK [1 fields]
    test: terms, freq, prox...OK [38611 terms; 42557 terms/docs pairs; 335016 tokens]
    test: stored fields.......OK [8 total field count; avg 1 fields per doc]
    test: term vectors........OK [8 total vector count; avg 1 term/freq vector fields per doc]
    test: docvalues...........OK [0 total doc count; 0 docvalues fields]

No problems were detected with this index.



-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com]
Sent: Thursday, August 01, 2013 10:45 AM
To: Lucene Users
Subject: Re: lucene 4.3 seems to be much slower in indexing than lucene
3.6?


On Wed, Jul 31, 2013 at 7:17 PM, Zhang, Lisheng
<Li...@broadvision.com> wrote:
>
> Hi Mike,
>
> I retested and results are the same:
>
> 1/ I did not use sort (so FieldCache should not enter picture?)

No grouping or joining either (they will use FieldCache, if it's not
against a doc values field).

What sort of queries are you running?

> 2/ I created indexed data from scratch separately for 361 and 43
>    based on same text (text files), and I ran test from command
>    line separately against each index folder, so seems a pretty
>    fair test.

OK.

> 3/ Each test I created searcher from scrath (to measure creation
>    time). I did not include JVM start time in each case. The
>    tests are in same box.

OK.

> From indexed data it seems that 43 generated a lot more data in
> folder, below I listed (ls -ltr) result

This is very odd: the 4.3 index is quite a bit larger than the 3.x
index.  Are you certain the two indexed the same content in the same
way?  Which analyzer are you using?  Maybe run CheckIndex against each
index and post the output?

> (always pass in LUCENE_43
> version, so lucen 42 codec should be used, why lucene41?).

This is fine: the Lucene42 codec uses Lucene41PostingsFormat.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: lucene 4.3 seems to be much slower in indexing than lucene 3.6?

Posted by "Zhang, Lisheng" <Li...@BroadVision.com>.

Hi Mike,

First I really appreciate your help (for non commercial product)!!

1/ I attached source code of my testing (you see I used StandardAnalyzer), also from CheckIndex report below
   the unique terms are identical (token counts are slightly different). The stored field is just ID (1 - 8 
   for each document). The indexed files are from 8 typical files for 8 different languages (English one is
   "Animal Farm" by George Orwell). Sure I donot mind sending the text files in case you are interested?

   The query I issued is a trivial one (did not even use filter, like querying "boxer" to get "Animal Farm")

2/ CheckIndex output:

/// 361:
root@ec2usevmsstgamq:/home/cvsupport/lzhang/code/solr36# java org.apache.lucene.index.CheckIndex /home/cvsupport/lzhang/index36

NOTE: testing will be more thorough if you run java with '-ea:org.apache.lucene...', so assertions are enabled

Opening index @ /home/cvsupport/lzhang/index36

Segments file=segments_1 numSegments=1 version=3.6.1 format=FORMAT_3_1 [Lucene 3.1+]
  1 of 1: name=_0 docCount=8
    compound=false
    hasProx=true
    numFiles=11
    size (MB)=1.156
    diagnostics = {os.version=3.2.0-49-virtual, os=Linux, lucene.version=3.6.1 1362471 - thetaphi - 2012-07-17 12:40:12, source=flush, os.arch=amd64, java.version=1.7.0_25, java.vendor=Oracle Corporation}
    no deletions
    test: open reader.........OK
    test: fields..............OK [2 fields]
    test: field norms.........OK [1 fields]
    test: terms, freq, prox...OK [38611 terms; 42557 terms/docs pairs; 335024 tokens]
    test: stored fields.......OK [8 total field count; avg 1 fields per doc]
    test: term vectors........OK [8 total vector count; avg 1 term/freq vector fields per doc]

No problems were detected with this index.

/// 430
root@ec2usevmsstgamq:/home/cvsupport/lzhang/code/solr43# java org.apache.lucene.index.CheckIndex /home/cvsupport/lzhang/index43

NOTE: testing will be more thorough if you run java with '-ea:org.apache.lucene...', so assertions are enabled

Opening index @ /home/cvsupport/lzhang/index43

Segments file=segments_1 numSegments=1 version=4.3 format=
  1 of 1: name=_0 docCount=8
    codec=Lucene42
    compound=false
    numFiles=13
    size (MB)=1.742
    diagnostics = {timestamp=1375311061843, os=Linux, os.version=3.2.0-49-virtual, source=flush, lucene.version=4.3.0 1477023 - simonw - 2013-04-29 14:55:14, os.arch=amd64, java.version=1.7.0_25, java.vendor=Oracle Corporation}
    no deletions
    test: open reader.........OK
    test: fields..............OK [2 fields]
    test: field norms.........OK [1 fields]
    test: terms, freq, prox...OK [38611 terms; 42557 terms/docs pairs; 335016 tokens]
    test: stored fields.......OK [8 total field count; avg 1 fields per doc]
    test: term vectors........OK [8 total vector count; avg 1 term/freq vector fields per doc]
    test: docvalues...........OK [0 total doc count; 0 docvalues fields]

No problems were detected with this index.



-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com]
Sent: Thursday, August 01, 2013 10:45 AM
To: Lucene Users
Subject: Re: lucene 4.3 seems to be much slower in indexing than lucene
3.6?


On Wed, Jul 31, 2013 at 7:17 PM, Zhang, Lisheng
<Li...@broadvision.com> wrote:
>
> Hi Mike,
>
> I retested and results are the same:
>
> 1/ I did not use sort (so FieldCache should not enter picture?)

No grouping or joining either (they will use FieldCache, if it's not
against a doc values field).

What sort of queries are you running?

> 2/ I created indexed data from scratch separately for 361 and 43
>    based on same text (text files), and I ran test from command
>    line separately against each index folder, so seems a pretty
>    fair test.

OK.

> 3/ Each test I created searcher from scrath (to measure creation
>    time). I did not include JVM start time in each case. The
>    tests are in same box.

OK.

> From indexed data it seems that 43 generated a lot more data in
> folder, below I listed (ls -ltr) result

This is very odd: the 4.3 index is quite a bit larger than the 3.x
index.  Are you certain the two indexed the same content in the same
way?  Which analyzer are you using?  Maybe run CheckIndex against each
index and post the output?

> (always pass in LUCENE_43
> version, so lucen 42 codec should be used, why lucene41?).

This is fine: the Lucene42 codec uses Lucene41PostingsFormat.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: lucene 4.3 seems to be much slower in indexing than lucene 3.6?

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Wed, Jul 31, 2013 at 7:17 PM, Zhang, Lisheng
<Li...@broadvision.com> wrote:
>
> Hi Mike,
>
> I retested and results are the same:
>
> 1/ I did not use sort (so FieldCache should not enter picture?)

No grouping or joining either (they will use FieldCache, if it's not
against a doc values field).

What sort of queries are you running?

> 2/ I created indexed data from scratch separately for 361 and 43
>    based on same text (text files), and I ran test from command
>    line separately against each index folder, so seems a pretty
>    fair test.

OK.

> 3/ Each test I created searcher from scrath (to measure creation
>    time). I did not include JVM start time in each case. The
>    tests are in same box.

OK.

> From indexed data it seems that 43 generated a lot more data in
> folder, below I listed (ls -ltr) result

This is very odd: the 4.3 index is quite a bit larger than the 3.x
index.  Are you certain the two indexed the same content in the same
way?  Which analyzer are you using?  Maybe run CheckIndex against each
index and post the output?

> (always pass in LUCENE_43
> version, so lucen 42 codec should be used, why lucene41?).

This is fine: the Lucene42 codec uses Lucene41PostingsFormat.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org