You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Shlomit Rosen <SH...@il.ibm.com> on 2014/06/17 10:37:06 UTC

Search degradation on Windows when upgrading from lucene 3.6 to lucene 4.7.2

Hi, 

We are in the process of upgrading  from lucene 3.6.0 to lucene 4.7.2, 
and our tests show a significant search degradation on Windows platform.

Trying to figure this out, here are a couple of points we noticed. 
Any suggestions/thoughts will be greatly appreciated. 

Thanks!

1) Running search on an optimized collection.

Our first run on Windows machine showed the following results: 
        Lucene 3.6:             115 queries / sec
        Lucene 4.7.2:           74   queries / sec

Looking at the collections themselves, we got the following 
characterization: 


Lucene 3.6
General Index Information:
==========================
Num docs: 10485760
Num deleted docs: 0
Deletion rate: 0%
Number of files in FOLDER: 116
Total size of files in FOLDER: 81558862032 bytes (75.96 GB)

Commit Point Information:
=========================
Version: 1399567203042
Timestamp: 1399593668185
Generation: 6018
Segments file name: segments_4n6
Number of segments: 32
Committed size: 81216915273 bytes (75.64 GB)
Number of files in COMMIT POINT: 89
Total size of files in COMMIT POINT: 81216923390 bytes (75.64 GB)


Lucene 4.7.2: 
General Index Information:
==========================
Num docs: 10485760
Num deleted docs: 0
Deletion rate: 0%
Number of files in FOLDER: 301
Total size of files in FOLDER: 71019073768 bytes (66.14 GB)

Commit Point Information:
=========================
Generation: 4518
Segments file name: segments_3hi
Number of segments: 38
Committed size: 70635339707 bytes (65.78 GB)
Number of files in COMMIT POINT: 115
Total size of files in COMMIT POINT: 70635341223 bytes (65.78 GB)

We saw that the collection created by lucene 4.7.2 was10GB smaller but it 
had a more segments. 
We thought that more segments might account to the search degradation, and 
so we decided to run optimization on the 4.7.2 index before rerunning the 
search test. 

The index was more compact: 

Lucene 4.7.2
General Index Information:
==========================
Num docs: 10485760
Num deleted docs: 0
Deletion rate: 0%
Number of files in FOLDER: 38
Total size of files in FOLDER: 70488334388 bytes (65.65 GB)

Commit Point Information:
=========================
Generation: 4519
Segments file name: segments_3hj
Number of segments: 12
Committed size: 70488333864 bytes (65.65 GB)
Number of files in COMMIT POINT: 37
Total size of files in COMMIT POINT: 70488334368 bytes (65.65 GB)
And as expected, the search results were much better: 
        4.7.2.           118 queries / sec


We thought that this might be a good direction, so our next step was to 
simulate a more compact index as part of our indexing session without 
running a full optimize at the end. 
To do that we changed maxMergeMB from 4 GB to 6 GB. The collection was 
indeed more compact: 


Win64 4.7.2 merge=6000 commitPoints:
General Index Information:
==========================
Num docs: 10485760
Num deleted docs: 0
Deletion rate: 0%
Number of files in FOLDER: 213
Total size of files in FOLDER: 83038952682 bytes (77.34 GB)

Commit Point Information:
=========================
Generation: 4406
Segments file name: segments_3ee
Number of segments: 14
Committed size: 70324985193 bytes (65.50 GB)
Number of files in COMMIT POINT: 91
Total size of files in COMMIT POINT: 70324985781 bytes (65.50 GB)
But search results were not good at all: 
        4.7.2:          72 queries / sec

Does this make sense? 
We thought of "Optimize" as mainly decreasing the number of segments in 
the collection, and removing deletions. 
In this scenario, we had no deletions, and we saw that the number of 
segments did in fact decrease substantially, 
So why are we not seeing this reflect in search performance? Is there any 
other "optimize" influence/hidden-operation that we are missing here? 

(Note that we are using LogByteSizeMergePolicy. We know that 
TieredMergePolicy is suppose to be better in this aspect, but it is 
important to us 
To keep the order of the documents the same between commit points... )


2) Search Directory
On Lucene 3.6, we did comprehensive testing and saw that the best search 
performance is reached when using an Mmap directory. 
(for Indexing we are using SimpleFSDirectory). 
We tried different directories again with lucene 4.7.2, and while the 
differences were not big, it still seems that Mmap is no longer the best 
option: 

Lucene 4.7.2 with MMap:         72 queries / sec
Lucene 4.7.2 with SimpleFS:     84 queries / sec

Was there any changes around the MMap directory that might account for 
this difference? 
If so, do you think that those changes might account for the overall 
performance we are seeing? 

3) Java 6 / Java 7
We are currently running on Java 6 (that is also the reason we stopped at 
lucene 4.7.2 and not 4.8). 
Is there a reason to believe that the degradation might be connected to 
this? 


Thanks again in advance!