You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Nadav Har'El <ny...@math.technion.ac.il> on 2008/06/25 09:27:19 UTC

The 2GB segment size limit

Hi,

Recently an index I've been building passed the 2 GB mark, and after I
optimize()ed it into one segment over 2 GB, it stopped working.

Apparently, this is a known problem (on 32 bit JVMs), and mentioned in the FAQ,
http://wiki.apache.org/lucene-java/LuceneFAQ question "Is there a way to limit
the size of an index".

My first problem is that it looks to me like this FAQ entry is passing
outdated advice. My second second problem is that we document a bug, instead 
of fixing it.

The first thing the FAQ does is to recommend IndexWriter.setMaxMergeDocs().
This solution has two serious problems: First, normally one doesn't know how
many documents one can index before reaching 2 GB, and second, a call to
optimize() appears to ignore this setting and merge everything again - no good!

The second solution the FAQ recommends (using MultiSearcher) is unwieldy and
in my opinion, should be unnecessary (since we have the concept of segments,
why do we need separate indices in that case?).

The third option labeled the "optimal solution" is to write a new
FSDirectory implementation that represents files over 2 GB as several
files, broken on the 2 GB mark. But has anyone ever implemented this?

Does anyone have any experience with the 2 GB problem? Is one of these
recommendations *really* the recommended solution? What about the new
LogByteSizeMergePolicy and its setMaxMergeMB argument - wouldn't it be better
to use that? Does anybody know if optimize() also obeys this flag? If not,
shouldn't it?

In short, I'd like to understand the "best practices" of solving the 2 GB
problem, and improve the FAQ in this regard.

Moreover, I wonder, instead of documenting around the problem, should we
perhaps make the default behavior more correct? In other words, imagine
that we set LogByteSizeMergePolicy.DEFAULT_MAX_MERGE_MB to 1024 (or 1023,
to be on the safe side?). Then, segments larger than 1 GB will never be
merged with anything else. Some users (with multi-gigabyte indices on a 64
bit CPU) may not like this default, but they can change it - at least with
this default Lucene's behavior will be correct on all CPUs and JVMs.

I have one last question that I wonder if anyone can answer before I start
digging into the code. We use merges not just for merging segments, but also
as an oportunity to clean up segments from deleted documents. If some segment
is bigger than the maximum and is never merged again, does this also mean
deleted documents will never ever get cleaned up from it? This can be a
serious problem on huge dynamic indices (e.g., imagine a crawl of the Web
or some large intranet).

Nowadays, 2 GB indices are less rare than they used to be, and 32 bit JVMs
are still quite common, so I think this is a problem we should solve properly.

Thanks,
Nadav.

-- 
Nadav Har'El                        |    Wednesday, Jun 25 2008, 22 Sivan 5768
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Committee: A group of people that keeps
http://nadav.harel.org.il           |minutes and wastes hours.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: The 2GB segment size limit

Posted by Michael McCandless <lu...@mikemccandless.com>.
Nadav Har'El wrote:

> Recently an index I've been building passed the 2 GB mark, and after I
> optimize()ed it into one segment over 2 GB, it stopped working.

Nadav, which platform did you hit this on?  I think I've created > 2  
GB index on 32 bit WinXP just fine.  How many platforms are really  
affected by this?

> Apparently, this is a known problem (on 32 bit JVMs), and mentioned  
> in the FAQ,
> http://wiki.apache.org/lucene-java/LuceneFAQ question "Is there a  
> way to limit
> the size of an index".
>
> My first problem is that it looks to me like this FAQ entry is passing
> outdated advice. My second second problem is that we document a bug,  
> instead
> of fixing it.
>
> The first thing the FAQ does is to recommend  
> IndexWriter.setMaxMergeDocs().
> This solution has two serious problems: First, normally one doesn't  
> know how
> many documents one can index before reaching 2 GB, and second, a  
> call to
> optimize() appears to ignore this setting and merge everything again  
> - no good!

And a 3rd problem is: that limit applies to the input segments (to the  
merge), not the output segment.  So the eg given of setting  
maxMergeDocs to 7M is very likely too high because if you merge 10  
segments, each < 7M docs, you'll likely easily get a resulting segment  
 > 2 GB.

> The second solution the FAQ recommends (using MultiSearcher) is  
> unwieldy and
> in my opinion, should be unnecessary (since we have the concept of  
> segments,
> why do we need separate indices in that case?).
>
> The third option labeled the "optimal solution" is to write a new
> FSDirectory implementation that represents files over 2 GB as several
> files, broken on the 2 GB mark. But has anyone ever implemented this?

I agree these two workarounds sound quite challenging to do in  
practice...

> Does anyone have any experience with the 2 GB problem? Is one of these
> recommendations *really* the recommended solution? What about the new
> LogByteSizeMergePolicy and its setMaxMergeMB argument - wouldn't it  
> be better
> to use that? Does anybody know if optimize() also obeys this flag?  
> If not,
> shouldn't it?

optimize() doesn't obey it, and the same problem (input vs output)  
applies to maxMergeMB as well.

To make optimize() obey these limits, one would have to make their own  
MergePolicy.

> In short, I'd like to understand the "best practices" of solving the  
> 2 GB
> problem, and improve the FAQ in this regard.
>
> Moreover, I wonder, instead of documenting around the problem,  
> should we
> perhaps make the default behavior more correct? In other words,  
> imagine
> that we set LogByteSizeMergePolicy.DEFAULT_MAX_MERGE_MB to 1024 (or  
> 1023,
> to be on the safe side?). Then, segments larger than 1 GB will never  
> be
> merged with anything else. Some users (with multi-gigabyte indices  
> on a 64
> bit CPU) may not like this default, but they can change it - at  
> least with
> this default Lucene's behavior will be correct on all CPUs and JVMs.

I think we should understand how widespread this really is in our  
userbase.  If it's a minority being affected by it, I think the  
current defaults are correct (and, it's this minority that should  
change Lucene to not produce too large a segment).

> I have one last question that I wonder if anyone can answer before I  
> start
> digging into the code. We use merges not just for merging segments,  
> but also
> as an oportunity to clean up segments from deleted documents. If  
> some segment
> is bigger than the maximum and is never merged again, does this also  
> mean
> deleted documents will never ever get cleaned up from it? This can  
> be a
> serious problem on huge dynamic indices (e.g., imagine a crawl of  
> the Web
> or some large intranet).

Right, the deletes will not be cleaned up.  But you can use  
expungeDeletes()?  Or, make a MergePolicy that favors merges that  
would clean up deletes.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org