You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Tom Burton-West <tb...@umich.edu> on 2013/06/03 18:11:46 UTC
Documentation for Solr/Lucene 4.x, termIndexInterval and limitations
of Lucene File format
Hello,
The current documentation for Lucene 4.3 file formats says
When referring to term numbers, Lucene's current implementation uses a Java
int to hold the term index, which means the maximum number of unique terms
in any single index segment is ~2.1 billion times the term index interval
(default 128) = ~274 billion. This is technically not a limitation of the
index file format, just of Lucene's current implementation.
(
http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/codecs/lucene42/package-summary.html#Limitations
)
I believe that the termIndexInterval is not used in the default codec:
http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29
and instead the terms index is now in an FST.
So the above limit does not apply to the default codec.
What is the current limit?
I suspect it may be related to the maximum number of nodes in the FST, but
I don't know what that is or how it would translate to number of unique
terms, since prefix sharing among terms probably affects the number of
nodes in the FST.
Tom.
Re: Documentation for Solr/Lucene 4.x, termIndexInterval and
limitations of Lucene File format
Posted by Robert Muir <rc...@gmail.com>.
On Wed, Jun 5, 2013 at 4:21 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:
>
> Nice :) That's good news (that nothing blew up!). Thanks for sharing.
>
With such a old jvm and such a large index, I'd say its a stroke of pure
luck nothing didn't blow up.
Re: Documentation for Solr/Lucene 4.x, termIndexInterval and
limitations of Lucene File format
Posted by Michael McCandless <lu...@mikemccandless.com>.
On Wed, Jun 5, 2013 at 2:47 PM, Tom Burton-West <tb...@umich.edu> wrote:
> 13 Billion unique terms. (CheckIndex output appended below)
Nice :) That's good news (that nothing blew up!). Thanks for sharing.
Mike McCandless
http://blog.mikemccandless.com
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Documentation for Solr/Lucene 4.x, termIndexInterval and
limitations of Lucene File format
Posted by Tom Burton-West <tb...@umich.edu>.
Hi Mike,
13 Billion unique terms. (CheckIndex output appended below)
Tom
------
test: terms, freq, prox...OK [13,068,302,002 terms; 187,284,275,343
terms/docs pairs; 786,014,075,745 tokens]
Segments file=segments_6 numSegments=2 version=4.0.0.2 format=
userData={commitTimeMSec=1357596564850}
1 of 2: name=_uhj docCount=866984
codec=Lucene40
compound=false
numFiles=10
size (MB)=2,048,537.68
diagnostics = {os=Linux, os.version=2.6.18-308.24.1.el5, mergeFactor=8,
source=merge, lucene.version=4.0.0 1394950 - rmuir - 2012-10-06 03:00:40,
os.arch=amd64, mergeMaxNumSegments=1, java.version=1.6.0_16,
java.vendor=Sun Microsystems Inc.}
no deletions
test: open reader.........OK
test: fields..............OK [92 fields]
test: field norms.........OK [46 fields]
test: terms, freq, prox...OK [13068302002 terms; 187284275343
terms/docs pairs; 786014075745 tokens]
test: stored fields.......OK [34172522 total field count; avg 39.415
fields per doc]
test: term vectors........OK [0 total vector count; avg 0 term/freq
vector fields per doc]
test: DocValues........OK [0 total doc Count; Num DocValues Fields 0
On Tue, Jun 4, 2013 at 1:00 PM, Tom Burton-West <tb...@umich.edu> wrote:
> Thanks Mike.
>
> I'm running CheckIndex on the 2TB index right now. Hopefully it will
> finish running by tomorrow. I'll send you a copy of the output.
>
> Tom
>
>
> On Mon, Jun 3, 2013 at 9:04 PM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> Hi Tom,
>>
>> On Mon, Jun 3, 2013 at 12:11 PM, Tom Burton-West <tb...@umich.edu>
>> wrote:
>>
>> > What is the current limit?
>>
>> I *think* (but would be nice to hear back how many terms you were able
>> to index into one segment ;) ) there is no hard limit to the max
>> number of terms, now that FSTs can handle more than 2.1 B
>> bytes/nodes/arcs.
>>
>> I'll update those javadocs, thanks!
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>
Re: Documentation for Solr/Lucene 4.x, termIndexInterval and
limitations of Lucene File format
Posted by Tom Burton-West <tb...@umich.edu>.
Thanks Mike.
I'm running CheckIndex on the 2TB index right now. Hopefully it will
finish running by tomorrow. I'll send you a copy of the output.
Tom
On Mon, Jun 3, 2013 at 9:04 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:
> Hi Tom,
>
> On Mon, Jun 3, 2013 at 12:11 PM, Tom Burton-West <tb...@umich.edu>
> wrote:
>
> > What is the current limit?
>
> I *think* (but would be nice to hear back how many terms you were able
> to index into one segment ;) ) there is no hard limit to the max
> number of terms, now that FSTs can handle more than 2.1 B
> bytes/nodes/arcs.
>
> I'll update those javadocs, thanks!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: Documentation for Solr/Lucene 4.x, termIndexInterval and
limitations of Lucene File format
Posted by Michael McCandless <lu...@mikemccandless.com>.
Hi Tom,
On Mon, Jun 3, 2013 at 12:11 PM, Tom Burton-West <tb...@umich.edu> wrote:
> What is the current limit?
I *think* (but would be nice to hear back how many terms you were able
to index into one segment ;) ) there is no hard limit to the max
number of terms, now that FSTs can handle more than 2.1 B
bytes/nodes/arcs.
I'll update those javadocs, thanks!
Mike McCandless
http://blog.mikemccandless.com
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org