You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Tom Burton-West <tb...@umich.edu> on 2013/06/03 18:11:46 UTC

Documentation for Solr/Lucene 4.x, termIndexInterval and limitations of Lucene File format

Hello,

The current documentation for Lucene 4.3 file formats says

When referring to term numbers, Lucene's current implementation uses a Java
int to hold the term index, which means the maximum number of unique terms
in any single index segment is ~2.1 billion times the term index interval
(default 128) = ~274 billion. This is technically not a limitation of the
index file format, just of Lucene's current implementation.

(
http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/codecs/lucene42/package-summary.html#Limitations
)

I believe that the termIndexInterval is not used in the default codec:
http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29
 and instead the terms index is now in an FST.

So the above limit does not apply to the default codec.
What is the current limit?

I suspect it may be related to the maximum number of nodes in the FST, but
I don't know what that is or how it would translate to number of unique
terms, since prefix sharing among terms probably affects the number of
nodes in the FST.

Tom.

Re: Documentation for Solr/Lucene 4.x, termIndexInterval and limitations of Lucene File format

Posted by Robert Muir <rc...@gmail.com>.
On Wed, Jun 5, 2013 at 4:21 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

>
> Nice :)  That's good news (that nothing blew up!).  Thanks for sharing.
>

With such a old jvm and such a large index, I'd say its a stroke of pure
luck nothing didn't blow up.

Re: Documentation for Solr/Lucene 4.x, termIndexInterval and limitations of Lucene File format

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Wed, Jun 5, 2013 at 2:47 PM, Tom Burton-West <tb...@umich.edu> wrote:

> 13 Billion unique terms.  (CheckIndex output appended below)

Nice :)  That's good news (that nothing blew up!).  Thanks for sharing.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Documentation for Solr/Lucene 4.x, termIndexInterval and limitations of Lucene File format

Posted by Tom Burton-West <tb...@umich.edu>.
Hi Mike,

13 Billion unique terms.  (CheckIndex output appended below)

Tom
------

 test: terms, freq, prox...OK [13,068,302,002 terms; 187,284,275,343
terms/docs pairs; 786,014,075,745 tokens]

Segments file=segments_6 numSegments=2 version=4.0.0.2 format=
userData={commitTimeMSec=1357596564850}
  1 of 2: name=_uhj docCount=866984
    codec=Lucene40
    compound=false
    numFiles=10
    size (MB)=2,048,537.68
    diagnostics = {os=Linux, os.version=2.6.18-308.24.1.el5, mergeFactor=8,
source=merge, lucene.version=4.0.0 1394950 - rmuir - 2012-10-06 03:00:40,
os.arch=amd64, mergeMaxNumSegments=1, java.version=1.6.0_16,
java.vendor=Sun Microsystems Inc.}
    no deletions
    test: open reader.........OK
    test: fields..............OK [92 fields]
    test: field norms.........OK [46 fields]
    test: terms, freq, prox...OK [13068302002 terms; 187284275343
terms/docs pairs; 786014075745 tokens]
    test: stored fields.......OK [34172522 total field count; avg 39.415
fields per doc]
    test: term vectors........OK [0 total vector count; avg 0 term/freq
vector fields per doc]
    test: DocValues........OK [0 total doc Count; Num DocValues Fields 0



On Tue, Jun 4, 2013 at 1:00 PM, Tom Burton-West <tb...@umich.edu> wrote:

> Thanks Mike.
>
> I'm running CheckIndex on the 2TB index right now.    Hopefully it will
> finish running by tomorrow.  I'll send you a copy of the output.
>
> Tom
>
>
> On Mon, Jun 3, 2013 at 9:04 PM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> Hi Tom,
>>
>> On Mon, Jun 3, 2013 at 12:11 PM, Tom Burton-West <tb...@umich.edu>
>> wrote:
>>
>> > What is the current limit?
>>
>> I *think* (but would be nice to hear back how many terms you were able
>> to index into one segment ;) ) there is no hard limit to the max
>> number of terms, now that FSTs can handle more than 2.1 B
>> bytes/nodes/arcs.
>>
>> I'll update those javadocs, thanks!
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>

Re: Documentation for Solr/Lucene 4.x, termIndexInterval and limitations of Lucene File format

Posted by Tom Burton-West <tb...@umich.edu>.
Thanks Mike.

I'm running CheckIndex on the 2TB index right now.    Hopefully it will
finish running by tomorrow.  I'll send you a copy of the output.

Tom


On Mon, Jun 3, 2013 at 9:04 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> Hi Tom,
>
> On Mon, Jun 3, 2013 at 12:11 PM, Tom Burton-West <tb...@umich.edu>
> wrote:
>
> > What is the current limit?
>
> I *think* (but would be nice to hear back how many terms you were able
> to index into one segment ;) ) there is no hard limit to the max
> number of terms, now that FSTs can handle more than 2.1 B
> bytes/nodes/arcs.
>
> I'll update those javadocs, thanks!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Documentation for Solr/Lucene 4.x, termIndexInterval and limitations of Lucene File format

Posted by Michael McCandless <lu...@mikemccandless.com>.
Hi Tom,

On Mon, Jun 3, 2013 at 12:11 PM, Tom Burton-West <tb...@umich.edu> wrote:

> What is the current limit?

I *think* (but would be nice to hear back how many terms you were able
to index into one segment ;) ) there is no hard limit to the max
number of terms, now that FSTs can handle more than 2.1 B
bytes/nodes/arcs.

I'll update those javadocs, thanks!

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org