You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Tim Johnson <ti...@saic.com> on 2005/08/10 21:29:44 UTC

Indexing terms limit

I'm currently attempting to index the distinct list of terms found in a
Lucene index using the TermEnum.  I'm creating a document with each list
and indexing the document of terms.  It appears there's a limit of
10,000 distinct terms within a given document.  Can this be overcome??


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing terms limit

Posted by Chris Lamprecht <cl...@gmail.com>.
See IndexWriter.setMaxFieldLength(), I think it's what you want:

from javadocs:

public void setMaxFieldLength(int maxFieldLength)

The maximum number of terms that will be indexed for a single field in
a document. This limits the amount of memory required for indexing, so
that collections with very large files will not crash the indexing
process by running out of memory.
Note that this effectively truncates large documents, excluding from
the index terms that occur further in the document. If you know your
source documents are large, be sure to set this value high enough to
accomodate the expected size. If you set it to Integer.MAX_VALUE, then
the only limit is your memory, but you should anticipate an
OutOfMemoryError.

By default, no more than 10,000 terms will be indexed for a field.

On 8/10/05, Tim Johnson <ti...@saic.com> wrote:
> I'm currently attempting to index the distinct list of terms found in a
> Lucene index using the TermEnum.  I'm creating a document with each list
> and indexing the document of terms.  It appears there's a limit of
> 10,000 distinct terms within a given document.  Can this be overcome??
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Indexing terms limit

Posted by Tim Johnson <ti...@saic.com>.
I posted before with issues about searching multiple indexes to get a
total number of docs that match a query with a given category.  The best
performance I found was to have a separate index for each index and
iterate over each category and do a hits.length() to get the total hits.


Well each query had to iterate over about 500 categories and as you can
assume, not every search has hits within every category.  So some of the
iteration can be reduced by determining which categories have terms that
match a search and only iterate over those indexes to get a total hit
count.  

I've indexed the list of terms within each index and search that index
first to get a distinct list of categories where the terms match and
thus reducing (in theory) the number on indexes needed to iterate.

If you have any suggestions on getting at the total hits faster, please
feel free to offer and suggestions.

Thanks
Tim

-----Original Message-----
From: hossman@hal.rescomp.berkeley.edu
[mailto:hossman@hal.rescomp.berkeley.edu] On Behalf Of Chris Hostetter
Sent: Wednesday, August 10, 2005 4:41 PM
To: java-user@lucene.apache.org
Subject: Re: Indexing terms limit


: I'm currently attempting to index the distinct list of terms found in
a
: Lucene index using the TermEnum.  I'm creating a document with each
list
: and indexing the document of terms.  It appears there's a limit of
: 10,000 distinct terms within a given document.  Can this be overcome??

Out of curiousity: why are you doing this?



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing terms limit

Posted by Chris Hostetter <ho...@fucit.org>.
: I'm currently attempting to index the distinct list of terms found in a
: Lucene index using the TermEnum.  I'm creating a document with each list
: and indexing the document of terms.  It appears there's a limit of
: 10,000 distinct terms within a given document.  Can this be overcome??

Out of curiousity: why are you doing this?



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org