You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Ma...@materna.de on 2004/07/08 19:08:01 UTC

Understanding TooManyClauses-Exception and Query-RAM-size

Hi,

a couple of weeks ago we migrated from Lucene 1.2 to 1.4rc3. Everything went
smoothly, but we are experiencing some problems with that new constant limit


	maxClauseCount=1024

which leeds to Exceptions of type 

	org.apache.lucene.search.BooleanQuery$TooManyClauses 

when certain RangeQueries are executed (in fact, we get this Excpetion when
we execute certain Wildcard queries, too). Although we are working with a
fairly small index with about 35.000 documents, we encounter this Exception
when we search for the property "modificationDate". For example

	modificationDate:[000000 TO 0dwc970kw] 

It is my understanding that Lucene iterates over all modificationDate-Fields
inside the index and expands that RangeQuery into some kind of an internal
OR-Query using every modificationDate entry that fits the given range. 

Obviously, we will have many different modificationDates which means that we
will easily hit the limit of 1024 clauses. I know that we can increase the
limit by setting the JVM-Parameter

	-Dorg.apache.lucene.maxClauseCount=100000

or choose an even higher number that exceeds the number of indexed
documents, but then again this limit was introduced for a reason which is to
prevent OutOfMemory-Exceptions in case the internal OR-Query gets too large.


I am wondering how people with millions of indexed documents handle this
situation. Do you take care not to construct those type of "large"
RangeQuery by keeping the query's upper and lower boundaries close together?
Do you keep increasing the maxClauseCount limit as your index grows? 

A couple of days ago Doug Cutting gave a formula which estimates the
RAM-size of a given Query which was 


1 byte * Number of searchable fields in your index * Number of docs in 
your index

plus

1k bytes * number of terms in query

plus

1k bytes * number of phrase terms in query

According to Luke, our index consists of about 500 fields 35.000 documents.
The first line gives 

1 * 500 * 35.000 = 17.500.000 Bytes = 17.5 MB

when we assume that the RangeQuery includes virtually all of our documents
the second line gives

1 k * 35.000 = 35 MB 

we don't expect too many phrase terms so we have a total size of about 52.5
MB per query. 

Once again, our index is pretty small, still we get that pretty big
query-RAM-sizes. If we increase the number of indexed documents by a factor
of 10 we will have 520 MB which in my opinion is getting out of hand. 

Am I missing something? How do other people handle this situation? 


Thanks,

Martin Stein











-- 
Dipl.-Math. Martin Stein
Software Engineer 
CC Portale und Content Management
Business Unit Information
_________________________________________________
MATERNA GmbH Information & Communications
Vosskuhle 37 * 44141 Dortmund * Germany
phone.: +49 231 5599 - 8614 * fax: -8975  
<http://www.materna.de>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Understanding TooManyClauses-Exception and Query-RAM-size

Posted by "Kevin A. Burton" <bu...@newsmonster.org>.
Martin.Stein@materna.de wrote:

>Hi,
>
>a couple of weeks ago we migrated from Lucene 1.2 to 1.4rc3. Everything went
>smoothly, but we are experiencing some problems with that new constant limit
>
>
>	maxClauseCount=1024
>
>which leeds to Exceptions of type 
>
>	org.apache.lucene.search.BooleanQuery$TooManyClauses 
>
>when certain RangeQueries are executed (in fact, we get this Excpetion when
>we execute certain Wildcard queries, too). Although we are working with a
>fairly small index with about 35.000 documents, we encounter this Exception
>when we search for the property "modificationDate". For example
>
>	modificationDate:[000000 TO 0dwc970kw] 
>
>  
>
We talked about this the other day.

http://wiki.apache.org/jakarta-lucene/IndexingDateFields

Find out what type of precision you need and use that.  If you only need 
days or hours or minutes then use that.   Millis is just too small. 

We're only using days and have queries for just the last 7 days as max 
so this really works out well...

Kevin

-- 

Please reply using PGP.

    http://peerfear.org/pubkey.asc    
    
    NewsMonster - http://www.newsmonster.org/
    
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org