You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Chuck Williams <ch...@manawiz.com> on 2006/11/09 19:24:17 UTC

Dynamically varying maxBufferedDocs

Hi All,

Does anybody have experience dynamically varying maxBufferedDocs?  In my
app, I can never truncate docs and so work with maxFieldLength set to
Integer.MAX_VALUE.  Some documents are large, over 100 MBytes.  Most
documents are tiny.  So a fixed value of maxBufferedDocs to avoid OOM's
is too small for good ongoing performance.

It appears to me that the merging code will work fine if the initial
segment sizes vary.  E.g., a simple solution is to make
IndexWriter.flushRamSegments() public and manage this externally (for
which I already have all the needed apparatus, including size
information, the necessary thread synchronization, etc.).

A better solution might be to build a size-management option into the
maxBufferedDocs mechanism in lucene, but at least for my purposes, that
doesn' t appear necessary as a first step.

My main concern is that the mergeFactor escalation merging logic will
somehow behave poorly in the presence of dynamically varying initial
segment sizes.

I'm going to try this now, but am wondering if anybody has tried things
along these lines and might offer useful suggestions or admonitions.

Thanks for any advice,

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Dynamically varying maxBufferedDocs

Posted by Chuck Williams <ch...@manawiz.com>.

Chuck Williams wrote on 11/09/2006 08:55 AM:
> Yonik Seeley wrote on 11/09/2006 08:50 AM:
>   
>> For best behavior, you probably want to be using the current
>> (svn-trunk) version of Lucene with the new merge policy.  It ensures
>> there are mergeFactor segments with size <= maxBufferedDocs before
>> triggering a merge.  This makes for faster indexing in the presence of
>> deleted docs or partially full segments.
>>
>>     
>
> I've got quite a few local patches unfortunately.  It will take a while
> to sync up.  If I don't already have this new logic, can I pick it up by
> just merging with the latest IndexWriter or are the changes more extensive?
>   
I must already have the new merge logic as the only diff between my
IndexWriter and latest svn is the change just made to make
flushRamSegments public.

Yonik, thanks for your help.  This should work well!

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Dynamically varying maxBufferedDocs

Posted by Chuck Williams <ch...@manawiz.com>.
Yonik Seeley wrote on 11/09/2006 08:50 AM:
> For best behavior, you probably want to be using the current
> (svn-trunk) version of Lucene with the new merge policy.  It ensures
> there are mergeFactor segments with size <= maxBufferedDocs before
> triggering a merge.  This makes for faster indexing in the presence of
> deleted docs or partially full segments.
>

I've got quite a few local patches unfortunately.  It will take a while
to sync up.  If I don't already have this new logic, can I pick it up by
just merging with the latest IndexWriter or are the changes more extensive?

Thanks again,

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Dynamically varying maxBufferedDocs

Posted by Chuck Williams <ch...@manawiz.com>.
Michael Busch wrote on 11/09/2006 09:56 AM:
>
>> This sounds good.  Michael, I'd love to see your patch,
>>
>> Chuck
>
> Ok, I'll probably need a few days before I can submit it (have to code
> unit tests and check if it compiles with the current head), because
> I'm quite busy with other stuff right now. But you will get it soon :-)

I've just written my patch and will submit it too once it is fully
tested.  I took this approach:

   1. Add sizeInBytes() to RAMDirectory
   2. Make flushRamSegments() plus new numRamDocs() and ramSizeInBytes()
      public in IndexWriter


This does not provide the facility in IndexWriter, but it does provide a
nice api to manage this externally.  I didn't do it in IndexWriter for
two reasons:

   1. I use ParallelWriter, which has to manage this differently
   2. There is no general mechanism in lucene to size documents.  I use
      have an interface for my readers in reader-valued fields to
      support this.


In general, there are things the application knows that lucene doesn't
know that help to manage the size bounds

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Dynamically varying maxBufferedDocs

Posted by Michael Busch <bu...@gmail.com>.
> This sounds good.  Michael, I'd love to see your patch,
>
> Chuck

Ok, I'll probably need a few days before I can submit it (have to code 
unit tests and check if it compiles with the current head), because I'm 
quite busy with other stuff right now. But you will get it soon :-)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Dynamically varying maxBufferedDocs

Posted by Chuck Williams <ch...@manawiz.com>.
This sounds good.  Michael, I'd love to see your patch,

Chuck


Michael Busch wrote on 11/09/2006 09:13 AM:
> I had the same problem with large documents causing memory problems. I
> solved this problem by introducing a new setting in IndexWriter
> setMaxBufferSize(long). Now a merge is either triggered when
> bufferedDocs==maxBufferedDocs *or* the size of the bufferedDocs >=
> maxBufferSize. I made these changes based on the new merge policy
> Yonik mentioned, so if anyone is interested I could open a Jira issue
> and submit a patch.
>
> - Michael
>
>
> Yonik Seeley wrote:
>> On 11/9/06, Chuck Williams <ch...@manawiz.com> wrote:
>>> Thanks Yonik!  Poor wording on my part.  I won't vary maxBufferedDocs,
>>> just am making flushRamSegments() public and calling it externally
>>> (properly synchronized), earlier than it would otherwise be called from
>>> ongoing addDocument-driven merging.
>>>
>>> Sounds like this should work.
>>
>> Yep.
>> For best behavior, you probably want to be using the current
>> (svn-trunk) version of Lucene with the new merge policy.  It ensures
>> there are mergeFactor segments with size <= maxBufferedDocs before
>> triggering a merge.  This makes for faster indexing in the presence of
>> deleted docs or partially full segments.
>>
>> -Yonik
>> http://incubator.apache.org/solr Solr, the open-source Lucene search
>> server
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Dynamically varying maxBufferedDocs

Posted by Michael Busch <bu...@gmail.com>.
I had the same problem with large documents causing memory problems. I 
solved this problem by introducing a new setting in IndexWriter 
setMaxBufferSize(long). Now a merge is either triggered when 
bufferedDocs==maxBufferedDocs *or* the size of the bufferedDocs >= 
maxBufferSize. I made these changes based on the new merge policy Yonik 
mentioned, so if anyone is interested I could open a Jira issue and 
submit a patch.

- Michael


Yonik Seeley wrote:
> On 11/9/06, Chuck Williams <ch...@manawiz.com> wrote:
>> Thanks Yonik!  Poor wording on my part.  I won't vary maxBufferedDocs,
>> just am making flushRamSegments() public and calling it externally
>> (properly synchronized), earlier than it would otherwise be called from
>> ongoing addDocument-driven merging.
>>
>> Sounds like this should work.
>
> Yep.
> For best behavior, you probably want to be using the current
> (svn-trunk) version of Lucene with the new merge policy.  It ensures
> there are mergeFactor segments with size <= maxBufferedDocs before
> triggering a merge.  This makes for faster indexing in the presence of
> deleted docs or partially full segments.
>
> -Yonik
> http://incubator.apache.org/solr Solr, the open-source Lucene search 
> server
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Dynamically varying maxBufferedDocs

Posted by Yonik Seeley <yo...@apache.org>.
On 11/9/06, Chuck Williams <ch...@manawiz.com> wrote:
> Thanks Yonik!  Poor wording on my part.  I won't vary maxBufferedDocs,
> just am making flushRamSegments() public and calling it externally
> (properly synchronized), earlier than it would otherwise be called from
> ongoing addDocument-driven merging.
>
> Sounds like this should work.

Yep.
For best behavior, you probably want to be using the current
(svn-trunk) version of Lucene with the new merge policy.  It ensures
there are mergeFactor segments with size <= maxBufferedDocs before
triggering a merge.  This makes for faster indexing in the presence of
deleted docs or partially full segments.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Dynamically varying maxBufferedDocs

Posted by Chuck Williams <ch...@manawiz.com>.
Thanks Yonik!  Poor wording on my part.  I won't vary maxBufferedDocs,
just am making flushRamSegments() public and calling it externally
(properly synchronized), earlier than it would otherwise be called from
ongoing addDocument-driven merging.

Sounds like this should work.

Chuck


Yonik Seeley wrote on 11/09/2006 08:37 AM:
> On 11/9/06, Chuck Williams <ch...@manawiz.com> wrote:
>> My main concern is that the mergeFactor escalation merging logic will
>> somehow behave poorly in the presence of dynamically varying initial
>> segment sizes.
>
> Things will work as expected with varying segments sizes, but *not*
> varying maxBufferedDocuments.  The "level" of a segment is defined by
> maxBufferedDocuments.
>
> If there were a solution to flush early w/o maxBufferedDocuments
> changing, things would work fine.
>
> -Yonik
> http://incubator.apache.org/solr Solr, the open-source Lucene search
> server
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Dynamically varying maxBufferedDocs

Posted by Yonik Seeley <yo...@apache.org>.
On 11/9/06, Chuck Williams <ch...@manawiz.com> wrote:
> My main concern is that the mergeFactor escalation merging logic will
> somehow behave poorly in the presence of dynamically varying initial
> segment sizes.

Things will work as expected with varying segments sizes, but *not*
varying maxBufferedDocuments.  The "level" of a segment is defined by
maxBufferedDocuments.

If there were a solution to flush early w/o maxBufferedDocuments
changing, things would work fine.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org