You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Shalin Shekhar Mangar <sh...@gmail.com> on 2009/09/17 22:30:08 UTC

Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

On Fri, Sep 18, 2009 at 1:06 AM, Jibo John <ji...@mac.com> wrote:

> Hello,
>
> Came across a lucene patch (
> http://issues.apache.org/jira/browse/LUCENE-1634) that would consider the
> number of deleted documents as the criteria when deciding which segments to
> merge.
>
> Since we expect to have very frequent deletes, we hope this would help
> reclaim the space consumed by the deleted documents in a much more efficient
> way.
>
> Currently, we can specify a mergepolicy in solrconfig.xml like this:
>
>
>
>  <!--<mergePolicy>org.apache.lucene.index.LogByteSizeMergePolicy</mergePolicy>-->
>
>
> However, by default, calibrateSizeByDeletes = false in LogMergePolicy.
>
> I was wondering if there is a way I can modify calibrateSizeByDeletes just
> by configuration ?
>

Alas, no. The only option that I see for you is to sub-class
LogByteSizeMergePolicy and set calibrateSizeByDeletes to true in the
constructor. However, please open a Jira issue and so we don't forget about
it.

Also, you might be interested in expungeDeletes which has been added as a
request parameter for commits. Calling commit with expungeDeletes=true will
remove all deleted documents from the index but unlike an optimize it won't
always reduce the index to a single segment.

-- 
Regards,
Shalin Shekhar Mangar.

Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Jibo John <ji...@mac.com>.
On Sep 17, 2009, at 1:30 PM, Shalin Shekhar Mangar wrote:

> On Fri, Sep 18, 2009 at 1:06 AM, Jibo John <ji...@mac.com> wrote:
>
>> Hello,
>>
>> Came across a lucene patch (
>> http://issues.apache.org/jira/browse/LUCENE-1634) that would  
>> consider the
>> number of deleted documents as the criteria when deciding which  
>> segments to
>> merge.
>>
>> Since we expect to have very frequent deletes, we hope this would  
>> help
>> reclaim the space consumed by the deleted documents in a much more  
>> efficient
>> way.
>>
>> Currently, we can specify a mergepolicy in solrconfig.xml like this:
>>
>>
>>
>> <!--<mergePolicy>org.apache.lucene.index.LogByteSizeMergePolicy</ 
>> mergePolicy>-->
>>
>>
>> However, by default, calibrateSizeByDeletes = false in  
>> LogMergePolicy.
>>
>> I was wondering if there is a way I can modify  
>> calibrateSizeByDeletes just
>> by configuration ?
>>
>
> Alas, no. The only option that I see for you is to sub-class
> LogByteSizeMergePolicy and set calibrateSizeByDeletes to true in the
> constructor. However, please open a Jira issue and so we don't  
> forget about
> it.

Created a jira issue https://issues.apache.org/jira/browse/SOLR-1444

>
> Also, you might be interested in expungeDeletes which has been added  
> as a
> request parameter for commits. Calling commit with  
> expungeDeletes=true will
> remove all deleted documents from the index but unlike an optimize  
> it won't
> always reduce the index to a single segment.

Thanks for this information. Will explore this.



>
> -- 
> Regards,
> Shalin Shekhar Mangar.


Thanks,
-Jibo

Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Ted Dunning <te...@gmail.com>.
Actually, I strongly disagree.  If you optimize for this case, you are
pessimizing for the real world.

It would be much better to fit a realistic life cycle or just record a trace
of profile updates (no need for content, just an abstract id for each
profile that got updated).

On Mon, Sep 21, 2009 at 6:30 PM, John Wang <jo...@gmail.com> wrote:

>      We do experience people updating their profile and the assumption is
> every member is likely to update their profile (that is a bit aggressive I'd
> agree, but it is nevertheless a safe upper bound)
>



-- 
Ted Dunning, CTO
DeepDyve

Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by John Wang <jo...@gmail.com>.
Hi Ted:

     In our case it is profile updates. Each profile -> 1 document keyed on
member id.

     We do experience people updating their profile and the assumption is
every member is likely to update their profile (that is a bit aggressive I'd
agree, but it is nevertheless a safe upper bound)

     In our scenario, there are 2 types of realtime updates:

1) every document can be updated (within a shard)
2) add-only, e.g. tweets etc.

     In our test, we aimed at 1)

-John

On Tue, Sep 22, 2009 at 8:28 AM, Ted Dunning <te...@gmail.com> wrote:

> John,
>
> I think that inherent in your test is a uniform distribution of updates.
>
> This seems unrealistic to me, not least because any distribution of updates
> caused by a population of objects interacting with each other should be
> translation invariant in time which is something a uniform distribution just
> cannot be.
>
> The only plausible way I can see to cause uniform distribution of updates
> is a global update to many entries.  Such a global update problem usually
> indicates that the object set should be factored into objects and
> properties.  Then what was a global update becomes an update to a single
> property.  The cost of fetching an object with all updated properties is a
> secondary retrieval to elaborate the state implied by the properties.  This
> can literally be done in a single additional Lucene query since all property
> keys will be available from the object fetch.  Moreover, you generally have
> far fewer unique properties than you have objects so the property fetch is
> blindingly fast.
>
> My own experience is that natural update rates almost invariable decay over
> time and that the peak rate of updates varies dramatically between objects.
> Both of these factors mean that most of the objects being updated should be
> predominantly objects were were updated recently.  Rather shortly, this kind
> of distribution should result in the rate of updates per item being much
> lower for the larger segments.
>
> Can you say more about what motivates your test model and where I am wrong
> about your situation?
>
>
> On Mon, Sep 21, 2009 at 4:50 PM, John Wang <jo...@gmail.com> wrote:
>
>> Jason:
>>
>>    Before jumping into any conclusions, let me describe the test setup. It
>> is rather different from Lucene benchmark as we are testing high updates in
>> a realtime environment:
>>
>>    We took a public corpus: medline, indexed to approximately 3 million
>> docs. And update all the docs over and over again for a 10 hour duration.
>>
>>    Only differences in code used where the different MergePolicy settings
>> were applied.
>>
>>    Taking the variable of HW/OS out of the equation, let's igonored the
>> absolute numbers and compare the relative numbers between the two runs.
>>
>>    The spike is due to merging of a large segment when we accumulate. The
>> graph/perf numbers fit our hypothesis that the default MergePolicy chooses
>> to merge small segments before large ones and does not handle segmens with
>> high number of deletes well.
>>
>>     Merging is BOTH IO and CPU intensive. Especially large ones.
>>
>>     I think the wiki explains it pretty well.
>>
>>     What are you saying is true with IO cache w.r.t. merge. Everytime new
>> files are created, old files in IO cache is invalided. As the experiment
>> shows, this is detrimental to query performance when large segmens are being
>> merged.
>>
>>     "As we move to a sharded model of indexes, large merges will
>> naturally not occur." Our test is on a 3 million document index, not very
>> large for a single shard. Some katta people have run it on a much much
>> larger index per shard. Saying large merges will not occur on indexes of
>> this size IMHO is unfounded.
>>
>> -John
>>
>> On Tue, Sep 22, 2009 at 2:34 AM, Jason Rutherglen <
>> jason.rutherglen@gmail.com> wrote:
>>
>>> John,
>>>
>>> It would be great if Lucene's benchmark were used so everyone
>>> could execute the test in their own environment and verify. It's
>>> not clear the settings or code used to generate the results so
>>> it's difficult to draw any reliable conclusions.
>>>
>>> The steep spike shows greater evidence for the IO cache being
>>> cleared during large merges resulting in search performance
>>> degradation. See:
>>> http://www.lucidimagination.com/search/?q=madvise
>>>
>>> Merging is IO intensive, less CPU intensive, if the
>>> ConcurrentMergeScheduler is used, which defaults to 3 threads,
>>> then the CPU could be maxed out. Using a single thread on
>>> synchronous spinning magnetic media seems more logical. Queries
>>> are usually the inverse, CPU intensive, not IO intensive when
>>> the index is in the IO cache. After merging a large segment (or
>>> during), queries would start hitting disk, and the results
>>> clearly show that. The queries are suddenly more time consuming
>>> as they seek on disk at a time when IO activity is at it's peak
>>> from merging large segments. Using madvise would prevent usable
>>> indexes from being swapped to disk during a merge, query
>>> performance would continue unabated.
>>>
>>> As we move to a sharded model of indexes, large merges will
>>> naturally not occur. Shards will reach a specified size and new
>>> documents will be sent to new shards.
>>>
>>> -J
>>>
>>> On Sun, Sep 20, 2009 at 11:12 PM, John Wang <jo...@gmail.com> wrote:
>>> > The current default Lucene MergePolicy does not handle frequent updates
>>> > well.
>>> >
>>> > We have done some performance analysis with that and a custom merge
>>> policy:
>>> >
>>> > http://code.google.com/p/zoie/wiki/ZoieMergePolicy
>>> >
>>> > -John
>>> >
>>> > On Mon, Sep 21, 2009 at 1:08 PM, Jason Rutherglen <
>>> > jason.rutherglen@gmail.com> wrote:
>>> >
>>> >> I opened SOLR-1447 for this
>>> >>
>>> >> 2009/9/18 Noble Paul നോബിള്‍  नोब्ळ् <no...@corp.aol.com>:
>>> >> > We can use a simple reflection based implementation to simplify
>>> >> > reading too many parameters.
>>> >> >
>>> >> > What I wish to emphasize is that Solr should be agnostic of xml
>>> >> > altogether. It should only be aware of specific Objects and
>>> >> > interfaces. If users wish to plugin something else in some other way
>>> ,
>>> >> > it should be fine
>>> >> >
>>> >> >
>>> >> >  There is a huge learning involved in learning the current
>>> >> > solrconfig.xml . Let us not make people throw away that .
>>> >> >
>>> >> > On Sat, Sep 19, 2009 at 1:59 AM, Jason Rutherglen
>>> >> > <ja...@gmail.com> wrote:
>>> >> >> Over the weekend I may write a patch to allow simple reflection
>>> based
>>> >> >> injection from within solrconfig.
>>> >> >>
>>> >> >> On Fri, Sep 18, 2009 at 8:10 AM, Yonik Seeley
>>> >> >> <yo...@lucidimagination.com> wrote:
>>> >> >>> On Thu, Sep 17, 2009 at 4:30 PM, Shalin Shekhar Mangar
>>> >> >>> <sh...@gmail.com> wrote:
>>> >> >>>>> I was wondering if there is a way I can modify
>>> calibrateSizeByDeletes
>>> >> just
>>> >> >>>>> by configuration ?
>>> >> >>>>>
>>> >> >>>>
>>> >> >>>> Alas, no. The only option that I see for you is to sub-class
>>> >> >>>> LogByteSizeMergePolicy and set calibrateSizeByDeletes to true in
>>> the
>>> >> >>>> constructor. However, please open a Jira issue and so we don't
>>> forget
>>> >> about
>>> >> >>>> it.
>>> >> >>>
>>> >> >>> It's the continuing stuff like this that makes me feel like we
>>> should
>>> >> >>> be Spring (or equivalent) based someday... I'm just not sure how
>>> we're
>>> >> >>> going to get there.
>>> >> >>>
>>> >> >>> -Yonik
>>> >> >>> http://www.lucidimagination.com
>>> >> >>>
>>> >> >>
>>> >> >
>>> >> >
>>> >> >
>>> >> > --
>>> >> > -----------------------------------------------------
>>> >> > Noble Paul | Principal Engineer| AOL | http://aol.com
>>> >> >
>>> >>
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>>
>>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>
>

Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Ted Dunning <te...@gmail.com>.
John,

I think that inherent in your test is a uniform distribution of updates.

This seems unrealistic to me, not least because any distribution of updates
caused by a population of objects interacting with each other should be
translation invariant in time which is something a uniform distribution just
cannot be.

The only plausible way I can see to cause uniform distribution of updates is
a global update to many entries.  Such a global update problem usually
indicates that the object set should be factored into objects and
properties.  Then what was a global update becomes an update to a single
property.  The cost of fetching an object with all updated properties is a
secondary retrieval to elaborate the state implied by the properties.  This
can literally be done in a single additional Lucene query since all property
keys will be available from the object fetch.  Moreover, you generally have
far fewer unique properties than you have objects so the property fetch is
blindingly fast.

My own experience is that natural update rates almost invariable decay over
time and that the peak rate of updates varies dramatically between objects.
Both of these factors mean that most of the objects being updated should be
predominantly objects were were updated recently.  Rather shortly, this kind
of distribution should result in the rate of updates per item being much
lower for the larger segments.

Can you say more about what motivates your test model and where I am wrong
about your situation?

On Mon, Sep 21, 2009 at 4:50 PM, John Wang <jo...@gmail.com> wrote:

> Jason:
>
>    Before jumping into any conclusions, let me describe the test setup. It
> is rather different from Lucene benchmark as we are testing high updates in
> a realtime environment:
>
>    We took a public corpus: medline, indexed to approximately 3 million
> docs. And update all the docs over and over again for a 10 hour duration.
>
>    Only differences in code used where the different MergePolicy settings
> were applied.
>
>    Taking the variable of HW/OS out of the equation, let's igonored the
> absolute numbers and compare the relative numbers between the two runs.
>
>    The spike is due to merging of a large segment when we accumulate. The
> graph/perf numbers fit our hypothesis that the default MergePolicy chooses
> to merge small segments before large ones and does not handle segmens with
> high number of deletes well.
>
>     Merging is BOTH IO and CPU intensive. Especially large ones.
>
>     I think the wiki explains it pretty well.
>
>     What are you saying is true with IO cache w.r.t. merge. Everytime new
> files are created, old files in IO cache is invalided. As the experiment
> shows, this is detrimental to query performance when large segmens are being
> merged.
>
>     "As we move to a sharded model of indexes, large merges will
> naturally not occur." Our test is on a 3 million document index, not very
> large for a single shard. Some katta people have run it on a much much
> larger index per shard. Saying large merges will not occur on indexes of
> this size IMHO is unfounded.
>
> -John
>
> On Tue, Sep 22, 2009 at 2:34 AM, Jason Rutherglen <
> jason.rutherglen@gmail.com> wrote:
>
>> John,
>>
>> It would be great if Lucene's benchmark were used so everyone
>> could execute the test in their own environment and verify. It's
>> not clear the settings or code used to generate the results so
>> it's difficult to draw any reliable conclusions.
>>
>> The steep spike shows greater evidence for the IO cache being
>> cleared during large merges resulting in search performance
>> degradation. See:
>> http://www.lucidimagination.com/search/?q=madvise
>>
>> Merging is IO intensive, less CPU intensive, if the
>> ConcurrentMergeScheduler is used, which defaults to 3 threads,
>> then the CPU could be maxed out. Using a single thread on
>> synchronous spinning magnetic media seems more logical. Queries
>> are usually the inverse, CPU intensive, not IO intensive when
>> the index is in the IO cache. After merging a large segment (or
>> during), queries would start hitting disk, and the results
>> clearly show that. The queries are suddenly more time consuming
>> as they seek on disk at a time when IO activity is at it's peak
>> from merging large segments. Using madvise would prevent usable
>> indexes from being swapped to disk during a merge, query
>> performance would continue unabated.
>>
>> As we move to a sharded model of indexes, large merges will
>> naturally not occur. Shards will reach a specified size and new
>> documents will be sent to new shards.
>>
>> -J
>>
>> On Sun, Sep 20, 2009 at 11:12 PM, John Wang <jo...@gmail.com> wrote:
>> > The current default Lucene MergePolicy does not handle frequent updates
>> > well.
>> >
>> > We have done some performance analysis with that and a custom merge
>> policy:
>> >
>> > http://code.google.com/p/zoie/wiki/ZoieMergePolicy
>> >
>> > -John
>> >
>> > On Mon, Sep 21, 2009 at 1:08 PM, Jason Rutherglen <
>> > jason.rutherglen@gmail.com> wrote:
>> >
>> >> I opened SOLR-1447 for this
>> >>
>> >> 2009/9/18 Noble Paul നോബിള്‍  नोब्ळ् <no...@corp.aol.com>:
>> >> > We can use a simple reflection based implementation to simplify
>> >> > reading too many parameters.
>> >> >
>> >> > What I wish to emphasize is that Solr should be agnostic of xml
>> >> > altogether. It should only be aware of specific Objects and
>> >> > interfaces. If users wish to plugin something else in some other way
>> ,
>> >> > it should be fine
>> >> >
>> >> >
>> >> >  There is a huge learning involved in learning the current
>> >> > solrconfig.xml . Let us not make people throw away that .
>> >> >
>> >> > On Sat, Sep 19, 2009 at 1:59 AM, Jason Rutherglen
>> >> > <ja...@gmail.com> wrote:
>> >> >> Over the weekend I may write a patch to allow simple reflection
>> based
>> >> >> injection from within solrconfig.
>> >> >>
>> >> >> On Fri, Sep 18, 2009 at 8:10 AM, Yonik Seeley
>> >> >> <yo...@lucidimagination.com> wrote:
>> >> >>> On Thu, Sep 17, 2009 at 4:30 PM, Shalin Shekhar Mangar
>> >> >>> <sh...@gmail.com> wrote:
>> >> >>>>> I was wondering if there is a way I can modify
>> calibrateSizeByDeletes
>> >> just
>> >> >>>>> by configuration ?
>> >> >>>>>
>> >> >>>>
>> >> >>>> Alas, no. The only option that I see for you is to sub-class
>> >> >>>> LogByteSizeMergePolicy and set calibrateSizeByDeletes to true in
>> the
>> >> >>>> constructor. However, please open a Jira issue and so we don't
>> forget
>> >> about
>> >> >>>> it.
>> >> >>>
>> >> >>> It's the continuing stuff like this that makes me feel like we
>> should
>> >> >>> be Spring (or equivalent) based someday... I'm just not sure how
>> we're
>> >> >>> going to get there.
>> >> >>>
>> >> >>> -Yonik
>> >> >>> http://www.lucidimagination.com
>> >> >>>
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > -----------------------------------------------------
>> >> > Noble Paul | Principal Engineer| AOL | http://aol.com
>> >> >
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>


-- 
Ted Dunning, CTO
DeepDyve

Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Earwin Burrfoot <ea...@gmail.com>.
On Tue, Sep 22, 2009 at 19:08, Yonik Seeley <yo...@lucidimagination.com> wrote:
> On Tue, Sep 22, 2009 at 10:48 AM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>> John are you using IndexWriter.setMergedSegmentWarmer, so that a newly
>> merged segment is warmed before it's "put into production" (returned
>> by getReader)?
>
> I'm still not sure I see the reason for complicating the IndexWriter
> with warming... can't this be done just as efficiently (if not more
> efficiently) in user/application space?
+1


-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Jason Rutherglen <ja...@gmail.com>.
Right, it allows warming without interrupting obtaining new readers.
I'll update the realtime wiki with this.

Thanks Mike.

On Tue, Sep 22, 2009 at 8:53 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> It's not only that the newly merged segments are quickly searchable
> (you could do that with warming outside of IW).
>
> It's more importantly so that you can continue to add/delete docs,
> flush the segment, open a new NRT reader, and search those changes,
> without waiting for the warming to complete.  You could do many such
> updates all while a large merged segment is being warmed in the BG.
>
> It decouples merging (which results in no change to the search
> results) from the add/deletes (which do result in changes to the
> search results), so that the warming due to a large merge won't hold
> up the stream of updates.
>
> I think for any serious NRT app, it's a must.  (Either that or avoid
> ever doing large merges entirely).
>
> Mike
>
> On Tue, Sep 22, 2009 at 11:44 AM, Jason Rutherglen
> <ja...@gmail.com> wrote:
>> Adding segment warming to IW is the only way to insure newly
>> merged segments are quickly searchable without the impact
>> brought up by John W regarding queries on new segments being
>> slow when they load field caches.
>>
>> On Tue, Sep 22, 2009 at 8:37 AM, Michael McCandless
>> <lu...@mikemccandless.com> wrote:
>>> On Tue, Sep 22, 2009 at 11:08 AM, Yonik Seeley
>>> <yo...@lucidimagination.com> wrote:
>>>
>>>> I'm still not sure I see the reason for complicating the IndexWriter
>>>> with warming... can't this be done just as efficiently (if not more
>>>> efficiently) in user/application space?
>>>
>>> It will be less efficient when you warm outside of IndexWriter, ie,
>>> you will necessarily delay the app's net turnaround time on being able
>>> to search newly added/deleted docs.
>>>
>>> The whole point of putting optional warming into IndexWriter was so
>>> the segment could be warmed *before* the merge commits the change to
>>> the writer's SegmentInfos.  Any newly opened near-real-timer readers
>>> continue to search the old (merged away) segments, until the warming
>>> completes.
>>>
>>> This way the warming of merged segments is independent of making any
>>> newly flushed segments searchable (as long as you use CMS, or any
>>> merge scheduler that uses separate threads for merging).  New segments
>>> can be flushed and then become searchable (with getReader()) even
>>> while the warming is happening.
>>>
>>> So... if your merge policy allows large merges, setting a warmer in
>>> the IndexWriter is crucial for minimizing turnaround time.  But, even
>>> once you do that, merging is still IO & CPU intensive, plus IO caches
>>> are unnecessarily flushed (since we can't easily madvise/posix_fadvise
>>> from java), and we have no IO scheduler control to have merging run at
>>> very lower priority, etc., so while the merge & warming are taking
>>> place, search performance will be impacted.
>>>
>>> Mike
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Tue, Sep 22, 2009 at 2:01 PM, Tim Smith <ts...@attivio.com> wrote:

> is there a proposed API for doing this warming yet?

It's already committed and available in 2.9 (see
IndexWriter.setMergedSegmentWarmer).

> for my use cases, it would be really nice for applications to be able to
> associate a custom "IndexCache" object with an index reader, then this
> pluggable "AutoWarmer" would be in charge of initializing this cache for a
> segment reader. I have a number of caches outside the realm of regular field
> caches that i associate with a segment, currently doing this after getting
> the IndexReader by iterating over its segments, and getting a cache object
> shared across all instances of the same logical segment. it would be nice if
> i could just have my "cache" object subclass a lucene IndexCache class and
> drop it right into this auto warming infrastructure (would greatly simplify
> things).
>
> then, once the index reader has been closed, it would call close on any
> attached IndexCache objects in order to free up memory/objects. (so i don't
> have to maintain reference counts anymore)

Lucene doesn't expose this today; I think you have to track the
association externally.  But we could consider adding something like
this...

> Seems this could also greatly simplify the current field caching mechanisms,
> as the field caches could be associated with an IndexReader directly using
> the attached "IndexCache" object, instead of using static weak reference
> hash maps. (could then add methods like getFieldCache() to the IndexReader)

One challenge here is that there can easily be multiple SegmentReaders
"out there", for a single index segment.  EG if you reopen a reader
after new deletes are flushed to a previous segment.  In this case,
you'll have different SegmentReaders, but they intentionally share the
same entry in FieldCache since their "core" is shared.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Jason Rutherglen <ja...@gmail.com>.
Yeah it's all package private, I think it should be protected.

One would use OneMerge.info to then obtain the newly merged SR
via IW.getReader(). There's no reason not to include the newly
merged SR in OneMerge, there wasn't a need when 1516 was written.

On Tue, Sep 22, 2009 at 12:00 PM, Tim Smith <ts...@attivio.com> wrote:
> Jason Rutherglen wrote:
>
> For that you can subclass IW.mergeSuccess.
>
>
>
> looks like thats package private :(
> also doesn't look like it has the merged output SegmentReader which could be
> used for cache loading/cache key (since it may not have been opened yet, but
> with NRT it should be available?)
> OneMerge looks heavily package private as well
>
>  -- Tim
>
> On Tue, Sep 22, 2009 at 11:43 AM, Tim Smith <ts...@attivio.com> wrote:
>
>
> Jason Rutherglen wrote:
>
> I have a working version of Simple FieldCache Merging LUCENE-1785 that
> should go in real soon.
>
>
>
> Will this contain a callback mechanism i can register with to know what
> segments are being merged?
> that way i can merge my own caches as well at the application layer, perhaps
> exposed through something like IndexReaderWarmer.warmMerge(IndexReader[]
> input, IndexReader output)
>
> On Tue, Sep 22, 2009 at 11:14 AM, Mark Miller <ma...@gmail.com> wrote:
>
>
> 1. see IndexWriter and the method/class that Mike pointed out earlier
> for the warming.
>
> 2. See Lucene-831 - I think we will get some form of that in someday.
>
> Tim Smith wrote:
>
>
> This sounds pretty interesting
>
> is there a proposed API for doing this warming yet?
> Is there a ticket tracking this?
>
> for my use cases, it would be really nice for applications to be able
> to associate a custom "IndexCache" object with an index reader, then
> this pluggable "AutoWarmer" would be in charge of initializing this
> cache for a segment reader. I have a number of caches outside the
> realm of regular field caches that i associate with a segment,
> currently doing this after getting the IndexReader by iterating over
> its segments, and getting a cache object shared across all instances
> of the same logical segment. it would be nice if i could just have my
> "cache" object subclass a lucene IndexCache class and drop it right
> into this auto warming infrastructure (would greatly simplify things).
>
> then, once the index reader has been closed, it would call close on
> any attached IndexCache objects in order to free up memory/objects.
> (so i don't have to maintain reference counts anymore)
>
> Seems this could also greatly simplify the current field caching
> mechanisms, as the field caches could be associated with an
> IndexReader directly using the attached "IndexCache" object, instead
> of using static weak reference hash maps. (could then add methods like
> getFieldCache() to the IndexReader)
>
>  -- Tim Smith
>
> Michael McCandless wrote:
>
>
> Well described, that's exactly it!  I like the concrete example :)
>
> Thanks Yonik.
>
> Mike
>
> On Tue, Sep 22, 2009 at 1:38 PM, Yonik Seeley
> <yo...@lucidimagination.com> wrote:
>
>
>
> OK Mike, thanks for your patience - I understand now :-)
>
> Here's an example that helped me understand - hopefully it will add to
> others understanding more than it confuses ;-)
>
> IW.getReader() => segments={A, B}
>  // something causes a merge of A,B into AB to start
> addDoc(doc1)
>  // doc1 goes into segment C
> IW.getReader() => segments={A, B, C}
>  // merge isn't done yet, so getReader() still returns A,B instead of
> AB, but doc1 is still searchable!
>
> OK, in this scenario, there's no advantage to warming in the IW vs the app.
> Let's start over with a little different timing:
>
> segments={A,B}
>  // something causes a merge of A,B into AB to start
> addDoc(doc1)
>  // doc1 goes into segment C
>  // merging of A,B into AB finishes
> IW.getReader() => segments={AB, C}
>
> Oh, no... with warming at the app level, we need to warm the huge AB
> segment before doc1 is visible.  We could continue using the old
> reader while the warming is ongoing, so no user requests will
> experience long queries, but doc1 isn't in the old segment.
>
> With warming in the IW (basically warming becomes part of the same
> operation as merging), then getReader() would return segments={A,B,C}
> and doc1 would still be instantly searchable.
>
> The only way to duplicate this functionality at the app layer would be
> to recognize that there is a new segment, try and figure out what old
> segments were merged to create this new segment, and create a reader
> that's a mix of old and new to avoid unwarmed segments - not nice.
>
> -Yonik
> http://www.lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Tim Smith <ts...@attivio.com>.
Jason Rutherglen wrote:
> For that you can subclass IW.mergeSuccess.
>
>   
looks like thats package private :(
also doesn't look like it has the merged output SegmentReader which
could be used for cache loading/cache key (since it may not have been
opened yet, but with NRT it should be available?)
OneMerge looks heavily package private as well

 -- Tim

> On Tue, Sep 22, 2009 at 11:43 AM, Tim Smith <ts...@attivio.com> wrote:
>   
>> Jason Rutherglen wrote:
>>
>> I have a working version of Simple FieldCache Merging LUCENE-1785 that
>> should go in real soon.
>>
>>
>>
>> Will this contain a callback mechanism i can register with to know what
>> segments are being merged?
>> that way i can merge my own caches as well at the application layer, perhaps
>> exposed through something like IndexReaderWarmer.warmMerge(IndexReader[]
>> input, IndexReader output)
>>
>> On Tue, Sep 22, 2009 at 11:14 AM, Mark Miller <ma...@gmail.com> wrote:
>>
>>
>> 1. see IndexWriter and the method/class that Mike pointed out earlier
>> for the warming.
>>
>> 2. See Lucene-831 - I think we will get some form of that in someday.
>>
>> Tim Smith wrote:
>>
>>
>> This sounds pretty interesting
>>
>> is there a proposed API for doing this warming yet?
>> Is there a ticket tracking this?
>>
>> for my use cases, it would be really nice for applications to be able
>> to associate a custom "IndexCache" object with an index reader, then
>> this pluggable "AutoWarmer" would be in charge of initializing this
>> cache for a segment reader. I have a number of caches outside the
>> realm of regular field caches that i associate with a segment,
>> currently doing this after getting the IndexReader by iterating over
>> its segments, and getting a cache object shared across all instances
>> of the same logical segment. it would be nice if i could just have my
>> "cache" object subclass a lucene IndexCache class and drop it right
>> into this auto warming infrastructure (would greatly simplify things).
>>
>> then, once the index reader has been closed, it would call close on
>> any attached IndexCache objects in order to free up memory/objects.
>> (so i don't have to maintain reference counts anymore)
>>
>> Seems this could also greatly simplify the current field caching
>> mechanisms, as the field caches could be associated with an
>> IndexReader directly using the attached "IndexCache" object, instead
>> of using static weak reference hash maps. (could then add methods like
>> getFieldCache() to the IndexReader)
>>
>>  -- Tim Smith
>>
>> Michael McCandless wrote:
>>
>>
>> Well described, that's exactly it!  I like the concrete example :)
>>
>> Thanks Yonik.
>>
>> Mike
>>
>> On Tue, Sep 22, 2009 at 1:38 PM, Yonik Seeley
>> <yo...@lucidimagination.com> wrote:
>>
>>
>>
>> OK Mike, thanks for your patience - I understand now :-)
>>
>> Here's an example that helped me understand - hopefully it will add to
>> others understanding more than it confuses ;-)
>>
>> IW.getReader() => segments={A, B}
>>  // something causes a merge of A,B into AB to start
>> addDoc(doc1)
>>  // doc1 goes into segment C
>> IW.getReader() => segments={A, B, C}
>>  // merge isn't done yet, so getReader() still returns A,B instead of
>> AB, but doc1 is still searchable!
>>
>> OK, in this scenario, there's no advantage to warming in the IW vs the app.
>> Let's start over with a little different timing:
>>
>> segments={A,B}
>>  // something causes a merge of A,B into AB to start
>> addDoc(doc1)
>>  // doc1 goes into segment C
>>  // merging of A,B into AB finishes
>> IW.getReader() => segments={AB, C}
>>
>> Oh, no... with warming at the app level, we need to warm the huge AB
>> segment before doc1 is visible.  We could continue using the old
>> reader while the warming is ongoing, so no user requests will
>> experience long queries, but doc1 isn't in the old segment.
>>
>> With warming in the IW (basically warming becomes part of the same
>> operation as merging), then getReader() would return segments={A,B,C}
>> and doc1 would still be instantly searchable.
>>
>> The only way to duplicate this functionality at the app layer would be
>> to recognize that there is a new segment, try and figure out what old
>> segments were merged to create this new segment, and create a reader
>> that's a mix of old and new to avoid unwarmed segments - not nice.
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>>
>>     
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>   


Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Jason Rutherglen <ja...@gmail.com>.
For that you can subclass IW.mergeSuccess.

On Tue, Sep 22, 2009 at 11:43 AM, Tim Smith <ts...@attivio.com> wrote:
> Jason Rutherglen wrote:
>
> I have a working version of Simple FieldCache Merging LUCENE-1785 that
> should go in real soon.
>
>
>
> Will this contain a callback mechanism i can register with to know what
> segments are being merged?
> that way i can merge my own caches as well at the application layer, perhaps
> exposed through something like IndexReaderWarmer.warmMerge(IndexReader[]
> input, IndexReader output)
>
> On Tue, Sep 22, 2009 at 11:14 AM, Mark Miller <ma...@gmail.com> wrote:
>
>
> 1. see IndexWriter and the method/class that Mike pointed out earlier
> for the warming.
>
> 2. See Lucene-831 - I think we will get some form of that in someday.
>
> Tim Smith wrote:
>
>
> This sounds pretty interesting
>
> is there a proposed API for doing this warming yet?
> Is there a ticket tracking this?
>
> for my use cases, it would be really nice for applications to be able
> to associate a custom "IndexCache" object with an index reader, then
> this pluggable "AutoWarmer" would be in charge of initializing this
> cache for a segment reader. I have a number of caches outside the
> realm of regular field caches that i associate with a segment,
> currently doing this after getting the IndexReader by iterating over
> its segments, and getting a cache object shared across all instances
> of the same logical segment. it would be nice if i could just have my
> "cache" object subclass a lucene IndexCache class and drop it right
> into this auto warming infrastructure (would greatly simplify things).
>
> then, once the index reader has been closed, it would call close on
> any attached IndexCache objects in order to free up memory/objects.
> (so i don't have to maintain reference counts anymore)
>
> Seems this could also greatly simplify the current field caching
> mechanisms, as the field caches could be associated with an
> IndexReader directly using the attached "IndexCache" object, instead
> of using static weak reference hash maps. (could then add methods like
> getFieldCache() to the IndexReader)
>
>  -- Tim Smith
>
> Michael McCandless wrote:
>
>
> Well described, that's exactly it!  I like the concrete example :)
>
> Thanks Yonik.
>
> Mike
>
> On Tue, Sep 22, 2009 at 1:38 PM, Yonik Seeley
> <yo...@lucidimagination.com> wrote:
>
>
>
> OK Mike, thanks for your patience - I understand now :-)
>
> Here's an example that helped me understand - hopefully it will add to
> others understanding more than it confuses ;-)
>
> IW.getReader() => segments={A, B}
>  // something causes a merge of A,B into AB to start
> addDoc(doc1)
>  // doc1 goes into segment C
> IW.getReader() => segments={A, B, C}
>  // merge isn't done yet, so getReader() still returns A,B instead of
> AB, but doc1 is still searchable!
>
> OK, in this scenario, there's no advantage to warming in the IW vs the app.
> Let's start over with a little different timing:
>
> segments={A,B}
>  // something causes a merge of A,B into AB to start
> addDoc(doc1)
>  // doc1 goes into segment C
>  // merging of A,B into AB finishes
> IW.getReader() => segments={AB, C}
>
> Oh, no... with warming at the app level, we need to warm the huge AB
> segment before doc1 is visible.  We could continue using the old
> reader while the warming is ongoing, so no user requests will
> experience long queries, but doc1 isn't in the old segment.
>
> With warming in the IW (basically warming becomes part of the same
> operation as merging), then getReader() would return segments={A,B,C}
> and doc1 would still be instantly searchable.
>
> The only way to duplicate this functionality at the app layer would be
> to recognize that there is a new segment, try and figure out what old
> segments were merged to create this new segment, and create a reader
> that's a mix of old and new to avoid unwarmed segments - not nice.
>
> -Yonik
> http://www.lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Tim Smith <ts...@attivio.com>.
Jason Rutherglen wrote:
> I have a working version of Simple FieldCache Merging LUCENE-1785 that
> should go in real soon.
>
>   
Will this contain a callback mechanism i can register with to know what
segments are being merged?
that way i can merge my own caches as well at the application layer,
perhaps exposed through something like
IndexReaderWarmer.warmMerge(IndexReader[] input, IndexReader output)

> On Tue, Sep 22, 2009 at 11:14 AM, Mark Miller <ma...@gmail.com> wrote:
>   
>> 1. see IndexWriter and the method/class that Mike pointed out earlier
>> for the warming.
>>
>> 2. See Lucene-831 - I think we will get some form of that in someday.
>>
>> Tim Smith wrote:
>>     
>>> This sounds pretty interesting
>>>
>>> is there a proposed API for doing this warming yet?
>>> Is there a ticket tracking this?
>>>
>>> for my use cases, it would be really nice for applications to be able
>>> to associate a custom "IndexCache" object with an index reader, then
>>> this pluggable "AutoWarmer" would be in charge of initializing this
>>> cache for a segment reader. I have a number of caches outside the
>>> realm of regular field caches that i associate with a segment,
>>> currently doing this after getting the IndexReader by iterating over
>>> its segments, and getting a cache object shared across all instances
>>> of the same logical segment. it would be nice if i could just have my
>>> "cache" object subclass a lucene IndexCache class and drop it right
>>> into this auto warming infrastructure (would greatly simplify things).
>>>
>>> then, once the index reader has been closed, it would call close on
>>> any attached IndexCache objects in order to free up memory/objects.
>>> (so i don't have to maintain reference counts anymore)
>>>
>>> Seems this could also greatly simplify the current field caching
>>> mechanisms, as the field caches could be associated with an
>>> IndexReader directly using the attached "IndexCache" object, instead
>>> of using static weak reference hash maps. (could then add methods like
>>> getFieldCache() to the IndexReader)
>>>
>>>  -- Tim Smith
>>>
>>> Michael McCandless wrote:
>>>       
>>>> Well described, that's exactly it!  I like the concrete example :)
>>>>
>>>> Thanks Yonik.
>>>>
>>>> Mike
>>>>
>>>> On Tue, Sep 22, 2009 at 1:38 PM, Yonik Seeley
>>>> <yo...@lucidimagination.com> wrote:
>>>>
>>>>         
>>>>> OK Mike, thanks for your patience - I understand now :-)
>>>>>
>>>>> Here's an example that helped me understand - hopefully it will add to
>>>>> others understanding more than it confuses ;-)
>>>>>
>>>>> IW.getReader() => segments={A, B}
>>>>>  // something causes a merge of A,B into AB to start
>>>>> addDoc(doc1)
>>>>>  // doc1 goes into segment C
>>>>> IW.getReader() => segments={A, B, C}
>>>>>  // merge isn't done yet, so getReader() still returns A,B instead of
>>>>> AB, but doc1 is still searchable!
>>>>>
>>>>> OK, in this scenario, there's no advantage to warming in the IW vs the app.
>>>>> Let's start over with a little different timing:
>>>>>
>>>>> segments={A,B}
>>>>>  // something causes a merge of A,B into AB to start
>>>>> addDoc(doc1)
>>>>>  // doc1 goes into segment C
>>>>>  // merging of A,B into AB finishes
>>>>> IW.getReader() => segments={AB, C}
>>>>>
>>>>> Oh, no... with warming at the app level, we need to warm the huge AB
>>>>> segment before doc1 is visible.  We could continue using the old
>>>>> reader while the warming is ongoing, so no user requests will
>>>>> experience long queries, but doc1 isn't in the old segment.
>>>>>
>>>>> With warming in the IW (basically warming becomes part of the same
>>>>> operation as merging), then getReader() would return segments={A,B,C}
>>>>> and doc1 would still be instantly searchable.
>>>>>
>>>>> The only way to duplicate this functionality at the app layer would be
>>>>> to recognize that there is a new segment, try and figure out what old
>>>>> segments were merged to create this new segment, and create a reader
>>>>> that's a mix of old and new to avoid unwarmed segments - not nice.
>>>>>
>>>>> -Yonik
>>>>> http://www.lucidimagination.com
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>
>>>>>
>>>>>
>>>>>           
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>
>>>>
>>>>         
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>>     
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>   


Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Jason Rutherglen <ja...@gmail.com>.
I have a working version of Simple FieldCache Merging LUCENE-1785 that
should go in real soon.

On Tue, Sep 22, 2009 at 11:14 AM, Mark Miller <ma...@gmail.com> wrote:
> 1. see IndexWriter and the method/class that Mike pointed out earlier
> for the warming.
>
> 2. See Lucene-831 - I think we will get some form of that in someday.
>
> Tim Smith wrote:
>> This sounds pretty interesting
>>
>> is there a proposed API for doing this warming yet?
>> Is there a ticket tracking this?
>>
>> for my use cases, it would be really nice for applications to be able
>> to associate a custom "IndexCache" object with an index reader, then
>> this pluggable "AutoWarmer" would be in charge of initializing this
>> cache for a segment reader. I have a number of caches outside the
>> realm of regular field caches that i associate with a segment,
>> currently doing this after getting the IndexReader by iterating over
>> its segments, and getting a cache object shared across all instances
>> of the same logical segment. it would be nice if i could just have my
>> "cache" object subclass a lucene IndexCache class and drop it right
>> into this auto warming infrastructure (would greatly simplify things).
>>
>> then, once the index reader has been closed, it would call close on
>> any attached IndexCache objects in order to free up memory/objects.
>> (so i don't have to maintain reference counts anymore)
>>
>> Seems this could also greatly simplify the current field caching
>> mechanisms, as the field caches could be associated with an
>> IndexReader directly using the attached "IndexCache" object, instead
>> of using static weak reference hash maps. (could then add methods like
>> getFieldCache() to the IndexReader)
>>
>>  -- Tim Smith
>>
>> Michael McCandless wrote:
>>> Well described, that's exactly it!  I like the concrete example :)
>>>
>>> Thanks Yonik.
>>>
>>> Mike
>>>
>>> On Tue, Sep 22, 2009 at 1:38 PM, Yonik Seeley
>>> <yo...@lucidimagination.com> wrote:
>>>
>>>> OK Mike, thanks for your patience - I understand now :-)
>>>>
>>>> Here's an example that helped me understand - hopefully it will add to
>>>> others understanding more than it confuses ;-)
>>>>
>>>> IW.getReader() => segments={A, B}
>>>>  // something causes a merge of A,B into AB to start
>>>> addDoc(doc1)
>>>>  // doc1 goes into segment C
>>>> IW.getReader() => segments={A, B, C}
>>>>  // merge isn't done yet, so getReader() still returns A,B instead of
>>>> AB, but doc1 is still searchable!
>>>>
>>>> OK, in this scenario, there's no advantage to warming in the IW vs the app.
>>>> Let's start over with a little different timing:
>>>>
>>>> segments={A,B}
>>>>  // something causes a merge of A,B into AB to start
>>>> addDoc(doc1)
>>>>  // doc1 goes into segment C
>>>>  // merging of A,B into AB finishes
>>>> IW.getReader() => segments={AB, C}
>>>>
>>>> Oh, no... with warming at the app level, we need to warm the huge AB
>>>> segment before doc1 is visible.  We could continue using the old
>>>> reader while the warming is ongoing, so no user requests will
>>>> experience long queries, but doc1 isn't in the old segment.
>>>>
>>>> With warming in the IW (basically warming becomes part of the same
>>>> operation as merging), then getReader() would return segments={A,B,C}
>>>> and doc1 would still be instantly searchable.
>>>>
>>>> The only way to duplicate this functionality at the app layer would be
>>>> to recognize that there is a new segment, try and figure out what old
>>>> segments were merged to create this new segment, and create a reader
>>>> that's a mix of old and new to avoid unwarmed segments - not nice.
>>>>
>>>> -Yonik
>>>> http://www.lucidimagination.com
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Mark Miller <ma...@gmail.com>.
1. see IndexWriter and the method/class that Mike pointed out earlier
for the warming.

2. See Lucene-831 - I think we will get some form of that in someday.

Tim Smith wrote:
> This sounds pretty interesting
>
> is there a proposed API for doing this warming yet?
> Is there a ticket tracking this?
>
> for my use cases, it would be really nice for applications to be able
> to associate a custom "IndexCache" object with an index reader, then
> this pluggable "AutoWarmer" would be in charge of initializing this
> cache for a segment reader. I have a number of caches outside the
> realm of regular field caches that i associate with a segment,
> currently doing this after getting the IndexReader by iterating over
> its segments, and getting a cache object shared across all instances
> of the same logical segment. it would be nice if i could just have my
> "cache" object subclass a lucene IndexCache class and drop it right
> into this auto warming infrastructure (would greatly simplify things).
>
> then, once the index reader has been closed, it would call close on
> any attached IndexCache objects in order to free up memory/objects.
> (so i don't have to maintain reference counts anymore)
>
> Seems this could also greatly simplify the current field caching
> mechanisms, as the field caches could be associated with an
> IndexReader directly using the attached "IndexCache" object, instead
> of using static weak reference hash maps. (could then add methods like
> getFieldCache() to the IndexReader)
>
>  -- Tim Smith
>
> Michael McCandless wrote:
>> Well described, that's exactly it!  I like the concrete example :)
>>
>> Thanks Yonik.
>>
>> Mike
>>
>> On Tue, Sep 22, 2009 at 1:38 PM, Yonik Seeley
>> <yo...@lucidimagination.com> wrote:
>>   
>>> OK Mike, thanks for your patience - I understand now :-)
>>>
>>> Here's an example that helped me understand - hopefully it will add to
>>> others understanding more than it confuses ;-)
>>>
>>> IW.getReader() => segments={A, B}
>>>  // something causes a merge of A,B into AB to start
>>> addDoc(doc1)
>>>  // doc1 goes into segment C
>>> IW.getReader() => segments={A, B, C}
>>>  // merge isn't done yet, so getReader() still returns A,B instead of
>>> AB, but doc1 is still searchable!
>>>
>>> OK, in this scenario, there's no advantage to warming in the IW vs the app.
>>> Let's start over with a little different timing:
>>>
>>> segments={A,B}
>>>  // something causes a merge of A,B into AB to start
>>> addDoc(doc1)
>>>  // doc1 goes into segment C
>>>  // merging of A,B into AB finishes
>>> IW.getReader() => segments={AB, C}
>>>
>>> Oh, no... with warming at the app level, we need to warm the huge AB
>>> segment before doc1 is visible.  We could continue using the old
>>> reader while the warming is ongoing, so no user requests will
>>> experience long queries, but doc1 isn't in the old segment.
>>>
>>> With warming in the IW (basically warming becomes part of the same
>>> operation as merging), then getReader() would return segments={A,B,C}
>>> and doc1 would still be instantly searchable.
>>>
>>> The only way to duplicate this functionality at the app layer would be
>>> to recognize that there is a new segment, try and figure out what old
>>> segments were merged to create this new segment, and create a reader
>>> that's a mix of old and new to avoid unwarmed segments - not nice.
>>>
>>> -Yonik
>>> http://www.lucidimagination.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>>
>>>     
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>   
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Tim Smith <ts...@attivio.com>.
This sounds pretty interesting

is there a proposed API for doing this warming yet?
Is there a ticket tracking this?

for my use cases, it would be really nice for applications to be able to
associate a custom "IndexCache" object with an index reader, then this
pluggable "AutoWarmer" would be in charge of initializing this cache for
a segment reader. I have a number of caches outside the realm of regular
field caches that i associate with a segment, currently doing this after
getting the IndexReader by iterating over its segments, and getting a
cache object shared across all instances of the same logical segment. it
would be nice if i could just have my "cache" object subclass a lucene
IndexCache class and drop it right into this auto warming infrastructure
(would greatly simplify things).

then, once the index reader has been closed, it would call close on any
attached IndexCache objects in order to free up memory/objects. (so i
don't have to maintain reference counts anymore)

Seems this could also greatly simplify the current field caching
mechanisms, as the field caches could be associated with an IndexReader
directly using the attached "IndexCache" object, instead of using static
weak reference hash maps. (could then add methods like getFieldCache()
to the IndexReader)

 -- Tim Smith

Michael McCandless wrote:
> Well described, that's exactly it!  I like the concrete example :)
>
> Thanks Yonik.
>
> Mike
>
> On Tue, Sep 22, 2009 at 1:38 PM, Yonik Seeley
> <yo...@lucidimagination.com> wrote:
>   
>> OK Mike, thanks for your patience - I understand now :-)
>>
>> Here's an example that helped me understand - hopefully it will add to
>> others understanding more than it confuses ;-)
>>
>> IW.getReader() => segments={A, B}
>>  // something causes a merge of A,B into AB to start
>> addDoc(doc1)
>>  // doc1 goes into segment C
>> IW.getReader() => segments={A, B, C}
>>  // merge isn't done yet, so getReader() still returns A,B instead of
>> AB, but doc1 is still searchable!
>>
>> OK, in this scenario, there's no advantage to warming in the IW vs the app.
>> Let's start over with a little different timing:
>>
>> segments={A,B}
>>  // something causes a merge of A,B into AB to start
>> addDoc(doc1)
>>  // doc1 goes into segment C
>>  // merging of A,B into AB finishes
>> IW.getReader() => segments={AB, C}
>>
>> Oh, no... with warming at the app level, we need to warm the huge AB
>> segment before doc1 is visible.  We could continue using the old
>> reader while the warming is ongoing, so no user requests will
>> experience long queries, but doc1 isn't in the old segment.
>>
>> With warming in the IW (basically warming becomes part of the same
>> operation as merging), then getReader() would return segments={A,B,C}
>> and doc1 would still be instantly searchable.
>>
>> The only way to duplicate this functionality at the app layer would be
>> to recognize that there is a new segment, try and figure out what old
>> segments were merged to create this new segment, and create a reader
>> that's a mix of old and new to avoid unwarmed segments - not nice.
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>>     
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>   


Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Yonik Seeley <yo...@lucidimagination.com>.
Ya know, It turned out to be embarrassingly simple - I think I just
had a mental block from thinking about how Solr's warming worked for
so long.

Actually, it was so simple, yet I still got in wrong on the first
glance, that it reminded me of this:
http://www.marilynvossavant.com/forum/viewtopic.php?t=64
fun stuff ;-)

-Yonik
http://www.lucidimagination.com

On Tue, Sep 22, 2009 at 1:42 PM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> Well described, that's exactly it!  I like the concrete example :)
>
> Thanks Yonik.
>
> Mike
>
> On Tue, Sep 22, 2009 at 1:38 PM, Yonik Seeley
> <yo...@lucidimagination.com> wrote:
>> OK Mike, thanks for your patience - I understand now :-)
>>
>> Here's an example that helped me understand - hopefully it will add to
>> others understanding more than it confuses ;-)
>>
>> IW.getReader() => segments={A, B}
>>  // something causes a merge of A,B into AB to start
>> addDoc(doc1)
>>  // doc1 goes into segment C
>> IW.getReader() => segments={A, B, C}
>>  // merge isn't done yet, so getReader() still returns A,B instead of
>> AB, but doc1 is still searchable!
>>
>> OK, in this scenario, there's no advantage to warming in the IW vs the app.
>> Let's start over with a little different timing:
>>
>> segments={A,B}
>>  // something causes a merge of A,B into AB to start
>> addDoc(doc1)
>>  // doc1 goes into segment C
>>  // merging of A,B into AB finishes
>> IW.getReader() => segments={AB, C}
>>
>> Oh, no... with warming at the app level, we need to warm the huge AB
>> segment before doc1 is visible.  We could continue using the old
>> reader while the warming is ongoing, so no user requests will
>> experience long queries, but doc1 isn't in the old segment.
>>
>> With warming in the IW (basically warming becomes part of the same
>> operation as merging), then getReader() would return segments={A,B,C}
>> and doc1 would still be instantly searchable.
>>
>> The only way to duplicate this functionality at the app layer would be
>> to recognize that there is a new segment, try and figure out what old
>> segments were merged to create this new segment, and create a reader
>> that's a mix of old and new to avoid unwarmed segments - not nice.
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Michael McCandless <lu...@mikemccandless.com>.
Well described, that's exactly it!  I like the concrete example :)

Thanks Yonik.

Mike

On Tue, Sep 22, 2009 at 1:38 PM, Yonik Seeley
<yo...@lucidimagination.com> wrote:
> OK Mike, thanks for your patience - I understand now :-)
>
> Here's an example that helped me understand - hopefully it will add to
> others understanding more than it confuses ;-)
>
> IW.getReader() => segments={A, B}
>  // something causes a merge of A,B into AB to start
> addDoc(doc1)
>  // doc1 goes into segment C
> IW.getReader() => segments={A, B, C}
>  // merge isn't done yet, so getReader() still returns A,B instead of
> AB, but doc1 is still searchable!
>
> OK, in this scenario, there's no advantage to warming in the IW vs the app.
> Let's start over with a little different timing:
>
> segments={A,B}
>  // something causes a merge of A,B into AB to start
> addDoc(doc1)
>  // doc1 goes into segment C
>  // merging of A,B into AB finishes
> IW.getReader() => segments={AB, C}
>
> Oh, no... with warming at the app level, we need to warm the huge AB
> segment before doc1 is visible.  We could continue using the old
> reader while the warming is ongoing, so no user requests will
> experience long queries, but doc1 isn't in the old segment.
>
> With warming in the IW (basically warming becomes part of the same
> operation as merging), then getReader() would return segments={A,B,C}
> and doc1 would still be instantly searchable.
>
> The only way to duplicate this functionality at the app layer would be
> to recognize that there is a new segment, try and figure out what old
> segments were merged to create this new segment, and create a reader
> that's a mix of old and new to avoid unwarmed segments - not nice.
>
> -Yonik
> http://www.lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Yonik Seeley <yo...@lucidimagination.com>.
OK Mike, thanks for your patience - I understand now :-)

Here's an example that helped me understand - hopefully it will add to
others understanding more than it confuses ;-)

IW.getReader() => segments={A, B}
  // something causes a merge of A,B into AB to start
addDoc(doc1)
  // doc1 goes into segment C
IW.getReader() => segments={A, B, C}
  // merge isn't done yet, so getReader() still returns A,B instead of
AB, but doc1 is still searchable!

OK, in this scenario, there's no advantage to warming in the IW vs the app.
Let's start over with a little different timing:

segments={A,B}
  // something causes a merge of A,B into AB to start
addDoc(doc1)
  // doc1 goes into segment C
  // merging of A,B into AB finishes
IW.getReader() => segments={AB, C}

Oh, no... with warming at the app level, we need to warm the huge AB
segment before doc1 is visible.  We could continue using the old
reader while the warming is ongoing, so no user requests will
experience long queries, but doc1 isn't in the old segment.

With warming in the IW (basically warming becomes part of the same
operation as merging), then getReader() would return segments={A,B,C}
and doc1 would still be instantly searchable.

The only way to duplicate this functionality at the app layer would be
to recognize that there is a new segment, try and figure out what old
segments were merged to create this new segment, and create a reader
that's a mix of old and new to avoid unwarmed segments - not nice.

-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Michael McCandless <lu...@mikemccandless.com>.
It's not only that the newly merged segments are quickly searchable
(you could do that with warming outside of IW).

It's more importantly so that you can continue to add/delete docs,
flush the segment, open a new NRT reader, and search those changes,
without waiting for the warming to complete.  You could do many such
updates all while a large merged segment is being warmed in the BG.

It decouples merging (which results in no change to the search
results) from the add/deletes (which do result in changes to the
search results), so that the warming due to a large merge won't hold
up the stream of updates.

I think for any serious NRT app, it's a must.  (Either that or avoid
ever doing large merges entirely).

Mike

On Tue, Sep 22, 2009 at 11:44 AM, Jason Rutherglen
<ja...@gmail.com> wrote:
> Adding segment warming to IW is the only way to insure newly
> merged segments are quickly searchable without the impact
> brought up by John W regarding queries on new segments being
> slow when they load field caches.
>
> On Tue, Sep 22, 2009 at 8:37 AM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>> On Tue, Sep 22, 2009 at 11:08 AM, Yonik Seeley
>> <yo...@lucidimagination.com> wrote:
>>
>>> I'm still not sure I see the reason for complicating the IndexWriter
>>> with warming... can't this be done just as efficiently (if not more
>>> efficiently) in user/application space?
>>
>> It will be less efficient when you warm outside of IndexWriter, ie,
>> you will necessarily delay the app's net turnaround time on being able
>> to search newly added/deleted docs.
>>
>> The whole point of putting optional warming into IndexWriter was so
>> the segment could be warmed *before* the merge commits the change to
>> the writer's SegmentInfos.  Any newly opened near-real-timer readers
>> continue to search the old (merged away) segments, until the warming
>> completes.
>>
>> This way the warming of merged segments is independent of making any
>> newly flushed segments searchable (as long as you use CMS, or any
>> merge scheduler that uses separate threads for merging).  New segments
>> can be flushed and then become searchable (with getReader()) even
>> while the warming is happening.
>>
>> So... if your merge policy allows large merges, setting a warmer in
>> the IndexWriter is crucial for minimizing turnaround time.  But, even
>> once you do that, merging is still IO & CPU intensive, plus IO caches
>> are unnecessarily flushed (since we can't easily madvise/posix_fadvise
>> from java), and we have no IO scheduler control to have merging run at
>> very lower priority, etc., so while the merge & warming are taking
>> place, search performance will be impacted.
>>
>> Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Jason Rutherglen <ja...@gmail.com>.
Adding segment warming to IW is the only way to insure newly
merged segments are quickly searchable without the impact
brought up by John W regarding queries on new segments being
slow when they load field caches.

On Tue, Sep 22, 2009 at 8:37 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> On Tue, Sep 22, 2009 at 11:08 AM, Yonik Seeley
> <yo...@lucidimagination.com> wrote:
>
>> I'm still not sure I see the reason for complicating the IndexWriter
>> with warming... can't this be done just as efficiently (if not more
>> efficiently) in user/application space?
>
> It will be less efficient when you warm outside of IndexWriter, ie,
> you will necessarily delay the app's net turnaround time on being able
> to search newly added/deleted docs.
>
> The whole point of putting optional warming into IndexWriter was so
> the segment could be warmed *before* the merge commits the change to
> the writer's SegmentInfos.  Any newly opened near-real-timer readers
> continue to search the old (merged away) segments, until the warming
> completes.
>
> This way the warming of merged segments is independent of making any
> newly flushed segments searchable (as long as you use CMS, or any
> merge scheduler that uses separate threads for merging).  New segments
> can be flushed and then become searchable (with getReader()) even
> while the warming is happening.
>
> So... if your merge policy allows large merges, setting a warmer in
> the IndexWriter is crucial for minimizing turnaround time.  But, even
> once you do that, merging is still IO & CPU intensive, plus IO caches
> are unnecessarily flushed (since we can't easily madvise/posix_fadvise
> from java), and we have no IO scheduler control to have merging run at
> very lower priority, etc., so while the merge & warming are taking
> place, search performance will be impacted.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Tue, Sep 22, 2009 at 11:37 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> The whole point of putting optional warming into IndexWriter was so
> the segment could be warmed *before* the merge commits the change to
> the writer's SegmentInfos.

But... doesn't this add the same amount of latency in a different
place (moves it from after the commit to before the commit)?

Seems like the total latency from when the doc is added to when it's
searchable is the same in both cases?

> Any newly opened near-real-timer readers
> continue to search the old (merged away) segments, until the warming
> completes.

If the application is doing warming, it can use the same approach...
don't immediately expose the result of IW.getReader - warm it first
and have requests go against the old one in the meantime.

I like IW.getReader()... it adds functionality that one couldn't do at
the application layer.
I'm still missing what adding warming does that can't easily be done
at the application layer.  It can also result in warming of segments
that will never be used (because they will be merged again before
getReader() is called.

-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Tue, Sep 22, 2009 at 11:08 AM, Yonik Seeley
<yo...@lucidimagination.com> wrote:

> I'm still not sure I see the reason for complicating the IndexWriter
> with warming... can't this be done just as efficiently (if not more
> efficiently) in user/application space?

It will be less efficient when you warm outside of IndexWriter, ie,
you will necessarily delay the app's net turnaround time on being able
to search newly added/deleted docs.

The whole point of putting optional warming into IndexWriter was so
the segment could be warmed *before* the merge commits the change to
the writer's SegmentInfos.  Any newly opened near-real-timer readers
continue to search the old (merged away) segments, until the warming
completes.

This way the warming of merged segments is independent of making any
newly flushed segments searchable (as long as you use CMS, or any
merge scheduler that uses separate threads for merging).  New segments
can be flushed and then become searchable (with getReader()) even
while the warming is happening.

So... if your merge policy allows large merges, setting a warmer in
the IndexWriter is crucial for minimizing turnaround time.  But, even
once you do that, merging is still IO & CPU intensive, plus IO caches
are unnecessarily flushed (since we can't easily madvise/posix_fadvise
from java), and we have no IO scheduler control to have merging run at
very lower priority, etc., so while the merge & warming are taking
place, search performance will be impacted.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Tue, Sep 22, 2009 at 10:48 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> John are you using IndexWriter.setMergedSegmentWarmer, so that a newly
> merged segment is warmed before it's "put into production" (returned
> by getReader)?

I'm still not sure I see the reason for complicating the IndexWriter
with warming... can't this be done just as efficiently (if not more
efficiently) in user/application space?

-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: IndexWriter.getReader() was Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Jason Rutherglen <ja...@gmail.com>.
> which one is better

Better for what? What use case are you thinking of?

The merge reasons were covered well in the previous thread.
Another gain is the carry over of deletes in RAM.

I'm getting the feeling the Realtime wiki needs a lot of work.
http://wiki.apache.org/lucene-java/NearRealtimeSearch

On Tue, Sep 22, 2009 at 11:47 AM, Grant Ingersoll <gs...@apache.org> wrote:
> Slight divergence from the topic...
>
> On Sep 22, 2009, at 10:48 AM, Michael McCandless wrote:
>
>> John are you using IndexWriter.setMergedSegmentWarmer, so that a newly
>> merged segment is warmed before it's "put into production" (returned
>> by getReader)?
>
> One of the pieces I still am missing from all of this is why isn't
> IW.getReader() now just the preferred way of getting a IndexReader for all
> applications other than those that are completely batch oriented?  Why
> bother with IndexReader.reopen()?  IW.getReader() is marked as Expert right
> now, which says to me there are some tradeoffs or that one needs to be
> really careful using it, but I don't see the downside other than what
> appears to be some extra resources consumed and the fact that it is brand
> new code, or at least the downside is not documented.
>
> And yet, at the first SF Meetup, I recall having a discussion with Michael
> B. about this approach versus IR.reopen() that left me wondering which one
> is better, since, Lucene has, in fact, always been about incremental updates
> (since there are commercial systems out there that require complete
> re-indexing) and that getting IR.reopen to perform is just a matter of
> tuning one's application in regards to reads and writes vs. having to do all
> this work in the IndexWriter that now tightly couples the IndexReader to the
> IndexWriter.  Hopefully Michael can refresh my memory on the conversation,
> as I may be remembering incorrectly.
>
> -Grant
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: IndexWriter.getReader() was Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Tue, Sep 22, 2009 at 3:53 PM, Grant Ingersoll <gs...@apache.org> wrote:

>> But, the returned reader is read-only, so you can't use it to change
>> norms, do deletes, etc.
>
> Yeah, but an IW can do deletes, and if the this IR is coupled to it
> anyway...

True, but IW's deletes are still buffered, and you can't delete by doc
ID with IW.

>> But Directory is too low... we could probably get by with a class that
>> holds the SegmentReader cache (currently IndexWriter.ReaderPool), and
>> the "current" segmentInfos.  IW would interact with this class to get
>> the readers it needs, for applying deletes, merging, as well as
>> posting newly flushed but not yet committed segments, and IR would
>> then pull from this class to get the latest segments in the index and
>> to checkout the readers.
>
> Not sure why Directory, a public well-known class, is considered too low (I
> thought you would say too high!) versus inner classes that assume an
> IndexWriter. The reason I chose Directory is because it is the common thing
> already shared between the two and it is already a public, well-known class
> that requires no extra understanding by users.  It's a first class citizen.
>  By reusing it, apps can be agnostic about where it came from, versus having
> to wire in all this new stuff to handle ReaderPools, etc. versus simply
> reusing the directory stuff.

Sorry, by "low" I meant logically Directory is a lightweight file
access API -- it's at the low level of Lucene's stack.  I'm not sure
we should overload it by storing SegmentInfos, caching SegmentReaders
in it, etc.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: IndexWriter.getReader() was Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Grant Ingersoll <gs...@apache.org>.
On Sep 22, 2009, at 3:44 PM, Michael McCandless wrote:

> On Tue, Sep 22, 2009 at 2:53 PM, Grant Ingersoll  
> <gs...@apache.org> wrote:
>> One of the pieces I still am missing from all of this is why isn't
>> IW.getReader() now just the preferred way of getting a IndexReader
>> for all applications other than those that are completely batch
>> oriented?
>>
>> Why bother with IndexReader.reopen()?
>
> I agree, most apps should simply use getReader, as long as they're
> running in the same JVM as the IndexWriter, and, they are holding the
> IW open anyway.
>
> But, the returned reader is read-only, so you can't use it to change
> norms, do deletes, etc.

Yeah, but an IW can do deletes, and if the this IR is coupled to it  
anyway...

>
> The API really shouldn't be marked expert.  I'll go remove that...
>
>> Lucene has, in fact, always been about incremental updates (since
>> there are commercial systems out there that require complete
>> re-indexing)
>
> True, for writing.  But for reading, reopening a reader was very
> costly before 2.9 because FieldCache entry had to be fully recomputed.
> So, switching to per-segment search/collect in 2.9 was the biggest
> step to reducing NRT reopen latency.
>
>> and that getting IR.reopen to perform is just a matter of tuning
>> one's application in regards to reads and writes vs. having to do
>> all this work in the IndexWriter that now tightly couples the
>> IndexReader to the IndexWriter.
>
> The integration with IndexWriter allows a reader to access segments
> that haven't yet been committed to the index.  This saves fsync()'ing
> the written files, saves writing a new segments_N file, saves flushing
> deletes to disk and then reloading them (we just share the BitVector
> directly in RAM now).  On many OS/filesystems fsync is surprisingly
> costly.
>
> LUCENE-1313, the next step for NRT, further reduces NRT reopen latency
> by allowing the small segments to remain in RAM, so when reopening
> your NRT reader after smallish add/deletes no IO is incurred.
>
> Beyond LUCENE-1313 we've discussed making IndexWriter's RAM buffer
> directly searchable, so you don't pay the cost of pinching a new
> segment when an NRT reader is reopened.
>
> Really we only need to further improve the approach here if the
> existing performance proves inadequate... in my limited testing the
> performance was excellent.
>
> Though, our inability to prioritize IO and control the OS's IO cache,
> from java, are likely far bigger impacts on our NRT performance at
> this point, than further improvements in our impl.  I'd love to see a
> Directory impl that "emulates" IO prioritization by making merging IO
> wait whenever search IO is live.  I think we need a JNI extension that
> taps into madvise/posix_fadvise, when possible.
>
>> FWIW, I still don't like the coupling of the two.  I think it would
>> be better if IW allowed you to get a Directory (or some other
>> appropriate representation) representing the in memory segment that
>> can then easily be added to an existing Searcher/Reader.  This would
>> at least decouple the two and instead use the common data structure
>> they both already share, i.e. the Directory.  Whether this is doable
>> or not, I am not sure.
>
> I agree the coupling is overkill.
>
> But Directory is too low... we could probably get by with a class that
> holds the SegmentReader cache (currently IndexWriter.ReaderPool), and
> the "current" segmentInfos.  IW would interact with this class to get
> the readers it needs, for applying deletes, merging, as well as
> posting newly flushed but not yet committed segments, and IR would
> then pull from this class to get the latest segments in the index and
> to checkout the readers.

Not sure why Directory, a public well-known class, is considered too  
low (I thought you would say too high!) versus inner classes that  
assume an IndexWriter. The reason I chose Directory is because it is  
the common thing already shared between the two and it is already a  
public, well-known class that requires no extra understanding by  
users.  It's a first class citizen.  By reusing it, apps can be  
agnostic about where it came from, versus having to wire in all this  
new stuff to handle ReaderPools, etc. versus simply reusing the  
directory stuff.

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: IndexWriter.getReader() was Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Tue, Sep 22, 2009 at 2:53 PM, Grant Ingersoll <gs...@apache.org> wrote:
> One of the pieces I still am missing from all of this is why isn't
> IW.getReader() now just the preferred way of getting a IndexReader
> for all applications other than those that are completely batch
> oriented?
>
> Why bother with IndexReader.reopen()?

I agree, most apps should simply use getReader, as long as they're
running in the same JVM as the IndexWriter, and, they are holding the
IW open anyway.

But, the returned reader is read-only, so you can't use it to change
norms, do deletes, etc.

The API really shouldn't be marked expert.  I'll go remove that...

> Lucene has, in fact, always been about incremental updates (since
> there are commercial systems out there that require complete
> re-indexing)

True, for writing.  But for reading, reopening a reader was very
costly before 2.9 because FieldCache entry had to be fully recomputed.
So, switching to per-segment search/collect in 2.9 was the biggest
step to reducing NRT reopen latency.

> and that getting IR.reopen to perform is just a matter of tuning
> one's application in regards to reads and writes vs. having to do
> all this work in the IndexWriter that now tightly couples the
> IndexReader to the IndexWriter.

The integration with IndexWriter allows a reader to access segments
that haven't yet been committed to the index.  This saves fsync()'ing
the written files, saves writing a new segments_N file, saves flushing
deletes to disk and then reloading them (we just share the BitVector
directly in RAM now).  On many OS/filesystems fsync is surprisingly
costly.

LUCENE-1313, the next step for NRT, further reduces NRT reopen latency
by allowing the small segments to remain in RAM, so when reopening
your NRT reader after smallish add/deletes no IO is incurred.

Beyond LUCENE-1313 we've discussed making IndexWriter's RAM buffer
directly searchable, so you don't pay the cost of pinching a new
segment when an NRT reader is reopened.

Really we only need to further improve the approach here if the
existing performance proves inadequate... in my limited testing the
performance was excellent.

Though, our inability to prioritize IO and control the OS's IO cache,
from java, are likely far bigger impacts on our NRT performance at
this point, than further improvements in our impl.  I'd love to see a
Directory impl that "emulates" IO prioritization by making merging IO
wait whenever search IO is live.  I think we need a JNI extension that
taps into madvise/posix_fadvise, when possible.

> FWIW, I still don't like the coupling of the two.  I think it would
> be better if IW allowed you to get a Directory (or some other
> appropriate representation) representing the in memory segment that
> can then easily be added to an existing Searcher/Reader.  This would
> at least decouple the two and instead use the common data structure
> they both already share, i.e. the Directory.  Whether this is doable
> or not, I am not sure.

I agree the coupling is overkill.

But Directory is too low... we could probably get by with a class that
holds the SegmentReader cache (currently IndexWriter.ReaderPool), and
the "current" segmentInfos.  IW would interact with this class to get
the readers it needs, for applying deletes, merging, as well as
posting newly flushed but not yet committed segments, and IR would
then pull from this class to get the latest segments in the index and
to checkout the readers.

Such a shared "per-segment state" class could also be the basis for
app-specific custom caches to update themselves when new segments are
created, old ones are merged, etc.  Probably this class should break
out SR's core separately.  Hmm.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: IndexWriter.getReader() was Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Grant Ingersoll <gs...@apache.org>.
On Sep 22, 2009, at 2:47 PM, Grant Ingersoll wrote:
>
>
> And yet, at the first SF Meetup, I recall having a discussion with  
> Michael B. about this approach versus IR.reopen() that left me  
> wondering which one is better, since, Lucene has, in fact, always  
> been about incremental updates (since there are commercial systems  
> out there that require complete re-indexing) and that getting  
> IR.reopen to perform is just a matter of tuning one's application in  
> regards to reads and writes vs. having to do all this work in the  
> IndexWriter that now tightly couples the IndexReader to the  
> IndexWriter.

FWIW, I still don't like the coupling of the two.  I think it would be  
better if IW allowed you to get a Directory (or some other appropriate  
representation) representing the in memory segment that can then  
easily be added to an existing Searcher/Reader.  This would at least  
decouple the two and instead use the common data structure they both  
already share, i.e. the Directory.  Whether this is doable or not, I  
am not sure.

IndexWriter.getReader() was Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Grant Ingersoll <gs...@apache.org>.
Slight divergence from the topic...

On Sep 22, 2009, at 10:48 AM, Michael McCandless wrote:

> John are you using IndexWriter.setMergedSegmentWarmer, so that a newly
> merged segment is warmed before it's "put into production" (returned
> by getReader)?

One of the pieces I still am missing from all of this is why isn't  
IW.getReader() now just the preferred way of getting a IndexReader for  
all applications other than those that are completely batch oriented?   
Why bother with IndexReader.reopen()?  IW.getReader() is marked as  
Expert right now, which says to me there are some tradeoffs or that  
one needs to be really careful using it, but I don't see the downside  
other than what appears to be some extra resources consumed and the  
fact that it is brand new code, or at least the downside is not  
documented.

And yet, at the first SF Meetup, I recall having a discussion with  
Michael B. about this approach versus IR.reopen() that left me  
wondering which one is better, since, Lucene has, in fact, always been  
about incremental updates (since there are commercial systems out  
there that require complete re-indexing) and that getting IR.reopen to  
perform is just a matter of tuning one's application in regards to  
reads and writes vs. having to do all this work in the IndexWriter  
that now tightly couples the IndexReader to the IndexWriter.   
Hopefully Michael can refresh my memory on the conversation, as I may  
be remembering incorrectly.

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Michael McCandless <lu...@mikemccandless.com>.
John are you using IndexWriter.setMergedSegmentWarmer, so that a newly
merged segment is warmed before it's "put into production" (returned
by getReader)?

Mike

On Mon, Sep 21, 2009 at 9:35 PM, John Wang <jo...@gmail.com> wrote:
> Jason:
>
>     You are missing the point.
>
>     The idea is to avoid merging of large segments. The point of this
> MergePolicy is to balance segment merges across the index. The aim is not to
> have 1 large segment, it is to have n segments with balanced sizes.
>
>     When the large segment is out of the IO cache, replacing it is very
> costly. What we have done is to split the cost over time by having more
> frequent but faster merges.
>
>     I am not suggesting Lucene's default mergePolicy isn't good, it is just
> not suitable for our case where there are high updates introducing tons of
> deletes. The fact that the api is nice enough to allow MergePolicies to be
> plgged it is a good thing.
>
>     Please DO read the wiki.
>
> -John
>
> On Tue, Sep 22, 2009 at 8:58 AM, Jason Rutherglen
> <ja...@gmail.com> wrote:
>>
>> I'm not sure I communicated the idea properly. If CMS is set to
>> 1 thread, no matter how intensive the CPU for a merge, it's
>> limited to 1 core of what is in many cases a 4 or 8 core server.
>> That leaves the other 3 or 7 cores for queries, which if slow,
>> indicates that it isn't the merging that's slowing down queries,
>> but the dumping of the queried segments from the system IO cache.
>>
>> This holds true regardless of the merge policy used. So while a
>> new merge policy sounds great, unless the system IO cache
>> problem is solved, there will always be a lingering problem in
>> regards to large merges with a regularly updated index. Avoiding
>> large merges probably isn't the answer. And
>> LogByteSizeMergePolicy somewhat allows managing the size of the
>> segments merged already. I would personally prefer being able to
>> merge segments up to a given estimated size, which requires
>> LUCENE-1076 to do well.
>>
>> > is rather different from Lucene benchmark as we are testing
>> high updates in a realtime environment
>>
>> Lucene's benchmark allows this. NearRealtimeReaderTask is a good
>> place to start.
>>
>> On Mon, Sep 21, 2009 at 4:50 PM, John Wang <jo...@gmail.com> wrote:
>> > Jason:
>> >
>> >    Before jumping into any conclusions, let me describe the test setup.
>> > It
>> > is rather different from Lucene benchmark as we are testing high updates
>> > in
>> > a realtime environment:
>> >
>> >    We took a public corpus: medline, indexed to approximately 3 million
>> > docs. And update all the docs over and over again for a 10 hour
>> > duration.
>> >
>> >    Only differences in code used where the different MergePolicy
>> > settings
>> > were applied.
>> >
>> >    Taking the variable of HW/OS out of the equation, let's igonored the
>> > absolute numbers and compare the relative numbers between the two runs.
>> >
>> >    The spike is due to merging of a large segment when we accumulate.
>> > The
>> > graph/perf numbers fit our hypothesis that the default MergePolicy
>> > chooses
>> > to merge small segments before large ones and does not handle segmens
>> > with
>> > high number of deletes well.
>> >
>> >     Merging is BOTH IO and CPU intensive. Especially large ones.
>> >
>> >     I think the wiki explains it pretty well.
>> >
>> >     What are you saying is true with IO cache w.r.t. merge. Everytime
>> > new
>> > files are created, old files in IO cache is invalided. As the experiment
>> > shows, this is detrimental to query performance when large segmens are
>> > being
>> > merged.
>> >
>> >     "As we move to a sharded model of indexes, large merges will
>> > naturally not occur." Our test is on a 3 million document index, not
>> > very
>> > large for a single shard. Some katta people have run it on a much much
>> > larger index per shard. Saying large merges will not occur on indexes of
>> > this size IMHO is unfounded.
>> >
>> > -John
>> >
>> > On Tue, Sep 22, 2009 at 2:34 AM, Jason Rutherglen
>> > <ja...@gmail.com> wrote:
>> >>
>> >> John,
>> >>
>> >> It would be great if Lucene's benchmark were used so everyone
>> >> could execute the test in their own environment and verify. It's
>> >> not clear the settings or code used to generate the results so
>> >> it's difficult to draw any reliable conclusions.
>> >>
>> >> The steep spike shows greater evidence for the IO cache being
>> >> cleared during large merges resulting in search performance
>> >> degradation. See:
>> >> http://www.lucidimagination.com/search/?q=madvise
>> >>
>> >> Merging is IO intensive, less CPU intensive, if the
>> >> ConcurrentMergeScheduler is used, which defaults to 3 threads,
>> >> then the CPU could be maxed out. Using a single thread on
>> >> synchronous spinning magnetic media seems more logical. Queries
>> >> are usually the inverse, CPU intensive, not IO intensive when
>> >> the index is in the IO cache. After merging a large segment (or
>> >> during), queries would start hitting disk, and the results
>> >> clearly show that. The queries are suddenly more time consuming
>> >> as they seek on disk at a time when IO activity is at it's peak
>> >> from merging large segments. Using madvise would prevent usable
>> >> indexes from being swapped to disk during a merge, query
>> >> performance would continue unabated.
>> >>
>> >> As we move to a sharded model of indexes, large merges will
>> >> naturally not occur. Shards will reach a specified size and new
>> >> documents will be sent to new shards.
>> >>
>> >> -J
>> >>
>> >> On Sun, Sep 20, 2009 at 11:12 PM, John Wang <jo...@gmail.com>
>> >> wrote:
>> >> > The current default Lucene MergePolicy does not handle frequent
>> >> > updates
>> >> > well.
>> >> >
>> >> > We have done some performance analysis with that and a custom merge
>> >> > policy:
>> >> >
>> >> > http://code.google.com/p/zoie/wiki/ZoieMergePolicy
>> >> >
>> >> > -John
>> >> >
>> >> > On Mon, Sep 21, 2009 at 1:08 PM, Jason Rutherglen <
>> >> > jason.rutherglen@gmail.com> wrote:
>> >> >
>> >> >> I opened SOLR-1447 for this
>> >> >>
>> >> >> 2009/9/18 Noble Paul നോബിള്‍  नोब्ळ् <no...@corp.aol.com>:
>> >> >> > We can use a simple reflection based implementation to simplify
>> >> >> > reading too many parameters.
>> >> >> >
>> >> >> > What I wish to emphasize is that Solr should be agnostic of xml
>> >> >> > altogether. It should only be aware of specific Objects and
>> >> >> > interfaces. If users wish to plugin something else in some other
>> >> >> > way
>> >> >> > ,
>> >> >> > it should be fine
>> >> >> >
>> >> >> >
>> >> >> >  There is a huge learning involved in learning the current
>> >> >> > solrconfig.xml . Let us not make people throw away that .
>> >> >> >
>> >> >> > On Sat, Sep 19, 2009 at 1:59 AM, Jason Rutherglen
>> >> >> > <ja...@gmail.com> wrote:
>> >> >> >> Over the weekend I may write a patch to allow simple reflection
>> >> >> >> based
>> >> >> >> injection from within solrconfig.
>> >> >> >>
>> >> >> >> On Fri, Sep 18, 2009 at 8:10 AM, Yonik Seeley
>> >> >> >> <yo...@lucidimagination.com> wrote:
>> >> >> >>> On Thu, Sep 17, 2009 at 4:30 PM, Shalin Shekhar Mangar
>> >> >> >>> <sh...@gmail.com> wrote:
>> >> >> >>>>> I was wondering if there is a way I can modify
>> >> >> >>>>> calibrateSizeByDeletes
>> >> >> just
>> >> >> >>>>> by configuration ?
>> >> >> >>>>>
>> >> >> >>>>
>> >> >> >>>> Alas, no. The only option that I see for you is to sub-class
>> >> >> >>>> LogByteSizeMergePolicy and set calibrateSizeByDeletes to true
>> >> >> >>>> in
>> >> >> >>>> the
>> >> >> >>>> constructor. However, please open a Jira issue and so we don't
>> >> >> >>>> forget
>> >> >> about
>> >> >> >>>> it.
>> >> >> >>>
>> >> >> >>> It's the continuing stuff like this that makes me feel like we
>> >> >> >>> should
>> >> >> >>> be Spring (or equivalent) based someday... I'm just not sure how
>> >> >> >>> we're
>> >> >> >>> going to get there.
>> >> >> >>>
>> >> >> >>> -Yonik
>> >> >> >>> http://www.lucidimagination.com
>> >> >> >>>
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > -----------------------------------------------------
>> >> >> > Noble Paul | Principal Engineer| AOL | http://aol.com
>> >> >> >
>> >> >>
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Grant Ingersoll <gs...@apache.org>.
On Sep 21, 2009, at 9:35 PM, John Wang wrote:

> Jason:
>
>     You are missing the point.
>
>     The idea is to avoid merging of large segments. The point of  
> this MergePolicy is to balance segment merges across the index. The  
> aim is not to have 1 large segment, it is to have n segments with  
> balanced sizes.
>
>     When the large segment is out of the IO cache, replacing it is  
> very costly. What we have done is to split the cost over time by  
> having more frequent but faster merges.
>


Yeah, I have seen this in action several times as well.  See also some  
discussion at:  http://www.lucidimagination.com/search/document/bd53b0431f7eada5/concurrentmergescheduler_and_mergepolicy_question 
.


Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by John Wang <jo...@gmail.com>.
Jason:

I am not sure what "parameters" are you referring to either. Are you
responding to the right email?

Anyhoot, I used everything for the default for both MergePolicies.

LogMergePolicy.setCalibrateSizeByDeletes was a contribution by us from ZMP
for normalize segment size using deleted doc counts. So it was part of ZMP.

The idea with ZMP is to have a set of balanced-sized segments instead of 1
large segment. (as I have been repeatedly describing on this email thread)

To get this balance, we represent every point before the merge as a state
modeled in a Viterbi alg with a cost function for each type of merge, this
is used to select the desired segment to merge.

I hate to hijack a Lucene thread to discuss Zoie, feel free to post
questions on the Zoie group for details.

-John

On Wed, Sep 23, 2009 at 1:56 AM, Jason Rutherglen <
jason.rutherglen@gmail.com> wrote:

> John,
>
> I have a few questions in order to better understand, as the
> wiki does not reflect the entirety of what you're trying to
> describe.
>
> > But it is required to set up several parameters carefully to
> get desired behavior.
>
> Which parameters are you referring to?
>
> What were the ZMP parameters used for the test?
>
> What was the number CMS of threads?
>
> It would be helpful to see a time based table of the data used
> to generate the chart at the bottom with the segment infos at
> regular intervals.
>
> What is the difference between how ZMP and
> LogMergePolicy.setCalibrateSizeByDeletes handles deletes?
>
> Are the queries using Zoie or Lucene's index searcher?
>
> Can you explain why the Viterbi algorithm was used and how it
> works in this context?
>
> -J
>
> On Mon, Sep 21, 2009 at 6:35 PM, John Wang <jo...@gmail.com> wrote:
> > Jason:
> >
> >     You are missing the point.
> >
> >     The idea is to avoid merging of large segments. The point of this
> > MergePolicy is to balance segment merges across the index. The aim is not
> to
> > have 1 large segment, it is to have n segments with balanced sizes.
> >
> >     When the large segment is out of the IO cache, replacing it is very
> > costly. What we have done is to split the cost over time by having more
> > frequent but faster merges.
> >
> >     I am not suggesting Lucene's default mergePolicy isn't good, it is
> just
> > not suitable for our case where there are high updates introducing tons
> of
> > deletes. The fact that the api is nice enough to allow MergePolicies to
> be
> > plgged it is a good thing.
> >
> >     Please DO read the wiki.
> >
> > -John
> >
> > On Tue, Sep 22, 2009 at 8:58 AM, Jason Rutherglen
> > <ja...@gmail.com> wrote:
> >>
> >> I'm not sure I communicated the idea properly. If CMS is set to
> >> 1 thread, no matter how intensive the CPU for a merge, it's
> >> limited to 1 core of what is in many cases a 4 or 8 core server.
> >> That leaves the other 3 or 7 cores for queries, which if slow,
> >> indicates that it isn't the merging that's slowing down queries,
> >> but the dumping of the queried segments from the system IO cache.
> >>
> >> This holds true regardless of the merge policy used. So while a
> >> new merge policy sounds great, unless the system IO cache
> >> problem is solved, there will always be a lingering problem in
> >> regards to large merges with a regularly updated index. Avoiding
> >> large merges probably isn't the answer. And
> >> LogByteSizeMergePolicy somewhat allows managing the size of the
> >> segments merged already. I would personally prefer being able to
> >> merge segments up to a given estimated size, which requires
> >> LUCENE-1076 to do well.
> >>
> >> > is rather different from Lucene benchmark as we are testing
> >> high updates in a realtime environment
> >>
> >> Lucene's benchmark allows this. NearRealtimeReaderTask is a good
> >> place to start.
> >>
> >> On Mon, Sep 21, 2009 at 4:50 PM, John Wang <jo...@gmail.com> wrote:
> >> > Jason:
> >> >
> >> >    Before jumping into any conclusions, let me describe the test
> setup.
> >> > It
> >> > is rather different from Lucene benchmark as we are testing high
> updates
> >> > in
> >> > a realtime environment:
> >> >
> >> >    We took a public corpus: medline, indexed to approximately 3
> million
> >> > docs. And update all the docs over and over again for a 10 hour
> >> > duration.
> >> >
> >> >    Only differences in code used where the different MergePolicy
> >> > settings
> >> > were applied.
> >> >
> >> >    Taking the variable of HW/OS out of the equation, let's igonored
> the
> >> > absolute numbers and compare the relative numbers between the two
> runs.
> >> >
> >> >    The spike is due to merging of a large segment when we accumulate.
> >> > The
> >> > graph/perf numbers fit our hypothesis that the default MergePolicy
> >> > chooses
> >> > to merge small segments before large ones and does not handle segmens
> >> > with
> >> > high number of deletes well.
> >> >
> >> >     Merging is BOTH IO and CPU intensive. Especially large ones.
> >> >
> >> >     I think the wiki explains it pretty well.
> >> >
> >> >     What are you saying is true with IO cache w.r.t. merge. Everytime
> >> > new
> >> > files are created, old files in IO cache is invalided. As the
> experiment
> >> > shows, this is detrimental to query performance when large segmens are
> >> > being
> >> > merged.
> >> >
> >> >     "As we move to a sharded model of indexes, large merges will
> >> > naturally not occur." Our test is on a 3 million document index, not
> >> > very
> >> > large for a single shard. Some katta people have run it on a much much
> >> > larger index per shard. Saying large merges will not occur on indexes
> of
> >> > this size IMHO is unfounded.
> >> >
> >> > -John
> >> >
> >> > On Tue, Sep 22, 2009 at 2:34 AM, Jason Rutherglen
> >> > <ja...@gmail.com> wrote:
> >> >>
> >> >> John,
> >> >>
> >> >> It would be great if Lucene's benchmark were used so everyone
> >> >> could execute the test in their own environment and verify. It's
> >> >> not clear the settings or code used to generate the results so
> >> >> it's difficult to draw any reliable conclusions.
> >> >>
> >> >> The steep spike shows greater evidence for the IO cache being
> >> >> cleared during large merges resulting in search performance
> >> >> degradation. See:
> >> >> http://www.lucidimagination.com/search/?q=madvise
> >> >>
> >> >> Merging is IO intensive, less CPU intensive, if the
> >> >> ConcurrentMergeScheduler is used, which defaults to 3 threads,
> >> >> then the CPU could be maxed out. Using a single thread on
> >> >> synchronous spinning magnetic media seems more logical. Queries
> >> >> are usually the inverse, CPU intensive, not IO intensive when
> >> >> the index is in the IO cache. After merging a large segment (or
> >> >> during), queries would start hitting disk, and the results
> >> >> clearly show that. The queries are suddenly more time consuming
> >> >> as they seek on disk at a time when IO activity is at it's peak
> >> >> from merging large segments. Using madvise would prevent usable
> >> >> indexes from being swapped to disk during a merge, query
> >> >> performance would continue unabated.
> >> >>
> >> >> As we move to a sharded model of indexes, large merges will
> >> >> naturally not occur. Shards will reach a specified size and new
> >> >> documents will be sent to new shards.
> >> >>
> >> >> -J
> >> >>
> >> >> On Sun, Sep 20, 2009 at 11:12 PM, John Wang <jo...@gmail.com>
> >> >> wrote:
> >> >> > The current default Lucene MergePolicy does not handle frequent
> >> >> > updates
> >> >> > well.
> >> >> >
> >> >> > We have done some performance analysis with that and a custom merge
> >> >> > policy:
> >> >> >
> >> >> > http://code.google.com/p/zoie/wiki/ZoieMergePolicy
> >> >> >
> >> >> > -John
> >> >> >
> >> >> > On Mon, Sep 21, 2009 at 1:08 PM, Jason Rutherglen <
> >> >> > jason.rutherglen@gmail.com> wrote:
> >> >> >
> >> >> >> I opened SOLR-1447 for this
> >> >> >>
> >> >> >> 2009/9/18 Noble Paul നോബിള്‍  नोब्ळ् <no...@corp.aol.com>:
> >> >> >> > We can use a simple reflection based implementation to simplify
> >> >> >> > reading too many parameters.
> >> >> >> >
> >> >> >> > What I wish to emphasize is that Solr should be agnostic of xml
> >> >> >> > altogether. It should only be aware of specific Objects and
> >> >> >> > interfaces. If users wish to plugin something else in some other
> >> >> >> > way
> >> >> >> > ,
> >> >> >> > it should be fine
> >> >> >> >
> >> >> >> >
> >> >> >> >  There is a huge learning involved in learning the current
> >> >> >> > solrconfig.xml . Let us not make people throw away that .
> >> >> >> >
> >> >> >> > On Sat, Sep 19, 2009 at 1:59 AM, Jason Rutherglen
> >> >> >> > <ja...@gmail.com> wrote:
> >> >> >> >> Over the weekend I may write a patch to allow simple reflection
> >> >> >> >> based
> >> >> >> >> injection from within solrconfig.
> >> >> >> >>
> >> >> >> >> On Fri, Sep 18, 2009 at 8:10 AM, Yonik Seeley
> >> >> >> >> <yo...@lucidimagination.com> wrote:
> >> >> >> >>> On Thu, Sep 17, 2009 at 4:30 PM, Shalin Shekhar Mangar
> >> >> >> >>> <sh...@gmail.com> wrote:
> >> >> >> >>>>> I was wondering if there is a way I can modify
> >> >> >> >>>>> calibrateSizeByDeletes
> >> >> >> just
> >> >> >> >>>>> by configuration ?
> >> >> >> >>>>>
> >> >> >> >>>>
> >> >> >> >>>> Alas, no. The only option that I see for you is to sub-class
> >> >> >> >>>> LogByteSizeMergePolicy and set calibrateSizeByDeletes to true
> >> >> >> >>>> in
> >> >> >> >>>> the
> >> >> >> >>>> constructor. However, please open a Jira issue and so we
> don't
> >> >> >> >>>> forget
> >> >> >> about
> >> >> >> >>>> it.
> >> >> >> >>>
> >> >> >> >>> It's the continuing stuff like this that makes me feel like we
> >> >> >> >>> should
> >> >> >> >>> be Spring (or equivalent) based someday... I'm just not sure
> how
> >> >> >> >>> we're
> >> >> >> >>> going to get there.
> >> >> >> >>>
> >> >> >> >>> -Yonik
> >> >> >> >>> http://www.lucidimagination.com
> >> >> >> >>>
> >> >> >> >>
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > --
> >> >> >> > -----------------------------------------------------
> >> >> >> > Noble Paul | Principal Engineer| AOL | http://aol.com
> >> >> >> >
> >> >> >>
> >> >> >
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >> >>
> >> >
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Jason Rutherglen <ja...@gmail.com>.
John,

I have a few questions in order to better understand, as the
wiki does not reflect the entirety of what you're trying to
describe.

> But it is required to set up several parameters carefully to
get desired behavior.

Which parameters are you referring to?

What were the ZMP parameters used for the test?

What was the number CMS of threads?

It would be helpful to see a time based table of the data used
to generate the chart at the bottom with the segment infos at
regular intervals.

What is the difference between how ZMP and
LogMergePolicy.setCalibrateSizeByDeletes handles deletes?

Are the queries using Zoie or Lucene's index searcher?

Can you explain why the Viterbi algorithm was used and how it
works in this context?

-J

On Mon, Sep 21, 2009 at 6:35 PM, John Wang <jo...@gmail.com> wrote:
> Jason:
>
>     You are missing the point.
>
>     The idea is to avoid merging of large segments. The point of this
> MergePolicy is to balance segment merges across the index. The aim is not to
> have 1 large segment, it is to have n segments with balanced sizes.
>
>     When the large segment is out of the IO cache, replacing it is very
> costly. What we have done is to split the cost over time by having more
> frequent but faster merges.
>
>     I am not suggesting Lucene's default mergePolicy isn't good, it is just
> not suitable for our case where there are high updates introducing tons of
> deletes. The fact that the api is nice enough to allow MergePolicies to be
> plgged it is a good thing.
>
>     Please DO read the wiki.
>
> -John
>
> On Tue, Sep 22, 2009 at 8:58 AM, Jason Rutherglen
> <ja...@gmail.com> wrote:
>>
>> I'm not sure I communicated the idea properly. If CMS is set to
>> 1 thread, no matter how intensive the CPU for a merge, it's
>> limited to 1 core of what is in many cases a 4 or 8 core server.
>> That leaves the other 3 or 7 cores for queries, which if slow,
>> indicates that it isn't the merging that's slowing down queries,
>> but the dumping of the queried segments from the system IO cache.
>>
>> This holds true regardless of the merge policy used. So while a
>> new merge policy sounds great, unless the system IO cache
>> problem is solved, there will always be a lingering problem in
>> regards to large merges with a regularly updated index. Avoiding
>> large merges probably isn't the answer. And
>> LogByteSizeMergePolicy somewhat allows managing the size of the
>> segments merged already. I would personally prefer being able to
>> merge segments up to a given estimated size, which requires
>> LUCENE-1076 to do well.
>>
>> > is rather different from Lucene benchmark as we are testing
>> high updates in a realtime environment
>>
>> Lucene's benchmark allows this. NearRealtimeReaderTask is a good
>> place to start.
>>
>> On Mon, Sep 21, 2009 at 4:50 PM, John Wang <jo...@gmail.com> wrote:
>> > Jason:
>> >
>> >    Before jumping into any conclusions, let me describe the test setup.
>> > It
>> > is rather different from Lucene benchmark as we are testing high updates
>> > in
>> > a realtime environment:
>> >
>> >    We took a public corpus: medline, indexed to approximately 3 million
>> > docs. And update all the docs over and over again for a 10 hour
>> > duration.
>> >
>> >    Only differences in code used where the different MergePolicy
>> > settings
>> > were applied.
>> >
>> >    Taking the variable of HW/OS out of the equation, let's igonored the
>> > absolute numbers and compare the relative numbers between the two runs.
>> >
>> >    The spike is due to merging of a large segment when we accumulate.
>> > The
>> > graph/perf numbers fit our hypothesis that the default MergePolicy
>> > chooses
>> > to merge small segments before large ones and does not handle segmens
>> > with
>> > high number of deletes well.
>> >
>> >     Merging is BOTH IO and CPU intensive. Especially large ones.
>> >
>> >     I think the wiki explains it pretty well.
>> >
>> >     What are you saying is true with IO cache w.r.t. merge. Everytime
>> > new
>> > files are created, old files in IO cache is invalided. As the experiment
>> > shows, this is detrimental to query performance when large segmens are
>> > being
>> > merged.
>> >
>> >     "As we move to a sharded model of indexes, large merges will
>> > naturally not occur." Our test is on a 3 million document index, not
>> > very
>> > large for a single shard. Some katta people have run it on a much much
>> > larger index per shard. Saying large merges will not occur on indexes of
>> > this size IMHO is unfounded.
>> >
>> > -John
>> >
>> > On Tue, Sep 22, 2009 at 2:34 AM, Jason Rutherglen
>> > <ja...@gmail.com> wrote:
>> >>
>> >> John,
>> >>
>> >> It would be great if Lucene's benchmark were used so everyone
>> >> could execute the test in their own environment and verify. It's
>> >> not clear the settings or code used to generate the results so
>> >> it's difficult to draw any reliable conclusions.
>> >>
>> >> The steep spike shows greater evidence for the IO cache being
>> >> cleared during large merges resulting in search performance
>> >> degradation. See:
>> >> http://www.lucidimagination.com/search/?q=madvise
>> >>
>> >> Merging is IO intensive, less CPU intensive, if the
>> >> ConcurrentMergeScheduler is used, which defaults to 3 threads,
>> >> then the CPU could be maxed out. Using a single thread on
>> >> synchronous spinning magnetic media seems more logical. Queries
>> >> are usually the inverse, CPU intensive, not IO intensive when
>> >> the index is in the IO cache. After merging a large segment (or
>> >> during), queries would start hitting disk, and the results
>> >> clearly show that. The queries are suddenly more time consuming
>> >> as they seek on disk at a time when IO activity is at it's peak
>> >> from merging large segments. Using madvise would prevent usable
>> >> indexes from being swapped to disk during a merge, query
>> >> performance would continue unabated.
>> >>
>> >> As we move to a sharded model of indexes, large merges will
>> >> naturally not occur. Shards will reach a specified size and new
>> >> documents will be sent to new shards.
>> >>
>> >> -J
>> >>
>> >> On Sun, Sep 20, 2009 at 11:12 PM, John Wang <jo...@gmail.com>
>> >> wrote:
>> >> > The current default Lucene MergePolicy does not handle frequent
>> >> > updates
>> >> > well.
>> >> >
>> >> > We have done some performance analysis with that and a custom merge
>> >> > policy:
>> >> >
>> >> > http://code.google.com/p/zoie/wiki/ZoieMergePolicy
>> >> >
>> >> > -John
>> >> >
>> >> > On Mon, Sep 21, 2009 at 1:08 PM, Jason Rutherglen <
>> >> > jason.rutherglen@gmail.com> wrote:
>> >> >
>> >> >> I opened SOLR-1447 for this
>> >> >>
>> >> >> 2009/9/18 Noble Paul നോബിള്‍  नोब्ळ् <no...@corp.aol.com>:
>> >> >> > We can use a simple reflection based implementation to simplify
>> >> >> > reading too many parameters.
>> >> >> >
>> >> >> > What I wish to emphasize is that Solr should be agnostic of xml
>> >> >> > altogether. It should only be aware of specific Objects and
>> >> >> > interfaces. If users wish to plugin something else in some other
>> >> >> > way
>> >> >> > ,
>> >> >> > it should be fine
>> >> >> >
>> >> >> >
>> >> >> >  There is a huge learning involved in learning the current
>> >> >> > solrconfig.xml . Let us not make people throw away that .
>> >> >> >
>> >> >> > On Sat, Sep 19, 2009 at 1:59 AM, Jason Rutherglen
>> >> >> > <ja...@gmail.com> wrote:
>> >> >> >> Over the weekend I may write a patch to allow simple reflection
>> >> >> >> based
>> >> >> >> injection from within solrconfig.
>> >> >> >>
>> >> >> >> On Fri, Sep 18, 2009 at 8:10 AM, Yonik Seeley
>> >> >> >> <yo...@lucidimagination.com> wrote:
>> >> >> >>> On Thu, Sep 17, 2009 at 4:30 PM, Shalin Shekhar Mangar
>> >> >> >>> <sh...@gmail.com> wrote:
>> >> >> >>>>> I was wondering if there is a way I can modify
>> >> >> >>>>> calibrateSizeByDeletes
>> >> >> just
>> >> >> >>>>> by configuration ?
>> >> >> >>>>>
>> >> >> >>>>
>> >> >> >>>> Alas, no. The only option that I see for you is to sub-class
>> >> >> >>>> LogByteSizeMergePolicy and set calibrateSizeByDeletes to true
>> >> >> >>>> in
>> >> >> >>>> the
>> >> >> >>>> constructor. However, please open a Jira issue and so we don't
>> >> >> >>>> forget
>> >> >> about
>> >> >> >>>> it.
>> >> >> >>>
>> >> >> >>> It's the continuing stuff like this that makes me feel like we
>> >> >> >>> should
>> >> >> >>> be Spring (or equivalent) based someday... I'm just not sure how
>> >> >> >>> we're
>> >> >> >>> going to get there.
>> >> >> >>>
>> >> >> >>> -Yonik
>> >> >> >>> http://www.lucidimagination.com
>> >> >> >>>
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > -----------------------------------------------------
>> >> >> > Noble Paul | Principal Engineer| AOL | http://aol.com
>> >> >> >
>> >> >>
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by John Wang <jo...@gmail.com>.
Jason:

    You are missing the point.

    The idea is to avoid merging of large segments. The point of this
MergePolicy is to balance segment merges across the index. The aim is not to
have 1 large segment, it is to have n segments with balanced sizes.

    When the large segment is out of the IO cache, replacing it is very
costly. What we have done is to split the cost over time by having more
frequent but faster merges.

    I am not suggesting Lucene's default mergePolicy isn't good, it is just
not suitable for our case where there are high updates introducing tons of
deletes. The fact that the api is nice enough to allow MergePolicies to be
plgged it is a good thing.

    Please DO read the wiki.

-John

On Tue, Sep 22, 2009 at 8:58 AM, Jason Rutherglen <
jason.rutherglen@gmail.com> wrote:

> I'm not sure I communicated the idea properly. If CMS is set to
> 1 thread, no matter how intensive the CPU for a merge, it's
> limited to 1 core of what is in many cases a 4 or 8 core server.
> That leaves the other 3 or 7 cores for queries, which if slow,
> indicates that it isn't the merging that's slowing down queries,
> but the dumping of the queried segments from the system IO cache.
>
> This holds true regardless of the merge policy used. So while a
> new merge policy sounds great, unless the system IO cache
> problem is solved, there will always be a lingering problem in
> regards to large merges with a regularly updated index. Avoiding
> large merges probably isn't the answer. And
> LogByteSizeMergePolicy somewhat allows managing the size of the
> segments merged already. I would personally prefer being able to
> merge segments up to a given estimated size, which requires
> LUCENE-1076 to do well.
>
> > is rather different from Lucene benchmark as we are testing
> high updates in a realtime environment
>
> Lucene's benchmark allows this. NearRealtimeReaderTask is a good
> place to start.
>
> On Mon, Sep 21, 2009 at 4:50 PM, John Wang <jo...@gmail.com> wrote:
> > Jason:
> >
> >    Before jumping into any conclusions, let me describe the test setup.
> It
> > is rather different from Lucene benchmark as we are testing high updates
> in
> > a realtime environment:
> >
> >    We took a public corpus: medline, indexed to approximately 3 million
> > docs. And update all the docs over and over again for a 10 hour duration.
> >
> >    Only differences in code used where the different MergePolicy settings
> > were applied.
> >
> >    Taking the variable of HW/OS out of the equation, let's igonored the
> > absolute numbers and compare the relative numbers between the two runs.
> >
> >    The spike is due to merging of a large segment when we accumulate. The
> > graph/perf numbers fit our hypothesis that the default MergePolicy
> chooses
> > to merge small segments before large ones and does not handle segmens
> with
> > high number of deletes well.
> >
> >     Merging is BOTH IO and CPU intensive. Especially large ones.
> >
> >     I think the wiki explains it pretty well.
> >
> >     What are you saying is true with IO cache w.r.t. merge. Everytime new
> > files are created, old files in IO cache is invalided. As the experiment
> > shows, this is detrimental to query performance when large segmens are
> being
> > merged.
> >
> >     "As we move to a sharded model of indexes, large merges will
> > naturally not occur." Our test is on a 3 million document index, not very
> > large for a single shard. Some katta people have run it on a much much
> > larger index per shard. Saying large merges will not occur on indexes of
> > this size IMHO is unfounded.
> >
> > -John
> >
> > On Tue, Sep 22, 2009 at 2:34 AM, Jason Rutherglen
> > <ja...@gmail.com> wrote:
> >>
> >> John,
> >>
> >> It would be great if Lucene's benchmark were used so everyone
> >> could execute the test in their own environment and verify. It's
> >> not clear the settings or code used to generate the results so
> >> it's difficult to draw any reliable conclusions.
> >>
> >> The steep spike shows greater evidence for the IO cache being
> >> cleared during large merges resulting in search performance
> >> degradation. See:
> >> http://www.lucidimagination.com/search/?q=madvise
> >>
> >> Merging is IO intensive, less CPU intensive, if the
> >> ConcurrentMergeScheduler is used, which defaults to 3 threads,
> >> then the CPU could be maxed out. Using a single thread on
> >> synchronous spinning magnetic media seems more logical. Queries
> >> are usually the inverse, CPU intensive, not IO intensive when
> >> the index is in the IO cache. After merging a large segment (or
> >> during), queries would start hitting disk, and the results
> >> clearly show that. The queries are suddenly more time consuming
> >> as they seek on disk at a time when IO activity is at it's peak
> >> from merging large segments. Using madvise would prevent usable
> >> indexes from being swapped to disk during a merge, query
> >> performance would continue unabated.
> >>
> >> As we move to a sharded model of indexes, large merges will
> >> naturally not occur. Shards will reach a specified size and new
> >> documents will be sent to new shards.
> >>
> >> -J
> >>
> >> On Sun, Sep 20, 2009 at 11:12 PM, John Wang <jo...@gmail.com>
> wrote:
> >> > The current default Lucene MergePolicy does not handle frequent
> updates
> >> > well.
> >> >
> >> > We have done some performance analysis with that and a custom merge
> >> > policy:
> >> >
> >> > http://code.google.com/p/zoie/wiki/ZoieMergePolicy
> >> >
> >> > -John
> >> >
> >> > On Mon, Sep 21, 2009 at 1:08 PM, Jason Rutherglen <
> >> > jason.rutherglen@gmail.com> wrote:
> >> >
> >> >> I opened SOLR-1447 for this
> >> >>
> >> >> 2009/9/18 Noble Paul നോബിള്‍  नोब्ळ् <no...@corp.aol.com>:
> >> >> > We can use a simple reflection based implementation to simplify
> >> >> > reading too many parameters.
> >> >> >
> >> >> > What I wish to emphasize is that Solr should be agnostic of xml
> >> >> > altogether. It should only be aware of specific Objects and
> >> >> > interfaces. If users wish to plugin something else in some other
> way
> >> >> > ,
> >> >> > it should be fine
> >> >> >
> >> >> >
> >> >> >  There is a huge learning involved in learning the current
> >> >> > solrconfig.xml . Let us not make people throw away that .
> >> >> >
> >> >> > On Sat, Sep 19, 2009 at 1:59 AM, Jason Rutherglen
> >> >> > <ja...@gmail.com> wrote:
> >> >> >> Over the weekend I may write a patch to allow simple reflection
> >> >> >> based
> >> >> >> injection from within solrconfig.
> >> >> >>
> >> >> >> On Fri, Sep 18, 2009 at 8:10 AM, Yonik Seeley
> >> >> >> <yo...@lucidimagination.com> wrote:
> >> >> >>> On Thu, Sep 17, 2009 at 4:30 PM, Shalin Shekhar Mangar
> >> >> >>> <sh...@gmail.com> wrote:
> >> >> >>>>> I was wondering if there is a way I can modify
> >> >> >>>>> calibrateSizeByDeletes
> >> >> just
> >> >> >>>>> by configuration ?
> >> >> >>>>>
> >> >> >>>>
> >> >> >>>> Alas, no. The only option that I see for you is to sub-class
> >> >> >>>> LogByteSizeMergePolicy and set calibrateSizeByDeletes to true in
> >> >> >>>> the
> >> >> >>>> constructor. However, please open a Jira issue and so we don't
> >> >> >>>> forget
> >> >> about
> >> >> >>>> it.
> >> >> >>>
> >> >> >>> It's the continuing stuff like this that makes me feel like we
> >> >> >>> should
> >> >> >>> be Spring (or equivalent) based someday... I'm just not sure how
> >> >> >>> we're
> >> >> >>> going to get there.
> >> >> >>>
> >> >> >>> -Yonik
> >> >> >>> http://www.lucidimagination.com
> >> >> >>>
> >> >> >>
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > -----------------------------------------------------
> >> >> > Noble Paul | Principal Engineer| AOL | http://aol.com
> >> >> >
> >> >>
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Jason Rutherglen <ja...@gmail.com>.
I'm not sure I communicated the idea properly. If CMS is set to
1 thread, no matter how intensive the CPU for a merge, it's
limited to 1 core of what is in many cases a 4 or 8 core server.
That leaves the other 3 or 7 cores for queries, which if slow,
indicates that it isn't the merging that's slowing down queries,
but the dumping of the queried segments from the system IO cache.

This holds true regardless of the merge policy used. So while a
new merge policy sounds great, unless the system IO cache
problem is solved, there will always be a lingering problem in
regards to large merges with a regularly updated index. Avoiding
large merges probably isn't the answer. And
LogByteSizeMergePolicy somewhat allows managing the size of the
segments merged already. I would personally prefer being able to
merge segments up to a given estimated size, which requires
LUCENE-1076 to do well.

> is rather different from Lucene benchmark as we are testing
high updates in a realtime environment

Lucene's benchmark allows this. NearRealtimeReaderTask is a good
place to start.

On Mon, Sep 21, 2009 at 4:50 PM, John Wang <jo...@gmail.com> wrote:
> Jason:
>
>    Before jumping into any conclusions, let me describe the test setup. It
> is rather different from Lucene benchmark as we are testing high updates in
> a realtime environment:
>
>    We took a public corpus: medline, indexed to approximately 3 million
> docs. And update all the docs over and over again for a 10 hour duration.
>
>    Only differences in code used where the different MergePolicy settings
> were applied.
>
>    Taking the variable of HW/OS out of the equation, let's igonored the
> absolute numbers and compare the relative numbers between the two runs.
>
>    The spike is due to merging of a large segment when we accumulate. The
> graph/perf numbers fit our hypothesis that the default MergePolicy chooses
> to merge small segments before large ones and does not handle segmens with
> high number of deletes well.
>
>     Merging is BOTH IO and CPU intensive. Especially large ones.
>
>     I think the wiki explains it pretty well.
>
>     What are you saying is true with IO cache w.r.t. merge. Everytime new
> files are created, old files in IO cache is invalided. As the experiment
> shows, this is detrimental to query performance when large segmens are being
> merged.
>
>     "As we move to a sharded model of indexes, large merges will
> naturally not occur." Our test is on a 3 million document index, not very
> large for a single shard. Some katta people have run it on a much much
> larger index per shard. Saying large merges will not occur on indexes of
> this size IMHO is unfounded.
>
> -John
>
> On Tue, Sep 22, 2009 at 2:34 AM, Jason Rutherglen
> <ja...@gmail.com> wrote:
>>
>> John,
>>
>> It would be great if Lucene's benchmark were used so everyone
>> could execute the test in their own environment and verify. It's
>> not clear the settings or code used to generate the results so
>> it's difficult to draw any reliable conclusions.
>>
>> The steep spike shows greater evidence for the IO cache being
>> cleared during large merges resulting in search performance
>> degradation. See:
>> http://www.lucidimagination.com/search/?q=madvise
>>
>> Merging is IO intensive, less CPU intensive, if the
>> ConcurrentMergeScheduler is used, which defaults to 3 threads,
>> then the CPU could be maxed out. Using a single thread on
>> synchronous spinning magnetic media seems more logical. Queries
>> are usually the inverse, CPU intensive, not IO intensive when
>> the index is in the IO cache. After merging a large segment (or
>> during), queries would start hitting disk, and the results
>> clearly show that. The queries are suddenly more time consuming
>> as they seek on disk at a time when IO activity is at it's peak
>> from merging large segments. Using madvise would prevent usable
>> indexes from being swapped to disk during a merge, query
>> performance would continue unabated.
>>
>> As we move to a sharded model of indexes, large merges will
>> naturally not occur. Shards will reach a specified size and new
>> documents will be sent to new shards.
>>
>> -J
>>
>> On Sun, Sep 20, 2009 at 11:12 PM, John Wang <jo...@gmail.com> wrote:
>> > The current default Lucene MergePolicy does not handle frequent updates
>> > well.
>> >
>> > We have done some performance analysis with that and a custom merge
>> > policy:
>> >
>> > http://code.google.com/p/zoie/wiki/ZoieMergePolicy
>> >
>> > -John
>> >
>> > On Mon, Sep 21, 2009 at 1:08 PM, Jason Rutherglen <
>> > jason.rutherglen@gmail.com> wrote:
>> >
>> >> I opened SOLR-1447 for this
>> >>
>> >> 2009/9/18 Noble Paul നോബിള്‍  नोब्ळ् <no...@corp.aol.com>:
>> >> > We can use a simple reflection based implementation to simplify
>> >> > reading too many parameters.
>> >> >
>> >> > What I wish to emphasize is that Solr should be agnostic of xml
>> >> > altogether. It should only be aware of specific Objects and
>> >> > interfaces. If users wish to plugin something else in some other way
>> >> > ,
>> >> > it should be fine
>> >> >
>> >> >
>> >> >  There is a huge learning involved in learning the current
>> >> > solrconfig.xml . Let us not make people throw away that .
>> >> >
>> >> > On Sat, Sep 19, 2009 at 1:59 AM, Jason Rutherglen
>> >> > <ja...@gmail.com> wrote:
>> >> >> Over the weekend I may write a patch to allow simple reflection
>> >> >> based
>> >> >> injection from within solrconfig.
>> >> >>
>> >> >> On Fri, Sep 18, 2009 at 8:10 AM, Yonik Seeley
>> >> >> <yo...@lucidimagination.com> wrote:
>> >> >>> On Thu, Sep 17, 2009 at 4:30 PM, Shalin Shekhar Mangar
>> >> >>> <sh...@gmail.com> wrote:
>> >> >>>>> I was wondering if there is a way I can modify
>> >> >>>>> calibrateSizeByDeletes
>> >> just
>> >> >>>>> by configuration ?
>> >> >>>>>
>> >> >>>>
>> >> >>>> Alas, no. The only option that I see for you is to sub-class
>> >> >>>> LogByteSizeMergePolicy and set calibrateSizeByDeletes to true in
>> >> >>>> the
>> >> >>>> constructor. However, please open a Jira issue and so we don't
>> >> >>>> forget
>> >> about
>> >> >>>> it.
>> >> >>>
>> >> >>> It's the continuing stuff like this that makes me feel like we
>> >> >>> should
>> >> >>> be Spring (or equivalent) based someday... I'm just not sure how
>> >> >>> we're
>> >> >>> going to get there.
>> >> >>>
>> >> >>> -Yonik
>> >> >>> http://www.lucidimagination.com
>> >> >>>
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > -----------------------------------------------------
>> >> > Noble Paul | Principal Engineer| AOL | http://aol.com
>> >> >
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by John Wang <jo...@gmail.com>.
Jason:

   Before jumping into any conclusions, let me describe the test setup. It
is rather different from Lucene benchmark as we are testing high updates in
a realtime environment:

   We took a public corpus: medline, indexed to approximately 3 million
docs. And update all the docs over and over again for a 10 hour duration.

   Only differences in code used where the different MergePolicy settings
were applied.

   Taking the variable of HW/OS out of the equation, let's igonored the
absolute numbers and compare the relative numbers between the two runs.

   The spike is due to merging of a large segment when we accumulate. The
graph/perf numbers fit our hypothesis that the default MergePolicy chooses
to merge small segments before large ones and does not handle segmens with
high number of deletes well.

    Merging is BOTH IO and CPU intensive. Especially large ones.

    I think the wiki explains it pretty well.

    What are you saying is true with IO cache w.r.t. merge. Everytime new
files are created, old files in IO cache is invalided. As the experiment
shows, this is detrimental to query performance when large segmens are being
merged.

    "As we move to a sharded model of indexes, large merges will
naturally not occur." Our test is on a 3 million document index, not very
large for a single shard. Some katta people have run it on a much much
larger index per shard. Saying large merges will not occur on indexes of
this size IMHO is unfounded.

-John

On Tue, Sep 22, 2009 at 2:34 AM, Jason Rutherglen <
jason.rutherglen@gmail.com> wrote:

> John,
>
> It would be great if Lucene's benchmark were used so everyone
> could execute the test in their own environment and verify. It's
> not clear the settings or code used to generate the results so
> it's difficult to draw any reliable conclusions.
>
> The steep spike shows greater evidence for the IO cache being
> cleared during large merges resulting in search performance
> degradation. See:
> http://www.lucidimagination.com/search/?q=madvise
>
> Merging is IO intensive, less CPU intensive, if the
> ConcurrentMergeScheduler is used, which defaults to 3 threads,
> then the CPU could be maxed out. Using a single thread on
> synchronous spinning magnetic media seems more logical. Queries
> are usually the inverse, CPU intensive, not IO intensive when
> the index is in the IO cache. After merging a large segment (or
> during), queries would start hitting disk, and the results
> clearly show that. The queries are suddenly more time consuming
> as they seek on disk at a time when IO activity is at it's peak
> from merging large segments. Using madvise would prevent usable
> indexes from being swapped to disk during a merge, query
> performance would continue unabated.
>
> As we move to a sharded model of indexes, large merges will
> naturally not occur. Shards will reach a specified size and new
> documents will be sent to new shards.
>
> -J
>
> On Sun, Sep 20, 2009 at 11:12 PM, John Wang <jo...@gmail.com> wrote:
> > The current default Lucene MergePolicy does not handle frequent updates
> > well.
> >
> > We have done some performance analysis with that and a custom merge
> policy:
> >
> > http://code.google.com/p/zoie/wiki/ZoieMergePolicy
> >
> > -John
> >
> > On Mon, Sep 21, 2009 at 1:08 PM, Jason Rutherglen <
> > jason.rutherglen@gmail.com> wrote:
> >
> >> I opened SOLR-1447 for this
> >>
> >> 2009/9/18 Noble Paul നോബിള്‍  नोब्ळ् <no...@corp.aol.com>:
> >> > We can use a simple reflection based implementation to simplify
> >> > reading too many parameters.
> >> >
> >> > What I wish to emphasize is that Solr should be agnostic of xml
> >> > altogether. It should only be aware of specific Objects and
> >> > interfaces. If users wish to plugin something else in some other way ,
> >> > it should be fine
> >> >
> >> >
> >> >  There is a huge learning involved in learning the current
> >> > solrconfig.xml . Let us not make people throw away that .
> >> >
> >> > On Sat, Sep 19, 2009 at 1:59 AM, Jason Rutherglen
> >> > <ja...@gmail.com> wrote:
> >> >> Over the weekend I may write a patch to allow simple reflection based
> >> >> injection from within solrconfig.
> >> >>
> >> >> On Fri, Sep 18, 2009 at 8:10 AM, Yonik Seeley
> >> >> <yo...@lucidimagination.com> wrote:
> >> >>> On Thu, Sep 17, 2009 at 4:30 PM, Shalin Shekhar Mangar
> >> >>> <sh...@gmail.com> wrote:
> >> >>>>> I was wondering if there is a way I can modify
> calibrateSizeByDeletes
> >> just
> >> >>>>> by configuration ?
> >> >>>>>
> >> >>>>
> >> >>>> Alas, no. The only option that I see for you is to sub-class
> >> >>>> LogByteSizeMergePolicy and set calibrateSizeByDeletes to true in
> the
> >> >>>> constructor. However, please open a Jira issue and so we don't
> forget
> >> about
> >> >>>> it.
> >> >>>
> >> >>> It's the continuing stuff like this that makes me feel like we
> should
> >> >>> be Spring (or equivalent) based someday... I'm just not sure how
> we're
> >> >>> going to get there.
> >> >>>
> >> >>> -Yonik
> >> >>> http://www.lucidimagination.com
> >> >>>
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > -----------------------------------------------------
> >> > Noble Paul | Principal Engineer| AOL | http://aol.com
> >> >
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Jason Rutherglen <ja...@gmail.com>.
John,

It would be great if Lucene's benchmark were used so everyone
could execute the test in their own environment and verify. It's
not clear the settings or code used to generate the results so
it's difficult to draw any reliable conclusions.

The steep spike shows greater evidence for the IO cache being
cleared during large merges resulting in search performance
degradation. See:
http://www.lucidimagination.com/search/?q=madvise

Merging is IO intensive, less CPU intensive, if the
ConcurrentMergeScheduler is used, which defaults to 3 threads,
then the CPU could be maxed out. Using a single thread on
synchronous spinning magnetic media seems more logical. Queries
are usually the inverse, CPU intensive, not IO intensive when
the index is in the IO cache. After merging a large segment (or
during), queries would start hitting disk, and the results
clearly show that. The queries are suddenly more time consuming
as they seek on disk at a time when IO activity is at it's peak
from merging large segments. Using madvise would prevent usable
indexes from being swapped to disk during a merge, query
performance would continue unabated.

As we move to a sharded model of indexes, large merges will
naturally not occur. Shards will reach a specified size and new
documents will be sent to new shards.

-J

On Sun, Sep 20, 2009 at 11:12 PM, John Wang <jo...@gmail.com> wrote:
> The current default Lucene MergePolicy does not handle frequent updates
> well.
>
> We have done some performance analysis with that and a custom merge policy:
>
> http://code.google.com/p/zoie/wiki/ZoieMergePolicy
>
> -John
>
> On Mon, Sep 21, 2009 at 1:08 PM, Jason Rutherglen <
> jason.rutherglen@gmail.com> wrote:
>
>> I opened SOLR-1447 for this
>>
>> 2009/9/18 Noble Paul നോബിള്‍  नोब्ळ् <no...@corp.aol.com>:
>> > We can use a simple reflection based implementation to simplify
>> > reading too many parameters.
>> >
>> > What I wish to emphasize is that Solr should be agnostic of xml
>> > altogether. It should only be aware of specific Objects and
>> > interfaces. If users wish to plugin something else in some other way ,
>> > it should be fine
>> >
>> >
>> >  There is a huge learning involved in learning the current
>> > solrconfig.xml . Let us not make people throw away that .
>> >
>> > On Sat, Sep 19, 2009 at 1:59 AM, Jason Rutherglen
>> > <ja...@gmail.com> wrote:
>> >> Over the weekend I may write a patch to allow simple reflection based
>> >> injection from within solrconfig.
>> >>
>> >> On Fri, Sep 18, 2009 at 8:10 AM, Yonik Seeley
>> >> <yo...@lucidimagination.com> wrote:
>> >>> On Thu, Sep 17, 2009 at 4:30 PM, Shalin Shekhar Mangar
>> >>> <sh...@gmail.com> wrote:
>> >>>>> I was wondering if there is a way I can modify calibrateSizeByDeletes
>> just
>> >>>>> by configuration ?
>> >>>>>
>> >>>>
>> >>>> Alas, no. The only option that I see for you is to sub-class
>> >>>> LogByteSizeMergePolicy and set calibrateSizeByDeletes to true in the
>> >>>> constructor. However, please open a Jira issue and so we don't forget
>> about
>> >>>> it.
>> >>>
>> >>> It's the continuing stuff like this that makes me feel like we should
>> >>> be Spring (or equivalent) based someday... I'm just not sure how we're
>> >>> going to get there.
>> >>>
>> >>> -Yonik
>> >>> http://www.lucidimagination.com
>> >>>
>> >>
>> >
>> >
>> >
>> > --
>> > -----------------------------------------------------
>> > Noble Paul | Principal Engineer| AOL | http://aol.com
>> >
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Jason Rutherglen <ja...@gmail.com>.
John,

It would be great if Lucene's benchmark were used so everyone
could execute the test in their own environment and verify. It's
not clear the settings or code used to generate the results so
it's difficult to draw any reliable conclusions.

The steep spike shows greater evidence for the IO cache being
cleared during large merges resulting in search performance
degradation. See:
http://www.lucidimagination.com/search/?q=madvise

Merging is IO intensive, less CPU intensive, if the
ConcurrentMergeScheduler is used, which defaults to 3 threads,
then the CPU could be maxed out. Using a single thread on
synchronous spinning magnetic media seems more logical. Queries
are usually the inverse, CPU intensive, not IO intensive when
the index is in the IO cache. After merging a large segment (or
during), queries would start hitting disk, and the results
clearly show that. The queries are suddenly more time consuming
as they seek on disk at a time when IO activity is at it's peak
from merging large segments. Using madvise would prevent usable
indexes from being swapped to disk during a merge, query
performance would continue unabated.

As we move to a sharded model of indexes, large merges will
naturally not occur. Shards will reach a specified size and new
documents will be sent to new shards.

-J

On Sun, Sep 20, 2009 at 11:12 PM, John Wang <jo...@gmail.com> wrote:
> The current default Lucene MergePolicy does not handle frequent updates
> well.
>
> We have done some performance analysis with that and a custom merge policy:
>
> http://code.google.com/p/zoie/wiki/ZoieMergePolicy
>
> -John
>
> On Mon, Sep 21, 2009 at 1:08 PM, Jason Rutherglen <
> jason.rutherglen@gmail.com> wrote:
>
>> I opened SOLR-1447 for this
>>
>> 2009/9/18 Noble Paul നോബിള്‍  नोब्ळ् <no...@corp.aol.com>:
>> > We can use a simple reflection based implementation to simplify
>> > reading too many parameters.
>> >
>> > What I wish to emphasize is that Solr should be agnostic of xml
>> > altogether. It should only be aware of specific Objects and
>> > interfaces. If users wish to plugin something else in some other way ,
>> > it should be fine
>> >
>> >
>> >  There is a huge learning involved in learning the current
>> > solrconfig.xml . Let us not make people throw away that .
>> >
>> > On Sat, Sep 19, 2009 at 1:59 AM, Jason Rutherglen
>> > <ja...@gmail.com> wrote:
>> >> Over the weekend I may write a patch to allow simple reflection based
>> >> injection from within solrconfig.
>> >>
>> >> On Fri, Sep 18, 2009 at 8:10 AM, Yonik Seeley
>> >> <yo...@lucidimagination.com> wrote:
>> >>> On Thu, Sep 17, 2009 at 4:30 PM, Shalin Shekhar Mangar
>> >>> <sh...@gmail.com> wrote:
>> >>>>> I was wondering if there is a way I can modify calibrateSizeByDeletes
>> just
>> >>>>> by configuration ?
>> >>>>>
>> >>>>
>> >>>> Alas, no. The only option that I see for you is to sub-class
>> >>>> LogByteSizeMergePolicy and set calibrateSizeByDeletes to true in the
>> >>>> constructor. However, please open a Jira issue and so we don't forget
>> about
>> >>>> it.
>> >>>
>> >>> It's the continuing stuff like this that makes me feel like we should
>> >>> be Spring (or equivalent) based someday... I'm just not sure how we're
>> >>> going to get there.
>> >>>
>> >>> -Yonik
>> >>> http://www.lucidimagination.com
>> >>>
>> >>
>> >
>> >
>> >
>> > --
>> > -----------------------------------------------------
>> > Noble Paul | Principal Engineer| AOL | http://aol.com
>> >
>>
>

Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by John Wang <jo...@gmail.com>.
The current default Lucene MergePolicy does not handle frequent updates
well.

We have done some performance analysis with that and a custom merge policy:

http://code.google.com/p/zoie/wiki/ZoieMergePolicy

-John

On Mon, Sep 21, 2009 at 1:08 PM, Jason Rutherglen <
jason.rutherglen@gmail.com> wrote:

> I opened SOLR-1447 for this
>
> 2009/9/18 Noble Paul നോബിള്‍  नोब्ळ् <no...@corp.aol.com>:
> > We can use a simple reflection based implementation to simplify
> > reading too many parameters.
> >
> > What I wish to emphasize is that Solr should be agnostic of xml
> > altogether. It should only be aware of specific Objects and
> > interfaces. If users wish to plugin something else in some other way ,
> > it should be fine
> >
> >
> >  There is a huge learning involved in learning the current
> > solrconfig.xml . Let us not make people throw away that .
> >
> > On Sat, Sep 19, 2009 at 1:59 AM, Jason Rutherglen
> > <ja...@gmail.com> wrote:
> >> Over the weekend I may write a patch to allow simple reflection based
> >> injection from within solrconfig.
> >>
> >> On Fri, Sep 18, 2009 at 8:10 AM, Yonik Seeley
> >> <yo...@lucidimagination.com> wrote:
> >>> On Thu, Sep 17, 2009 at 4:30 PM, Shalin Shekhar Mangar
> >>> <sh...@gmail.com> wrote:
> >>>>> I was wondering if there is a way I can modify calibrateSizeByDeletes
> just
> >>>>> by configuration ?
> >>>>>
> >>>>
> >>>> Alas, no. The only option that I see for you is to sub-class
> >>>> LogByteSizeMergePolicy and set calibrateSizeByDeletes to true in the
> >>>> constructor. However, please open a Jira issue and so we don't forget
> about
> >>>> it.
> >>>
> >>> It's the continuing stuff like this that makes me feel like we should
> >>> be Spring (or equivalent) based someday... I'm just not sure how we're
> >>> going to get there.
> >>>
> >>> -Yonik
> >>> http://www.lucidimagination.com
> >>>
> >>
> >
> >
> >
> > --
> > -----------------------------------------------------
> > Noble Paul | Principal Engineer| AOL | http://aol.com
> >
>

Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Jason Rutherglen <ja...@gmail.com>.
I opened SOLR-1447 for this

2009/9/18 Noble Paul നോബിള്‍  नोब्ळ् <no...@corp.aol.com>:
> We can use a simple reflection based implementation to simplify
> reading too many parameters.
>
> What I wish to emphasize is that Solr should be agnostic of xml
> altogether. It should only be aware of specific Objects and
> interfaces. If users wish to plugin something else in some other way ,
> it should be fine
>
>
>  There is a huge learning involved in learning the current
> solrconfig.xml . Let us not make people throw away that .
>
> On Sat, Sep 19, 2009 at 1:59 AM, Jason Rutherglen
> <ja...@gmail.com> wrote:
>> Over the weekend I may write a patch to allow simple reflection based
>> injection from within solrconfig.
>>
>> On Fri, Sep 18, 2009 at 8:10 AM, Yonik Seeley
>> <yo...@lucidimagination.com> wrote:
>>> On Thu, Sep 17, 2009 at 4:30 PM, Shalin Shekhar Mangar
>>> <sh...@gmail.com> wrote:
>>>>> I was wondering if there is a way I can modify calibrateSizeByDeletes just
>>>>> by configuration ?
>>>>>
>>>>
>>>> Alas, no. The only option that I see for you is to sub-class
>>>> LogByteSizeMergePolicy and set calibrateSizeByDeletes to true in the
>>>> constructor. However, please open a Jira issue and so we don't forget about
>>>> it.
>>>
>>> It's the continuing stuff like this that makes me feel like we should
>>> be Spring (or equivalent) based someday... I'm just not sure how we're
>>> going to get there.
>>>
>>> -Yonik
>>> http://www.lucidimagination.com
>>>
>>
>
>
>
> --
> -----------------------------------------------------
> Noble Paul | Principal Engineer| AOL | http://aol.com
>

Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.
We can use a simple reflection based implementation to simplify
reading too many parameters.

What I wish to emphasize is that Solr should be agnostic of xml
altogether. It should only be aware of specific Objects and
interfaces. If users wish to plugin something else in some other way ,
it should be fine


 There is a huge learning involved in learning the current
solrconfig.xml . Let us not make people throw away that .

On Sat, Sep 19, 2009 at 1:59 AM, Jason Rutherglen
<ja...@gmail.com> wrote:
> Over the weekend I may write a patch to allow simple reflection based
> injection from within solrconfig.
>
> On Fri, Sep 18, 2009 at 8:10 AM, Yonik Seeley
> <yo...@lucidimagination.com> wrote:
>> On Thu, Sep 17, 2009 at 4:30 PM, Shalin Shekhar Mangar
>> <sh...@gmail.com> wrote:
>>>> I was wondering if there is a way I can modify calibrateSizeByDeletes just
>>>> by configuration ?
>>>>
>>>
>>> Alas, no. The only option that I see for you is to sub-class
>>> LogByteSizeMergePolicy and set calibrateSizeByDeletes to true in the
>>> constructor. However, please open a Jira issue and so we don't forget about
>>> it.
>>
>> It's the continuing stuff like this that makes me feel like we should
>> be Spring (or equivalent) based someday... I'm just not sure how we're
>> going to get there.
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Jason Rutherglen <ja...@gmail.com>.
Over the weekend I may write a patch to allow simple reflection based
injection from within solrconfig.

On Fri, Sep 18, 2009 at 8:10 AM, Yonik Seeley
<yo...@lucidimagination.com> wrote:
> On Thu, Sep 17, 2009 at 4:30 PM, Shalin Shekhar Mangar
> <sh...@gmail.com> wrote:
>>> I was wondering if there is a way I can modify calibrateSizeByDeletes just
>>> by configuration ?
>>>
>>
>> Alas, no. The only option that I see for you is to sub-class
>> LogByteSizeMergePolicy and set calibrateSizeByDeletes to true in the
>> constructor. However, please open a Jira issue and so we don't forget about
>> it.
>
> It's the continuing stuff like this that makes me feel like we should
> be Spring (or equivalent) based someday... I'm just not sure how we're
> going to get there.
>
> -Yonik
> http://www.lucidimagination.com
>

Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.
with SOLR-1447  you should be able to use ZoieMergePolicy as well

On Mon, Sep 21, 2009 at 11:43 AM, John Wang <jo...@gmail.com> wrote:
> Yonik:
>
>        It would be great if Solr can be configured through some sort of
> dependency injection framework like Spring! A big +1 from me!
>
> -John
>
> On Fri, Sep 18, 2009 at 11:10 PM, Yonik Seeley
> <yo...@lucidimagination.com>wrote:
>
>> On Thu, Sep 17, 2009 at 4:30 PM, Shalin Shekhar Mangar
>> <sh...@gmail.com> wrote:
>> >> I was wondering if there is a way I can modify calibrateSizeByDeletes
>> just
>> >> by configuration ?
>> >>
>> >
>> > Alas, no. The only option that I see for you is to sub-class
>> > LogByteSizeMergePolicy and set calibrateSizeByDeletes to true in the
>> > constructor. However, please open a Jira issue and so we don't forget
>> about
>> > it.
>>
>> It's the continuing stuff like this that makes me feel like we should
>> be Spring (or equivalent) based someday... I'm just not sure how we're
>> going to get there.
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by John Wang <jo...@gmail.com>.
Yonik:

        It would be great if Solr can be configured through some sort of
dependency injection framework like Spring! A big +1 from me!

-John

On Fri, Sep 18, 2009 at 11:10 PM, Yonik Seeley
<yo...@lucidimagination.com>wrote:

> On Thu, Sep 17, 2009 at 4:30 PM, Shalin Shekhar Mangar
> <sh...@gmail.com> wrote:
> >> I was wondering if there is a way I can modify calibrateSizeByDeletes
> just
> >> by configuration ?
> >>
> >
> > Alas, no. The only option that I see for you is to sub-class
> > LogByteSizeMergePolicy and set calibrateSizeByDeletes to true in the
> > constructor. However, please open a Jira issue and so we don't forget
> about
> > it.
>
> It's the continuing stuff like this that makes me feel like we should
> be Spring (or equivalent) based someday... I'm just not sure how we're
> going to get there.
>
> -Yonik
> http://www.lucidimagination.com
>

Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Thu, Sep 17, 2009 at 4:30 PM, Shalin Shekhar Mangar
<sh...@gmail.com> wrote:
>> I was wondering if there is a way I can modify calibrateSizeByDeletes just
>> by configuration ?
>>
>
> Alas, no. The only option that I see for you is to sub-class
> LogByteSizeMergePolicy and set calibrateSizeByDeletes to true in the
> constructor. However, please open a Jira issue and so we don't forget about
> it.

It's the continuing stuff like this that makes me feel like we should
be Spring (or equivalent) based someday... I'm just not sure how we're
going to get there.

-Yonik
http://www.lucidimagination.com