You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Stephen Boesch <ja...@gmail.com> on 2011/01/06 18:37:04 UTC

Including Small Amounts of New Data in Searches (MultiSearcher ?)

Solr/lucene newbie here ..

We would like searches against a solr/lucene index to immediately be able to
view data that was added.  I stress "small" amount of new data given that
any significant amount would require excessive  latency.

Looking around, i'm wondering if the direction would be a MultiSearcher
living on top of our standard directory-based IndexReader as well as a
custom Searchable that handles the newest documents - and then combines the
two results?

is that a way to go - and would there be examples of similar
implementations?

thanks!

stephenb

Re: Including Small Amounts of New Data in Searches (MultiSearcher ?)

Posted by Jason Rutherglen <ja...@gmail.com>.

> most of the Solr sites I know of
> have much larger indexes than ram and expect everything to work
> smoothly

Hmm... In that case, throttling the merges would probably help most,
though, yes, that's not available today.  In lieu of that, I'd run
large merges during off-peak hours, or better yet, use Solr's
replication, eg, merge on the master where queries aren't hitting
anything.  Perhaps that'd throw off the NRT interval though.

On Sun, Jan 9, 2011 at 8:55 PM, Lance Norskog <go...@gmail.com> wrote:
> Ok. I was talking about what tools are available now- much better
> things are in the NRT work. I don't know how merges work now, in re
> multitasking and thread contention. Most of the Solr sites I know of
> have much larger indexes than ram and expect everything to work
> smoothly.
>
> Lance
>
> On Sun, Jan 9, 2011 at 9:18 AM, Jason Rutherglen
> <ja...@gmail.com> wrote:
>>> The older MergePolicies followed a strategy which is quite disruptive in an NRT environment.
>>
>> Can you elaborate as to why (maybe we need to place this in a wiki)?
>> If large merges are running in their own thread, they should not
>> disrupt queries, eg, there won't be CPU contention.  The IO contention
>> can be disruptive, depending on the size and type of hardware, however
>> in the ideal case of the index 'fitting' into RAM/IO cache, then a
>> large merge should not affect queries (or indexing).
>>
>> I think what's useful that is being developed for not disrupting NRT
>> with merges is DirectIOLinuxDirectory:
>> https://issues.apache.org/jira/browse/LUCENE-2500  It's also useful
>> for the non-NRT use case because anytime IO cache pages are evicted,
>> queries will slow down (unless the index is too large to fit in RAM
>> anyways).
>>
>> On Sat, Jan 8, 2011 at 7:55 PM, Lance Norskog <go...@gmail.com> wrote:
>>> There are always slowdowns when merging new segments during indexing.
>>> A MergePolicy decides when to merge segments.  The older MergePolicies
>>> followed a strategy which is quite disruptive in an NRT environment.
>>>
>>> There is a new feature in 3.x & the trunk called
>>> 'BalancedSegmentMergePolicy'. This new MergePolicy is designed for the
>>> near-real-time use case. It was contributed by LinkedIn. You may find
>>> it works well enough for your case.
>>>
>>> Lance
>>>
>>> On Thu, Jan 6, 2011 at 10:21 AM, Stephen Boesch <ja...@gmail.com> wrote:
>>>> Thanks Yonik,
>>>>  Using a stable release of Solr what would you suggest to do - given
>>>> MultiSearch's demise and the other work is still ongoing?
>>>>
>>>> 2011/1/6 Yonik Seeley <yo...@lucidimagination.com>
>>>>
>>>>> On Thu, Jan 6, 2011 at 12:37 PM, Stephen Boesch <ja...@gmail.com> wrote:
>>>>> > Solr/lucene newbie here ..
>>>>> >
>>>>> > We would like searches against a solr/lucene index to immediately be able
>>>>> to
>>>>> > view data that was added.  I stress "small" amount of new data given that
>>>>> > any significant amount would require excessive  latency.
>>>>>
>>>>> There has been significant ongoing work in lucene-core for NRT (near real
>>>>> time).
>>>>> We need to overhaul Solr's DirectUpdateHandler2 to take advantage of
>>>>> all this work.
>>>>> Mark Miller took a first crack at it (sharing a single IndexWriter,
>>>>> letting lucene handle the concurrency issues, etc)
>>>>> but if there's a JIRA issue, I'm having trouble finding it.
>>>>>
>>>>> > Looking around, i'm wondering if the direction would be a MultiSearcher
>>>>> > living on top of our standard directory-based IndexReader as well as a
>>>>> > custom Searchable that handles the newest documents - and then combines
>>>>> the
>>>>> > two results?
>>>>>
>>>>> If you look at trunk, MultiSearcher has already gone away.
>>>>>
>>>>> -Yonik
>>>>> http://www.lucidimagination.com
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> goksron@gmail.com
>>>
>>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Including Small Amounts of New Data in Searches (MultiSearcher ?)

Posted by Lance Norskog <go...@gmail.com>.

Ok. I was talking about what tools are available now- much better
things are in the NRT work. I don't know how merges work now, in re
multitasking and thread contention. Most of the Solr sites I know of
have much larger indexes than ram and expect everything to work
smoothly.

Lance

On Sun, Jan 9, 2011 at 9:18 AM, Jason Rutherglen
<ja...@gmail.com> wrote:
>> The older MergePolicies followed a strategy which is quite disruptive in an NRT environment.
>
> Can you elaborate as to why (maybe we need to place this in a wiki)?
> If large merges are running in their own thread, they should not
> disrupt queries, eg, there won't be CPU contention.  The IO contention
> can be disruptive, depending on the size and type of hardware, however
> in the ideal case of the index 'fitting' into RAM/IO cache, then a
> large merge should not affect queries (or indexing).
>
> I think what's useful that is being developed for not disrupting NRT
> with merges is DirectIOLinuxDirectory:
> https://issues.apache.org/jira/browse/LUCENE-2500  It's also useful
> for the non-NRT use case because anytime IO cache pages are evicted,
> queries will slow down (unless the index is too large to fit in RAM
> anyways).
>
> On Sat, Jan 8, 2011 at 7:55 PM, Lance Norskog <go...@gmail.com> wrote:
>> There are always slowdowns when merging new segments during indexing.
>> A MergePolicy decides when to merge segments.  The older MergePolicies
>> followed a strategy which is quite disruptive in an NRT environment.
>>
>> There is a new feature in 3.x & the trunk called
>> 'BalancedSegmentMergePolicy'. This new MergePolicy is designed for the
>> near-real-time use case. It was contributed by LinkedIn. You may find
>> it works well enough for your case.
>>
>> Lance
>>
>> On Thu, Jan 6, 2011 at 10:21 AM, Stephen Boesch <ja...@gmail.com> wrote:
>>> Thanks Yonik,
>>>  Using a stable release of Solr what would you suggest to do - given
>>> MultiSearch's demise and the other work is still ongoing?
>>>
>>> 2011/1/6 Yonik Seeley <yo...@lucidimagination.com>
>>>
>>>> On Thu, Jan 6, 2011 at 12:37 PM, Stephen Boesch <ja...@gmail.com> wrote:
>>>> > Solr/lucene newbie here ..
>>>> >
>>>> > We would like searches against a solr/lucene index to immediately be able
>>>> to
>>>> > view data that was added.  I stress "small" amount of new data given that
>>>> > any significant amount would require excessive  latency.
>>>>
>>>> There has been significant ongoing work in lucene-core for NRT (near real
>>>> time).
>>>> We need to overhaul Solr's DirectUpdateHandler2 to take advantage of
>>>> all this work.
>>>> Mark Miller took a first crack at it (sharing a single IndexWriter,
>>>> letting lucene handle the concurrency issues, etc)
>>>> but if there's a JIRA issue, I'm having trouble finding it.
>>>>
>>>> > Looking around, i'm wondering if the direction would be a MultiSearcher
>>>> > living on top of our standard directory-based IndexReader as well as a
>>>> > custom Searchable that handles the newest documents - and then combines
>>>> the
>>>> > two results?
>>>>
>>>> If you look at trunk, MultiSearcher has already gone away.
>>>>
>>>> -Yonik
>>>> http://www.lucidimagination.com
>>>>
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Including Small Amounts of New Data in Searches (MultiSearcher ?)

Posted by Jason Rutherglen <ja...@gmail.com>.

> The older MergePolicies followed a strategy which is quite disruptive in an NRT environment.

Can you elaborate as to why (maybe we need to place this in a wiki)?
If large merges are running in their own thread, they should not
disrupt queries, eg, there won't be CPU contention.  The IO contention
can be disruptive, depending on the size and type of hardware, however
in the ideal case of the index 'fitting' into RAM/IO cache, then a
large merge should not affect queries (or indexing).

I think what's useful that is being developed for not disrupting NRT
with merges is DirectIOLinuxDirectory:
https://issues.apache.org/jira/browse/LUCENE-2500  It's also useful
for the non-NRT use case because anytime IO cache pages are evicted,
queries will slow down (unless the index is too large to fit in RAM
anyways).

On Sat, Jan 8, 2011 at 7:55 PM, Lance Norskog <go...@gmail.com> wrote:
> There are always slowdowns when merging new segments during indexing.
> A MergePolicy decides when to merge segments.  The older MergePolicies
> followed a strategy which is quite disruptive in an NRT environment.
>
> There is a new feature in 3.x & the trunk called
> 'BalancedSegmentMergePolicy'. This new MergePolicy is designed for the
> near-real-time use case. It was contributed by LinkedIn. You may find
> it works well enough for your case.
>
> Lance
>
> On Thu, Jan 6, 2011 at 10:21 AM, Stephen Boesch <ja...@gmail.com> wrote:
>> Thanks Yonik,
>>  Using a stable release of Solr what would you suggest to do - given
>> MultiSearch's demise and the other work is still ongoing?
>>
>> 2011/1/6 Yonik Seeley <yo...@lucidimagination.com>
>>
>>> On Thu, Jan 6, 2011 at 12:37 PM, Stephen Boesch <ja...@gmail.com> wrote:
>>> > Solr/lucene newbie here ..
>>> >
>>> > We would like searches against a solr/lucene index to immediately be able
>>> to
>>> > view data that was added.  I stress "small" amount of new data given that
>>> > any significant amount would require excessive  latency.
>>>
>>> There has been significant ongoing work in lucene-core for NRT (near real
>>> time).
>>> We need to overhaul Solr's DirectUpdateHandler2 to take advantage of
>>> all this work.
>>> Mark Miller took a first crack at it (sharing a single IndexWriter,
>>> letting lucene handle the concurrency issues, etc)
>>> but if there's a JIRA issue, I'm having trouble finding it.
>>>
>>> > Looking around, i'm wondering if the direction would be a MultiSearcher
>>> > living on top of our standard directory-based IndexReader as well as a
>>> > custom Searchable that handles the newest documents - and then combines
>>> the
>>> > two results?
>>>
>>> If you look at trunk, MultiSearcher has already gone away.
>>>
>>> -Yonik
>>> http://www.lucidimagination.com
>>>
>>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Including Small Amounts of New Data in Searches (MultiSearcher ?)

Posted by Stephen Boesch <ja...@gmail.com>.

Thanks Lance for mentioning the MergePolicies and specifically this one
contributed by LinkedIn.

2011/1/8 Lance Norskog <go...@gmail.com>

> There are always slowdowns when merging new segments during indexing.
> A MergePolicy decides when to merge segments.  The older MergePolicies
> followed a strategy which is quite disruptive in an NRT environment.
>
> There is a new feature in 3.x & the trunk called
> 'BalancedSegmentMergePolicy'. This new MergePolicy is designed for the
> near-real-time use case. It was contributed by LinkedIn. You may find
> it works well enough for your case.
>
> Lance
>
> On Thu, Jan 6, 2011 at 10:21 AM, Stephen Boesch <ja...@gmail.com> wrote:
> > Thanks Yonik,
> >  Using a stable release of Solr what would you suggest to do - given
> > MultiSearch's demise and the other work is still ongoing?
> >
> > 2011/1/6 Yonik Seeley <yo...@lucidimagination.com>
> >
> >> On Thu, Jan 6, 2011 at 12:37 PM, Stephen Boesch <ja...@gmail.com>
> wrote:
> >> > Solr/lucene newbie here ..
> >> >
> >> > We would like searches against a solr/lucene index to immediately be
> able
> >> to
> >> > view data that was added.  I stress "small" amount of new data given
> that
> >> > any significant amount would require excessive  latency.
> >>
> >> There has been significant ongoing work in lucene-core for NRT (near
> real
> >> time).
> >> We need to overhaul Solr's DirectUpdateHandler2 to take advantage of
> >> all this work.
> >> Mark Miller took a first crack at it (sharing a single IndexWriter,
> >> letting lucene handle the concurrency issues, etc)
> >> but if there's a JIRA issue, I'm having trouble finding it.
> >>
> >> > Looking around, i'm wondering if the direction would be a
> MultiSearcher
> >> > living on top of our standard directory-based IndexReader as well as a
> >> > custom Searchable that handles the newest documents - and then
> combines
> >> the
> >> > two results?
> >>
> >> If you look at trunk, MultiSearcher has already gone away.
> >>
> >> -Yonik
> >> http://www.lucidimagination.com
> >>
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Including Small Amounts of New Data in Searches (MultiSearcher ?)

Posted by Lance Norskog <go...@gmail.com>.

There are always slowdowns when merging new segments during indexing.
A MergePolicy decides when to merge segments.  The older MergePolicies
followed a strategy which is quite disruptive in an NRT environment.

There is a new feature in 3.x & the trunk called
'BalancedSegmentMergePolicy'. This new MergePolicy is designed for the
near-real-time use case. It was contributed by LinkedIn. You may find
it works well enough for your case.

Lance

On Thu, Jan 6, 2011 at 10:21 AM, Stephen Boesch <ja...@gmail.com> wrote:
> Thanks Yonik,
>  Using a stable release of Solr what would you suggest to do - given
> MultiSearch's demise and the other work is still ongoing?
>
> 2011/1/6 Yonik Seeley <yo...@lucidimagination.com>
>
>> On Thu, Jan 6, 2011 at 12:37 PM, Stephen Boesch <ja...@gmail.com> wrote:
>> > Solr/lucene newbie here ..
>> >
>> > We would like searches against a solr/lucene index to immediately be able
>> to
>> > view data that was added.  I stress "small" amount of new data given that
>> > any significant amount would require excessive  latency.
>>
>> There has been significant ongoing work in lucene-core for NRT (near real
>> time).
>> We need to overhaul Solr's DirectUpdateHandler2 to take advantage of
>> all this work.
>> Mark Miller took a first crack at it (sharing a single IndexWriter,
>> letting lucene handle the concurrency issues, etc)
>> but if there's a JIRA issue, I'm having trouble finding it.
>>
>> > Looking around, i'm wondering if the direction would be a MultiSearcher
>> > living on top of our standard directory-based IndexReader as well as a
>> > custom Searchable that handles the newest documents - and then combines
>> the
>> > two results?
>>
>> If you look at trunk, MultiSearcher has already gone away.
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Including Small Amounts of New Data in Searches (MultiSearcher ?)

Posted by Stephen Boesch <ja...@gmail.com>.

Thanks Yonik,
  Using a stable release of Solr what would you suggest to do - given
MultiSearch's demise and the other work is still ongoing?

2011/1/6 Yonik Seeley <yo...@lucidimagination.com>

> On Thu, Jan 6, 2011 at 12:37 PM, Stephen Boesch <ja...@gmail.com> wrote:
> > Solr/lucene newbie here ..
> >
> > We would like searches against a solr/lucene index to immediately be able
> to
> > view data that was added.  I stress "small" amount of new data given that
> > any significant amount would require excessive  latency.
>
> There has been significant ongoing work in lucene-core for NRT (near real
> time).
> We need to overhaul Solr's DirectUpdateHandler2 to take advantage of
> all this work.
> Mark Miller took a first crack at it (sharing a single IndexWriter,
> letting lucene handle the concurrency issues, etc)
> but if there's a JIRA issue, I'm having trouble finding it.
>
> > Looking around, i'm wondering if the direction would be a MultiSearcher
> > living on top of our standard directory-based IndexReader as well as a
> > custom Searchable that handles the newest documents - and then combines
> the
> > two results?
>
> If you look at trunk, MultiSearcher has already gone away.
>
> -Yonik
> http://www.lucidimagination.com
>

Re: Including Small Amounts of New Data in Searches (MultiSearcher ?)

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Thu, Jan 6, 2011 at 12:37 PM, Stephen Boesch <ja...@gmail.com> wrote:
> Solr/lucene newbie here ..
>
> We would like searches against a solr/lucene index to immediately be able to
> view data that was added.  I stress "small" amount of new data given that
> any significant amount would require excessive  latency.

There has been significant ongoing work in lucene-core for NRT (near real time).
We need to overhaul Solr's DirectUpdateHandler2 to take advantage of
all this work.
Mark Miller took a first crack at it (sharing a single IndexWriter,
letting lucene handle the concurrency issues, etc)
but if there's a JIRA issue, I'm having trouble finding it.

> Looking around, i'm wondering if the direction would be a MultiSearcher
> living on top of our standard directory-based IndexReader as well as a
> custom Searchable that handles the newest documents - and then combines the
> two results?

If you look at trunk, MultiSearcher has already gone away.

-Yonik
http://www.lucidimagination.com