You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Yonik Seeley <yo...@apache.org> on 2007/02/02 19:36:35 UTC

TermInfosReader lazy term index reading

What was the use-case behind loading the term index lazily?
I'm having a hard time figuring out what one would do with an
IndexReader that doesn't involve a term lookup somehow.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: TermInfosReader lazy term index reading

Posted by robert engels <re...@ix.netcom.com>.
You only need to load it for segments that are read, instead of  
paying the init price on all segments that may never be used.

On Feb 2, 2007, at 12:36 PM, Yonik Seeley wrote:

> What was the use-case behind loading the term index lazily?
> I'm having a hard time figuring out what one would do with an
> IndexReader that doesn't involve a term lookup somehow.
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: TermInfosReader lazy term index reading

Posted by Doug Cutting <cu...@apache.org>.
Yonik Seeley wrote:
> I think synchronization can become expensive under heavy contention,
> regardless of how lightweight the code inside.

I'm skeptical of this.  It's possible, but I've never seen it.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: TermInfosReader lazy term index reading

Posted by Yonik Seeley <yo...@apache.org>.
FYI, I filed a Solr bug for this issue:
https://issues.apache.org/jira/browse/SOLR-138

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: TermInfosReader lazy term index reading

Posted by robert engels <re...@ix.netcom.com>.
I think that is much more involved... I don't think there is an easy  
way to move a query between threads/pools once it has started unless  
you restart the entire query.

You might be able to dynamically lower the thread priority however  
when you detect a long query, so that smaller (faster) queries would  
have priority.


On Feb 2, 2007, at 4:44 PM, Doron Cohen wrote:

> robert engels <re...@ix.netcom.com> wrote on 02/02/2007 14:08:46:
>
>> You might be able to quantify the search request ahead of time (# of
>> terms, # of high frequency terms, etc.) and assign the request to the
>> appropriate pool (quick, normal, lengthy).
>>
>> Then you can assign an appropriate # of threads to each pool.
>
> Or, to avoid pre-computation, requests can first be assigned to a
> 'faster' queue, assuming they are short, and only later, if a
> request turns out to be longer, it can me dynamically moved to a
> 'slower' queue, maybe less prioritized. (Similar I think to OS
> job scheduling.) (Can have more than 2 queues.)
>
> I wonder if there's danger that queueing queries would increase the
> avg time-to-complete, even if the total time is reduced?
>
>>
>> Most people understand that complex queries might take longer to
>> execute.
>>
>>
>> On Feb 2, 2007, at 4:01 PM, Yonik Seeley wrote:
>>
>>> On 2/2/07, robert engels <re...@ix.netcom.com> wrote:
>>>> For a process that is mostly CPU bound (which is the case with  
>>>> Lucene
>>>> if the index is in the OS cache), having so many "active" threads
>>>> will actually hurt performance due to the context switching and
>>>> synchronization.
>>>
>>> Sure... it certainly wasn't by design to have that many threads all
>>> trying to do something.
>>>
>>>> Better to use a request queue / thread pool. (I
>>>> think I read somewhere that a good rule of thumb is 2x the  
>>>> number of
>>>> processors).
>>>
>>> You might hit a scenario where a couple of threads are doing long
>>> running queries, and that could lock out other queries that might
>>> otherwise execute quickly.  But overall, it's not a bad idea.
>>>
>>>> If most of the searches are IO bound having so many disparate
>>>> requests will hurt performance as well since the disk heads will be
>>>> seeking all over the place and losing any locality of data that
>>>> Lucene provides (postings, sequental term reads, etc.).
>>>
>>> We're not hitting disk... plenty of RAM.
>>>
>>> -Yonik
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: TermInfosReader lazy term index reading

Posted by Doron Cohen <DO...@il.ibm.com>.
robert engels <re...@ix.netcom.com> wrote on 02/02/2007 14:08:46:

> You might be able to quantify the search request ahead of time (# of
> terms, # of high frequency terms, etc.) and assign the request to the
> appropriate pool (quick, normal, lengthy).
>
> Then you can assign an appropriate # of threads to each pool.

Or, to avoid pre-computation, requests can first be assigned to a
'faster' queue, assuming they are short, and only later, if a
request turns out to be longer, it can me dynamically moved to a
'slower' queue, maybe less prioritized. (Similar I think to OS
job scheduling.) (Can have more than 2 queues.)

I wonder if there's danger that queueing queries would increase the
avg time-to-complete, even if the total time is reduced?

>
> Most people understand that complex queries might take longer to
> execute.
>
>
> On Feb 2, 2007, at 4:01 PM, Yonik Seeley wrote:
>
> > On 2/2/07, robert engels <re...@ix.netcom.com> wrote:
> >> For a process that is mostly CPU bound (which is the case with Lucene
> >> if the index is in the OS cache), having so many "active" threads
> >> will actually hurt performance due to the context switching and
> >> synchronization.
> >
> > Sure... it certainly wasn't by design to have that many threads all
> > trying to do something.
> >
> >> Better to use a request queue / thread pool. (I
> >> think I read somewhere that a good rule of thumb is 2x the number of
> >> processors).
> >
> > You might hit a scenario where a couple of threads are doing long
> > running queries, and that could lock out other queries that might
> > otherwise execute quickly.  But overall, it's not a bad idea.
> >
> >> If most of the searches are IO bound having so many disparate
> >> requests will hurt performance as well since the disk heads will be
> >> seeking all over the place and losing any locality of data that
> >> Lucene provides (postings, sequental term reads, etc.).
> >
> > We're not hitting disk... plenty of RAM.
> >
> > -Yonik
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: TermInfosReader lazy term index reading

Posted by robert engels <re...@ix.netcom.com>.
You might be able to quantify the search request ahead of time (# of  
terms, # of high frequency terms, etc.) and assign the request to the  
appropriate pool (quick, normal, lengthy).

Then you can assign an appropriate # of threads to each pool.

Most people understand that complex queries might take longer to  
execute.


On Feb 2, 2007, at 4:01 PM, Yonik Seeley wrote:

> On 2/2/07, robert engels <re...@ix.netcom.com> wrote:
>> For a process that is mostly CPU bound (which is the case with Lucene
>> if the index is in the OS cache), having so many "active" threads
>> will actually hurt performance due to the context switching and
>> synchronization.
>
> Sure... it certainly wasn't by design to have that many threads all
> trying to do something.
>
>> Better to use a request queue / thread pool. (I
>> think I read somewhere that a good rule of thumb is 2x the number of
>> processors).
>
> You might hit a scenario where a couple of threads are doing long
> running queries, and that could lock out other queries that might
> otherwise execute quickly.  But overall, it's not a bad idea.
>
>> If most of the searches are IO bound having so many disparate
>> requests will hurt performance as well since the disk heads will be
>> seeking all over the place and losing any locality of data that
>> Lucene provides (postings, sequental term reads, etc.).
>
> We're not hitting disk... plenty of RAM.
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: TermInfosReader lazy term index reading

Posted by Yonik Seeley <yo...@apache.org>.
On 2/2/07, robert engels <re...@ix.netcom.com> wrote:
> For a process that is mostly CPU bound (which is the case with Lucene
> if the index is in the OS cache), having so many "active" threads
> will actually hurt performance due to the context switching and
> synchronization.

Sure... it certainly wasn't by design to have that many threads all
trying to do something.

> Better to use a request queue / thread pool. (I
> think I read somewhere that a good rule of thumb is 2x the number of
> processors).

You might hit a scenario where a couple of threads are doing long
running queries, and that could lock out other queries that might
otherwise execute quickly.  But overall, it's not a bad idea.

> If most of the searches are IO bound having so many disparate
> requests will hurt performance as well since the disk heads will be
> seeking all over the place and losing any locality of data that
> Lucene provides (postings, sequental term reads, etc.).

We're not hitting disk... plenty of RAM.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: TermInfosReader lazy term index reading

Posted by robert engels <re...@ix.netcom.com>.
FYI,

For a process that is mostly CPU bound (which is the case with Lucene  
if the index is in the OS cache), having so many "active" threads  
will actually hurt performance due to the context switching and  
synchronization. Better to use a request queue / thread pool. (I  
think I read somewhere that a good rule of thumb is 2x the number of  
processors).

If most of the searches are IO bound having so many disparate  
requests will hurt performance as well since the disk heads will be  
seeking all over the place and losing any locality of data that  
Lucene provides (postings, sequental term reads, etc.).

There are some excellent academic papers I just came across on high- 
performance parallel disk based sorting and many of the techniques/ 
concerns apply to Lucene.

Robert


On Feb 2, 2007, at 3:38 PM, Yonik Seeley wrote:

> On 2/2/07, Doug Cutting <cu...@apache.org> wrote:
>> Yonik Seeley wrote:
>> > I ran across a situation where a great number of threads were  
>> blocked on
>> > ensureIndexIsRead(), even after it had already been loaded.
>>
>> That sounds bizarre.  A sync block that tests a field for non-null
>> shouldn't tie things up much, I wouldn't think.
>
> There were hundreds of threads all blocked on the same lock.
> I think synchronization can become expensive under heavy contention,
> regardless of how lightweight the code inside.
>
> It's obviously not the root cause of the problem... the query
> structure was very expensive (a range query covering most documents
> that didn't get pulled out into a Filter), but it still could be an
> area of improvement.
>
> I'm going to try and see if I can duplicate it, then see what effect
> removing the synchronization has.
>
>>   Are you sure that one
>> of the threads wasn't actually reading the index?
>
> Yep.  We've seen the same thing with older versions of Lucene when
> multiple threads tried to sort on the same field and there was massive
> contention from everyone trying to generate the same entry.
>
>> Or perhaps some other
>> method also synchronizes on the same object?
>
> Good question... I only checked TermInfosReader itself.
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: TermInfosReader lazy term index reading

Posted by Yonik Seeley <yo...@apache.org>.
On 2/2/07, Doug Cutting <cu...@apache.org> wrote:
> Yonik Seeley wrote:
> > I ran across a situation where a great number of threads were blocked on
> > ensureIndexIsRead(), even after it had already been loaded.
>
> That sounds bizarre.  A sync block that tests a field for non-null
> shouldn't tie things up much, I wouldn't think.

There were hundreds of threads all blocked on the same lock.
I think synchronization can become expensive under heavy contention,
regardless of how lightweight the code inside.

It's obviously not the root cause of the problem... the query
structure was very expensive (a range query covering most documents
that didn't get pulled out into a Filter), but it still could be an
area of improvement.

I'm going to try and see if I can duplicate it, then see what effect
removing the synchronization has.

>   Are you sure that one
> of the threads wasn't actually reading the index?

Yep.  We've seen the same thing with older versions of Lucene when
multiple threads tried to sort on the same field and there was massive
contention from everyone trying to generate the same entry.

> Or perhaps some other
> method also synchronizes on the same object?

Good question... I only checked TermInfosReader itself.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: TermInfosReader lazy term index reading

Posted by Doug Cutting <cu...@apache.org>.
Yonik Seeley wrote:
> I ran across a situation where a great number of threads were blocked on
> ensureIndexIsRead(), even after it had already been loaded.

That sounds bizarre.  A sync block that tests a field for non-null 
shouldn't tie things up much, I wouldn't think.  Are you sure that one 
of the threads wasn't actually reading the index?  Or perhaps some other 
method also synchronizes on the same object?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: TermInfosReader lazy term index reading

Posted by Yonik Seeley <yo...@apache.org>.
On 2/2/07, Doug Cutting <cu...@apache.org> wrote:
> Yonik Seeley wrote:
> > What was the use-case behind loading the term index lazily?
> > I'm having a hard time figuring out what one would do with an
> > IndexReader that doesn't involve a term lookup somehow.
>
> Index merging only iterates through terms.

Ah, that makes sense.

I ran across a situation where a great number of threads were blocked on
ensureIndexIsRead(), even after it had already been loaded.  I was
wondering if it was worth trying to get rid of the sync block.  It
wouldn't totally fix the issue, but it might improve things.

One could signal the SegmentReader to lazy-load or not, and then the
sync block could be moved inside an "if" that only executed if lazy
loading was on (or it could also be overridden in a subclass to do
nothing if lazy loading was off).

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: TermInfosReader lazy term index reading

Posted by Doug Cutting <cu...@apache.org>.
Yonik Seeley wrote:
> What was the use-case behind loading the term index lazily?
> I'm having a hard time figuring out what one would do with an
> IndexReader that doesn't involve a term lookup somehow.

Index merging only iterates through terms.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org