You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Jim Murphy <ji...@pobox.com> on 2008/10/06 20:10:35 UTC

Index updates blocking readers: To Multicore or not?

We have a farm of several Master-Slave pairs all managing a single very large
"logical" index sharded across the master-slaves.  We notice on the slaves,
after an rsync update, as the index is being committed that all queries are
blocked sometimes resulting in unacceptable service times.  I'm looking at
ways we can manage these "update burps".

Question #1: Anything obvious I can tweak in the configuration to mitigate
these multi-second blocking updates?  Our Indexes are 40GB, 20M documents
each.  RSync updates are every 5 minutes several hundred KB per update. 

Question #2: I'm considering setting up each slave with multiple Solr cores.
The 2 indexes per instance would be nearly identical copies but "A" would be
read from while "B" is being updated, then they would swap.  I'll have to
figure out how to rsync these 2 indexes properly but if I can get the
commits to happen to the offline index then I suspect my queries could
proceed unblocked.  

Is this the wrong tree to be barking up?  Any other thoughts? 

Thanks in advance,

Jim



-- 
View this message in context: http://www.nabble.com/Index-updates-blocking-readers%3A-To-Multicore-or-not--tp19843098p19843098.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Index updates blocking readers: To Multicore or not?

Posted by Jim Murphy <ji...@pobox.com>.

We shread the RSS into individual items then create Solr XML documents to
insert.  Solr is an easy choice for us over straight Lucene since it adds
the server infrastructure that we would mostly be writing ourself - caching,
data types, master/slave replication.

We looked at nutch too - but that was before my time.

Jim



John Martyniak-3 wrote:
> 
> Thank you that is good information, as that is kind of way that I am  
> leaning.
> 
> So when you fetch the content from RSS, does that get rendered to an  
> XML document that Solr indexes?
> 
> Also what where a couple of decision points for using Solr as opposed  
> to using Nutch, or even straight Lucene?
> 
> -John
> 
> 
> 
> On Oct 22, 2008, at 11:22 AM, Jim Murphy wrote:
> 
>>
>> We index RSS content using our own home grown distributed spiders -  
>> not using
>> Nutch.  We use ruby processes do do the feed fetching and XML  
>> shreading, and
>> Amazon SQS to queue up work packets to insert into our Solr cluster.
>>
>> Sorry can't be of more help.
>>
>> -- 
>> View this message in context:
>> http://www.nabble.com/Index-updates-blocking-readers%3A-To-Multicore-or-not--tp19843098p20113143.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Index-updates-blocking-readers%3A-To-Multicore-or-not--tp19843098p20114697.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Index updates blocking readers: To Multicore or not?

Posted by John Martyniak <jo...@beforedawn.com>.

Thank you that is good information, as that is kind of way that I am  
leaning.

So when you fetch the content from RSS, does that get rendered to an  
XML document that Solr indexes?

Also what where a couple of decision points for using Solr as opposed  
to using Nutch, or even straight Lucene?

-John

On Oct 22, 2008, at 11:22 AM, Jim Murphy wrote:

>
> We index RSS content using our own home grown distributed spiders -  
> not using
> Nutch.  We use ruby processes do do the feed fetching and XML  
> shreading, and
> Amazon SQS to queue up work packets to insert into our Solr cluster.
>
> Sorry can't be of more help.
>
> -- 
> View this message in context: http://www.nabble.com/Index-updates-blocking-readers%3A-To-Multicore-or-not--tp19843098p20113143.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Index updates blocking readers: To Multicore or not?

Posted by Jim Murphy <ji...@pobox.com>.

We index RSS content using our own home grown distributed spiders - not using
Nutch.  We use ruby processes do do the feed fetching and XML shreading, and
Amazon SQS to queue up work packets to insert into our Solr cluster. 

Sorry can't be of more help.

-- 
View this message in context: http://www.nabble.com/Index-updates-blocking-readers%3A-To-Multicore-or-not--tp19843098p20113143.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Index updates blocking readers: To Multicore or not?

Posted by John Martyniak <jo...@beforedawn.com>.

Jim,

This is a off topic question.

But for your 30M documents, did you fetch those from external web  
sites (Whole Web Search)?  Or are they internal documents?  If they  
are external what method did you use to fetch them and which spider?

I am in the process of deciding between using Nutch for whole web  
indexing, Solr + Spider?, or Nutch + Solr, etc.

Thank you in advance for your insight into this issue.

-John

On Oct 22, 2008, at 10:55 AM, Jim Murphy wrote:

>
> Thanks Yonik,
>
> I have more information...
>
> 1. We do indeed have large indexes: 40GB on disk, 30M documents -  
> and is
> just a test server we have 8 of these in parallel.
>
> 2. The performance problem I was seeing followed replication, and  
> first
> query on a new searcher.  It turns out we didn't configure index  
> warming
> queries very well so we removes the various "solr rocks" type  
> queries to one
> that was better for our data - and had not improvement.  The problem  
> was
> that replication completed, a new searcher was created and  
> registered but
> the first query qould take 10-20 seconds to complete.  There after  
> it took
> <200 milliseconds for similar non-cached queries.
>
> Profiler pointed us to building the FieldSortedHitQueue was taking  
> all the
> time.  Our warming query did not include a sort but our queries  
> commonly do.
> Once we added the sort parameter our warming query started taking  
> the 10-20
> seconds prior to registering the searcher.  After that the first  
> query on
> the new searcher took the expected 200ms.
>
> LESSON LEARNED: warm your caches! And, if a sort is involved in your  
> queries
> incorporate that sort in your warming query!  Add a warming query  
> for each
> kind of sort that you expect to do.
>
>
>
>
>
>
>
>
>
> Yonik Seeley wrote:
>>
>> On Mon, Oct 6, 2008 at 2:10 PM, Jim Murphy <ji...@pobox.com>  
>> wrote:
>>> We have a farm of several Master-Slave pairs all managing a single  
>>> very
>>> large
>>> "logical" index sharded across the master-slaves.  We notice on the
>>> slaves,
>>> after an rsync update, as the index is being committed that all  
>>> queries
>>> are
>>> blocked sometimes resulting in unacceptable service times.  I'm  
>>> looking
>>> at
>>> ways we can manage these "update burps".
>>
>> Updates should never block queries.
>> What version of Solr are you using?
>> Is it possible that your indexes are so big, opening a new index in
>> the background causes enough of the old index to be flushed from OS
>> cache, causing big slowdowns?
>>
>> -Yonik
>>
>>
>>> Question #1: Anything obvious I can tweak in the configuration to
>>> mitigate
>>> these multi-second blocking updates?  Our Indexes are 40GB, 20M  
>>> documents
>>> each.  RSync updates are every 5 minutes several hundred KB per  
>>> update.
>>>
>>> Question #2: I'm considering setting up each slave with multiple  
>>> Solr
>>> cores.
>>> The 2 indexes per instance would be nearly identical copies but  
>>> "A" would
>>> be
>>> read from while "B" is being updated, then they would swap.  I'll  
>>> have to
>>> figure out how to rsync these 2 indexes properly but if I can get  
>>> the
>>> commits to happen to the offline index then I suspect my queries  
>>> could
>>> proceed unblocked.
>>>
>>> Is this the wrong tree to be barking up?  Any other thoughts?
>>>
>>> Thanks in advance,
>>>
>>> Jim
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Index-updates-blocking-readers%3A-To-Multicore-or-not--tp19843098p19843098.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>
> -- 
> View this message in context: http://www.nabble.com/Index-updates-blocking-readers%3A-To-Multicore-or-not--tp19843098p20112546.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Index updates blocking readers: To Multicore or not?

Posted by Jim Murphy <ji...@pobox.com>.

Thanks Yonik, 

I have more information...

1. We do indeed have large indexes: 40GB on disk, 30M documents - and is
just a test server we have 8 of these in parallel.

2. The performance problem I was seeing followed replication, and first
query on a new searcher.  It turns out we didn't configure index warming
queries very well so we removes the various "solr rocks" type queries to one
that was better for our data - and had not improvement.  The problem was
that replication completed, a new searcher was created and registered but
the first query qould take 10-20 seconds to complete.  There after it took
<200 milliseconds for similar non-cached queries.

Profiler pointed us to building the FieldSortedHitQueue was taking all the
time.  Our warming query did not include a sort but our queries commonly do. 
Once we added the sort parameter our warming query started taking the 10-20
seconds prior to registering the searcher.  After that the first query on
the new searcher took the expected 200ms.

LESSON LEARNED: warm your caches! And, if a sort is involved in your queries
incorporate that sort in your warming query!  Add a warming query for each
kind of sort that you expect to do.

Yonik Seeley wrote:
> 
> On Mon, Oct 6, 2008 at 2:10 PM, Jim Murphy <ji...@pobox.com> wrote:
>> We have a farm of several Master-Slave pairs all managing a single very
>> large
>> "logical" index sharded across the master-slaves.  We notice on the
>> slaves,
>> after an rsync update, as the index is being committed that all queries
>> are
>> blocked sometimes resulting in unacceptable service times.  I'm looking
>> at
>> ways we can manage these "update burps".
> 
> Updates should never block queries.
> What version of Solr are you using?
> Is it possible that your indexes are so big, opening a new index in
> the background causes enough of the old index to be flushed from OS
> cache, causing big slowdowns?
> 
> -Yonik
> 
> 
>> Question #1: Anything obvious I can tweak in the configuration to
>> mitigate
>> these multi-second blocking updates?  Our Indexes are 40GB, 20M documents
>> each.  RSync updates are every 5 minutes several hundred KB per update.
>>
>> Question #2: I'm considering setting up each slave with multiple Solr
>> cores.
>> The 2 indexes per instance would be nearly identical copies but "A" would
>> be
>> read from while "B" is being updated, then they would swap.  I'll have to
>> figure out how to rsync these 2 indexes properly but if I can get the
>> commits to happen to the offline index then I suspect my queries could
>> proceed unblocked.
>>
>> Is this the wrong tree to be barking up?  Any other thoughts?
>>
>> Thanks in advance,
>>
>> Jim
>>
>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Index-updates-blocking-readers%3A-To-Multicore-or-not--tp19843098p19843098.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Index-updates-blocking-readers%3A-To-Multicore-or-not--tp19843098p20112546.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Index updates blocking readers: To Multicore or not?

Posted by Yonik Seeley <yo...@apache.org>.

On Mon, Oct 6, 2008 at 2:10 PM, Jim Murphy <ji...@pobox.com> wrote:
> We have a farm of several Master-Slave pairs all managing a single very large
> "logical" index sharded across the master-slaves.  We notice on the slaves,
> after an rsync update, as the index is being committed that all queries are
> blocked sometimes resulting in unacceptable service times.  I'm looking at
> ways we can manage these "update burps".

Updates should never block queries.
What version of Solr are you using?
Is it possible that your indexes are so big, opening a new index in
the background causes enough of the old index to be flushed from OS
cache, causing big slowdowns?

-Yonik


> Question #1: Anything obvious I can tweak in the configuration to mitigate
> these multi-second blocking updates?  Our Indexes are 40GB, 20M documents
> each.  RSync updates are every 5 minutes several hundred KB per update.
>
> Question #2: I'm considering setting up each slave with multiple Solr cores.
> The 2 indexes per instance would be nearly identical copies but "A" would be
> read from while "B" is being updated, then they would swap.  I'll have to
> figure out how to rsync these 2 indexes properly but if I can get the
> commits to happen to the offline index then I suspect my queries could
> proceed unblocked.
>
> Is this the wrong tree to be barking up?  Any other thoughts?
>
> Thanks in advance,
>
> Jim
>
>
>
> --
> View this message in context: http://www.nabble.com/Index-updates-blocking-readers%3A-To-Multicore-or-not--tp19843098p19843098.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>