You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "scott.chu" <sc...@udngroup.com> on 2016/06/07 02:00:42 UTC

Concern of large amount daily update

We recently plan to replace a old-school lucene that has 50M docs with Solrcloud but the daily update, according to the responsive colleague,  could be around 100 thousands docs. Its data source is a bunch of mysql tables. When implementing the updating workflow, what solud I do so that I can maintain a fair amount of time when doing updating docs? Currently what I have in mind are:

1. Use atomic update to avoid unnecessary full-doc update.
2. Run multple of my updating process where each update different range of docs.

Is there other things that I can do to help my issue? Is there any suggestion or expereiences for preparing appropriate h/w, e.g. CPU or RAM?

scott.chu,scott.chu@udngroup.com
2016/6/7 (週二)

Re(2): Concern of large amount daily update

Posted by "scott.chu" <sc...@udngroup.com>.
Thanks! I thought it's a fairly-large amount. Looks like I need more exprience on Solrcloud. Ya, I also believe reading from MySql could be a bottleneck of the workflow. I pass your suggestion to my colleagues. Thanks again!


scott.chu,scott.chu@udngroup.com
2016/6/8 (週三)
----- Original Message ----- 
From: Erick Erickson 
To: solr-user ; scott(自己) 
CC: 
Date: 2016/6/8 (週三) 09:37
Subject: Re: Concern of large amount daily update


Atomic updates are really a full document index under the covers. What 
happens is that the stored fields are all read from disk, your updates 
overlain and the entire document is re-indexed. From Solr's 
persepctive, this is probably actually _more_ work than just having 
the document resent completely. 

100K documents each _day_ is a pretty small update load actually. 
Indexing Wiki docs on my laptop I can get 3-4 _thousand_ docs a 
second. 

I believe you should start with just re-indexing the entire documents 
and go to more complex solutions if (and only if) that performs 
poorly. My bet is that you'll wind up spending more time getting the 
documents from your system of record and Solr will hardly notice the 
indexing load. 

Much depends on how you index of course. If you're sending docs for 
the ExtractingRequestHandler to process you'll be putting the 
extraction load on your servers, you might consider moving that 
processing to a separate client. 

Best, 
Erick 

On Mon, Jun 6, 2016 at 7:00 PM, scott.chu <sc...@udngroup.com> wrote: 
> 
> We recently plan to replace a old-school lucene that has 50M docs with Solrcloud but the daily update, according to the responsive colleague, could be around 100 thousands docs. Its data source is a bunch of mysql tables. When implementing the updating workflow, what solud I do so that I can maintain a fair amount of time when doing updating docs? Currently what I have in mind are: 
> 
> 1. Use atomic update to avoid unnecessary full-doc update. 
> 2. Run multple of my updating process where each update different range of docs. 
> 
> Is there other things that I can do to help my issue? Is there any suggestion or expereiences for preparing appropriate h/w, e.g. CPU or RAM? 
> 
> scott.chu,scott.chu@udngroup.com 
> 2016/6/7 (週二) 


----- 
未在此訊息中找到病毒。 
已透過 AVG 檢查 - www.avg.com 
版本: 2015.0.6201 / 病毒庫: 4598/12381 - 發佈日期: 06/07/16 

Re: Concern of large amount daily update

Posted by Erick Erickson <er...@gmail.com>.
Atomic updates are really a full document index under the covers. What
happens is that the stored fields are all read from disk, your updates
overlain and the entire document is re-indexed. From Solr's
persepctive, this is probably actually _more_ work than just having
the document resent completely.

100K documents each _day_ is a pretty small update load actually.
Indexing Wiki docs on my laptop I can get 3-4 _thousand_ docs a
second.

I believe you should start with just re-indexing the entire documents
and go to more complex solutions if (and only if) that performs
poorly. My bet is that you'll wind up spending more time getting the
documents from your system of record and Solr will hardly notice the
indexing load.

Much depends on how you index of course. If you're sending docs for
the ExtractingRequestHandler to process you'll be putting the
extraction load on your servers, you might consider moving that
processing to a separate client.

Best,
Erick

On Mon, Jun 6, 2016 at 7:00 PM, scott.chu <sc...@udngroup.com> wrote:
>
> We recently plan to replace a old-school lucene that has 50M docs with Solrcloud but the daily update, according to the responsive colleague,  could be around 100 thousands docs. Its data source is a bunch of mysql tables. When implementing the updating workflow, what solud I do so that I can maintain a fair amount of time when doing updating docs? Currently what I have in mind are:
>
> 1. Use atomic update to avoid unnecessary full-doc update.
> 2. Run multple of my updating process where each update different range of docs.
>
> Is there other things that I can do to help my issue? Is there any suggestion or expereiences for preparing appropriate h/w, e.g. CPU or RAM?
>
> scott.chu,scott.chu@udngroup.com
> 2016/6/7 (週二)