You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Charles Wardell <ch...@bcsolution.com> on 2011/04/26 20:32:29 UTC

Question on Batch process

I am sure that this question has been asked a few times, but I can't seem to find the sweetspot for indexing.

I have about 100,000 files each containing 1,000 xml documents ready to be posted to Solr. My desire is to have it index as quickly as possible and then once completed the daily stream of ADDs will be small in comparison.

The individual documents are small. Essentially web postings from the net. Title, postPostContent, date. 

What would be the ideal configuration? For RamBufferSize, mergeFactor, MaxbufferedDocs, etc..

My machine is a quad core hyper-threaded. So it shows up as 8 cpu's in TOP
I have 16GB of available ram.


Thanks in advance.
Charlie

Re: Question on Batch process

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Charles,

Maybe the question to ask is why you are committing at all?  Do you need 
somebody to see index changes while you are indexing?  If not, commit just at 
the end.  And optimize if you won't touch the index for a while.

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Charles Wardell <ch...@bcsolution.com>
> To: solr-user@lucene.apache.org
> Sent: Wed, April 27, 2011 7:51:20 PM
> Subject: Re: Question on Batch process
> 
> Thank you for your response. I did not make the StreamingUpdate application 
>yet,  but I did change the other settings that you mentioned. It gave me a huge 
>boost  in indexing speed. (I am still using post.sh but hope to change that  
>soon).
> 
> One thing I noticed is the indexing speed was incredibly fast last  night, but 
>today the commits are taking so long. Is this to be  expected?
> 
> 
> 
> -- 
> Best Regards,
> 
> Charles Wardell
> Blue  Chips Technology, Inc.
> www.bcsolution.com
> 
> On Wednesday, April 27, 2011  at 6:15 PM, Otis Gospodnetic wrote: 
> > Hi Charles,
> > 
> > Yes,  the threads I was referring to are in the context of the 
>client/indexer, so 
>
> > one of the params for StreamingUpdateSolrServer.
> > post.sh/jar  are just there because they are handy. Don't use them for 
> >  production.
> > 
> > It's impossible to tell how long indexing of 100M  documents may take. They 
> > could be very big or very small. You could  perform very light or no analysis 
>or 
>
> > heavy analysis. They could contain  1 or 100 fields. :)
> > 
> > Otis
> > ----
> > Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
> > Lucene ecosystem search :: http://search-lucene.com/
> > 
> > 
> > 
> > ----- Original  Message ----
> > > From: Charles Wardell <ch...@bcsolution.com>
> >  > To: solr-user@lucene.apache.org
> >  > Sent: Tue, April 26, 2011 8:01:28 PM
> > > Subject: Re: Question on  Batch process
> > > 
> > > Thank you Otis.
> > > Without  trying to appear to stupid, when you refer to having the params 
> > >  matching your # of CPU cores, you are talking about the # of threads I can 
>
> > > spawn with the StreamingUpdateSolrServer object?
> > > Up  until now, I have been just utilizing post.sh or post.jar. Are these 
> >  > capable of that or do I need to write some code to collect a bunch of 
>files 
>
> > > into the buffer and send it off?
> > > 
> > > Also,  Do you have a sense for how long it should take to index 100,000 
>files 
>
> >  > or in my case 100,000,000 documents?
> > >  StreamingUpdateSolrServer
> > > public StreamingUpdateSolrServer(String  solrServerUrl, int queueSize, int 

> > > threadCount) throws  MalformedURLException
> > > 
> > > Thanks again,
> > >  Charlie
> > > 
> > > -- 
> > > Best Regards,
> > > 
> > > Charles Wardell
> > > Blue Chips Technology, Inc.
> >  > www.bcsolution.com
> > > 
> > > On Tuesday, April 26, 2011 at  5:12 PM, Otis Gospodnetic wrote: 
> > > > Charlie,
> > > > 
> > > > How's this:
> > > > * -Xmx2g
> > > > *  ramBufferSizeMB 512
> > > > * mergeFactor 10 (default, but you could  up it to 20, 30, if ulimit -n 
> > > allows)
> > > > *  ignore/delete maxBufferedDocs - not used if you ran ramBufferSizeMB
> > >  > * use SolrStreamingUpdateServer (with params matching your number of CPU 
>
> > > cores) 
> > > 
> > > > or send batches of say  1000 docs with the other SolrServer impl using N 

> > > threads 
> >  > 
> > > > (N=# of your CPU cores)
> > > > 
> > >  > Otis
> > > >  ----
> > > > Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
> > > > Lucene ecosystem search :: http://search-lucene.com/
> > > > 
> > > > 
> >  > > 
> > > > ----- Original Message ----
> > > > >  From: Charles Wardell <ch...@bcsolution.com>
> >  > > > To: solr-user@lucene.apache.org
> >  > > > Sent: Tue, April 26, 2011 2:32:29 PM
> > > > >  Subject: Question on Batch process
> > > > > 
> > > >  > I am sure that this question has been asked a few times, but I can't 
>seem 
>
> > > to 
> > > 
> > > > > find the sweetspot for  indexing.
> > > > > 
> > > > > I have about 100,000  files each containing 1,000 xml documents ready 
>to be 
>
> > > 
> >  > > > posted to Solr. My desire is to have it index as quickly as  possible 
>and 
>
> > > then 
> > > 
> > > > > once  completed the daily stream of ADDs will be small in comparison.
> > >  > > 
> > > > > The individual documents are small.  Essentially web postings from the 
>net. 
>
> > > 
> > > > >  Title, postPostContent, date. 
> > > > > 
> > > > > 
> > > > >  What would be the ideal configuration? For  RamBufferSize, 
>mergeFactor, 
>
> > > > > MaxbufferedDocs,  etc..
> > > > > 
> > > > > My machine is a quad core  hyper-threaded. So it shows up as 8 cpu's in 
>
> > TOP
> > > > >  I have 16GB of available ram.
> > > > > 
> > > > > 
> > > > > Thanks in advance.
> > > > >  Charlie
> > 
>

Re: Question on Batch process

Posted by Charles Wardell <ch...@bcsolution.com>.

Thank you for your response. I did not make the StreamingUpdate application yet, but I did change the other settings that you mentioned. It gave me a huge boost in indexing speed. (I am still using post.sh but hope to change that soon).

One thing I noticed is the indexing speed was incredibly fast last night, but today the commits are taking so long. Is this to be expected?



-- 
Best Regards,

Charles Wardell
Blue Chips Technology, Inc.
www.bcsolution.com

On Wednesday, April 27, 2011 at 6:15 PM, Otis Gospodnetic wrote: 
> Hi Charles,
> 
> Yes, the threads I was referring to are in the context of the client/indexer, so 
> one of the params for StreamingUpdateSolrServer.
> post.sh/jar are just there because they are handy. Don't use them for 
> production.
> 
> It's impossible to tell how long indexing of 100M documents may take. They 
> could be very big or very small. You could perform very light or no analysis or 
> heavy analysis. They could contain 1 or 100 fields. :)
> 
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
> 
> 
> 
> ----- Original Message ----
> > From: Charles Wardell <ch...@bcsolution.com>
> > To: solr-user@lucene.apache.org
> > Sent: Tue, April 26, 2011 8:01:28 PM
> > Subject: Re: Question on Batch process
> > 
> > Thank you Otis.
> > Without trying to appear to stupid, when you refer to having the params 
> > matching your # of CPU cores, you are talking about the # of threads I can 
> > spawn with the StreamingUpdateSolrServer object?
> > Up until now, I have been just utilizing post.sh or post.jar. Are these 
> > capable of that or do I need to write some code to collect a bunch of files 
> > into the buffer and send it off?
> > 
> > Also, Do you have a sense for how long it should take to index 100,000 files 
> > or in my case 100,000,000 documents?
> > StreamingUpdateSolrServer
> > public StreamingUpdateSolrServer(String solrServerUrl, int queueSize, int 
> > threadCount) throws MalformedURLException
> > 
> > Thanks again,
> > Charlie
> > 
> > -- 
> > Best Regards,
> > 
> > Charles Wardell
> > Blue Chips Technology, Inc.
> > www.bcsolution.com
> > 
> > On Tuesday, April 26, 2011 at 5:12 PM, Otis Gospodnetic wrote: 
> > > Charlie,
> > > 
> > > How's this:
> > > * -Xmx2g
> > > * ramBufferSizeMB 512
> > > * mergeFactor 10 (default, but you could up it to 20, 30, if ulimit -n 
> > allows)
> > > * ignore/delete maxBufferedDocs - not used if you ran ramBufferSizeMB
> > > * use SolrStreamingUpdateServer (with params matching your number of CPU 
> > cores) 
> > 
> > > or send batches of say 1000 docs with the other SolrServer impl using N 
> > threads 
> > 
> > > (N=# of your CPU cores)
> > > 
> > > Otis
> > >  ----
> > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > > Lucene ecosystem search :: http://search-lucene.com/
> > > 
> > > 
> > > 
> > > ----- Original Message ----
> > > > From: Charles Wardell <ch...@bcsolution.com>
> > > > To: solr-user@lucene.apache.org
> > > > Sent: Tue, April 26, 2011 2:32:29 PM
> > > > Subject: Question on Batch process
> > > > 
> > > > I am sure that this question has been asked a few times, but I can't seem 
> > to 
> > 
> > > > find the sweetspot for indexing.
> > > > 
> > > > I have about 100,000 files each containing 1,000 xml documents ready to be 
> > 
> > > > posted to Solr. My desire is to have it index as quickly as possible and 
> > then 
> > 
> > > > once completed the daily stream of ADDs will be small in comparison.
> > > > 
> > > > The individual documents are small. Essentially web postings from the net. 
> > 
> > > > Title, postPostContent, date. 
> > > > 
> > > > 
> > > >  What would be the ideal configuration? For RamBufferSize, mergeFactor, 
> > > > MaxbufferedDocs, etc..
> > > > 
> > > > My machine is a quad core hyper-threaded. So it shows up as 8 cpu's in 
> TOP
> > > > I have 16GB of available ram.
> > > > 
> > > > 
> > > > Thanks in advance.
> > > > Charlie
>

Re: Question on Batch process

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hi Charles,

Yes, the threads I was referring to are in the context of the client/indexer, so 
one of the params for StreamingUpdateSolrServer.
post.sh/jar are just there because they are handy.  Don't use them for 
production.

It's impossible to tell how long indexing of 100M documents may take.  They 
could be very big or very small.  You could perform very light or no analysis or 
heavy analysis.  They could contain 1 or 100 fields. :)

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Charles Wardell <ch...@bcsolution.com>
> To: solr-user@lucene.apache.org
> Sent: Tue, April 26, 2011 8:01:28 PM
> Subject: Re: Question on Batch process
> 
> Thank you Otis.
> Without trying to appear to stupid, when you refer to having  the params 
>matching your # of CPU cores, you are talking about the # of threads  I can 
>spawn with the StreamingUpdateSolrServer object?
> Up until now, I have  been just utilizing post.sh or post.jar. Are these 
>capable of that or do I need  to write some code to collect a bunch of files 
>into the buffer and send it  off?
> 
> Also, Do you have a sense for how long it should take to index  100,000 files 
>or in my case 100,000,000  documents?
> StreamingUpdateSolrServer
> public  StreamingUpdateSolrServer(String solrServerUrl, int queueSize, int 
>threadCount)  throws MalformedURLException
> 
> Thanks again,
> Charlie
> 
> -- 
> Best  Regards,
> 
> Charles Wardell
> Blue Chips Technology,  Inc.
> www.bcsolution.com
> 
> On Tuesday, April 26, 2011 at 5:12 PM, Otis  Gospodnetic wrote: 
> > Charlie,
> > 
> > How's this:
> > *  -Xmx2g
> > * ramBufferSizeMB 512
> > * mergeFactor 10 (default, but you  could up it to 20, 30, if ulimit -n 
>allows)
> > * ignore/delete  maxBufferedDocs - not used if you ran ramBufferSizeMB
> > * use  SolrStreamingUpdateServer (with params matching your number of CPU 
>cores) 
>
> > or send batches of say 1000 docs with the other SolrServer impl using N  
>threads 
>
> > (N=# of your CPU cores)
> > 
> > Otis
> >  ----
> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > Lucene ecosystem  search :: http://search-lucene.com/
> > 
> > 
> > 
> > ----- Original  Message ----
> > > From: Charles Wardell <ch...@bcsolution.com>
> >  > To: solr-user@lucene.apache.org
> >  > Sent: Tue, April 26, 2011 2:32:29 PM
> > > Subject: Question on  Batch process
> > > 
> > > I am sure that this question has been  asked a few times, but I can't seem 
>to 
>
> > > find the sweetspot for  indexing.
> > > 
> > > I have about 100,000 files each containing  1,000 xml documents ready to be 
>
> > > posted to Solr. My desire is to  have it index as quickly as possible and 
>then 
>
> > > once completed the  daily stream of ADDs will be small in comparison.
> > > 
> > > The  individual documents are small. Essentially web postings from the net. 
>
> >  > Title, postPostContent, date. 
> > > 
> > > 
> > >  What would be the ideal configuration? For RamBufferSize, mergeFactor, 
> >  > MaxbufferedDocs, etc..
> > > 
> > > My machine is a quad core  hyper-threaded. So it shows up as 8 cpu's in 
TOP
> > > I have 16GB of  available ram.
> > > 
> > > 
> > > Thanks in  advance.
> > > Charlie
> > 
>

Re: Question on Batch process

Posted by Charles Wardell <ch...@bcsolution.com>.

Thank you Otis.
Without trying to appear to stupid, when you refer to having the params matching your # of CPU cores, you are talking about the # of threads I can spawn with the StreamingUpdateSolrServer object?
Up until now, I have been just utilizing post.sh or post.jar. Are these capable of that or do I need to write some code to collect a bunch of files into the buffer and send it off?

Also, Do you have a sense for how long it should take to index 100,000 files or in my case 100,000,000 documents?
StreamingUpdateSolrServer
public StreamingUpdateSolrServer(String solrServerUrl, int queueSize, int threadCount) throws MalformedURLException

Thanks again,
Charlie

-- 
Best Regards,

Charles Wardell
Blue Chips Technology, Inc.
www.bcsolution.com

On Tuesday, April 26, 2011 at 5:12 PM, Otis Gospodnetic wrote: 
> Charlie,
> 
> How's this:
> * -Xmx2g
> * ramBufferSizeMB 512
> * mergeFactor 10 (default, but you could up it to 20, 30, if ulimit -n allows)
> * ignore/delete maxBufferedDocs - not used if you ran ramBufferSizeMB
> * use SolrStreamingUpdateServer (with params matching your number of CPU cores) 
> or send batches of say 1000 docs with the other SolrServer impl using N threads 
> (N=# of your CPU cores)
> 
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
> 
> 
> 
> ----- Original Message ----
> > From: Charles Wardell <ch...@bcsolution.com>
> > To: solr-user@lucene.apache.org
> > Sent: Tue, April 26, 2011 2:32:29 PM
> > Subject: Question on Batch process
> > 
> > I am sure that this question has been asked a few times, but I can't seem to 
> > find the sweetspot for indexing.
> > 
> > I have about 100,000 files each containing 1,000 xml documents ready to be 
> > posted to Solr. My desire is to have it index as quickly as possible and then 
> > once completed the daily stream of ADDs will be small in comparison.
> > 
> > The individual documents are small. Essentially web postings from the net. 
> > Title, postPostContent, date. 
> > 
> > 
> > What would be the ideal configuration? For RamBufferSize, mergeFactor, 
> > MaxbufferedDocs, etc..
> > 
> > My machine is a quad core hyper-threaded. So it shows up as 8 cpu's in TOP
> > I have 16GB of available ram.
> > 
> > 
> > Thanks in advance.
> > Charlie
>

Re: Question on Batch process

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Charlie,

How's this:
* -Xmx2g
* ramBufferSizeMB 512
* mergeFactor 10 (default, but you could up it to 20, 30, if ulimit -n allows)
* ignore/delete maxBufferedDocs - not used if you ran ramBufferSizeMB
* use SolrStreamingUpdateServer (with params matching your number of CPU cores) 
or send batches of say 1000 docs with the other SolrServer impl using N threads 
(N=# of your CPU cores)

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Charles Wardell <ch...@bcsolution.com>
> To: solr-user@lucene.apache.org
> Sent: Tue, April 26, 2011 2:32:29 PM
> Subject: Question on Batch process
> 
> I am sure that this question has been asked a few times, but I can't seem to  
>find the sweetspot for indexing.
> 
> I have about 100,000 files each  containing 1,000 xml documents ready to be 
>posted to Solr. My desire is to have  it index as quickly as possible and then 
>once completed the daily stream of ADDs  will be small in comparison.
> 
> The individual documents are small.  Essentially web postings from the net. 
>Title, postPostContent, date. 
>
> 
> What would be the ideal configuration? For RamBufferSize, mergeFactor,  
>MaxbufferedDocs, etc..
> 
> My machine is a quad core hyper-threaded. So it  shows up as 8 cpu's in TOP
> I have 16GB of available ram.
> 
> 
> Thanks in  advance.
> Charlie