You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by I-Chiang Chen <ic...@gmail.com> on 2012/03/22 02:37:48 UTC

Commit Strategy for SolrCloud when Talking about 200 million records.

We are currently experimenting with SolrCloud functionality in Solr 4.0.
The goal is to see if Solr 4.0 trunk with is current state is able to
handle roughly 200million documents. The document size is not big around 40
fields no more than a KB, most of which are empty majority of times.

The setup we have is 4 servers w/ 2 shards w/ 2 servers per shard. We are
running in Tomcat.

The questions are giving the approximate data volume, is it a realistic to
expect above setup can handle it. Giving the number of documents should
commit every x documents or rely on auto commits?

-- 
-IC

Re: Commit Strategy for SolrCloud when Talking about 200 million records.

Posted by Mark Miller <ma...@gmail.com>.
On Mar 23, 2012, at 12:49 PM, I-Chiang Chen wrote:

> Caused by: java.lang.OutOfMemoryError: Map failed

Hmm...looks like this is the key info here. 

- Mark Miller
lucidimagination.com












Re: Commit Strategy for SolrCloud when Talking about 200 million records.

Posted by I-Chiang Chen <ic...@gmail.com>.
We saw couple distinct errors and all machines in a shard is identical:

-On the leader of the shard
Mar 21, 2012 1:58:34 AM org.apache.solr.common.SolrException log
SEVERE: shard update error StdNode:
http://blah.blah.net:8983/solr/master2-slave1/:org.apache.solr.common.SolrException:
Map failed
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:488)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:251)
at
org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:319)
at
org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:300)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

followed by

Mar 21, 2012 1:58:52 AM org.apache.solr.common.SolrException log
SEVERE: shard update error StdNode:
http://blah.blah.net:8983/solr/master2-slave1/:org.apache.solr.common.SolrException:
java.io.IOException: Map failed
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:488)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:251)
at
org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:319)
at
org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:300)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

followed by

Mar 21, 2012 1:58:55 AM
org.apache.solr.update.processor.DistributedUpdateProcessor doFinish
INFO: Could not tell a replica to recover
org.apache.solr.client.solrj.SolrServerException:
http://blah.blah.net:8983/solr
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:496)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:251)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:347)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:816)
at
org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:176)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1540)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:433)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:602)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
at org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78)
at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106)
at
org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1116)
at
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1413)
at
org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1973)
at
org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735)
at
org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1098)
at
org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398)
at
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:426)
... 21 more

followed by

Mar 21, 2012 3:56:11 AM org.apache.solr.common.SolrException log
SEVERE: SnapPull failed :org.apache.solr.common.SolrException: Index fetch
failed :
at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:361)
at
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:298)
at
org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:138)
at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:336)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:208)
Caused by: java.io.FileNotFoundException:
/opt/apps/solrcloud/solr/data/index/_g0o.per (No such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:216)
at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:219)
at
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$VisitPerFieldFile.<init>(PerFieldPostingsFormat.java:262)
at
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader$1.<init>(PerFieldPostingsFormat.java:186)
at
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:186)
at
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:256)
at
org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:108)
at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:51)
at
org.apache.lucene.index.IndexWriter$ReadersAndLiveDocs.getReader(IndexWriter.java:494)
at
org.apache.lucene.index.BufferedDeletesStream.applyDeletes(BufferedDeletesStream.java:214)
at
org.apache.lucene.index.IndexWriter.applyAllDeletes(IndexWriter.java:2940)
at
org.apache.lucene.index.IndexWriter.maybeApplyDeletes(IndexWriter.java:2931)
at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:2904)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:2873)
at org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:1105)
at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1069)
at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1033)
at org.apache.solr.update.SolrIndexWriter.close(SolrIndexWriter.java:128)
at
org.apache.solr.update.DefaultSolrCoreState.newIndexWriter(DefaultSolrCoreState.java:60)
at
org.apache.solr.update.DirectUpdateHandler2.newIndexWriter(DirectUpdateHandler2.java:473)
at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:499)
at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:348)
... 4 more

follow by

Mar 21, 2012 3:56:11 AM org.apache.solr.cloud.RecoveryStrategy doRecovery
SEVERE: Error while trying to recover
org.apache.solr.common.SolrException: Replication for recovery failed.
at
org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:141)
at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:336)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:208)
Mar 21, 2012 3:56:11 AM org.apache.solr.update.UpdateLog dropBufferedUpdates
INFO: Dropping buffered updates FSUpdateLog{state=BUFFERING,
tlog=tlog{file=/opt/apps/solrcloud/solr/data/tlog/tlog.0000000000000001284
refcount=1}}
Mar 21, 2012 3:56:11 AM org.apache.solr.cloud.RecoveryStrategy doRecovery
SEVERE: Recovery failed - trying again...
Mar 21, 2012 3:56:11 AM org.apache.solr.update.processor.LogUpdateProcessor
finish
INFO: [master2] webapp=/solr path=/update params={wt=javabin&version=2} {}
0 1
Mar 21, 2012 3:56:11 AM org.apache.solr.common.SolrException log
SEVERE: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is
closed
at
org.apache.lucene.index.DocumentsWriter.ensureOpen(DocumentsWriter.java:195)
at
org.apache.lucene.index.DocumentsWriter.preUpdate(DocumentsWriter.java:280)
at
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:361)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1533)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1505)
at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:175)
at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:56)
at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:358)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:455)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:261)
at
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:97)
at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:135)
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:59)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1540)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:433)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:602)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Thread.java:662)

-on the non-leader machine of the shard
Mar 21, 2012 1:56:40 AM org.apache.solr.common.SolrException log
SEVERE: auto commit error...:org.apache.solr.common.SolrException: Error
opening new searcher
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1154)
at
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:427)
at org.apache.solr.update.CommitTracker.run(CommitTracker.java:197)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Map failed
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:748)
at
org.apache.lucene.store.MMapDirectory$MMapIndexInput.<init>(MMapDirectory.java:293)
at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:221)
at
org.apache.lucene.codecs.BlockTreeTermsReader.<init>(BlockTreeTermsReader.java:115)
at
org.apache.lucene.codecs.lucene40.Lucene40PostingsFormat.fieldsProducer(Lucene40PostingsFormat.java:84)
at
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader$1.visitOneFormat(PerFieldPostingsFormat.java:189)
at
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$VisitPerFieldFile.<init>(PerFieldPostingsFormat.java:280)
at
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader$1.<init>(PerFieldPostingsFormat.java:186)
at
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:186)
at
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:256)
at
org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:108)
at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:51)
at
org.apache.lucene.index.IndexWriter$ReadersAndLiveDocs.getReader(IndexWriter.java:494)
at
org.apache.lucene.index.IndexWriter$ReadersAndLiveDocs.getReadOnlyClone(IndexWriter.java:566)
at
org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:95)
at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:366)
at
org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:258)
at
org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:243)
at
org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:245)
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1107)
... 10 more
Caused by: java.lang.OutOfMemoryError: Map failed
at sun.nio.ch.FileChannelImpl.map0(Native Method)
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:745)
... 29 more

On Thu, Mar 22, 2012 at 8:44 PM, I-Chiang Chen <ic...@gmail.com>wrote:

> At this time we are not leveraging the NRT functionality. This is the
> initial data load process where the idea is to just add all 200 millions
> records first. Than do a single commit at the end to make them searchable.
> We actually disabled auto commit at this time.
>
> We have tried to leave auto commit enabled during the initial data load
> process and ran into multiple issues that leads to botched loading process.
>
> On Thu, Mar 22, 2012 at 2:15 PM, Mark Miller <ma...@gmail.com>wrote:
>
>>
>> On Mar 21, 2012, at 9:37 PM, I-Chiang Chen wrote:
>>
>> > We are currently experimenting with SolrCloud functionality in Solr 4.0.
>> > The goal is to see if Solr 4.0 trunk with is current state is able to
>> > handle roughly 200million documents. The document size is not big
>> around 40
>> > fields no more than a KB, most of which are empty majority of times.
>> >
>> > The setup we have is 4 servers w/ 2 shards w/ 2 servers per shard. We
>> are
>> > running in Tomcat.
>> >
>> > The questions are giving the approximate data volume, is it a realistic
>> to
>> > expect above setup can handle it.
>>
>> So 100 million docs per machine essentially? Totally depends on the
>> hardware and what features you are using - but def in the realm of
>> possibility.
>>
>> > Giving the number of documents should
>> > commit every x documents or rely on auto commits?
>>
>> The number of docs shouldn't really matter here. Do you need near real
>> time search?
>>
>> You should be able to commit about as frequently as you'd like with NRT
>> (eg every 1 second if you'd like) - either using soft auto commit or
>> commitWithin.
>>
>> Then you want to do a hard commit less frequently - every minute (or more
>> or less) with openSearcher=false.
>>
>> eg
>>
>>     <autoCommit>
>>       <maxTime>15000</maxTime>
>>       <openSearcher>false</openSearcher>
>>     </autoCommit>
>>
>> >
>> > --
>> > -IC
>>
>> - Mark Miller
>> lucidimagination.com
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
> --
> -IC
>



-- 
-IC

Re: Commit Strategy for SolrCloud when Talking about 200 million records.

Posted by Markus Jelsma <ma...@openindex.io>.
We did some tests too with many millions of documents and auto-commit enabled. 
It didn't take long for the indexer to stall and in the meantime the number of 
open files exploded, to over 16k, then 32k.

On Friday 23 March 2012 12:20:15 Mark Miller wrote:
> What issues? It really shouldn't be a problem.
> 
> On Mar 22, 2012, at 11:44 PM, I-Chiang Chen <ic...@gmail.com> wrote:
> > At this time we are not leveraging the NRT functionality. This is the
> > initial data load process where the idea is to just add all 200 millions
> > records first. Than do a single commit at the end to make them
> > searchable. We actually disabled auto commit at this time.
> > 
> > We have tried to leave auto commit enabled during the initial data load
> > process and ran into multiple issues that leads to botched loading
> > process.
> > 
> > On Thu, Mar 22, 2012 at 2:15 PM, Mark Miller <ma...@gmail.com> 
wrote:
> >> On Mar 21, 2012, at 9:37 PM, I-Chiang Chen wrote:
> >>> We are currently experimenting with SolrCloud functionality in Solr
> >>> 4.0. The goal is to see if Solr 4.0 trunk with is current state is
> >>> able to handle roughly 200million documents. The document size is not
> >>> big around
> >> 
> >> 40
> >> 
> >>> fields no more than a KB, most of which are empty majority of times.
> >>> 
> >>> The setup we have is 4 servers w/ 2 shards w/ 2 servers per shard. We
> >>> are running in Tomcat.
> >>> 
> >>> The questions are giving the approximate data volume, is it a realistic
> >> 
> >> to
> >> 
> >>> expect above setup can handle it.
> >> 
> >> So 100 million docs per machine essentially? Totally depends on the
> >> hardware and what features you are using - but def in the realm of
> >> possibility.
> >> 
> >>> Giving the number of documents should
> >>> commit every x documents or rely on auto commits?
> >> 
> >> The number of docs shouldn't really matter here. Do you need near real
> >> time search?
> >> 
> >> You should be able to commit about as frequently as you'd like with NRT
> >> (eg every 1 second if you'd like) - either using soft auto commit or
> >> commitWithin.
> >> 
> >> Then you want to do a hard commit less frequently - every minute (or
> >> more or less) with openSearcher=false.
> >> 
> >> eg
> >> 
> >>    <autoCommit>
> >>    
> >>      <maxTime>15000</maxTime>
> >>      <openSearcher>false</openSearcher>
> >>    
> >>    </autoCommit>
> >>> 
> >>> --
> >>> -IC
> >> 
> >> - Mark Miller
> >> lucidimagination.com

-- 
Markus Jelsma - CTO - Openindex

Re: Commit Strategy for SolrCloud when Talking about 200 million records.

Posted by Mark Miller <ma...@gmail.com>.
What issues? It really shouldn't be a problem. 


On Mar 22, 2012, at 11:44 PM, I-Chiang Chen <ic...@gmail.com> wrote:

> At this time we are not leveraging the NRT functionality. This is the
> initial data load process where the idea is to just add all 200 millions
> records first. Than do a single commit at the end to make them searchable.
> We actually disabled auto commit at this time.
> 
> We have tried to leave auto commit enabled during the initial data load
> process and ran into multiple issues that leads to botched loading process.
> 
> On Thu, Mar 22, 2012 at 2:15 PM, Mark Miller <ma...@gmail.com> wrote:
> 
>> 
>> On Mar 21, 2012, at 9:37 PM, I-Chiang Chen wrote:
>> 
>>> We are currently experimenting with SolrCloud functionality in Solr 4.0.
>>> The goal is to see if Solr 4.0 trunk with is current state is able to
>>> handle roughly 200million documents. The document size is not big around
>> 40
>>> fields no more than a KB, most of which are empty majority of times.
>>> 
>>> The setup we have is 4 servers w/ 2 shards w/ 2 servers per shard. We are
>>> running in Tomcat.
>>> 
>>> The questions are giving the approximate data volume, is it a realistic
>> to
>>> expect above setup can handle it.
>> 
>> So 100 million docs per machine essentially? Totally depends on the
>> hardware and what features you are using - but def in the realm of
>> possibility.
>> 
>>> Giving the number of documents should
>>> commit every x documents or rely on auto commits?
>> 
>> The number of docs shouldn't really matter here. Do you need near real
>> time search?
>> 
>> You should be able to commit about as frequently as you'd like with NRT
>> (eg every 1 second if you'd like) - either using soft auto commit or
>> commitWithin.
>> 
>> Then you want to do a hard commit less frequently - every minute (or more
>> or less) with openSearcher=false.
>> 
>> eg
>> 
>>    <autoCommit>
>>      <maxTime>15000</maxTime>
>>      <openSearcher>false</openSearcher>
>>    </autoCommit>
>> 
>>> 
>>> --
>>> -IC
>> 
>> - Mark Miller
>> lucidimagination.com
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
> -- 
> -IC

Re: Commit Strategy for SolrCloud when Talking about 200 million records.

Posted by I-Chiang Chen <ic...@gmail.com>.
At this time we are not leveraging the NRT functionality. This is the
initial data load process where the idea is to just add all 200 millions
records first. Than do a single commit at the end to make them searchable.
We actually disabled auto commit at this time.

We have tried to leave auto commit enabled during the initial data load
process and ran into multiple issues that leads to botched loading process.

On Thu, Mar 22, 2012 at 2:15 PM, Mark Miller <ma...@gmail.com> wrote:

>
> On Mar 21, 2012, at 9:37 PM, I-Chiang Chen wrote:
>
> > We are currently experimenting with SolrCloud functionality in Solr 4.0.
> > The goal is to see if Solr 4.0 trunk with is current state is able to
> > handle roughly 200million documents. The document size is not big around
> 40
> > fields no more than a KB, most of which are empty majority of times.
> >
> > The setup we have is 4 servers w/ 2 shards w/ 2 servers per shard. We are
> > running in Tomcat.
> >
> > The questions are giving the approximate data volume, is it a realistic
> to
> > expect above setup can handle it.
>
> So 100 million docs per machine essentially? Totally depends on the
> hardware and what features you are using - but def in the realm of
> possibility.
>
> > Giving the number of documents should
> > commit every x documents or rely on auto commits?
>
> The number of docs shouldn't really matter here. Do you need near real
> time search?
>
> You should be able to commit about as frequently as you'd like with NRT
> (eg every 1 second if you'd like) - either using soft auto commit or
> commitWithin.
>
> Then you want to do a hard commit less frequently - every minute (or more
> or less) with openSearcher=false.
>
> eg
>
>     <autoCommit>
>       <maxTime>15000</maxTime>
>       <openSearcher>false</openSearcher>
>     </autoCommit>
>
> >
> > --
> > -IC
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>
>


-- 
-IC

Re: Commit Strategy for SolrCloud when Talking about 200 million records.

Posted by Mark Miller <ma...@gmail.com>.
On Mar 21, 2012, at 9:37 PM, I-Chiang Chen wrote:

> We are currently experimenting with SolrCloud functionality in Solr 4.0.
> The goal is to see if Solr 4.0 trunk with is current state is able to
> handle roughly 200million documents. The document size is not big around 40
> fields no more than a KB, most of which are empty majority of times.
> 
> The setup we have is 4 servers w/ 2 shards w/ 2 servers per shard. We are
> running in Tomcat.
> 
> The questions are giving the approximate data volume, is it a realistic to
> expect above setup can handle it.

So 100 million docs per machine essentially? Totally depends on the hardware and what features you are using - but def in the realm of possibility.

> Giving the number of documents should
> commit every x documents or rely on auto commits?

The number of docs shouldn't really matter here. Do you need near real time search?

You should be able to commit about as frequently as you'd like with NRT (eg every 1 second if you'd like) - either using soft auto commit or commitWithin.

Then you want to do a hard commit less frequently - every minute (or more or less) with openSearcher=false.

eg

     <autoCommit> 
       <maxTime>15000</maxTime> 
       <openSearcher>false</openSearcher> 
     </autoCommit>

> 
> -- 
> -IC

- Mark Miller
lucidimagination.com