You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Otis Gospodnetic (JIRA)" <ji...@apache.org> on 2013/07/25 13:37:49 UTC
[jira] [Commented] (SOLR-5075) SolrCloud commit process is too time consuming, even if documents are light

    [ https://issues.apache.org/jira/browse/SOLR-5075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13719527#comment-13719527 ] 

Otis Gospodnetic commented on SOLR-5075:
----------------------------------------

[~radu@wmds.ro] you should close this issue and ask on the solr-user mailing list.
                
> SolrCloud commit process is too time consuming, even if documents are light
> ---------------------------------------------------------------------------
>
>                 Key: SOLR-5075
>                 URL: https://issues.apache.org/jira/browse/SOLR-5075
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis, SolrCloud
>    Affects Versions: 4.1
>         Environment: SolrCloud 4.1, internal Zookeeper, 16 shards, custom java importer.
> Server: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz, 32 cores, 192gb RAM, 10tb SSD and 50tb SAS memory
>            Reporter: Radu Ghita
>              Labels: import, solrconfig.xml
>
> We are having a client with business model that requires indexing each month billion rows into solr from mysql in a small time-frame. The documents are very light, but the number is very high and we need to achieve speeds of around 80-100k/s. The built in solr indexer goes to 40-50k tops, but after some hours ( ~12 ) it crashes and the speed slows down as hours go by.
> Therefore we have developed a custom java importer that connects directly to mysql and solrcloud via zookeeper, grabs data from mysql, creates documents and then imports into solr. This helps because we are opening ~50 threads and the indexing process speeds up. We have optimized the mysql queries ( mysql was the initial bottleneck ) and the speeds we get now are over 100k/s, but as index number gets bigger, solr stays very long on adding documents. I assume it needs to be something from solrconfig that makes solr stay and even block after 100 mil documents indexed.
> Here is the java code that creates documents and then adds to solr server:
> public void createDocuments() throws SQLException, SolrServerException, IOException
> 	{
> 		App.logger.write("Creating documents..");
> 		this.docs = new ArrayList<SolrInputDocument>();
> 		App.logger.incrementNumberOfRows(this.size);
> 		while(this.results.next())
> 		{
> 			   this.docs.add(this.getDocumentFromResultSet(this.results));
> 		}
> 		this.statement.close();
> 		this.results.close();
> 	}
> 	
> 	public void commitDocuments() throws SolrServerException, IOException
> 	{
> 		App.logger.write("Committing..");
> 		App.solrServer.add(this.docs); // here it stays very long and then blocks
> 		App.logger.incrementNumberOfRows(this.docs.size());
> 		this.docs.clear();
> 	}
> I am also pasting solrconfig.xml parameters that make sense to this discussion:
> <maxIndexingThreads>128</maxIndexingThreads>
> <useCompoundFile>false</useCompoundFile>
> <ramBufferSizeMB>10000</ramBufferSizeMB>
> <maxBufferedDocs>1000000</maxBufferedDocs>
> <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
>           <int name="maxMergeAtOnce">20000</int>
>           <int name="segmentsPerTier">1000000</int>
>           <int name="maxMergeAtOnceExplicit">10000</int>
> </mergePolicy>
> <mergeFactor>100</mergeFactor>
> <termIndexInterval>1024</termIndexInterval>
> <autoCommit> 
>        <maxTime>15000</maxTime> 
>        <maxDocs>1000000</maxDocs>
>        <openSearcher>false</openSearcher> 
>      </autoCommit>
> <autoSoftCommit> 
>          <maxTime>2000000</maxTime> 
>        </autoSoftCommit>
> Thanks a lot for any answers and excuse my long text, I'm new to this JIRA. If there's any other info needed please let me know.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org