You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Timothy Potter (JIRA)" <ji...@apache.org> on 2015/03/26 16:52:55 UTC

[jira] [Commented] (SOLR-6816) Review SolrCloud Indexing Performance.

    [ https://issues.apache.org/jira/browse/SOLR-6816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382086#comment-14382086 ] 

Timothy Potter commented on SOLR-6816:
--------------------------------------

Coming back to this discussion ...

I still think there is need for a new optional parameter on an UpdateRequest that specifies this request is a bulk add and the client application knows all the docs in the request are either exact duplicates or new docs. You would use this parameter for high-volume indexing jobs such as from Hadoop, Spark, or log indexing applications. When this parameter is set to true (default is false of course), we can skip the version lookup on replicas in the {{versionAdd}} method of the {{DistributedUpdateProcessor}}, i.e.:

{code}
              boolean bulkAdds = cmd.getReq().getParams().getBool(UpdateRequest.BULK_ADD, false);
              if (!bulkAdds) {
                Long lastVersion = vinfo.lookupVersion(cmd.getIndexedId());
                if (lastVersion != null && Math.abs(lastVersion) >= versionOnUpdate) {
                  // This update is a repeat, or was reordered.  We need to drop this update.
                  return true;
                }
              }
{code}

I didn't think the {{lookupVersion}} would be that much of an overhead, but my testing shows that it is, even when using docValues for the {{_version_}} field.

Using this bulk add parameter, I'm seeing very good improvements when using replication. Specifically, here are the results I'm getting by making this simple change:

Indexing 9,992,262 docs (~1k in size) in a 3-shard collection with RF=2 (I'm using 6 r3.xlarge instances in EC2 so there is no contention between nodes, i.e. all replicas are on different servers):

* baseline branch5x: 758 seconds, ~13,182 docs per second
* branch5x with fix for SOLR-6820 (65536 version buckets): 710 seconds, ~14,074 docs per second
* branch5x with fix for SOLR-6820 and this bulkAdds parameter: 485 seconds, ~20,603 docs per second

That's a 56% increase in throughput performance over the baseline in branch5x! What's more is the 20,603 is nearing the performance I was getting in the baseline without replication (23,401).

I don't think using {{overwrite=false}} will work here though because most apps still want basic duplicate checking on the leader to catch duplicate documents that get resent to Solr. For instance, imagine a Map/Reduce job that indexes into Solr ... if a task fails, then Hadoop usually re-tries that task a couple of times, meaning all docs in the block that failed will be sent again. If we use {{overwrite=false}}, then you'll end up with dupes in your index. This is why I think having an additional parameter that lets client apps tell Solr they are doing bulk adds of new docs is required.

Lastly, I'm still working on a way to send less requests from leader to replica when using batches. Just increasing the poll queue time for CUSS in StreamingSolrClients imposes an unnecessary wait after the last doc in the batched request is processed. So I'm trying to devise a way for the entire batch of docs to be streamed to the replica without having this unnecessary wait after the last doc.

> Review SolrCloud Indexing Performance.
> --------------------------------------
>
>                 Key: SOLR-6816
>                 URL: https://issues.apache.org/jira/browse/SOLR-6816
>             Project: Solr
>          Issue Type: Task
>          Components: SolrCloud
>            Reporter: Mark Miller
>            Priority: Critical
>         Attachments: SolrBench.pdf
>
>
> We have never really focused on indexing performance, just correctness and low hanging fruit. We need to vet the performance and try to address any holes.
> Note: A common report is that adding any replication is very slow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org