You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Zisis Tachtsidis <zi...@runbox.com> on 2015/11/28 19:58:51 UTC

Single-sharded SolrCloud vs Lucene indexing speed

I'm conducting some indexing experiments in SolrCloud and I want to confirm
my conclusions and ask for suggestions on how to improve performance.

My setup includes a single-sharded collection with 1 additional replica in
SolrCloud 5.3.1. I'm using SolrJ and the indexing speed refers to the actual
SolrJ call that adds the document. I've run some indexing tests and it seems
that Lucene indexing is equal to or better than Solr's in all cases. In all
cases the same documents are sent to both Lucene&Solr and the same analysis
is performed on the documents.

- 2 replicas, leader is a replica on a machine under heavy load => ~3x
slower than Lucene.
- 2 replicas, leader is a replica on a machine under light load => ~2x
slower than Lucene.
- 1 replica on a machine under light load => indexing speed similar to
Lucene.

Conclusions
(*) It seems that the slowest replica determines the indexing speed.
(*) It gets even worse if the slowest replica is the leader. This is
justified if it's true that only after the leader finishes indexing it
forwards the request to the remaining replicas.

Regarding improvements
(*) I'm indexing pretty big documents 0.5MB<DocSize<1MB so batch updates do
not offer significant performance gain.
(*) Can I see improvement if I use a multi-sharded collection?

Thanks

--
View this message in context: http://lucene.472066.n3.nabble.com/Single-sharded-SolrCloud-vs-Lucene-indexing-speed-tp4242568.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Single-sharded SolrCloud vs Lucene indexing speed

Posted by Erick Erickson <er...@gmail.com>.

Of course Lucene will be faster in all cases when replicas are
present. Solr is built on Lucene so any overhead at all that Solr adds
will cause the total round-trip to be slower.

Lucene doesn't have to concern itself with distributing updates to
replicas for instance as happens in your first two cases. The raw
overhead imposed by Solr is probably your third case.

Yes, slowest replica determines indexing speed. To guarantee data
isn't lost, the process is:
> leader receives updates.
> leader indexes locally _and_ forwards docs to follower
> follower acks back to leader when the docs are written to tlog (at least).
> leader acks back to client.

If it were otherwise, the follower couldn't guarantee that it had all
updates, so that's an early design decision.

If the slowest replica is the leader... Hmmm, forwarding updates to
the followers is done in parallel, but there is some additional work
done on the leader that the follower doesn't have to do, possibly this
is what you're seeing?

Solr will scale nearly linearly with additional shards. SolrJ
(assuming you're using CloudSolrClient) routes documents up-front so
you get a significant amount of parallelization. Of course this won't
be true if you only index one doc at a time single threaded....

Best,
Erick

On Sat, Nov 28, 2015 at 10:58 AM, Zisis Tachtsidis <zi...@runbox.com> wrote:
> I'm conducting some indexing experiments in SolrCloud and I want to confirm
> my conclusions and ask for suggestions on how to improve performance.
>
> My setup includes a single-sharded collection with 1 additional replica in
> SolrCloud 5.3.1. I'm using SolrJ and the indexing speed refers to the actual
> SolrJ call that adds the document. I've run some indexing tests and it seems
> that Lucene indexing is equal to or better than Solr's in all cases. In all
> cases the same documents are sent to both Lucene&Solr and the same analysis
> is performed on the documents.
>
> - 2 replicas, leader is a replica on a machine under heavy load => ~3x
> slower than Lucene.
> - 2 replicas, leader is a replica on a machine under light load => ~2x
> slower than Lucene.
> - 1 replica on a machine under light load => indexing speed similar to
> Lucene.
>
> Conclusions
> (*) It seems that the slowest replica determines the indexing speed.
> (*) It gets even worse if the slowest replica is the leader. This is
> justified if it's true that only after the leader finishes indexing it
> forwards the request to the remaining replicas.
>
> Regarding improvements
> (*) I'm indexing pretty big documents 0.5MB<DocSize<1MB so batch updates do
> not offer significant performance gain.
> (*) Can I see improvement if I use a multi-sharded collection?
>
> Thanks
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Single-sharded-SolrCloud-vs-Lucene-indexing-speed-tp4242568.html
> Sent from the Solr - User mailing list archive at Nabble.com.