You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Edd Grant <ed...@eddgrant.com> on 2013/05/03 14:13:46 UTC

Performance considerations when using distributed indexing + loadbalancing with Solr cloud

Hi all,

I have been playing with Solr Cloud recently and am enjoying the
distributed indexing capability.

At the moment my SolrCloud consists of 2 leaders and 2 replicas which are
fronted by an HAProxy instance. I want to maximise performance for indexing
and it occurred to me that the model I use for loadbalancing my indexing
requests may impact performance. i.e. am I likely to see better indexing
performance if I stick certain groups of requests to certain nodes vs
simply using a round robin approach?

I'll be doing some impirical testing to try and figure this out but was
wondering if there's any general guidance here? Or if anyone has any
experience of particularly good/ bad configurations?

Many thanks,

Edd

-- 
Web: http://www.eddgrant.com
Email: edd@eddgrant.com
Mobile: +44 (0) 7861 394 543

Re: Performance considerations when using distributed indexing + loadbalancing with Solr cloud

Posted by Edd Grant <ed...@eddgrant.com>.

Aah I see - very useful. Thanks!


On 3 May 2013 15:49, Shawn Heisey <so...@elyograg.org> wrote:

> On 5/3/2013 8:35 AM, Edd Grant wrote:
> > Thanks, that's exactly what I was worried about. If I take your suggested
> > approach of using SolrCloudServer and the feeder learns which shard
> leader
> > to target, then if the shard leader goes down midway through indexing
> then
> > I've lost my ability to index. Whereas if I take the route of making all
> > updates via the HAProxy instance then I've got HA but at the cost of
> > performance.
> >
> > This has me wondering if it might be feasable to address each shard with
> a
> > VIP? Then if the leader of the shard goes down and a replica is elected
> as
> > the leader it could also take the VIP, so in essence we'd always be
> sending
> > messages to the leader. Anyone tried anything like this?
>
> CloudSolrServer is part of the SolrJ (Java) API.  It incorporates a
> zookeeper client.  To initialize it, you don't tell it about your Solr
> servers, you give it the same zookeeper host information that you give
> to Solr when starting in cloud mode.  It always knows the current state
> of the cluster, so if you have a failure, it adjusts so that your
> queries and updates don't fail.  That also means that it will know when
> servers are added to or removed from the cloud.
>
>
> http://lucene.apache.org/solr/4_2_1/solr-solrj/org/apache/solr/client/solrj/impl/CloudSolrServer.html
>
> Thanks,
> Shawn
>
>


-- 
Web: http://www.eddgrant.com
Email: edd@eddgrant.com
Mobile: +44 (0) 7861 394 543

Re: Performance considerations when using distributed indexing + loadbalancing with Solr cloud

Posted by Shawn Heisey <so...@elyograg.org>.

On 5/3/2013 8:35 AM, Edd Grant wrote:
> Thanks, that's exactly what I was worried about. If I take your suggested
> approach of using SolrCloudServer and the feeder learns which shard leader
> to target, then if the shard leader goes down midway through indexing then
> I've lost my ability to index. Whereas if I take the route of making all
> updates via the HAProxy instance then I've got HA but at the cost of
> performance.
> 
> This has me wondering if it might be feasable to address each shard with a
> VIP? Then if the leader of the shard goes down and a replica is elected as
> the leader it could also take the VIP, so in essence we'd always be sending
> messages to the leader. Anyone tried anything like this?

CloudSolrServer is part of the SolrJ (Java) API.  It incorporates a
zookeeper client.  To initialize it, you don't tell it about your Solr
servers, you give it the same zookeeper host information that you give
to Solr when starting in cloud mode.  It always knows the current state
of the cluster, so if you have a failure, it adjusts so that your
queries and updates don't fail.  That also means that it will know when
servers are added to or removed from the cloud.

http://lucene.apache.org/solr/4_2_1/solr-solrj/org/apache/solr/client/solrj/impl/CloudSolrServer.html

Thanks,
Shawn

Re: Performance considerations when using distributed indexing + loadbalancing with Solr cloud

Posted by Edd Grant <ed...@eddgrant.com>.

Thanks, that's exactly what I was worried about. If I take your suggested
approach of using SolrCloudServer and the feeder learns which shard leader
to target, then if the shard leader goes down midway through indexing then
I've lost my ability to index. Whereas if I take the route of making all
updates via the HAProxy instance then I've got HA but at the cost of
performance.

This has me wondering if it might be feasable to address each shard with a
VIP? Then if the leader of the shard goes down and a replica is elected as
the leader it could also take the VIP, so in essence we'd always be sending
messages to the leader. Anyone tried anything like this?

Cheers,

Edd


On 3 May 2013 15:22, Furkan KAMACI <fu...@gmail.com> wrote:

> If you index them with SolrCloudServer, your server will learn where data
> will go from Zookeeper and send data to that shard leader. However if you
> use another random processes or something like data will go any of nodes
> and after that will be routed into the right place within cluster. This
> extra routing process within cluster may cause unnecessary network traffic
> and latency for indexing time as well.
>
> 2013/5/3 Edd Grant <ed...@eddgrant.com>
>
> > Hi,
> >
> > No we're actually POSTing them over plain old http. Our "feeder" process
> > simply points at the HAProxy box and posts merrily away.
> >
> > Cheers,
> >
> > Edd
> >
> >
> > On 3 May 2013 13:17, Furkan KAMACI <fu...@gmail.com> wrote:
> >
> > > Do you use CloudSolrServer when you push documnts into SolrCloud to be
> > > indexed?
> > >
> > > 2013/5/3 Edd Grant <ed...@eddgrant.com>
> > >
> > > > Hi all,
> > > >
> > > > I have been playing with Solr Cloud recently and am enjoying the
> > > > distributed indexing capability.
> > > >
> > > > At the moment my SolrCloud consists of 2 leaders and 2 replicas which
> > are
> > > > fronted by an HAProxy instance. I want to maximise performance for
> > > indexing
> > > > and it occurred to me that the model I use for loadbalancing my
> > indexing
> > > > requests may impact performance. i.e. am I likely to see better
> > indexing
> > > > performance if I stick certain groups of requests to certain nodes vs
> > > > simply using a round robin approach?
> > > >
> > > > I'll be doing some impirical testing to try and figure this out but
> was
> > > > wondering if there's any general guidance here? Or if anyone has any
> > > > experience of particularly good/ bad configurations?
> > > >
> > > > Many thanks,
> > > >
> > > > Edd
> > > >
> > > > --
> > > > Web: http://www.eddgrant.com
> > > > Email: edd@eddgrant.com
> > > > Mobile: +44 (0) 7861 394 543
> > > >
> > >
> >
> >
> >
> > --
> > Web: http://www.eddgrant.com
> > Email: edd@eddgrant.com
> > Mobile: +44 (0) 7861 394 543
> >
>



-- 
Web: http://www.eddgrant.com
Email: edd@eddgrant.com
Mobile: +44 (0) 7861 394 543

Re: Performance considerations when using distributed indexing + loadbalancing with Solr cloud

Posted by Furkan KAMACI <fu...@gmail.com>.

If you index them with SolrCloudServer, your server will learn where data
will go from Zookeeper and send data to that shard leader. However if you
use another random processes or something like data will go any of nodes
and after that will be routed into the right place within cluster. This
extra routing process within cluster may cause unnecessary network traffic
and latency for indexing time as well.

2013/5/3 Edd Grant <ed...@eddgrant.com>

> Hi,
>
> No we're actually POSTing them over plain old http. Our "feeder" process
> simply points at the HAProxy box and posts merrily away.
>
> Cheers,
>
> Edd
>
>
> On 3 May 2013 13:17, Furkan KAMACI <fu...@gmail.com> wrote:
>
> > Do you use CloudSolrServer when you push documnts into SolrCloud to be
> > indexed?
> >
> > 2013/5/3 Edd Grant <ed...@eddgrant.com>
> >
> > > Hi all,
> > >
> > > I have been playing with Solr Cloud recently and am enjoying the
> > > distributed indexing capability.
> > >
> > > At the moment my SolrCloud consists of 2 leaders and 2 replicas which
> are
> > > fronted by an HAProxy instance. I want to maximise performance for
> > indexing
> > > and it occurred to me that the model I use for loadbalancing my
> indexing
> > > requests may impact performance. i.e. am I likely to see better
> indexing
> > > performance if I stick certain groups of requests to certain nodes vs
> > > simply using a round robin approach?
> > >
> > > I'll be doing some impirical testing to try and figure this out but was
> > > wondering if there's any general guidance here? Or if anyone has any
> > > experience of particularly good/ bad configurations?
> > >
> > > Many thanks,
> > >
> > > Edd
> > >
> > > --
> > > Web: http://www.eddgrant.com
> > > Email: edd@eddgrant.com
> > > Mobile: +44 (0) 7861 394 543
> > >
> >
>
>
>
> --
> Web: http://www.eddgrant.com
> Email: edd@eddgrant.com
> Mobile: +44 (0) 7861 394 543
>

Re: Performance considerations when using distributed indexing + loadbalancing with Solr cloud

Posted by Edd Grant <ed...@eddgrant.com>.

Hi,

No we're actually POSTing them over plain old http. Our "feeder" process
simply points at the HAProxy box and posts merrily away.

Cheers,

Edd


On 3 May 2013 13:17, Furkan KAMACI <fu...@gmail.com> wrote:

> Do you use CloudSolrServer when you push documnts into SolrCloud to be
> indexed?
>
> 2013/5/3 Edd Grant <ed...@eddgrant.com>
>
> > Hi all,
> >
> > I have been playing with Solr Cloud recently and am enjoying the
> > distributed indexing capability.
> >
> > At the moment my SolrCloud consists of 2 leaders and 2 replicas which are
> > fronted by an HAProxy instance. I want to maximise performance for
> indexing
> > and it occurred to me that the model I use for loadbalancing my indexing
> > requests may impact performance. i.e. am I likely to see better indexing
> > performance if I stick certain groups of requests to certain nodes vs
> > simply using a round robin approach?
> >
> > I'll be doing some impirical testing to try and figure this out but was
> > wondering if there's any general guidance here? Or if anyone has any
> > experience of particularly good/ bad configurations?
> >
> > Many thanks,
> >
> > Edd
> >
> > --
> > Web: http://www.eddgrant.com
> > Email: edd@eddgrant.com
> > Mobile: +44 (0) 7861 394 543
> >
>



-- 
Web: http://www.eddgrant.com
Email: edd@eddgrant.com
Mobile: +44 (0) 7861 394 543

Re: Performance considerations when using distributed indexing + loadbalancing with Solr cloud

Posted by Furkan KAMACI <fu...@gmail.com>.

Do you use CloudSolrServer when you push documnts into SolrCloud to be
indexed?

2013/5/3 Edd Grant <ed...@eddgrant.com>

> Hi all,
>
> I have been playing with Solr Cloud recently and am enjoying the
> distributed indexing capability.
>
> At the moment my SolrCloud consists of 2 leaders and 2 replicas which are
> fronted by an HAProxy instance. I want to maximise performance for indexing
> and it occurred to me that the model I use for loadbalancing my indexing
> requests may impact performance. i.e. am I likely to see better indexing
> performance if I stick certain groups of requests to certain nodes vs
> simply using a round robin approach?
>
> I'll be doing some impirical testing to try and figure this out but was
> wondering if there's any general guidance here? Or if anyone has any
> experience of particularly good/ bad configurations?
>
> Many thanks,
>
> Edd
>
> --
> Web: http://www.eddgrant.com
> Email: edd@eddgrant.com
> Mobile: +44 (0) 7861 394 543
>