You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Ganesh Sethuraman <ga...@gmail.com> on 2019/02/25 18:15:42 UTC

how to get high-availability for Solr csv update handler?

Hi

We are using Solr Cloud 7.2.1. We are using Solr CSV update handler to do
bulk update (several Millions of docs) in to multiple collections. When we
make a call to the CSV update handler using curl command line (as below),
we are pointing to single server in Solr. During the problem time, when one
of the Solr server goes down this approach could fail. Is there any way
that we do this to send the write to the leader, like how the solrj does,
through the simple curl command(s) line?

In the request below for some reason, if the SOLR1-SERVER is down, the
request will fail, even though the new leader say SOLR2-SERVER is up.

curl 'http://<<SOLR1-SERVER>>:8983/solr/my_collection/update?commit=true'
--data-binary @example/exampledocs/books.csv -H
'Content-type:application/csv'

1. I can create load balancer / ALB infront of solr, but that may not still
identify the Leader for efficiency.
2. I can write a solrj client to update, but i am not sure if i will get
the efficiency of  bulk update? not sure about the simplicity of the curl
as well.

Any best practices for the same would be good to have.

Regards
Ganesh

Re: how to get high-availability for Solr csv update handler?

Posted by Walter Underwood <wu...@wunderwood.org>.

We send batches of updates to a load balancer. The cluster gets the updates to the right leader with very little overhead. When we get an error, we resend the update batch. The load balancer will find a healthy node to receive it. This is simple, robust, and fast.

One handy tip: if a batch fails with a 400, we back off and resend it in batches of 1 document each so we can identify the bad one. This saves a ton of time trying to manually find the bad document.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 25, 2019, at 1:31 PM, Ganesh Sethuraman <ga...@gmail.com> wrote:
> 
> Thanks for details and updates. We are looking at load balancers not
> because of the little improvement in performance. But more for high
> availability. Other alternative is, if the update fails on one server using
> curl, on error we have to call another SOLR server. I was looking to see if
> there any other way to get the working leader from the Zookeeper before the
> update, is there a way to query zookeeper for the same? But, I understand
> there is no guarantee that leader wont change during the large CSV file
> update. But at least some protection during planed server restarts can be
> managed.
> 
> Regarding the Solrj option, it certainly seems to be best option, do we
> have the python solr client to it which can be Solr Leader aware? like how
> it is done in the solrj (java) client.
> 
> Regards,
> Ganesh
> 
> On Mon, Feb 25, 2019 at 3:00 PM Shawn Heisey <ap...@elyograg.org> wrote:
> 
>> On 2/25/2019 11:15 AM, Ganesh Sethuraman wrote:
>>> We are using Solr Cloud 7.2.1. We are using Solr CSV update handler to do
>>> bulk update (several Millions of docs) in to multiple collections. When
>> we
>>> make a call to the CSV update handler using curl command line (as below),
>>> we are pointing to single server in Solr. During the problem time, when
>> one
>>> of the Solr server goes down this approach could fail. Is there any way
>>> that we do this to send the write to the leader, like how the solrj does,
>>> through the simple curl command(s) line?
>> 
>> The SolrJ client named CloudSolrClient is able to do this because it is
>> a full ZooKeeper client that has instant access to the clusterstate
>> maintained by your Solr servers.
>> 
>> To get that capability in any other client would require that the client
>> is aware of the ZooKeeper ensemble in the same way.  Curl cannot do this.
>> 
>>> 
>>> In the request below for some reason, if the SOLR1-SERVER is down, the
>>> request will fail, even though the new leader say SOLR2-SERVER is up.
>>> 
>>> curl 'http://
>> <<SOLR1-SERVER>>:8983/solr/my_collection/update?commit=true'
>>> --data-binary @example/exampledocs/books.csv -H
>>> 'Content-type:application/csv'
>>> 
>>> 1. I can create load balancer / ALB infront of solr, but that may not
>> still
>>> identify the Leader for efficiency.
>> 
>> A load balancer won't be able to identify the leader unless it is
>> capable of talking to ZooKeeper and knows how Solr represents data in
>> ZK.  Have you measured the efficiency improvement that comes from
>> sending to the leader?  If that improvement is small, it's probably not
>> worth implementing something that talks to ZooKeeper.  I know there are
>> people who don't try to send to leaders that are achieving very fast
>> indexing rates ... I suspect that the improvement obtained by sending to
>> leaders is relatively small.
>> 
>>> 2. I can write a solrj client to update, but i am not sure if i will get
>>> the efficiency of  bulk update? not sure about the simplicity of the curl
>>> as well.
>> 
>> SolrJ is probably more efficient than something like curl, because it
>> utilizes a compact binary format for data transfer in both directions,
>> called javabin.  With curl, you would most likely be using a text format
>> like json, xml, or csv.
>> 
>> SolrJ clients are fully thread-safe.  Which means you can use a single
>> instance to send updates in parallel with multiple threads.  That is the
>> best way to achieve good indexing performance with Solr.
>> 
>> Thanks,
>> Shawn
>>

Re: how to get high-availability for Solr csv update handler?

Posted by Ganesh Sethuraman <ga...@gmail.com>.

Thanks for details and updates. We are looking at load balancers not
because of the little improvement in performance. But more for high
availability. Other alternative is, if the update fails on one server using
curl, on error we have to call another SOLR server. I was looking to see if
there any other way to get the working leader from the Zookeeper before the
update, is there a way to query zookeeper for the same? But, I understand
there is no guarantee that leader wont change during the large CSV file
update. But at least some protection during planed server restarts can be
managed.

Regarding the Solrj option, it certainly seems to be best option, do we
have the python solr client to it which can be Solr Leader aware? like how
it is done in the solrj (java) client.

Regards,
Ganesh

On Mon, Feb 25, 2019 at 3:00 PM Shawn Heisey <ap...@elyograg.org> wrote:

> On 2/25/2019 11:15 AM, Ganesh Sethuraman wrote:
> > We are using Solr Cloud 7.2.1. We are using Solr CSV update handler to do
> > bulk update (several Millions of docs) in to multiple collections. When
> we
> > make a call to the CSV update handler using curl command line (as below),
> > we are pointing to single server in Solr. During the problem time, when
> one
> > of the Solr server goes down this approach could fail. Is there any way
> > that we do this to send the write to the leader, like how the solrj does,
> > through the simple curl command(s) line?
>
> The SolrJ client named CloudSolrClient is able to do this because it is
> a full ZooKeeper client that has instant access to the clusterstate
> maintained by your Solr servers.
>
> To get that capability in any other client would require that the client
> is aware of the ZooKeeper ensemble in the same way.  Curl cannot do this.
>
> >
> > In the request below for some reason, if the SOLR1-SERVER is down, the
> > request will fail, even though the new leader say SOLR2-SERVER is up.
> >
> > curl 'http://
> <<SOLR1-SERVER>>:8983/solr/my_collection/update?commit=true'
> > --data-binary @example/exampledocs/books.csv -H
> > 'Content-type:application/csv'
> >
> > 1. I can create load balancer / ALB infront of solr, but that may not
> still
> > identify the Leader for efficiency.
>
> A load balancer won't be able to identify the leader unless it is
> capable of talking to ZooKeeper and knows how Solr represents data in
> ZK.  Have you measured the efficiency improvement that comes from
> sending to the leader?  If that improvement is small, it's probably not
> worth implementing something that talks to ZooKeeper.  I know there are
> people who don't try to send to leaders that are achieving very fast
> indexing rates ... I suspect that the improvement obtained by sending to
> leaders is relatively small.
>
> > 2. I can write a solrj client to update, but i am not sure if i will get
> > the efficiency of  bulk update? not sure about the simplicity of the curl
> > as well.
>
> SolrJ is probably more efficient than something like curl, because it
> utilizes a compact binary format for data transfer in both directions,
> called javabin.  With curl, you would most likely be using a text format
> like json, xml, or csv.
>
> SolrJ clients are fully thread-safe.  Which means you can use a single
> instance to send updates in parallel with multiple threads.  That is the
> best way to achieve good indexing performance with Solr.
>
> Thanks,
> Shawn
>

Re: how to get high-availability for Solr csv update handler?

Posted by Shawn Heisey <ap...@elyograg.org>.

On 2/25/2019 11:15 AM, Ganesh Sethuraman wrote:
> We are using Solr Cloud 7.2.1. We are using Solr CSV update handler to do
> bulk update (several Millions of docs) in to multiple collections. When we
> make a call to the CSV update handler using curl command line (as below),
> we are pointing to single server in Solr. During the problem time, when one
> of the Solr server goes down this approach could fail. Is there any way
> that we do this to send the write to the leader, like how the solrj does,
> through the simple curl command(s) line?

The SolrJ client named CloudSolrClient is able to do this because it is 
a full ZooKeeper client that has instant access to the clusterstate 
maintained by your Solr servers.

To get that capability in any other client would require that the client 
is aware of the ZooKeeper ensemble in the same way.  Curl cannot do this.

> 
> In the request below for some reason, if the SOLR1-SERVER is down, the
> request will fail, even though the new leader say SOLR2-SERVER is up.
> 
> curl 'http://<<SOLR1-SERVER>>:8983/solr/my_collection/update?commit=true'
> --data-binary @example/exampledocs/books.csv -H
> 'Content-type:application/csv'
> 
> 1. I can create load balancer / ALB infront of solr, but that may not still
> identify the Leader for efficiency.

A load balancer won't be able to identify the leader unless it is 
capable of talking to ZooKeeper and knows how Solr represents data in 
ZK.  Have you measured the efficiency improvement that comes from 
sending to the leader?  If that improvement is small, it's probably not 
worth implementing something that talks to ZooKeeper.  I know there are 
people who don't try to send to leaders that are achieving very fast 
indexing rates ... I suspect that the improvement obtained by sending to 
leaders is relatively small.

> 2. I can write a solrj client to update, but i am not sure if i will get
> the efficiency of  bulk update? not sure about the simplicity of the curl
> as well.

SolrJ is probably more efficient than something like curl, because it 
utilizes a compact binary format for data transfer in both directions, 
called javabin.  With curl, you would most likely be using a text format 
like json, xml, or csv.

SolrJ clients are fully thread-safe.  Which means you can use a single 
instance to send updates in parallel with multiple threads.  That is the 
best way to achieve good indexing performance with Solr.

Thanks,
Shawn