You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Nick Chase <nc...@earthlink.net> on 2012/11/03 17:17:40 UTC

SolrCloud failover behavior

I think there's a change in the behavior of SolrCloud vs. what's in the 
wiki, but I was hoping someone could confirm for me.  I checked JIRA and 
there were a couple of issues requesting partial results if one server 
comes down, but that doesn't seem to be the issue here.  I also checked 
CHANGES.txt and don't see anything that seems to apply.

I'm running "Example B: Simple two shard cluster with shard replicas" 
from the wiki at https://wiki.apache.org/solr/SolrCloud and everything 
starts out as expected.  However, when I get to the part about fail over 
behavior is when things get a little wonky.

I added data to the shard running on 7475.  If I kill 7500, a query to 
any of the other servers works fine.  But if I kill 7475, rather than 
getting zero results on a search to 8983 or 8900, I get a 503 error:

<response>
    <lst name="responseHeader">
       <int name="status">503</int>
       <int name="QTime">5</int>
       <lst name="params">
          <str name="q">*:*</str>
       </lst>
    </lst>
    <lst name="error">
       <str name="msg">no servers hosting shard:</str>
       <int name="code">503</int>
    </lst>
</response>

I don't see any errors in the consoles.

Also, if I kill 8983, which includes the Zookeeper server, everything 
dies, rather than just staying in a steady state; the other servers 
continually show:

Nov 03, 2012 11:39:34 AM org.apache.zookeeper.ClientCnxn$SendThread 
startConnect
NFO: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:9983
ov 03, 2012 11:39:35 AM org.apache.zookeeper.ClientCnxn$SendThread run
ARNING: Session 0x13ac6cf87890002 for server null, unexpected error, 
closing socket connection and attempting reconnect
ava.net.ConnectException: Connection refused: no further information
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
        at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1143)

ov 03, 2012 11:39:35 AM org.apache.zookeeper.ClientCnxn$SendThread 
startConnect

over and over again, and a call to any of the servers shows a connection 
error to 8983.

This is the current 4.0.0 release, running on Windows 7.

If this is the proper behavior and the wiki needs updating, fine; I just 
need to know.  Otherwise if anybody has any clues as to what I may be 
missing, I'd be grateful. :)

Thanks...

---  Nick

Re: SolrCloud failover behavior

Posted by Erick Erickson <er...@gmail.com>.

I was right for once <G>..

Thanks for updating the Wiki!

Erick


On Tue, Nov 6, 2012 at 9:42 AM, Nick Chase <nc...@earthlink.net> wrote:

> Thanks a million, Erick!  You're right about killing both nodes hosting
> the shard.  I'll get the wiki corrected.
>
> ----  Nick
>
>
> On 11/3/2012 10:51 PM, Erick Erickson wrote:
>
>> SolrCloud doesn't work unless every shard has at least one server that is
>> up and running.
>>
>> I _think_ you might be killing both nodes that host one of the shards. The
>> admin
>> page has a link showing you the state of your cluster. So when this
>> happens,
>> does that page show both nodes for that shard being down?
>>
>> And yeah, SolrCloud requires a quorum of ZK nodes up. So with only one ZK
>> node, killing that will bring down the whole cluster. Which is why the
>> usual
>> recommendation is that ZK be run externally and usually an odd number of
>> ZK
>> nodes (three or more).
>>
>> Anyone can create a login and edit the Wiki, so any clarifications are
>> welcome!
>>
>> Best
>> Erick
>>
>>
>> On Sat, Nov 3, 2012 at 12:17 PM, Nick Chase <nc...@earthlink.net> wrote:
>>
>>  I think there's a change in the behavior of SolrCloud vs. what's in the
>>> wiki, but I was hoping someone could confirm for me.  I checked JIRA and
>>> there were a couple of issues requesting partial results if one server
>>> comes down, but that doesn't seem to be the issue here.  I also checked
>>> CHANGES.txt and don't see anything that seems to apply.
>>>
>>> I'm running "Example B: Simple two shard cluster with shard replicas"
>>> from
>>> the wiki at https://wiki.apache.org/solr/****SolrCloud<https://wiki.apache.org/solr/**SolrCloud>
>>> <https://wiki.**apache.org/solr/SolrCloud<https://wiki.apache.org/solr/SolrCloud>>and
>>> everything starts out as expected.  However, when I get to the part
>>>
>>> about fail over behavior is when things get a little wonky.
>>>
>>> I added data to the shard running on 7475.  If I kill 7500, a query to
>>> any
>>> of the other servers works fine.  But if I kill 7475, rather than getting
>>> zero results on a search to 8983 or 8900, I get a 503 error:
>>>
>>> <response>
>>>     <lst name="responseHeader">
>>>        <int name="status">503</int>
>>>        <int name="QTime">5</int>
>>>        <lst name="params">
>>>           <str name="q">*:*</str>
>>>        </lst>
>>>     </lst>
>>>     <lst name="error">
>>>        <str name="msg">no servers hosting shard:</str>
>>>        <int name="code">503</int>
>>>     </lst>
>>> </response>
>>>
>>> I don't see any errors in the consoles.
>>>
>>> Also, if I kill 8983, which includes the Zookeeper server, everything
>>> dies, rather than just staying in a steady state; the other servers
>>> continually show:
>>>
>>> Nov 03, 2012 11:39:34 AM org.apache.zookeeper.****ClientCnxn$SendThread
>>>
>>> startConnect
>>> NFO: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:9983
>>> ov 03, 2012 11:39:35 AM org.apache.zookeeper.****ClientCnxn$SendThread
>>> run
>>>
>>> ARNING: Session 0x13ac6cf87890002 for server null, unexpected error,
>>> closing socket connection and attempting reconnect
>>> ava.net.ConnectException: Connection refused: no further information
>>>         at sun.nio.ch.SocketChannelImpl.****checkConnect(Native Method)
>>>         at sun.nio.ch.SocketChannelImpl.****finishConnect(Unknown
>>> Source)
>>>         at org.apache.zookeeper.****ClientCnxn$SendThread.run(**
>>> ClientCnxn.java:1143)
>>>
>>> ov 03, 2012 11:39:35 AM org.apache.zookeeper.****ClientCnxn$SendThread
>>>
>>> startConnect
>>>
>>> over and over again, and a call to any of the servers shows a connection
>>> error to 8983.
>>>
>>> This is the current 4.0.0 release, running on Windows 7.
>>>
>>> If this is the proper behavior and the wiki needs updating, fine; I just
>>> need to know.  Otherwise if anybody has any clues as to what I may be
>>> missing, I'd be grateful. :)
>>>
>>> Thanks...
>>>
>>> ---  Nick
>>>
>>>
>>

Re: SolrCloud failover behavior

Posted by Nick Chase <nc...@earthlink.net>.

Thanks a million, Erick!  You're right about killing both nodes hosting 
the shard.  I'll get the wiki corrected.

----  Nick

On 11/3/2012 10:51 PM, Erick Erickson wrote:
> SolrCloud doesn't work unless every shard has at least one server that is
> up and running.
>
> I _think_ you might be killing both nodes that host one of the shards. The
> admin
> page has a link showing you the state of your cluster. So when this happens,
> does that page show both nodes for that shard being down?
>
> And yeah, SolrCloud requires a quorum of ZK nodes up. So with only one ZK
> node, killing that will bring down the whole cluster. Which is why the
> usual
> recommendation is that ZK be run externally and usually an odd number of ZK
> nodes (three or more).
>
> Anyone can create a login and edit the Wiki, so any clarifications are
> welcome!
>
> Best
> Erick
>
>
> On Sat, Nov 3, 2012 at 12:17 PM, Nick Chase <nc...@earthlink.net> wrote:
>
>> I think there's a change in the behavior of SolrCloud vs. what's in the
>> wiki, but I was hoping someone could confirm for me.  I checked JIRA and
>> there were a couple of issues requesting partial results if one server
>> comes down, but that doesn't seem to be the issue here.  I also checked
>> CHANGES.txt and don't see anything that seems to apply.
>>
>> I'm running "Example B: Simple two shard cluster with shard replicas" from
>> the wiki at https://wiki.apache.org/solr/**SolrCloud<https://wiki.apache.org/solr/SolrCloud>and everything starts out as expected.  However, when I get to the part
>> about fail over behavior is when things get a little wonky.
>>
>> I added data to the shard running on 7475.  If I kill 7500, a query to any
>> of the other servers works fine.  But if I kill 7475, rather than getting
>> zero results on a search to 8983 or 8900, I get a 503 error:
>>
>> <response>
>>     <lst name="responseHeader">
>>        <int name="status">503</int>
>>        <int name="QTime">5</int>
>>        <lst name="params">
>>           <str name="q">*:*</str>
>>        </lst>
>>     </lst>
>>     <lst name="error">
>>        <str name="msg">no servers hosting shard:</str>
>>        <int name="code">503</int>
>>     </lst>
>> </response>
>>
>> I don't see any errors in the consoles.
>>
>> Also, if I kill 8983, which includes the Zookeeper server, everything
>> dies, rather than just staying in a steady state; the other servers
>> continually show:
>>
>> Nov 03, 2012 11:39:34 AM org.apache.zookeeper.**ClientCnxn$SendThread
>> startConnect
>> NFO: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:9983
>> ov 03, 2012 11:39:35 AM org.apache.zookeeper.**ClientCnxn$SendThread run
>> ARNING: Session 0x13ac6cf87890002 for server null, unexpected error,
>> closing socket connection and attempting reconnect
>> ava.net.ConnectException: Connection refused: no further information
>>         at sun.nio.ch.SocketChannelImpl.**checkConnect(Native Method)
>>         at sun.nio.ch.SocketChannelImpl.**finishConnect(Unknown Source)
>>         at org.apache.zookeeper.**ClientCnxn$SendThread.run(**
>> ClientCnxn.java:1143)
>>
>> ov 03, 2012 11:39:35 AM org.apache.zookeeper.**ClientCnxn$SendThread
>> startConnect
>>
>> over and over again, and a call to any of the servers shows a connection
>> error to 8983.
>>
>> This is the current 4.0.0 release, running on Windows 7.
>>
>> If this is the proper behavior and the wiki needs updating, fine; I just
>> need to know.  Otherwise if anybody has any clues as to what I may be
>> missing, I'd be grateful. :)
>>
>> Thanks...
>>
>> ---  Nick
>>
>

Re: SolrCloud failover behavior

Posted by Erick Erickson <er...@gmail.com>.

SolrCloud doesn't work unless every shard has at least one server that is
up and running.

I _think_ you might be killing both nodes that host one of the shards. The
admin
page has a link showing you the state of your cluster. So when this happens,
does that page show both nodes for that shard being down?

And yeah, SolrCloud requires a quorum of ZK nodes up. So with only one ZK
node, killing that will bring down the whole cluster. Which is why the
usual
recommendation is that ZK be run externally and usually an odd number of ZK
nodes (three or more).

Anyone can create a login and edit the Wiki, so any clarifications are
welcome!

Best
Erick


On Sat, Nov 3, 2012 at 12:17 PM, Nick Chase <nc...@earthlink.net> wrote:

> I think there's a change in the behavior of SolrCloud vs. what's in the
> wiki, but I was hoping someone could confirm for me.  I checked JIRA and
> there were a couple of issues requesting partial results if one server
> comes down, but that doesn't seem to be the issue here.  I also checked
> CHANGES.txt and don't see anything that seems to apply.
>
> I'm running "Example B: Simple two shard cluster with shard replicas" from
> the wiki at https://wiki.apache.org/solr/**SolrCloud<https://wiki.apache.org/solr/SolrCloud>and everything starts out as expected.  However, when I get to the part
> about fail over behavior is when things get a little wonky.
>
> I added data to the shard running on 7475.  If I kill 7500, a query to any
> of the other servers works fine.  But if I kill 7475, rather than getting
> zero results on a search to 8983 or 8900, I get a 503 error:
>
> <response>
>    <lst name="responseHeader">
>       <int name="status">503</int>
>       <int name="QTime">5</int>
>       <lst name="params">
>          <str name="q">*:*</str>
>       </lst>
>    </lst>
>    <lst name="error">
>       <str name="msg">no servers hosting shard:</str>
>       <int name="code">503</int>
>    </lst>
> </response>
>
> I don't see any errors in the consoles.
>
> Also, if I kill 8983, which includes the Zookeeper server, everything
> dies, rather than just staying in a steady state; the other servers
> continually show:
>
> Nov 03, 2012 11:39:34 AM org.apache.zookeeper.**ClientCnxn$SendThread
> startConnect
> NFO: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:9983
> ov 03, 2012 11:39:35 AM org.apache.zookeeper.**ClientCnxn$SendThread run
> ARNING: Session 0x13ac6cf87890002 for server null, unexpected error,
> closing socket connection and attempting reconnect
> ava.net.ConnectException: Connection refused: no further information
>        at sun.nio.ch.SocketChannelImpl.**checkConnect(Native Method)
>        at sun.nio.ch.SocketChannelImpl.**finishConnect(Unknown Source)
>        at org.apache.zookeeper.**ClientCnxn$SendThread.run(**
> ClientCnxn.java:1143)
>
> ov 03, 2012 11:39:35 AM org.apache.zookeeper.**ClientCnxn$SendThread
> startConnect
>
> over and over again, and a call to any of the servers shows a connection
> error to 8983.
>
> This is the current 4.0.0 release, running on Windows 7.
>
> If this is the proper behavior and the wiki needs updating, fine; I just
> need to know.  Otherwise if anybody has any clues as to what I may be
> missing, I'd be grateful. :)
>
> Thanks...
>
> ---  Nick
>