You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Tor Egil <tr...@gmail.com> on 2013/08/14 10:58:59 UTC

Clusterstate says "state:recovering", but Core says "I see state: null"?

Setup:
3 zk servers
3 solr 4.4 servers (1 shard with 2 replicas)

Every now and then Solr gets trapped recovering

Clusterstate says:


Leader says:


and the restarted replica says:

I've tried removing the data directory and restarting the replica, but it
ends up in the same loop. I must kill the cluster and start all servers to
recover from this.. which isnt much 24/7, which i need :-/

any clue whats wrong? 



--
View this message in context: http://lucene.472066.n3.nabble.com/Clusterstate-says-state-recovering-but-Core-says-I-see-state-null-tp4084504.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Clusterstate says "state:recovering", but Core says "I see state: null"?

Posted by Michael Della Bitta <mi...@appinions.com>.
We build new collections nightly (identifiers change for us) and change
aliases once they're done. Easy and effective.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062  | c: +1 917 477 7906

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Wed, Aug 14, 2013 at 11:41 AM, Mark Miller <ma...@gmail.com> wrote:

>
> On Aug 14, 2013, at 10:04 AM, Tor Egil <tr...@gmail.com> wrote:
>
> > The name of the core "swap" is used because I would like to upload new
> > configs to the swap core, and then do an actual swap when the core is up
> and
> > running with new data….
>
> I would do this with two collections and a collection alias instead.
>
> - Mark
>
>

Re: Clusterstate says "state:recovering", but Core says "I see state: null"?

Posted by Mark Miller <ma...@gmail.com>.
On Aug 14, 2013, at 10:04 AM, Tor Egil <tr...@gmail.com> wrote:

> The name of the core "swap" is used because I would like to upload new
> configs to the swap core, and then do an actual swap when the core is up and
> running with new data….

I would do this with two collections and a collection alias instead.

- Mark


SOLUTION: Clusterstate says "state:recovering", but Core says "I see state: null"?

Posted by Tor Egil <tr...@gmail.com>.
Aliasing instead of swapping removed this problem!

DO NOT USE "SWAP" WHEN IN CLOUD MODE (solr 4.3)



--
View this message in context: http://lucene.472066.n3.nabble.com/Clusterstate-says-state-recovering-but-Core-says-I-see-state-null-tp4084504p4086037.html
Sent from the Solr - User mailing list archive at Nabble.com.

unformatted: Clusterstate says "state:recovering", but Core says "I see state: null"?

Posted by Tor Egil <tr...@gmail.com>.
This time without formatting ;-)

This is from the leader log. (There are other statements inbetween, but I
think they are irrelevant). First of all it reads the zookeeper state. This
happens now and then. Then I guess the replica says "Hey, I'm alive, please
start the recover process": 

[qtp689554095-19] INFO  org.apache.solr.common.cloud.ZkStateReader  -
Updating cloud state from ZooKeeper...
[qtp689554095-17] INFO  org.apache.solr.servlet.SolrDispatchFilter  -
[admin] webapp=null path=/admin/cores
params={coreNodeName=10.231.188.127:8080_solr
_swap&state=recovering&nodeName=10.231.188.127:8080_solr&action=PREPRECOVERY&checkLive=true&core=swap&wt=javabin&onlyIfLeader=true&version=2}
status=400 QTime=12
0485
[qtp689554095-15] ERROR org.apache.solr.core.SolrCore  -
org.apache.solr.common.SolrException: I was asked to wait on state
recovering for 10.231.188.127
:8080_solr but I still do not see the requested state. I see state: null
live:false

>From the replica log, which tries to recover. For some reason I get the
"read timed out", as if solr was dead.. (but it clearly isnt). Maybe I could
turn on debug logging to view the actual URL for the query? 
Also, I can see the replica telling zookeeper its alive and recovering:

[RecoveryThread] ERROR org.apache.solr.cloud.RecoveryStrategy  - Error while
trying to recover.
core=swap:org.apache.solr.client.solrj.SolrServerException: Timeout occured
while waiting response from server at: http://10.231.188.126:8080/solr
...Caused by: java.net.SocketTimeoutException: Read timed out
[RecoveryThread] ERROR org.apache.solr.cloud.RecoveryStrategy  - Recovery
failed - trying again... (1) core=swap
[RecoveryThread] INFO  org.apache.solr.cloud.RecoveryStrategy  - Wait 4.0
seconds before trying to recover again (2)
[RecoveryThread] INFO  org.apache.solr.cloud.ZkController  - publishing
core=swap state=recovering

I left this looping for a while in case of some bad synchronizations, but it
never recoverd

I will use aliases instead of swapping from now on, since swapping could
lead to unstable situations (probably like this one...). Maybe its
appropriate to show a warning when the user click the "Swap" when in solr
cloud mode. That would have saved you from my noob messages ;-)




--
View this message in context: http://lucene.472066.n3.nabble.com/Clusterstate-says-state-recovering-but-Core-says-I-see-state-null-tp4084504p4084682.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Clusterstate says "state:recovering", but Core says "I see state: null"?

Posted by Mark Miller <ma...@gmail.com>.
On Aug 14, 2013, at 10:24 AM, Shawn Heisey <so...@elyograg.org> wrote:

> 
> I'm not sure what happens if you swap cores in a SolrCloud environment.
> It's possible that this kind of swapping could lead to a very unstable
> system.

Yeah, it's totally not supported. I've threatened to make a JIRA issue about throwing an exception on this and one or two other core admin commands. We might want to support them in some way at some point, but currently, swap is def no good in SolrCloud.

- Mark

Re: Clusterstate says "state:recovering", but Core says "I see state: null"?

Posted by Shawn Heisey <so...@elyograg.org>.
On 8/14/2013 8:04 AM, Tor Egil wrote:
> This is from the leader log. (There are other statements inbetween, but I
> think they are irrelevant). First of all it reads the zookeeper state. This
> happens now and then. Then I guess the replica says "Hey, I'm alive, please
> start the recover process":
> 
> 
> From the replica log, which tries to recover. For some reason I get the
> "read timed out", as if solr was dead.. (but it clearly isnt). Maybe I could
> turn on debug logging to view the actual URL for the query?
> Also, I can see the replica telling zookeeper its alive and recovering.
> 
> 
> I left this looping for a while in case of some bad synchronizations, but it
> never recovers.
> 
> The name of the core "swap" is used because I would like to upload new
> configs to the swap core, and then do an actual swap when the core is up and
> running with new data....

I'm not sure what happens if you swap cores in a SolrCloud environment.
 It's possible that this kind of swapping could lead to a very unstable
system.

FYI: Whatever Nabble forum feature you are using to change the font does
not translate over to the mailing list.  The altered text simply gets
deleted.

What you see above is what list subscribers (including Mark and I) can
see without visiting the Nabble website.  The canonical copy of this
discussion is the mailing list and not Nabble.  Many experienced users
read this in a text email program that doesn't have any graphical
browser capability, so even though there is a link at the bottom of your
messages to see the topic on Nabble, it isn't useful for those users.

Thanks,
Shawn


Re: Clusterstate says "state:recovering", but Core says "I see state: null"?

Posted by Tor Egil <tr...@gmail.com>.
This is from the leader log. (There are other statements inbetween, but I
think they are irrelevant). First of all it reads the zookeeper state. This
happens now and then. Then I guess the replica says "Hey, I'm alive, please
start the recover process":


>From the replica log, which tries to recover. For some reason I get the
"read timed out", as if solr was dead.. (but it clearly isnt). Maybe I could
turn on debug logging to view the actual URL for the query?
Also, I can see the replica telling zookeeper its alive and recovering.


I left this looping for a while in case of some bad synchronizations, but it
never recovers.

The name of the core "swap" is used because I would like to upload new
configs to the swap core, and then do an actual swap when the core is up and
running with new data....



--
View this message in context: http://lucene.472066.n3.nabble.com/Clusterstate-says-state-recovering-but-Core-says-I-see-state-null-tp4084504p4084561.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Clusterstate says "state:recovering", but Core says "I see state: null"?

Posted by Tor Egil <tr...@gmail.com>.
Mark, just to be sure, you can see the "raw text" formatted text in my
original, (and last) post?
It was left out in the text you qouted...



--
View this message in context: http://lucene.472066.n3.nabble.com/Clusterstate-says-state-recovering-but-Core-says-I-see-state-null-tp4084504p4084577.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Clusterstate says "state:recovering", but Core says "I see state: null"?

Posted by Mark Miller <ma...@gmail.com>.
What does the cluster state and leader say?

Anything interesting you can pull from the logs?

- Mark

On Aug 14, 2013, at 4:58 AM, Tor Egil <tr...@gmail.com> wrote:

> Setup:
> 3 zk servers
> 3 solr 4.4 servers (1 shard with 2 replicas)
> 
> Every now and then Solr gets trapped recovering
> 
> Clusterstate says:
> 
> 
> Leader says:
> 
> 
> and the restarted replica says:
> 
> I've tried removing the data directory and restarting the replica, but it
> ends up in the same loop. I must kill the cluster and start all servers to
> recover from this.. which isnt much 24/7, which i need :-/
> 
> any clue whats wrong? 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Clusterstate-says-state-recovering-but-Core-says-I-see-state-null-tp4084504.html
> Sent from the Solr - User mailing list archive at Nabble.com.