You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Steven Bower (JIRA)" <ji...@apache.org> on 2015/05/15 17:52:59 UTC
[jira] [Created] (SOLR-7550) PeerSync fails if a replica returns
500 error
Steven Bower created SOLR-7550:
----------------------------------
Summary: PeerSync fails if a replica returns 500 error
Key: SOLR-7550
URL: https://issues.apache.org/jira/browse/SOLR-7550
Project: Solr
Issue Type: Bug
Components: SolrCloud
Affects Versions: 4.10.2, 4.8.1
Environment: linux
Reporter: Steven Bower
Priority: Critical
4 node cluster we stopped a node and started that node back up. Prior to the node starting up a schema change was made that was invalid. When the node started back up the core could not load as the schema was invalid. While in this state the leader was restarted as well (so now two nodes in this bad state). When the remaining two nodes attempted to become leader and PeerSync they were getting a 500 error back from these failed-to-start cores and were not able to become leaders, which eventually lead to the remaining two nodes ending up in "recovery_failed" state and the cluster being offline.
Some logs:
{noformat}
2015-05-14 17:03:20.712 INFO ShardLeaderElectionContext [main-EventThread] - Running the leader process for shard shard1
2015-05-14 17:03:20.720 INFO ShardLeaderElectionContext [main-EventThread] - Checking if I should try and be the leader.
2015-05-14 17:03:20.720 INFO ShardLeaderElectionContext [main-EventThread] - My last published State was Active, it's okay to be the leader.
2015-05-14 17:03:20.720 INFO ShardLeaderElectionContext [main-EventThread] - I may be the new leader - try and sync
2015-05-14 17:03:20.720 WARN RecoveryStrategy [main-EventThread] - Stopping recovery for zkNodeName=host-a2:12345_solr_xxxxcore=xxxx
2015-05-14 17:03:23.220 INFO SyncStrategy [main-EventThread] - Sync replicas to http://host-a2:12345/solr/xxxx/
2015-05-14 17:03:23.221 INFO PeerSync [main-EventThread] - PeerSync: core=xxxx url=http://host-a2:12345/solr START replicas=[http://host-b1:12345/solr/xxxx/, http://host-a1:12345/solr/xxxx_shard1/] nUpdates=100
2015-05-14 17:03:23.238 INFO PeerSync [main-EventThread] - PeerSync: core=xxxx url=http://host-a2:12345/solr Received 96 versions from http://host-b1:12345/solr/xxxx/
2015-05-14 17:03:23.239 INFO PeerSync [main-EventThread] - PeerSync: core=xxxx url=http://host-a2:12345/solr Our versions are newer. ourLowThreshold=1501178223728263172 otherHigh=1501178223745040385
2015-05-14 17:03:23.385 WARN PeerSync [main-EventThread] - PeerSync: core=xxxx url=http://host-a2:12345/solr exception talking to http://host-a1:12345/solr/xxxx_shard1/, failed
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
<title>Error 500 {msg=SolrCore 'xxxx_shard1' is not available due to init failure: Could not load conf for core xxxx_shard1: Plugin init failure for [schema.xml] fieldType "text_split_colon": Plugin init failure for [schema.xml] analyzer/filter: Error loading class 'XXXXXXXXXXXXXX'. Schema file is /configs/xxxx/schema.xml,trace=org.apache.solr.common.SolrException: SolrCore 'xxxx_shard1' is not available due to init failure: Could not load conf for core xxxx_shard1: Plugin init failure for [schema.xml] fieldType "some_field_type": Plugin init failure for [schema.xml] analyzer/filter: Error loading class 'XXXXXXXXXXXXXXX'. Schema file is /configs/xxxx/schema.xml
at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:745)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:299)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
...
...
...
{noformat}
It looks as though the error handling is a bit brittle in that it can tolerate connection issues, 503 and 404 errors but anything else would cause a cluster that needed to leader elect and had a node in a bad state to fail.
If just adding support for 500 errors is seen as the best approach that is a simple fix and I can put a patch up quickly.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org