You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Steven Bower (JIRA)" <ji...@apache.org> on 2015/05/15 17:52:59 UTC
[jira] [Created] (SOLR-7550) PeerSync fails if a replica returns 500 error

Steven Bower created SOLR-7550:
----------------------------------

             Summary: PeerSync fails if a replica returns 500 error
                 Key: SOLR-7550
                 URL: https://issues.apache.org/jira/browse/SOLR-7550
             Project: Solr
          Issue Type: Bug
          Components: SolrCloud
    Affects Versions: 4.10.2, 4.8.1
         Environment: linux
            Reporter: Steven Bower
            Priority: Critical


4 node cluster we stopped a node and started that node back up. Prior to the node starting up a schema change was made that was invalid. When the node started back up the core could not load as the schema was invalid. While in this state the leader was restarted as well (so now two nodes in this bad state). When the remaining two nodes attempted to become leader and PeerSync they were getting a 500 error back from these failed-to-start cores and were not able to become leaders, which eventually lead to the remaining two nodes ending up in "recovery_failed" state and the cluster being offline.

Some logs:

{noformat}
2015-05-14 17:03:20.712 INFO  ShardLeaderElectionContext [main-EventThread] - Running the leader process for shard shard1
2015-05-14 17:03:20.720 INFO  ShardLeaderElectionContext [main-EventThread] - Checking if I should try and be the leader.
2015-05-14 17:03:20.720 INFO  ShardLeaderElectionContext [main-EventThread] - My last published State was Active, it's okay to be the leader.
2015-05-14 17:03:20.720 INFO  ShardLeaderElectionContext [main-EventThread] - I may be the new leader - try and sync
2015-05-14 17:03:20.720 WARN  RecoveryStrategy [main-EventThread] - Stopping recovery for zkNodeName=host-a2:12345_solr_xxxxcore=xxxx
2015-05-14 17:03:23.220 INFO  SyncStrategy [main-EventThread] - Sync replicas to http://host-a2:12345/solr/xxxx/
2015-05-14 17:03:23.221 INFO  PeerSync [main-EventThread] - PeerSync: core=xxxx url=http://host-a2:12345/solr START replicas=[http://host-b1:12345/solr/xxxx/, http://host-a1:12345/solr/xxxx_shard1/] nUpdates=100
2015-05-14 17:03:23.238 INFO  PeerSync [main-EventThread] - PeerSync: core=xxxx url=http://host-a2:12345/solr  Received 96 versions from http://host-b1:12345/solr/xxxx/
2015-05-14 17:03:23.239 INFO  PeerSync [main-EventThread] - PeerSync: core=xxxx url=http://host-a2:12345/solr  Our versions are newer. ourLowThreshold=1501178223728263172 otherHigh=1501178223745040385
2015-05-14 17:03:23.385 WARN  PeerSync [main-EventThread] - PeerSync: core=xxxx url=http://host-a2:12345/solr  exception talking to http://host-a1:12345/solr/xxxx_shard1/, failed
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
<title>Error 500 {msg=SolrCore 'xxxx_shard1' is not available due to init failure: Could not load conf for core xxxx_shard1: Plugin init failure for [schema.xml] fieldType "text_split_colon": Plugin init failure for [schema.xml] analyzer/filter: Error loading class 'XXXXXXXXXXXXXX'. Schema file is /configs/xxxx/schema.xml,trace=org.apache.solr.common.SolrException: SolrCore 'xxxx_shard1' is not available due to init failure: Could not load conf for core xxxx_shard1: Plugin init failure for [schema.xml] fieldType "some_field_type": Plugin init failure for [schema.xml] analyzer/filter: Error loading class 'XXXXXXXXXXXXXXX'. Schema file is /configs/xxxx/schema.xml
	at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:745)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:299)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
	at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
  ...
  ...
  ...
{noformat}

It looks as though the error handling is a bit brittle in that it can tolerate connection issues, 503 and 404 errors but anything else would cause a cluster that needed to leader elect and had a node in a bad state to fail.

If just adding support for 500 errors is seen as the best approach that is a simple fix and I can put a patch up quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org