You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Kai 'wusel' Siering <wu...@uu.org> on 2017/09/28 23:11:11 UTC

How to recover from failed SPLITSHARD?

Hi,

this is with SolrCloud 6.5.1 on Ubuntu LTS 16.04 and OpenJDK 8, 4 Solr in Cloud mode, external ZK.

I tried to split my colection's shard1 (500 GB) with SPLITSHARD, it kind of worked. After more than 8 hours the new shards left "construction" state — and entered "recovery" :( Another about 12 hours later, Out of Memory errors with "could not create thread" happened. Node 10.10.10.162 took leadership of shard1, but since we still saw errors on searches, I stopped solr on 10.10.10.161, changed heap from 24G to 31G and rebooted the system, just in case — good time to install latest patches. 10.10.10.161 came back and shards shard1, shard1_0 and shard1_1 started recovery. But unfortunately, 10.10.10.162, leader for shard2 which was being split as well, hit "something": solr.log got not updated anymore, the UI didn't work anymore, so in the end, I stopped solr there as well (finished instantly) and rebootet. Now both are running with 31G java heap, shard1 and shard2 are synced and I try to clean up before retrying.

Of shard2, only a shard2_0 without any replicas was left over, and DELETESHARD clean it up.

But shard1 has shard1_0 and shard1_1, each with two replicas. DELETESHARD errored out, so I DELETEREPLICA all of them. This worked, but "parts of" shard1_0 and shard1_1 are still there and I cannot delete them:

$ wget -q -O - 'http://10.10.10.162:8983/solr/admin/collections?wt=json&action=CLUSTERSTATUS' | jq
[…]
          "shard1_0": {
            "range": "80000000-bfffffff",
            "state": "recovery_failed",
            "replicas": {}
          },
          "shard1_1": {
            "parent": "shard1",
            "shard_parent_node": "10.10.10.161:8983_solr",
            "range": "c0000000-ffffffff",
            "state": "recovery_failed",
            "shard_parent_zk_session": "98682039611162624",
            "replicas": {}
          }
[…]


$ wget -O - 'http://10.10.10.161:8983/solr/admin/collections?action=DELETESHARD&shard=shard1_1&collection=collection'
--2017-09-29 01:01:16--  http://10.10.10.161:8983/solr/admin/collections?action=DELETESHARD&shard=shard1_1&collection=collection
Connecting to 10.10.10.161:8983... connected.
HTTP request sent, awaiting response... 400 Bad Request
2017-09-29 01:01:16 ERROR 400: Bad Request.

Any hint on how to fix this appreciated ;)

Regards,
-kai