You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by David Santamauro <da...@gmail.com> on 2014/01/31 16:13:45 UTC

shard1 gone missing ...

Hi,

I have a strange situation. I created a collection with 4 ndoes 
(separate servers, numShards=4), I then proceeded to index data ... all 
has been seemingly well until this morning when I had to reboot one of 
the nodes.

After reboot, the node I rebooted went into recovery mode! This is 
completely illogical as there is 1 shard per node (no replicas).

What could have possibly happened to 1) trigger a recovery and; 2) have 
the node think it has a replica to even recover from?

Looking at the graph from the SOLR admin page it shows that shard1 
disappeared and the server that was rebooted appears in a recovering 
state under the server home to shard2.

I then looked at clusterstate.json and it confirms that shard1 is 
completely missing and shard2 now has a replica. ... I'm baffled, 
confused, dismayed.

Versions:
Solr 4.4 (4 nodes with tomcat container)
zookeeper-3.4.5 (5-node ensemble)

Oh, and I'm assuming shard1 is completely corrupt.

I'd really appreciate any insight.

David

PS I have a copy of all the shards backed up. Is there a way to possibly 
rsync shard1 back into place and "fix" clusterstate.json manually?

Re: shard1 gone missing ... (upgrade to 4.6.1)

Posted by David Santamauro <da...@gmail.com>.

Mark, I am testing the upgrade and indexing gives me this error:

914379 [http-apr-8080-exec-4] ERROR org.apache.solr.core.SolrCore  ? 
org.apache.solr.common.SolrException: Invalid UTF-8 middle byte 0xe0 (at 
char #1, byte #-1)

... and a bunch of these

request: 
http://xx.xx.xx.xx/col1/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2Fxx.xx.xx.xx%3A8080%2Fcol1%2F&wt=javabin&version=2
         at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:240)
         at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
         at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
         at java.lang.Thread.run(Thread.java:744)
1581335 [updateExecutor-1-thread-7] ERROR 
org.apache.solr.update.StreamingSolrServers  ? error
org.apache.solr.common.SolrException: Bad Request


Nothing else in the process chain has changed. Does this have anything 
to do with the deprecated warnings:

WARN  org.apache.solr.handler.UpdateRequestHandler  ? Using deprecated 
class: XmlUpdateRequestHandler -- replace with UpdateRequestHandler

thanks

David


On 01/31/2014 11:22 AM, Mark Miller wrote:
>
>
> On Jan 31, 2014, at 11:15 AM, David Santamauro <da...@gmail.com> wrote:
>
>> On 01/31/2014 10:22 AM, Mark Miller wrote:
>>
>>> I’d also highly recommend you try moving to Solr 4.6.1 when you can though. We have fixed many, many, many bugs around SolrCloud in the 4 releases since 4.4. You can follow the progress in the CHANGES file we update for each release.
>>
>> Can I do a drop-in replacement of 4.4.0 ?
>>
>>
>
> It should be a drop in replacement. For some that use deep API’s in plugins, sometimes you might have to make a couple small changes to your code.
>
> Alway best to do a test with a copy of your index, but for most, it should be a drop in replacement.
>
> - Mark
>
> http://about.me/markrmiller
>

Re: shard1 gone missing ...

Posted by Mark Miller <ma...@gmail.com>.

On Jan 31, 2014, at 11:15 AM, David Santamauro <da...@gmail.com> wrote:

> On 01/31/2014 10:22 AM, Mark Miller wrote:
> 
>> I’d also highly recommend you try moving to Solr 4.6.1 when you can though. We have fixed many, many, many bugs around SolrCloud in the 4 releases since 4.4. You can follow the progress in the CHANGES file we update for each release.
> 
> Can I do a drop-in replacement of 4.4.0 ?
> 
> 

It should be a drop in replacement. For some that use deep API’s in plugins, sometimes you might have to make a couple small changes to your code.

Alway best to do a test with a copy of your index, but for most, it should be a drop in replacement.

- Mark

http://about.me/markrmiller

Re: shard1 gone missing ...

Posted by David Santamauro <da...@gmail.com>.

On 01/31/2014 10:22 AM, Mark Miller wrote:

> I’d also highly recommend you try moving to Solr 4.6.1 when you can though. We have fixed many, many, many bugs around SolrCloud in the 4 releases since 4.4. You can follow the progress in the CHANGES file we update for each release.

Can I do a drop-in replacement of 4.4.0 ?

Re: shard1 gone missing ...

Posted by Mark Miller <ma...@gmail.com>.

Would probably need to see some logs to have an idea of what happened.

Would also be nice to see the after state of zk in a text dump.

You should be able to fix it, as long as you have the index on a disk, just make sure it is where it is expected and manually update the clusterstate.json. Would be good to take a look at the logs and see if it tells anything first though.

I’d also highly recommend you try moving to Solr 4.6.1 when you can though. We have fixed many, many, many bugs around SolrCloud in the 4 releases since 4.4. You can follow the progress in the CHANGES file we update for each release.

I wrote a little about the 4.6.1 as it relates to SolrCloud here: https://plus.google.com/+MarkMillerMan/posts/CigxUPN4hbA

- Mark

http://about.me/markrmiller

On Jan 31, 2014, at 10:13 AM, David Santamauro <da...@gmail.com> wrote:

> 
> Hi,
> 
> I have a strange situation. I created a collection with 4 ndoes (separate servers, numShards=4), I then proceeded to index data ... all has been seemingly well until this morning when I had to reboot one of the nodes.
> 
> After reboot, the node I rebooted went into recovery mode! This is completely illogical as there is 1 shard per node (no replicas).
> 
> What could have possibly happened to 1) trigger a recovery and; 2) have the node think it has a replica to even recover from?
> 
> Looking at the graph from the SOLR admin page it shows that shard1 disappeared and the server that was rebooted appears in a recovering state under the server home to shard2.
> 
> I then looked at clusterstate.json and it confirms that shard1 is completely missing and shard2 now has a replica. ... I'm baffled, confused, dismayed.
> 
> Versions:
> Solr 4.4 (4 nodes with tomcat container)
> zookeeper-3.4.5 (5-node ensemble)
> 
> Oh, and I'm assuming shard1 is completely corrupt.
> 
> I'd really appreciate any insight.
> 
> David
> 
> PS I have a copy of all the shards backed up. Is there a way to possibly rsync shard1 back into place and "fix" clusterstate.json manually?

Re: shard1 gone missing ...

Posted by Mark Miller <ma...@gmail.com>.

<solr persistent=“false”

You have to set that to true. When a core starts up, it’s assigned a coreNodeName. That is persisted in solr.xml.

This will happen every time you restart with persistent=false.

As far as fixing. Yes, you simple want shard1 and remove the replica info.

You would also need to add to the <solr tag: coreNodeName="node1:8080_x_col1”

That is how it will match up in ZK and not create a new replica.

- Mark

http://about.me/markrmiller

On Jan 31, 2014, at 11:11 AM, David Santamauro <da...@gmail.com> wrote:

> 
> There is nothing of note in the zookeeper logs. My solr.xml (sanitized for privacy) and identical on all 4 nodes.
> 
> <solr persistent="false" zkHost="xx.xx.xx.xx:2181,xx.xx.xx.xx:2181,xx.xx.xx.xx:2181,xx.xx.xx.xx:2181,xx.xx.xx.xx:2181">
>  <cores adminPath="/admin/cores"
>         host="${host:}"
>         hostPort="8080"
>         hostContext="${hostContext:/x}"
>         zkClientTimeout="${zkClientTimeout:15000}"
>         defaultCoreName="c1"
>         shareSchema="true" >
> 
>     <core name="c1"
>           collection="col1"
>           instanceDir="/dir/x"
>           config="solrconfig.xml"
>           dataDir="/dir/x/data/y"
>     />
>  </cores>
> </solr>
> 
> I don't specify coreNodeName nor a genericCoreNodeNames default value ...  should I?
> 
> The tomcat log is basically just a replay of what happened.
> 
> 16443 [coreLoadExecutor-4-thread-2] INFO org.apache.solr.core.CoreContainer  ? registering core: ...
> 
> # this is, I think what you are talking about above with new coreNodeName
> 16444 [coreLoadExecutor-4-thread-2] INFO org.apache.solr.cloud.ZkController  ? Register replica - core:c1 address:http://xx.xx.xx.xx:8080/x collection: col1 shard:shard4
> 
> 16453 [coreLoadExecutor-4-thread-2] INFO org.apache.solr.client.solrj.impl.HttpClientUtil  ? Creating new http client, config:maxConnections=10000&maxConnectionsPerHost=20&connTimeout=30000&socketTimeout=30000&retry=false
> 
> 16505 [coreLoadExecutor-4-thread-2] INFO org.apache.solr.cloud.ZkController  ? We are http://node1:8080/x and leader is http://node2:8080/x
> 
> Then it just starts replicating.
> 
> If there is anything specific I should be groking for in these logs, let me know.
> 
> Also, given that my clusterstate.json now looks like this:
> 
> assume:
>  node1=xx.xx.xx.1
>  node2=xx.xx.xx.2
> 
> "shard4":{
>        "range":"20000000-3fffffff",
>        "state":"active",
>        "replicas":{
>          "node2:8080_x_col1":{
>            "state":"active",
>            "core":"c1",
>            "node_name":"node2:8080_x",
>            "base_url":"http://node2:8080/x",
>            "leader":"true"},
> **** this should not be a replica of shard2 but its own shard1
>          "node1:8080_x_col1":{
>            "state":"recovering",
>            "core":"c1",
>            "node_name":"node1:8080_x",
>            "base_url":"http://node1:8080/x"}},
> 
> Can I just recreate shard1
> 
> "shard1":{
> ***** NOTE: range is assumed based on ranges of other nodes
>        "range":"0-1fffffff",
>        "state":"active",
>        "replicas":{
>          "node1:8080_x_col1":{
>            "state":"active",
>            "core":"c1",
>            "node_name":"node1:8080_x",
>            "base_url":"http://node1:8080/x",
>            "leader":"true"}},
> 
> ... and then remove the replica ..
> "shard4":{
>        "range":"20000000-3fffffff",
>        "state":"active",
>        "replicas":{
>          "node2:8080_x_col1":{
>            "state":"active",
>            "core":"c1",
>            "node_name":"node2:8080_x",
>            "base_url":"http://node2:8080/x",
>            "leader":"true"}},
> 
> That would be great...
> 
> thanks for your help
> 
> David
>

Re: shard1 gone missing ...

Posted by David Santamauro <da...@gmail.com>.

On 01/31/2014 10:35 AM, Mark Miller wrote:
>
>
>
> On Jan 31, 2014, at 10:31 AM, Mark Miller <ma...@gmail.com> wrote:
>
>> Seems unlikely by the way. Sounds like what probably happened is that for some reason it thought when you restarted the shard that you were creating it with numShards=2 instead of 1.
>
> No, that’s not right. Sorry.
>
> It must have got assigned a new core node name. numShards would still have to be seen as 1 for it to try and be a replica. Brain lapse.
>
> Are you using a custom coreNodeName or taking the default? Can you post your solr.xml so we can see your genericCoreNodeNames and coreNodeName settings?
>
> One possibility is that you got assigned a coreNodeName, but for some reason it was not persisted in solr.xml.
>
> - Mark
>
> http://about.me/markrmiller
>

There is nothing of note in the zookeeper logs. My solr.xml (sanitized 
for privacy) and identical on all 4 nodes.

<solr persistent="false" 
zkHost="xx.xx.xx.xx:2181,xx.xx.xx.xx:2181,xx.xx.xx.xx:2181,xx.xx.xx.xx:2181,xx.xx.xx.xx:2181">
   <cores adminPath="/admin/cores"
          host="${host:}"
          hostPort="8080"
          hostContext="${hostContext:/x}"
          zkClientTimeout="${zkClientTimeout:15000}"
          defaultCoreName="c1"
          shareSchema="true" >

      <core name="c1"
            collection="col1"
            instanceDir="/dir/x"
            config="solrconfig.xml"
            dataDir="/dir/x/data/y"
      />
   </cores>
</solr>

I don't specify coreNodeName nor a genericCoreNodeNames default value 
...  should I?

The tomcat log is basically just a replay of what happened.

16443 [coreLoadExecutor-4-thread-2] INFO 
org.apache.solr.core.CoreContainer  ? registering core: ...

# this is, I think what you are talking about above with new coreNodeName
16444 [coreLoadExecutor-4-thread-2] INFO 
org.apache.solr.cloud.ZkController  ? Register replica - core:c1 
address:http://xx.xx.xx.xx:8080/x collection: col1 shard:shard4

16453 [coreLoadExecutor-4-thread-2] INFO 
org.apache.solr.client.solrj.impl.HttpClientUtil  ? Creating new http 
client, 
config:maxConnections=10000&maxConnectionsPerHost=20&connTimeout=30000&socketTimeout=30000&retry=false

16505 [coreLoadExecutor-4-thread-2] INFO 
org.apache.solr.cloud.ZkController  ? We are http://node1:8080/x and 
leader is http://node2:8080/x

Then it just starts replicating.

If there is anything specific I should be groking for in these logs, let 
me know.

Also, given that my clusterstate.json now looks like this:

assume:
   node1=xx.xx.xx.1
   node2=xx.xx.xx.2

"shard4":{
         "range":"20000000-3fffffff",
         "state":"active",
         "replicas":{
           "node2:8080_x_col1":{
             "state":"active",
             "core":"c1",
             "node_name":"node2:8080_x",
             "base_url":"http://node2:8080/x",
             "leader":"true"},
**** this should not be a replica of shard2 but its own shard1
           "node1:8080_x_col1":{
             "state":"recovering",
             "core":"c1",
             "node_name":"node1:8080_x",
             "base_url":"http://node1:8080/x"}},

Can I just recreate shard1

"shard1":{
***** NOTE: range is assumed based on ranges of other nodes
         "range":"0-1fffffff",
         "state":"active",
         "replicas":{
           "node1:8080_x_col1":{
             "state":"active",
             "core":"c1",
             "node_name":"node1:8080_x",
             "base_url":"http://node1:8080/x",
             "leader":"true"}},

... and then remove the replica ..
"shard4":{
         "range":"20000000-3fffffff",
         "state":"active",
         "replicas":{
           "node2:8080_x_col1":{
             "state":"active",
             "core":"c1",
             "node_name":"node2:8080_x",
             "base_url":"http://node2:8080/x",
             "leader":"true"}},

That would be great...

thanks for your help

David

Re: shard1 gone missing ...

Posted by Mark Miller <ma...@gmail.com>.



On Jan 31, 2014, at 10:31 AM, Mark Miller <ma...@gmail.com> wrote:

> Seems unlikely by the way. Sounds like what probably happened is that for some reason it thought when you restarted the shard that you were creating it with numShards=2 instead of 1.

No, that’s not right. Sorry.

It must have got assigned a new core node name. numShards would still have to be seen as 1 for it to try and be a replica. Brain lapse.

Are you using a custom coreNodeName or taking the default? Can you post your solr.xml so we can see your genericCoreNodeNames and coreNodeName settings?

One possibility is that you got assigned a coreNodeName, but for some reason it was not persisted in solr.xml.

- Mark

http://about.me/markrmiller

Re: shard1 gone missing ...

Posted by Mark Miller <ma...@gmail.com>.


On Jan 31, 2014, at 10:13 AM, David Santamauro <da...@gmail.com> wrote:

> Oh, and I'm assuming shard1 is completely corrupt.

Seems unlikely by the way. Sounds like what probably happened is that for some reason it thought when you restarted the shard that you were creating it with numShards=2 instead of 1.

In that case, the a new entry in zk would be created. The first entry *could* still look like it was active (we did not always try and publish a DOWN state on a clean shutdown) and the node that it is on will appear live (because it is).

In this case, the node would try to recover from itself.

I actually when expect that to easily corrupt the index. It’s easy enough to check though. Simply try starting a Solr instance against it and take a look.

- Mark

http://about.me/markrmiller