You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Annette Newton <an...@servicetick.com> on 2012/12/04 13:37:25 UTC

Replication error and Shard Inconsistencies..

Hi all,

 

I have a quite weird issue with Solr cloud.  I have a 4 shard, 2 replica
setup, yesterday one of the nodes lost communication with the cloud setup,
which resulted in it trying to run replication, this failed, which has left
me with a Shard (Shard 4) that has one node with 2,833,940 documents on the
leader and 409,837 on the follower - obviously a big discrepancy and this
leads to queries returning differing results depending on which of these
nodes it gets the data from.  There is no indication of a problem on the
admin site other than the big discrepancy in the number of documents.  They
are all marked as active etc.

 

So I thought that I would force replication to happen again, by stopping and
starting solr (probably the wrong thing to do) but this resulted in no
change.  So I turned off that node and replaced it with a new one.  In
zookeeper live nodes doesn't list that machine but it is still being shown
as active on in the ClusterState.json, I have attached images showing this.
This means the new node hasn't replaced the old node but is now a replica on
Shard 1!  Also that node doesn't appear to have replicated Shard 1's data
anyway, it didn't get marked with replicating or anything.  

 

How do I clear the zookeeper state without taking down the entire solr cloud
setup?  How do I force a node to replicate from the others in the shard?

 

Thanks in advance.

 

Annette Newton

RE: FW: Replication error and Shard Inconsistencies..

Posted by Annette Newton <an...@servicetick.com>.

Hi,

The file descriptor count is always quite low..  At the moment after heavy usage for a few days file descriptor counts are between 100-150 and I don't have any errors in the logs.  My worry at the moment is around all the CLOSE_WAIT connections I am seeing.  This is particularly true on the boxes marked as leaders, the replicas have a few but nowhere near as many.

Thanks for the response.

-----Original Message-----
From: Andre Bois-Crettez [mailto:andre.bois@kelkoo.com] 
Sent: 05 December 2012 17:57
To: solr-user@lucene.apache.org
Subject: Re: FW: Replication error and Shard Inconsistencies..

Not sure but, maybe you are running out of file descriptors ?
On each solr instance, look at the "dashboard" admin page, there is a bar with "File Descriptor Count".

However if this was the case, I would expect to see lots of errors in the solr logs...

André


On 12/05/2012 06:41 PM, Annette Newton wrote:
> Sorry to bombard you - final update of the day...
>
> One thing that I have noticed is that we have a lot of connections 
> between the solr boxes with the connection set to CLOSE_WAIT and they 
> hang around for ages.
>
> -----Original Message-----
> From: Annette Newton [mailto:annette.newton@servicetick.com]
> Sent: 05 December 2012 13:55
> To: solr-user@lucene.apache.org
> Subject: FW: Replication error and Shard Inconsistencies..
>
> Update:
>
> I did a full restart of the solr cloud setup, stopped all the 
> instances, cleared down zookeeper and started them up individually.  I 
> then removed the index from one of the replicas, restarted solr and it 
> replicated ok.  So I'm wondering whether this is something that happens over a period of time.
>
> Also just to let you know I changed the schema a couple of times and 
> reloaded the cores on all instances previous to the problem.  Don't 
> know if this could have contributed to the problem.
>
> Thanks.
>
> -----Original Message-----
> From: Annette Newton [mailto:annette.newton@servicetick.com]
> Sent: 05 December 2012 09:04
> To: solr-user@lucene.apache.org
> Subject: RE: Replication error and Shard Inconsistencies..
>
> Hi Mark,
>
> Thanks so much for the reply.
>
> We are using the release version of 4.0..
>
> It's very strange replication appears to be underway, but no files are 
> being copied across.  I have attached both the log from the new node 
> that I tried to bring up and the Schema and config we are using.
>
> I think it's probably something weird with our config, so I'm going to 
> play around with it today.  If I make any progress I'll send an update.
>
> Thanks again.
>
> -----Original Message-----
> From: Mark Miller [mailto:markrmiller@gmail.com]
> Sent: 05 December 2012 00:04
> To: solr-user@lucene.apache.org
> Subject: Re: Replication error and Shard Inconsistencies..
>
> Hey Annette,
>
> Are you using Solr 4.0 final? A version of 4x or 5x?
>
> Do you have the logs for when the replica tried to catch up to the leader?
>
> Stopping and starting the node is actually a fine thing to do. Perhaps 
> you can try it again and capture the logs.
>
> If a node is not listed as live but is in the clusterstate, that is 
> fine. It shouldn't be consulted. To remove it, you either have to 
> unload it with the core admin api or you could manually delete it's 
> registered state under the node states node that the Overseer looks at.
>
> Also, it would be useful to see the logs of the new node coming 
> up.there should be info about what happens when it tries to replicate.
>
> It almost sounds like replication is just not working for your setup 
> at all and that you have to tweak some configuration. You shouldn't 
> see these nodes as active then though - so we should get to the bottom of this.
>
> - Mark
>
> On Dec 4, 2012, at 4:37 AM, Annette 
> Newton<an...@servicetick.com>
> wrote:
>
>> Hi all,
>>
>> I have a quite weird issue with Solr cloud.  I have a 4 shard, 2 
>> replica
> setup, yesterday one of the nodes lost communication with the cloud 
> setup, which resulted in it trying to run replication, this failed, 
> which has left me with a Shard (Shard 4) that has one node with 
> 2,833,940 documents on the leader and 409,837 on the follower - 
> obviously a big discrepancy and this leads to queries returning 
> differing results depending on which of these nodes it gets the data 
> from.  There is no indication of a problem on the admin site other 
> than the big discrepancy in the number of documents.  They are all marked as active etc.
>>
>> So I thought that I would force replication to happen again, by 
>> stopping
> and starting solr (probably the wrong thing to do) but this resulted 
> in no change.  So I turned off that node and replaced it with a new 
> one.  In zookeeper live nodes doesn't list that machine but it is 
> still being shown as active on in the ClusterState.json, I have attached images showing this.
> This means the new node hasn't replaced the old node but is now a 
> replica on Shard 1!  Also that node doesn't appear to have replicated 
> Shard 1's data anyway, it didn't get marked with replicating or anything.
>>
>> How do I clear the zookeeper state without taking down the entire 
>> solr
> cloud setup?  How do I force a node to replicate from the others in 
> the shard?
>>
>> Thanks in advance.
>>
>> Annette Newton
>>
>>
>> <LiveNodes.zip>
>
>
>
>
>
>
> --
> André Bois-Crettez
>
> Search technology, Kelkoo
> http://www.kelkoo.com/

Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.

Re: FW: Replication error and Shard Inconsistencies..

Posted by Andre Bois-Crettez <an...@kelkoo.com>.

Not sure but, maybe you are running out of file descriptors ?
On each solr instance, look at the "dashboard" admin page, there is a
bar with "File Descriptor Count".

However if this was the case, I would expect to see lots of errors in
the solr logs...

André


On 12/05/2012 06:41 PM, Annette Newton wrote:
> Sorry to bombard you - final update of the day...
>
> One thing that I have noticed is that we have a lot of connections between
> the solr boxes with the connection set to CLOSE_WAIT and they hang around
> for ages.
>
> -----Original Message-----
> From: Annette Newton [mailto:annette.newton@servicetick.com]
> Sent: 05 December 2012 13:55
> To: solr-user@lucene.apache.org
> Subject: FW: Replication error and Shard Inconsistencies..
>
> Update:
>
> I did a full restart of the solr cloud setup, stopped all the instances,
> cleared down zookeeper and started them up individually.  I then removed the
> index from one of the replicas, restarted solr and it replicated ok.  So I'm
> wondering whether this is something that happens over a period of time.
>
> Also just to let you know I changed the schema a couple of times and
> reloaded the cores on all instances previous to the problem.  Don't know if
> this could have contributed to the problem.
>
> Thanks.
>
> -----Original Message-----
> From: Annette Newton [mailto:annette.newton@servicetick.com]
> Sent: 05 December 2012 09:04
> To: solr-user@lucene.apache.org
> Subject: RE: Replication error and Shard Inconsistencies..
>
> Hi Mark,
>
> Thanks so much for the reply.
>
> We are using the release version of 4.0..
>
> It's very strange replication appears to be underway, but no files are being
> copied across.  I have attached both the log from the new node that I tried
> to bring up and the Schema and config we are using.
>
> I think it's probably something weird with our config, so I'm going to play
> around with it today.  If I make any progress I'll send an update.
>
> Thanks again.
>
> -----Original Message-----
> From: Mark Miller [mailto:markrmiller@gmail.com]
> Sent: 05 December 2012 00:04
> To: solr-user@lucene.apache.org
> Subject: Re: Replication error and Shard Inconsistencies..
>
> Hey Annette,
>
> Are you using Solr 4.0 final? A version of 4x or 5x?
>
> Do you have the logs for when the replica tried to catch up to the leader?
>
> Stopping and starting the node is actually a fine thing to do. Perhaps you
> can try it again and capture the logs.
>
> If a node is not listed as live but is in the clusterstate, that is fine. It
> shouldn't be consulted. To remove it, you either have to unload it with the
> core admin api or you could manually delete it's registered state under the
> node states node that the Overseer looks at.
>
> Also, it would be useful to see the logs of the new node coming up.there
> should be info about what happens when it tries to replicate.
>
> It almost sounds like replication is just not working for your setup at all
> and that you have to tweak some configuration. You shouldn't see these nodes
> as active then though - so we should get to the bottom of this.
>
> - Mark
>
> On Dec 4, 2012, at 4:37 AM, Annette Newton<an...@servicetick.com>
> wrote:
>
>> Hi all,
>>
>> I have a quite weird issue with Solr cloud.  I have a 4 shard, 2
>> replica
> setup, yesterday one of the nodes lost communication with the cloud setup,
> which resulted in it trying to run replication, this failed, which has left
> me with a Shard (Shard 4) that has one node with 2,833,940 documents on the
> leader and 409,837 on the follower - obviously a big discrepancy and this
> leads to queries returning differing results depending on which of these
> nodes it gets the data from.  There is no indication of a problem on the
> admin site other than the big discrepancy in the number of documents.  They
> are all marked as active etc.
>>
>> So I thought that I would force replication to happen again, by
>> stopping
> and starting solr (probably the wrong thing to do) but this resulted in no
> change.  So I turned off that node and replaced it with a new one.  In
> zookeeper live nodes doesn't list that machine but it is still being shown
> as active on in the ClusterState.json, I have attached images showing this.
> This means the new node hasn't replaced the old node but is now a replica on
> Shard 1!  Also that node doesn't appear to have replicated Shard 1's data
> anyway, it didn't get marked with replicating or anything.
>>
>> How do I clear the zookeeper state without taking down the entire solr
> cloud setup?  How do I force a node to replicate from the others in the
> shard?
>>
>> Thanks in advance.
>>
>> Annette Newton
>>
>>
>> <LiveNodes.zip>
>
>
>
>
>
>
> --
> André Bois-Crettez
>
> Search technology, Kelkoo
> http://www.kelkoo.com/

Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.

FW: Replication error and Shard Inconsistencies..

Posted by Annette Newton <an...@servicetick.com>.

Sorry to bombard you - final update of the day...

One thing that I have noticed is that we have a lot of connections between
the solr boxes with the connection set to CLOSE_WAIT and they hang around
for ages.

-----Original Message-----
From: Annette Newton [mailto:annette.newton@servicetick.com] 
Sent: 05 December 2012 13:55
To: solr-user@lucene.apache.org
Subject: FW: Replication error and Shard Inconsistencies..

Update:

I did a full restart of the solr cloud setup, stopped all the instances,
cleared down zookeeper and started them up individually.  I then removed the
index from one of the replicas, restarted solr and it replicated ok.  So I'm
wondering whether this is something that happens over a period of time. 

Also just to let you know I changed the schema a couple of times and
reloaded the cores on all instances previous to the problem.  Don't know if
this could have contributed to the problem.

Thanks.

-----Original Message-----
From: Annette Newton [mailto:annette.newton@servicetick.com]
Sent: 05 December 2012 09:04
To: solr-user@lucene.apache.org
Subject: RE: Replication error and Shard Inconsistencies..

Hi Mark,

Thanks so much for the reply.

We are using the release version of 4.0..

It's very strange replication appears to be underway, but no files are being
copied across.  I have attached both the log from the new node that I tried
to bring up and the Schema and config we are using.

I think it's probably something weird with our config, so I'm going to play
around with it today.  If I make any progress I'll send an update.

Thanks again.

-----Original Message-----
From: Mark Miller [mailto:markrmiller@gmail.com]
Sent: 05 December 2012 00:04
To: solr-user@lucene.apache.org
Subject: Re: Replication error and Shard Inconsistencies..

Hey Annette, 

Are you using Solr 4.0 final? A version of 4x or 5x?

Do you have the logs for when the replica tried to catch up to the leader?

Stopping and starting the node is actually a fine thing to do. Perhaps you
can try it again and capture the logs.

If a node is not listed as live but is in the clusterstate, that is fine. It
shouldn't be consulted. To remove it, you either have to unload it with the
core admin api or you could manually delete it's registered state under the
node states node that the Overseer looks at.

Also, it would be useful to see the logs of the new node coming up.there
should be info about what happens when it tries to replicate.

It almost sounds like replication is just not working for your setup at all
and that you have to tweak some configuration. You shouldn't see these nodes
as active then though - so we should get to the bottom of this.

- Mark

On Dec 4, 2012, at 4:37 AM, Annette Newton <an...@servicetick.com>
wrote:

> Hi all,
>  
> I have a quite weird issue with Solr cloud.  I have a 4 shard, 2 
> replica
setup, yesterday one of the nodes lost communication with the cloud setup,
which resulted in it trying to run replication, this failed, which has left
me with a Shard (Shard 4) that has one node with 2,833,940 documents on the
leader and 409,837 on the follower - obviously a big discrepancy and this
leads to queries returning differing results depending on which of these
nodes it gets the data from.  There is no indication of a problem on the
admin site other than the big discrepancy in the number of documents.  They
are all marked as active etc.
>  
> So I thought that I would force replication to happen again, by 
> stopping
and starting solr (probably the wrong thing to do) but this resulted in no
change.  So I turned off that node and replaced it with a new one.  In
zookeeper live nodes doesn't list that machine but it is still being shown
as active on in the ClusterState.json, I have attached images showing this.
This means the new node hasn't replaced the old node but is now a replica on
Shard 1!  Also that node doesn't appear to have replicated Shard 1's data
anyway, it didn't get marked with replicating or anything. 
>  
> How do I clear the zookeeper state without taking down the entire solr
cloud setup?  How do I force a node to replicate from the others in the
shard?
>  
> Thanks in advance.
>  
> Annette Newton
>  
>  
> <LiveNodes.zip>

FW: Replication error and Shard Inconsistencies..

Posted by Annette Newton <an...@servicetick.com>.

Update:

I did a full restart of the solr cloud setup, stopped all the instances,
cleared down zookeeper and started them up individually.  I then removed the
index from one of the replicas, restarted solr and it replicated ok.  So I'm
wondering whether this is something that happens over a period of time. 

Also just to let you know I changed the schema a couple of times and
reloaded the cores on all instances previous to the problem.  Don't know if
this could have contributed to the problem.

Thanks.

-----Original Message-----
From: Annette Newton [mailto:annette.newton@servicetick.com] 
Sent: 05 December 2012 09:04
To: solr-user@lucene.apache.org
Subject: RE: Replication error and Shard Inconsistencies..

Hi Mark,

Thanks so much for the reply.

We are using the release version of 4.0..

It's very strange replication appears to be underway, but no files are being
copied across.  I have attached both the log from the new node that I tried
to bring up and the Schema and config we are using.

I think it's probably something weird with our config, so I'm going to play
around with it today.  If I make any progress I'll send an update.

Thanks again.

-----Original Message-----
From: Mark Miller [mailto:markrmiller@gmail.com]
Sent: 05 December 2012 00:04
To: solr-user@lucene.apache.org
Subject: Re: Replication error and Shard Inconsistencies..

Hey Annette, 

Are you using Solr 4.0 final? A version of 4x or 5x?

Do you have the logs for when the replica tried to catch up to the leader?

Stopping and starting the node is actually a fine thing to do. Perhaps you
can try it again and capture the logs.

If a node is not listed as live but is in the clusterstate, that is fine. It
shouldn't be consulted. To remove it, you either have to unload it with the
core admin api or you could manually delete it's registered state under the
node states node that the Overseer looks at.

Also, it would be useful to see the logs of the new node coming up.there
should be info about what happens when it tries to replicate.

It almost sounds like replication is just not working for your setup at all
and that you have to tweak some configuration. You shouldn't see these nodes
as active then though - so we should get to the bottom of this.

- Mark

On Dec 4, 2012, at 4:37 AM, Annette Newton <an...@servicetick.com>
wrote:

> Hi all,
>  
> I have a quite weird issue with Solr cloud.  I have a 4 shard, 2 
> replica
setup, yesterday one of the nodes lost communication with the cloud setup,
which resulted in it trying to run replication, this failed, which has left
me with a Shard (Shard 4) that has one node with 2,833,940 documents on the
leader and 409,837 on the follower - obviously a big discrepancy and this
leads to queries returning differing results depending on which of these
nodes it gets the data from.  There is no indication of a problem on the
admin site other than the big discrepancy in the number of documents.  They
are all marked as active etc.
>  
> So I thought that I would force replication to happen again, by 
> stopping
and starting solr (probably the wrong thing to do) but this resulted in no
change.  So I turned off that node and replaced it with a new one.  In
zookeeper live nodes doesn't list that machine but it is still being shown
as active on in the ClusterState.json, I have attached images showing this.
This means the new node hasn't replaced the old node but is now a replica on
Shard 1!  Also that node doesn't appear to have replicated Shard 1's data
anyway, it didn't get marked with replicating or anything. 
>  
> How do I clear the zookeeper state without taking down the entire solr
cloud setup?  How do I force a node to replicate from the others in the
shard?
>  
> Thanks in advance.
>  
> Annette Newton
>  
>  
> <LiveNodes.zip>

RE: Replication error and Shard Inconsistencies..

Posted by Annette Newton <an...@servicetick.com>.

Hi Mark,

Thanks so much for the reply.

We are using the release version of 4.0..

It's very strange replication appears to be underway, but no files are being
copied across.  I have attached both the log from the new node that I tried
to bring up and the Schema and config we are using.

I think it's probably something weird with our config, so I'm going to play
around with it today.  If I make any progress I'll send an update.

Thanks again.

-----Original Message-----
From: Mark Miller [mailto:markrmiller@gmail.com] 
Sent: 05 December 2012 00:04
To: solr-user@lucene.apache.org
Subject: Re: Replication error and Shard Inconsistencies..

Hey Annette, 

Are you using Solr 4.0 final? A version of 4x or 5x?

Do you have the logs for when the replica tried to catch up to the leader?

Stopping and starting the node is actually a fine thing to do. Perhaps you
can try it again and capture the logs.

If a node is not listed as live but is in the clusterstate, that is fine. It
shouldn't be consulted. To remove it, you either have to unload it with the
core admin api or you could manually delete it's registered state under the
node states node that the Overseer looks at.

Also, it would be useful to see the logs of the new node coming up.there
should be info about what happens when it tries to replicate.

It almost sounds like replication is just not working for your setup at all
and that you have to tweak some configuration. You shouldn't see these nodes
as active then though - so we should get to the bottom of this.

- Mark

On Dec 4, 2012, at 4:37 AM, Annette Newton <an...@servicetick.com>
wrote:

> Hi all,
>  
> I have a quite weird issue with Solr cloud.  I have a 4 shard, 2 replica
setup, yesterday one of the nodes lost communication with the cloud setup,
which resulted in it trying to run replication, this failed, which has left
me with a Shard (Shard 4) that has one node with 2,833,940 documents on the
leader and 409,837 on the follower - obviously a big discrepancy and this
leads to queries returning differing results depending on which of these
nodes it gets the data from.  There is no indication of a problem on the
admin site other than the big discrepancy in the number of documents.  They
are all marked as active etc.
>  
> So I thought that I would force replication to happen again, by stopping
and starting solr (probably the wrong thing to do) but this resulted in no
change.  So I turned off that node and replaced it with a new one.  In
zookeeper live nodes doesn't list that machine but it is still being shown
as active on in the ClusterState.json, I have attached images showing this.
This means the new node hasn't replaced the old node but is now a replica on
Shard 1!  Also that node doesn't appear to have replicated Shard 1's data
anyway, it didn't get marked with replicating or anything. 
>  
> How do I clear the zookeeper state without taking down the entire solr
cloud setup?  How do I force a node to replicate from the others in the
shard?
>  
> Thanks in advance.
>  
> Annette Newton
>  
>  
> <LiveNodes.zip>

Re: Replication error and Shard Inconsistencies..

Posted by Mark Miller <ma...@gmail.com>.

Hey Annette, 

Are you using Solr 4.0 final? A version of 4x or 5x?

Do you have the logs for when the replica tried to catch up to the leader?

Stopping and starting the node is actually a fine thing to do. Perhaps you can try it again and capture the logs.

If a node is not listed as live but is in the clusterstate, that is fine. It shouldn't be consulted. To remove it, you either have to unload it with the core admin api or you could manually delete it's registered state under the node states node that the Overseer looks at.

Also, it would be useful to see the logs of the new node coming up…there should be info about what happens when it tries to replicate.

It almost sounds like replication is just not working for your setup at all and that you have to tweak some configuration. You shouldn't see these nodes as active then though - so we should get to the bottom of this.

- Mark

On Dec 4, 2012, at 4:37 AM, Annette Newton <an...@servicetick.com> wrote:

> Hi all,
>  
> I have a quite weird issue with Solr cloud.  I have a 4 shard, 2 replica setup, yesterday one of the nodes lost communication with the cloud setup, which resulted in it trying to run replication, this failed, which has left me with a Shard (Shard 4) that has one node with 2,833,940 documents on the leader and 409,837 on the follower – obviously a big discrepancy and this leads to queries returning differing results depending on which of these nodes it gets the data from.  There is no indication of a problem on the admin site other than the big discrepancy in the number of documents.  They are all marked as active etc…
>  
> So I thought that I would force replication to happen again, by stopping and starting solr (probably the wrong thing to do) but this resulted in no change.  So I turned off that node and replaced it with a new one.  In zookeeper live nodes doesn’t list that machine but it is still being shown as active on in the ClusterState.json, I have attached images showing this…  This means the new node hasn’t replaced the old node but is now a replica on Shard 1!  Also that node doesn’t appear to have replicated Shard 1’s data anyway, it didn’t get marked with replicating or anything… 
>  
> How do I clear the zookeeper state without taking down the entire solr cloud setup?  How do I force a node to replicate from the others in the shard?
>  
> Thanks in advance.
>  
> Annette Newton
>  
>  
> <LiveNodes.zip>