You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Arcadius Ahouansou <ar...@menelic.com> on 2015/09/08 02:46:44 UTC

SolrCloud Admin UI shows node is Down, but state.json says it's active/up

Hello.

In one of our test environment, we have a SolrCloud cluster of 8 SolrCloud
nodes  and a quorum of 5 ZooKeeper node.
 We have only 2 collections and all SolrCloud nodes are identical and have
a single replica of each collection.

I noticed that when I shut down one of the solr nodes and refresh the
solrCloud admin UI, the Cloud->Graph view immediately shows the node/shards
as Gone/down (in gray color), which is what I expected.

Now, when I go through the UI to the Tree view and browse under individual
collections, the file state.json shows all nodes as "Active" or up. I
expected this to show "down":  This is the main issue here.


I looked into ZK for the state.json file and all nodes are marked as
actives in state.json  on ZK as well.
So, it seems the overseer is not writing to ZK?


Note that when I use the API
/solr/admin/collections?action=CLUSTERSTATUS, I have the expected result
i.e 1 host is down

When I do
/solr/admin/collections?action=OVERSEERSTATUS
there is no failed operation shown

For now, we noticed this issue in one of our test environment.
When I deploy a local cluster on my machine, I cannot reproduce this stale
state.json issue.

Any idea or hint about what could be causing this would be very appreciated.

Thank you.


Arcadius.

Re: SolrCloud Admin UI shows node is Down, but state.json says it's active/up

Posted by Mark Miller <ma...@gmail.com>.
Perhaps there is something preventing clean shutdown. Shutdown makes a best
effort attempt to publish DOWN for all the local cores.

Otherwise, yes, it's a little bit annoying, but full state is a combination
of the state entry and whether the live node for that replica exists or not.

- Mark

On Wed, Sep 9, 2015 at 1:50 AM Arcadius Ahouansou <ar...@menelic.com>
wrote:

> Thank you Tomás for pointing to the JavaDoc
>
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE
>
> The Javadoc is quite clear. So this stale state.json is not an issue after
> all.
>
> However, it's very confusing that when a node goes down, state.json may be
> updated for 1 collection while it remains stale in the other collection.
> Also in our case, the node did not crash as per the JavaDoc... it was a
> normal server stop/shut-down.
> We may need to review our shut-down process and see whether things change.
>
> Thank you very much Erick and Tomás for your valuable help... very
> appreciated.
>
> Arcadius.
>
>
> On 8 September 2015 at 18:28, Erick Erickson <er...@gmail.com>
> wrote:
>
> > bq: You were probably referring to state.json
> >
> > yep, I'm never sure whether people are on the old or new ZK versions.
> >
> > OK, With Tomás' comment, I think it's explained... although confusing.
> >
> > WDYT?
> >
> >
> > On Tue, Sep 8, 2015 at 10:03 AM, Arcadius Ahouansou
> > <ar...@menelic.com> wrote:
> > > Hello Erick.
> > >
> > > Yes,
> > >
> > > 1> liveNodes has N nodes listed (correctly): Correct, liveNodes is
> always
> > > right.
> > >
> > > 2> clusterstate.json has N+M nodes listed as "active":
> clusterstate.json
> > is
> > > always empty as it's no longer being "used" in 5.3. You were
> > > probably referring to state.json which is in individual collections.
> Yes,
> > > that one reflects the wrong value i.e N+M
> > >
> > > 3> using the collection API to get CLUSTERSTATUS always return the
> > correct
> > > value N
> > >
> > > 4> The Front-end code in code in cloud.js displays the right colour
> when
> > > nodes go down because it checks for the live node
> > >
> > > The problem is only with state.json under certain circumstances.
> > >
> > > Thanks.
> > >
> > > On 8 September 2015 at 17:51, Erick Erickson <er...@gmail.com>
> > > wrote:
> > >
> > >> Arcadius:
> > >>
> > >> Hmmm. It may take a while for the cluster state to change, but I'm
> > >> assuming that this state persists for minutes/hours/days.
> > >>
> > >> So to recap: If dump the entire ZK node from the root, you have
> > >> 1> liveNodes has N nodes listed (correctly)
> > >> 2> clusterstate.json has N+M nodes listed as "active"
> > >>
> > >> Doesn't sound right to me, but I'll have to let people who are deep
> > >> into that code speculate from here.
> > >>
> > >> Best,
> > >> Erick
> > >>
> > >> On Tue, Sep 8, 2015 at 1:13 AM, Arcadius Ahouansou <
> > arcadius@menelic.com>
> > >> wrote:
> > >> > On Sep 8, 2015 6:25 AM, "Erick Erickson" <er...@gmail.com>
> > >> wrote:
> > >> >>
> > >> >> Perhaps the browser cache? What happens if you, say, use
> > >> >> Zookeeper client tools to bring down the the cluster state in
> > >> >> question? Or perhaps just refresh the admin UI when showing
> > >> >> the cluster status....
> > >> >>
> > >> >
> > >> > Hello Erick.
> > >> >
> > >> > Thank you very much for answering.
> > >> > I did use the ZooInspetor tool to check the state.json in all 5 zk
> > nodes
> > >> > and they are all out of date and identical to what I get through the
> > tree
> > >> > view in sole admin ui.
> > >> >
> > >> > Looking at the source code cloud.js that correctly display nodes as
> > >> "gone"
> > >> > in the graph view, it calls the end point /zookeeper?wt=json and
> > relies
> > >> on
> > >> > the live nodes to mark a node as down instead of status.json.
> > >> >
> > >> > Thanks.
> > >> >
> > >> >> Shot in the dark,
> > >> >> Erick
> > >> >>
> > >> >> On Mon, Sep 7, 2015 at 6:09 PM, Arcadius Ahouansou <
> > >> arcadius@menelic.com>
> > >> > wrote:
> > >> >> > We are running the latest Solr 5.3.0
> > >> >> >
> > >> >> > Thanks.
> > >>
> > >
> > >
> > >
> > > --
> > > Arcadius Ahouansou
> > > Menelic Ltd | Information is Power
> > > M: 07908761999
> > > W: www.menelic.com
> > > ---
> >
>
>
>
> --
> Arcadius Ahouansou
> Menelic Ltd | Information is Power
> M: 07908761999
> W: www.menelic.com
> ---
>
-- 
- Mark
about.me/markrmiller

Re: SolrCloud Admin UI shows node is Down, but state.json says it's active/up

Posted by Arcadius Ahouansou <ar...@menelic.com>.
Thank you Tomás for pointing to the JavaDoc
http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE

The Javadoc is quite clear. So this stale state.json is not an issue after
all.

However, it's very confusing that when a node goes down, state.json may be
updated for 1 collection while it remains stale in the other collection.
Also in our case, the node did not crash as per the JavaDoc... it was a
normal server stop/shut-down.
We may need to review our shut-down process and see whether things change.

Thank you very much Erick and Tomás for your valuable help... very
appreciated.

Arcadius.


On 8 September 2015 at 18:28, Erick Erickson <er...@gmail.com>
wrote:

> bq: You were probably referring to state.json
>
> yep, I'm never sure whether people are on the old or new ZK versions.
>
> OK, With Tomás' comment, I think it's explained... although confusing.
>
> WDYT?
>
>
> On Tue, Sep 8, 2015 at 10:03 AM, Arcadius Ahouansou
> <ar...@menelic.com> wrote:
> > Hello Erick.
> >
> > Yes,
> >
> > 1> liveNodes has N nodes listed (correctly): Correct, liveNodes is always
> > right.
> >
> > 2> clusterstate.json has N+M nodes listed as "active": clusterstate.json
> is
> > always empty as it's no longer being "used" in 5.3. You were
> > probably referring to state.json which is in individual collections. Yes,
> > that one reflects the wrong value i.e N+M
> >
> > 3> using the collection API to get CLUSTERSTATUS always return the
> correct
> > value N
> >
> > 4> The Front-end code in code in cloud.js displays the right colour when
> > nodes go down because it checks for the live node
> >
> > The problem is only with state.json under certain circumstances.
> >
> > Thanks.
> >
> > On 8 September 2015 at 17:51, Erick Erickson <er...@gmail.com>
> > wrote:
> >
> >> Arcadius:
> >>
> >> Hmmm. It may take a while for the cluster state to change, but I'm
> >> assuming that this state persists for minutes/hours/days.
> >>
> >> So to recap: If dump the entire ZK node from the root, you have
> >> 1> liveNodes has N nodes listed (correctly)
> >> 2> clusterstate.json has N+M nodes listed as "active"
> >>
> >> Doesn't sound right to me, but I'll have to let people who are deep
> >> into that code speculate from here.
> >>
> >> Best,
> >> Erick
> >>
> >> On Tue, Sep 8, 2015 at 1:13 AM, Arcadius Ahouansou <
> arcadius@menelic.com>
> >> wrote:
> >> > On Sep 8, 2015 6:25 AM, "Erick Erickson" <er...@gmail.com>
> >> wrote:
> >> >>
> >> >> Perhaps the browser cache? What happens if you, say, use
> >> >> Zookeeper client tools to bring down the the cluster state in
> >> >> question? Or perhaps just refresh the admin UI when showing
> >> >> the cluster status....
> >> >>
> >> >
> >> > Hello Erick.
> >> >
> >> > Thank you very much for answering.
> >> > I did use the ZooInspetor tool to check the state.json in all 5 zk
> nodes
> >> > and they are all out of date and identical to what I get through the
> tree
> >> > view in sole admin ui.
> >> >
> >> > Looking at the source code cloud.js that correctly display nodes as
> >> "gone"
> >> > in the graph view, it calls the end point /zookeeper?wt=json and
> relies
> >> on
> >> > the live nodes to mark a node as down instead of status.json.
> >> >
> >> > Thanks.
> >> >
> >> >> Shot in the dark,
> >> >> Erick
> >> >>
> >> >> On Mon, Sep 7, 2015 at 6:09 PM, Arcadius Ahouansou <
> >> arcadius@menelic.com>
> >> > wrote:
> >> >> > We are running the latest Solr 5.3.0
> >> >> >
> >> >> > Thanks.
> >>
> >
> >
> >
> > --
> > Arcadius Ahouansou
> > Menelic Ltd | Information is Power
> > M: 07908761999
> > W: www.menelic.com
> > ---
>



-- 
Arcadius Ahouansou
Menelic Ltd | Information is Power
M: 07908761999
W: www.menelic.com
---

Re: SolrCloud Admin UI shows node is Down, but state.json says it's active/up

Posted by Erick Erickson <er...@gmail.com>.
bq: You were probably referring to state.json

yep, I'm never sure whether people are on the old or new ZK versions.

OK, With Tomás' comment, I think it's explained... although confusing.

WDYT?


On Tue, Sep 8, 2015 at 10:03 AM, Arcadius Ahouansou
<ar...@menelic.com> wrote:
> Hello Erick.
>
> Yes,
>
> 1> liveNodes has N nodes listed (correctly): Correct, liveNodes is always
> right.
>
> 2> clusterstate.json has N+M nodes listed as "active": clusterstate.json is
> always empty as it's no longer being "used" in 5.3. You were
> probably referring to state.json which is in individual collections. Yes,
> that one reflects the wrong value i.e N+M
>
> 3> using the collection API to get CLUSTERSTATUS always return the correct
> value N
>
> 4> The Front-end code in code in cloud.js displays the right colour when
> nodes go down because it checks for the live node
>
> The problem is only with state.json under certain circumstances.
>
> Thanks.
>
> On 8 September 2015 at 17:51, Erick Erickson <er...@gmail.com>
> wrote:
>
>> Arcadius:
>>
>> Hmmm. It may take a while for the cluster state to change, but I'm
>> assuming that this state persists for minutes/hours/days.
>>
>> So to recap: If dump the entire ZK node from the root, you have
>> 1> liveNodes has N nodes listed (correctly)
>> 2> clusterstate.json has N+M nodes listed as "active"
>>
>> Doesn't sound right to me, but I'll have to let people who are deep
>> into that code speculate from here.
>>
>> Best,
>> Erick
>>
>> On Tue, Sep 8, 2015 at 1:13 AM, Arcadius Ahouansou <ar...@menelic.com>
>> wrote:
>> > On Sep 8, 2015 6:25 AM, "Erick Erickson" <er...@gmail.com>
>> wrote:
>> >>
>> >> Perhaps the browser cache? What happens if you, say, use
>> >> Zookeeper client tools to bring down the the cluster state in
>> >> question? Or perhaps just refresh the admin UI when showing
>> >> the cluster status....
>> >>
>> >
>> > Hello Erick.
>> >
>> > Thank you very much for answering.
>> > I did use the ZooInspetor tool to check the state.json in all 5 zk nodes
>> > and they are all out of date and identical to what I get through the tree
>> > view in sole admin ui.
>> >
>> > Looking at the source code cloud.js that correctly display nodes as
>> "gone"
>> > in the graph view, it calls the end point /zookeeper?wt=json and relies
>> on
>> > the live nodes to mark a node as down instead of status.json.
>> >
>> > Thanks.
>> >
>> >> Shot in the dark,
>> >> Erick
>> >>
>> >> On Mon, Sep 7, 2015 at 6:09 PM, Arcadius Ahouansou <
>> arcadius@menelic.com>
>> > wrote:
>> >> > We are running the latest Solr 5.3.0
>> >> >
>> >> > Thanks.
>>
>
>
>
> --
> Arcadius Ahouansou
> Menelic Ltd | Information is Power
> M: 07908761999
> W: www.menelic.com
> ---

Re: SolrCloud Admin UI shows node is Down, but state.json says it's active/up

Posted by Arcadius Ahouansou <ar...@menelic.com>.
Hello Erick.

Yes,

1> liveNodes has N nodes listed (correctly): Correct, liveNodes is always
right.

2> clusterstate.json has N+M nodes listed as "active": clusterstate.json is
always empty as it's no longer being "used" in 5.3. You were
probably referring to state.json which is in individual collections. Yes,
that one reflects the wrong value i.e N+M

3> using the collection API to get CLUSTERSTATUS always return the correct
value N

4> The Front-end code in code in cloud.js displays the right colour when
nodes go down because it checks for the live node

The problem is only with state.json under certain circumstances.

Thanks.

On 8 September 2015 at 17:51, Erick Erickson <er...@gmail.com>
wrote:

> Arcadius:
>
> Hmmm. It may take a while for the cluster state to change, but I'm
> assuming that this state persists for minutes/hours/days.
>
> So to recap: If dump the entire ZK node from the root, you have
> 1> liveNodes has N nodes listed (correctly)
> 2> clusterstate.json has N+M nodes listed as "active"
>
> Doesn't sound right to me, but I'll have to let people who are deep
> into that code speculate from here.
>
> Best,
> Erick
>
> On Tue, Sep 8, 2015 at 1:13 AM, Arcadius Ahouansou <ar...@menelic.com>
> wrote:
> > On Sep 8, 2015 6:25 AM, "Erick Erickson" <er...@gmail.com>
> wrote:
> >>
> >> Perhaps the browser cache? What happens if you, say, use
> >> Zookeeper client tools to bring down the the cluster state in
> >> question? Or perhaps just refresh the admin UI when showing
> >> the cluster status....
> >>
> >
> > Hello Erick.
> >
> > Thank you very much for answering.
> > I did use the ZooInspetor tool to check the state.json in all 5 zk nodes
> > and they are all out of date and identical to what I get through the tree
> > view in sole admin ui.
> >
> > Looking at the source code cloud.js that correctly display nodes as
> "gone"
> > in the graph view, it calls the end point /zookeeper?wt=json and relies
> on
> > the live nodes to mark a node as down instead of status.json.
> >
> > Thanks.
> >
> >> Shot in the dark,
> >> Erick
> >>
> >> On Mon, Sep 7, 2015 at 6:09 PM, Arcadius Ahouansou <
> arcadius@menelic.com>
> > wrote:
> >> > We are running the latest Solr 5.3.0
> >> >
> >> > Thanks.
>



-- 
Arcadius Ahouansou
Menelic Ltd | Information is Power
M: 07908761999
W: www.menelic.com
---

Re: SolrCloud Admin UI shows node is Down, but state.json says it's active/up

Posted by Tomás Fernández Löbbe <to...@gmail.com>.
I believe this is expected in the current code. From Replica.State javadoc:


  /**
   * The replica's state. In general, if the node the replica is hosted on
is
   * not under {@code /live_nodes} in ZK, the replica's state should be
   * discarded.
   */
  public enum State {

    /**
     * The replica is ready to receive updates and queries.
     * <p>
     * <b>NOTE</b>: when the node the replica is hosted on crashes, the
     * replica's state may remain ACTIVE in ZK. To determine if the replica
is
     * truly active, you must also verify that its {@link
Replica#getNodeName()
     * node} is under {@code /live_nodes} in ZK (or use
     * {@link ClusterState#liveNodesContain(String)}).
     * </p>
     */
    ACTIVE,
...

On Tue, Sep 8, 2015 at 9:51 AM, Erick Erickson <er...@gmail.com>
wrote:

> Arcadius:
>
> Hmmm. It may take a while for the cluster state to change, but I'm
> assuming that this state persists for minutes/hours/days.
>
> So to recap: If dump the entire ZK node from the root, you have
> 1> liveNodes has N nodes listed (correctly)
> 2> clusterstate.json has N+M nodes listed as "active"
>
> Doesn't sound right to me, but I'll have to let people who are deep
> into that code speculate from here.
>
> Best,
> Erick
>
> On Tue, Sep 8, 2015 at 1:13 AM, Arcadius Ahouansou <ar...@menelic.com>
> wrote:
> > On Sep 8, 2015 6:25 AM, "Erick Erickson" <er...@gmail.com>
> wrote:
> >>
> >> Perhaps the browser cache? What happens if you, say, use
> >> Zookeeper client tools to bring down the the cluster state in
> >> question? Or perhaps just refresh the admin UI when showing
> >> the cluster status....
> >>
> >
> > Hello Erick.
> >
> > Thank you very much for answering.
> > I did use the ZooInspetor tool to check the state.json in all 5 zk nodes
> > and they are all out of date and identical to what I get through the tree
> > view in sole admin ui.
> >
> > Looking at the source code cloud.js that correctly display nodes as
> "gone"
> > in the graph view, it calls the end point /zookeeper?wt=json and relies
> on
> > the live nodes to mark a node as down instead of status.json.
> >
> > Thanks.
> >
> >> Shot in the dark,
> >> Erick
> >>
> >> On Mon, Sep 7, 2015 at 6:09 PM, Arcadius Ahouansou <
> arcadius@menelic.com>
> > wrote:
> >> > We are running the latest Solr 5.3.0
> >> >
> >> > Thanks.
>

Re: SolrCloud Admin UI shows node is Down, but state.json says it's active/up

Posted by Erick Erickson <er...@gmail.com>.
Arcadius:

Hmmm. It may take a while for the cluster state to change, but I'm
assuming that this state persists for minutes/hours/days.

So to recap: If dump the entire ZK node from the root, you have
1> liveNodes has N nodes listed (correctly)
2> clusterstate.json has N+M nodes listed as "active"

Doesn't sound right to me, but I'll have to let people who are deep
into that code speculate from here.

Best,
Erick

On Tue, Sep 8, 2015 at 1:13 AM, Arcadius Ahouansou <ar...@menelic.com> wrote:
> On Sep 8, 2015 6:25 AM, "Erick Erickson" <er...@gmail.com> wrote:
>>
>> Perhaps the browser cache? What happens if you, say, use
>> Zookeeper client tools to bring down the the cluster state in
>> question? Or perhaps just refresh the admin UI when showing
>> the cluster status....
>>
>
> Hello Erick.
>
> Thank you very much for answering.
> I did use the ZooInspetor tool to check the state.json in all 5 zk nodes
> and they are all out of date and identical to what I get through the tree
> view in sole admin ui.
>
> Looking at the source code cloud.js that correctly display nodes as "gone"
> in the graph view, it calls the end point /zookeeper?wt=json and relies on
> the live nodes to mark a node as down instead of status.json.
>
> Thanks.
>
>> Shot in the dark,
>> Erick
>>
>> On Mon, Sep 7, 2015 at 6:09 PM, Arcadius Ahouansou <ar...@menelic.com>
> wrote:
>> > We are running the latest Solr 5.3.0
>> >
>> > Thanks.

Re: SolrCloud Admin UI shows node is Down, but state.json says it's active/up

Posted by Arcadius Ahouansou <ar...@menelic.com>.
On Sep 8, 2015 6:25 AM, "Erick Erickson" <er...@gmail.com> wrote:
>
> Perhaps the browser cache? What happens if you, say, use
> Zookeeper client tools to bring down the the cluster state in
> question? Or perhaps just refresh the admin UI when showing
> the cluster status....
>

Hello Erick.

Thank you very much for answering.
I did use the ZooInspetor tool to check the state.json in all 5 zk nodes
and they are all out of date and identical to what I get through the tree
view in sole admin ui.

Looking at the source code cloud.js that correctly display nodes as "gone"
in the graph view, it calls the end point /zookeeper?wt=json and relies on
the live nodes to mark a node as down instead of status.json.

Thanks.

> Shot in the dark,
> Erick
>
> On Mon, Sep 7, 2015 at 6:09 PM, Arcadius Ahouansou <ar...@menelic.com>
wrote:
> > We are running the latest Solr 5.3.0
> >
> > Thanks.

Re: SolrCloud Admin UI shows node is Down, but state.json says it's active/up

Posted by Arcadius Ahouansou <ar...@menelic.com>.
On Sep 8, 2015 6:25 AM, "Erick Erickson" <er...@gmail.com> wrote:
>
> Perhaps the browser cache? What happens if you, say, use
> Zookeeper client tools to bring down the the cluster state in
> question? Or perhaps just refresh the admin UI when showing
> the cluster status....
>

Hello Erick.

Thank you very much for answering.
I did use the ZooInspetor tool to check the state.json in all 5 zk nodes
and they are all out of date and identical to what I get through the tree
view in sole admin ui.

Looking at the source code cloud.js that correctly display nodes as "gone"
in the graph view, it calls the end point /zookeeper?wt=json and relies on
the live nodes to mark a node as down instead of status.json.

Thanks.

> Shot in the dark,
> Erick
>
> On Mon, Sep 7, 2015 at 6:09 PM, Arcadius Ahouansou <ar...@menelic.com>
wrote:
> > We are running the latest Solr 5.3.0
> >
> > Thanks.

Re: SolrCloud Admin UI shows node is Down, but state.json says it's active/up

Posted by Erick Erickson <er...@gmail.com>.
Perhaps the browser cache? What happens if you, say, use
Zookeeper client tools to bring down the the cluster state in
question? Or perhaps just refresh the admin UI when showing
the cluster status....

Shot in the dark,
Erick

On Mon, Sep 7, 2015 at 6:09 PM, Arcadius Ahouansou <ar...@menelic.com> wrote:
> We are running the latest Solr 5.3.0
>
> Thanks.

Re: SolrCloud Admin UI shows node is Down, but state.json says it's active/up

Posted by Arcadius Ahouansou <ar...@menelic.com>.
We are running the latest Solr 5.3.0

Thanks.