You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by "Chris M. Hostetter (Jira)" <ji...@apache.org> on 2022/03/08 22:02:00 UTC

[jira] [Comment Edited] (SOLR-16089) DOWN replica causes missing data on Cloud>Nodes admin ui screen for unrelated nodes

    [ https://issues.apache.org/jira/browse/SOLR-16089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503193#comment-17503193 ] 

Chris M. Hostetter edited comment on SOLR-16089 at 3/8/22, 10:01 PM:
---------------------------------------------------------------------

Steps to reproduce...

1. Create the example cloud cluster with some data and shutdown "node2"...
{noformat}
./bin/solr -e cloud -noprompt
./bin/post -c gettingstarted example/exampledocs/books.csv
./bin/solr stop -p 7574
{noformat}
(At this point, if you load the admin UI, the Cloud>Nodes screen will function as designed – it will recognize that some of the replicas in the CLUSTERSTATUS are on nodes that are not alive, and that node will be highlighted in RED w/o causing any problems viewing the data from the other node)

2. Pick a replica on node2 and jack it up so that even when the node is restarted, the replica can't even start recovery
{noformat}
rm -rf example/cloud/node2/solr/gettingstarted_shard2_*/data/*
chmod a-w example/cloud/node2/solr/gettingstarted_shard2_*/data/
./bin/solr start -cloud -p 7574 -s example/cloud/node2/solr -z localhost:9983
{noformat}
 * Now the cluster status will say we have a down replica on a live node
 * The metrics API response will not list the down core at all
 * the absence of data from the DOWN replica (may) cause the Disk Usage column to be completley empty for both nodes, depending on what order the nodes were returned in my the metrics API
 ** see attached screenshot


was (Author: hossman):
Steps to reproduce...

1. Create the example cloud cluster with some data and shutdown "node2"...

{noformat}
./bin/solr -e cloud -noprompt
./bin/post -c gettingstarted example/exampledocs/books.csv
./bin/solr stop -p 7574
{noformat}

(At this point, if you load the admin UI, the Cloud>Nodes screen will function as designed -- it will recognize that some of the replicas in the CLUSTERSTATUS are on nodes that are not alive, and that node will be highlighted in RED w/o causing any problems viewing the data from the other node)

2. Pick a replica on node2 and jack it up so that even when the node is restarted, the replica can't even start recovery

{noformat}
rm -rf example/cloud/node2/solr/gettingstarted_shard2_*/data/*
chmod a-w example/cloud/node2/solr/gettingstarted_shard2_*/data/
./bin/solr start -cloud -p 7574 -s example/cloud/node2/solr -z localhost:9983
{noformat}

* Now the cluster status will say we have a down replica on a live node
* The metrics API response will not list the down core at all
* the absence of data from the DOWN node (may) cause the Disk Usage column to be completley empty for both nodes, depending on what order the nodes were returned in my the metrics API
** see attached screenshot




> DOWN replica causes missing data on Cloud>Nodes admin ui screen for unrelated nodes
> -----------------------------------------------------------------------------------
>
>                 Key: SOLR-16089
>                 URL: https://issues.apache.org/jira/browse/SOLR-16089
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Admin UI
>            Reporter: Chris M. Hostetter
>            Priority: Major
>         Attachments: SOLR-16089.screenshot.png
>
>
> If a node is {{{}/live_nodes{}}}, but a replica hosted on that node is {{DOWN}} ( or missing from the Metrics API response section for that node for any other reason) this breaks a brittle assumption in the {{cloud.js}} logic that generates the data structure used to power the {{/solr/#/~cloud?view=nodes}}
> The current assumption is that _any_ replica found in the {{CLUSTERSTATUS}} response, hosted on a live_node, will be found in the Metrics API response -- when this is not true, the javascript throws a {{TypeError}} while looping over the metrics API response, leaving the datastructure it was building incomplete.
> This means that, depending on _where_ in the (effectively) random order that the metrics API returns node details, all or some of the nodes visible on the resulting admin UI screen will be missing data from the metrics API (most notably in the Disk Usage column) depending on whether they came "after" the node hosting the problematic replica.
> There is nothing obvious in the UI to indicate that a particular node/replica is having a problem -- making it particularly hard to identify why columns like Disk Usage are "blank"



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org