You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Varun Thacker (JIRA)" <ji...@apache.org> on 2018/04/30 02:47:00 UTC

[jira] [Comment Edited] (SOLR-12291) OverseerCollectionMessageHandler sliceCmd assumes only one replica exists on each node

    [ https://issues.apache.org/jira/browse/SOLR-12291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16458279#comment-16458279 ] 

Varun Thacker edited comment on SOLR-12291 at 4/30/18 2:46 AM:
---------------------------------------------------------------

Here's a simple way to reproduce this:

 

Start Solr with
{code:java}
bin/solr start -e gettingstarted -noprompt{code}
Call collection reload with async
{code:java}
http://localhost:8983/solr/admin/collections?action=reload&name=gettingstarted&async=1{code}
Call request status
{code:java}
http://localhost:8983/solr/admin/collections?action=requeststatus&requestid=1{code}
Here's the output ( it's a miracle our users are able to make some sense of this response ;) ) 

 
{code:java}
{
  "responseHeader":{
    "status":0,
    "QTime":5},
  "success":{
    "192.168.0.6:8983_solr":{
      "responseHeader":{
        "status":0,
        "QTime":1}},
    "192.168.0.6:8983_solr":{
      "responseHeader":{
        "status":0,
        "QTime":0}},
    "192.168.0.6:7574_solr":{
      "responseHeader":{
        "status":0,
        "QTime":58}},
    "192.168.0.6:7574_solr":{
      "responseHeader":{
        "status":0,
        "QTime":59}}},
  "167115373362154":{
    "responseHeader":{
      "status":0,
      "QTime":0},
    "STATUS":"completed",
    "Response":"TaskId: 167115373362154 webapp=null path=/admin/cores params={core=gettingstarted_shard2_replica_n4&async=167115373362154&qt=/admin/cores&action=RELOAD&wt=javabin&version=2} status=0 QTime=58"},
  "167115373632133":{
    "responseHeader":{
      "status":0,
      "QTime":0},
    "STATUS":"completed",
    "Response":"TaskId: 167115373632133 webapp=null path=/admin/cores params={core=gettingstarted_shard2_replica_n6&async=167115373632133&qt=/admin/cores&action=RELOAD&wt=javabin&version=2} status=0 QTime=0"},
  "status":{
    "state":"completed",
    "msg":"found [1] in completed tasks"}}{code}
Notice how we only track 2 task id's ..It should be 4 because the gettingstarted example creates a 2X2 collection.

 

Also if you check the logs there were 4 tasks but the overseer didn't track 2 of them 
{code:java}
~/solr-7.3.0$ grep "/admin/cores" example/cloud/node1/logs/solr.log | grep -i reload

INFO  - 2018-04-30 02:40:05.614; [   ] org.apache.solr.servlet.HttpSolrCall; [admin] webapp=null path=/admin/cores params={core=gettingstarted_shard1_replica_n2&async=167115373148039&qt=/admin/cores&action=RELOAD&wt=javabin&version=2} status=0 QTime=1

INFO  - 2018-04-30 02:40:05.614; [   ] org.apache.solr.servlet.HttpSolrCall; [admin] webapp=null path=/admin/cores params={core=gettingstarted_shard2_replica_n6&async=167115373632133&qt=/admin/cores&action=RELOAD&wt=javabin&version=2} status=0 QTime=0

~/solr-7.3.0$ grep "/admin/cores" example/cloud/node2/logs/solr.log | grep -i reload

INFO  - 2018-04-30 02:40:05.672; [   ] org.apache.solr.servlet.HttpSolrCall; [admin] webapp=null path=/admin/cores params={core=gettingstarted_shard1_replica_n1&async=167115372921360&qt=/admin/cores&action=RELOAD&wt=javabin&version=2} status=0 QTime=59

INFO  - 2018-04-30 02:40:05.672; [   ] org.apache.solr.servlet.HttpSolrCall; [admin] webapp=null path=/admin/cores params={core=gettingstarted_shard2_replica_n4&async=167115373362154&qt=/admin/cores&action=RELOAD&wt=javabin&version=2} status=0 QTime=58{code}


was (Author: varunthacker):
Here's a simple way to reproduce this:

 

Start Solr with
{code:java}
bin/solr start -e gettingstarted -noprompt{code}
Call collection reload with async
{code:java}
http://localhost:8983/solr/admin/collections?action=reload&name=gettingstarted&async=1{code}
Call request status
{code:java}
http://localhost:8983/solr/admin/collections?action=requeststatus&requestid=1{code}
Here's the output

 
{code:java}
{
  "responseHeader":{
    "status":0,
    "QTime":5},
  "success":{
    "192.168.0.6:8983_solr":{
      "responseHeader":{
        "status":0,
        "QTime":1}},
    "192.168.0.6:8983_solr":{
      "responseHeader":{
        "status":0,
        "QTime":0}},
    "192.168.0.6:7574_solr":{
      "responseHeader":{
        "status":0,
        "QTime":58}},
    "192.168.0.6:7574_solr":{
      "responseHeader":{
        "status":0,
        "QTime":59}}},
  "167115373362154":{
    "responseHeader":{
      "status":0,
      "QTime":0},
    "STATUS":"completed",
    "Response":"TaskId: 167115373362154 webapp=null path=/admin/cores params={core=gettingstarted_shard2_replica_n4&async=167115373362154&qt=/admin/cores&action=RELOAD&wt=javabin&version=2} status=0 QTime=58"},
  "167115373632133":{
    "responseHeader":{
      "status":0,
      "QTime":0},
    "STATUS":"completed",
    "Response":"TaskId: 167115373632133 webapp=null path=/admin/cores params={core=gettingstarted_shard2_replica_n6&async=167115373632133&qt=/admin/cores&action=RELOAD&wt=javabin&version=2} status=0 QTime=0"},
  "status":{
    "state":"completed",
    "msg":"found [1] in completed tasks"}}{code}
Notice how we only track 2 task id's ..It should be 4 because the gettingstarted example creates a 2X2 collection.

 

Also if you check the logs there were 4 tasks but the overseer didn't track 2 of them 
{code:java}
~/solr-7.3.0$ grep "/admin/cores" example/cloud/node1/logs/solr.log | grep -i reload

INFO  - 2018-04-30 02:40:05.614; [   ] org.apache.solr.servlet.HttpSolrCall; [admin] webapp=null path=/admin/cores params={core=gettingstarted_shard1_replica_n2&async=167115373148039&qt=/admin/cores&action=RELOAD&wt=javabin&version=2} status=0 QTime=1

INFO  - 2018-04-30 02:40:05.614; [   ] org.apache.solr.servlet.HttpSolrCall; [admin] webapp=null path=/admin/cores params={core=gettingstarted_shard2_replica_n6&async=167115373632133&qt=/admin/cores&action=RELOAD&wt=javabin&version=2} status=0 QTime=0

~/solr-7.3.0$ grep "/admin/cores" example/cloud/node2/logs/solr.log | grep -i reload

INFO  - 2018-04-30 02:40:05.672; [   ] org.apache.solr.servlet.HttpSolrCall; [admin] webapp=null path=/admin/cores params={core=gettingstarted_shard1_replica_n1&async=167115372921360&qt=/admin/cores&action=RELOAD&wt=javabin&version=2} status=0 QTime=59

INFO  - 2018-04-30 02:40:05.672; [   ] org.apache.solr.servlet.HttpSolrCall; [admin] webapp=null path=/admin/cores params={core=gettingstarted_shard2_replica_n4&async=167115373362154&qt=/admin/cores&action=RELOAD&wt=javabin&version=2} status=0 QTime=58{code}

> OverseerCollectionMessageHandler sliceCmd assumes only one replica exists on each node
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-12291
>                 URL: https://issues.apache.org/jira/browse/SOLR-12291
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Backup/Restore, SolrCloud
>            Reporter: Varun Thacker
>            Priority: Major
>
> The OverseerCollectionMessageHandler sliceCmd assumes only one replica exists on one node
> When multiple replicas of a slice are on the same node we only track one replica's async request. This happens because the async requestMap's key is "node_name"
> I discovered this when [~alabax] shared some logs of a restore issue, where the second replica got added before the first replica had completed it's restorecore action.
> While looking at the logs I noticed that the overseer never called REQUESTSTATUS for the restorecore action , almost as if it had missed tracking that particular async request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org