You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Shawn Heisey <ap...@elyograg.org> on 2018/10/12 15:28:08 UTC

Something odd with async request status for BACKUP operation on Collections API

I'm working on reproducing a problem reported via the IRC channel.

Started a test cloud with 7.5.0. Initially with two nodes, then again 
with 3 nodes.  Did this on Windows 10.

Command to create a collection:

bin\solr create -c test2 -shards 30 -replicationFactor 2

For these URLs, I dropped them into a browser, so URL encoding was 
handled automatically.  I'm sure the URL to start the backup wouldn't 
work as-is with curl because it includes characters that need encoding.

Backup URL:

http://localhost:8983/solr/admin/collections?action=BACKUP&name=test2.3&collection=test2&location=C:\Users\elyograg\Downloads\solrbackups&async=sometag

Request status URL:

http://localhost:8983/solr/admin/collections?action=REQUESTSTATUS&requestid=sometag

Here's the raw JSON response from the status URL:
{
   "responseHeader":{
     "status":0,
     "QTime":3},
   "success":{
     "192.168.56.1:7574_solr":{
       "responseHeader":{
         "status":0,
         "QTime":0}},
     "192.168.56.1:8983_solr":{
       "responseHeader":{
         "status":0,
         "QTime":2}},
     "192.168.56.1:8983_solr":{
       "responseHeader":{
         "status":0,
         "QTime":0}},
     "192.168.56.1:8983_solr":{
       "responseHeader":{
         "status":0,
         "QTime":2}},
     "192.168.56.1:8983_solr":{
       "responseHeader":{
         "status":0,
         "QTime":0}},
     "192.168.56.1:7574_solr":{
       "responseHeader":{
         "status":0,
         "QTime":0}},
     "192.168.56.1:7574_solr":{
       "responseHeader":{
         "status":0,
         "QTime":1}},
     "192.168.56.1:8983_solr":{
       "responseHeader":{
         "status":0,
         "QTime":35}},
     "192.168.56.1:8983_solr":{
       "responseHeader":{
         "status":0,
         "QTime":0}},
     "192.168.56.1:8983_solr":{
       "responseHeader":{
         "status":0,
         "QTime":1}},
     "192.168.56.1:8983_solr":{
       "responseHeader":{
         "status":0,
         "QTime":1}},
     "192.168.56.1:8983_solr":{
       "responseHeader":{
         "status":0,
         "QTime":0}},
     "192.168.56.1:8983_solr":{
       "responseHeader":{
         "status":0,
         "QTime":33}},
     "192.168.56.1:8983_solr":{
       "responseHeader":{
         "status":0,
         "QTime":34}},
     "192.168.56.1:8983_solr":{
       "responseHeader":{
         "status":0,
         "QTime":0}},
     "192.168.56.1:8983_solr":{
       "responseHeader":{
         "status":0,
         "QTime":40}},
     "192.168.56.1:8984_solr":{
       "responseHeader":{
         "status":0,
         "QTime":2}},
     "192.168.56.1:8984_solr":{
       "responseHeader":{
         "status":0,
         "QTime":2}},
     "192.168.56.1:7574_solr":{
       "responseHeader":{
         "status":0,
         "QTime":0}},
     "192.168.56.1:8983_solr":{
       "responseHeader":{
         "status":0,
         "QTime":0}},
     "192.168.56.1:7574_solr":{
       "responseHeader":{
         "status":0,
         "QTime":0}},
     "192.168.56.1:7574_solr":{
       "responseHeader":{
         "status":0,
         "QTime":0}},
     "192.168.56.1:8983_solr":{
       "responseHeader":{
         "status":0,
         "QTime":0}},
     "192.168.56.1:7574_solr":{
       "responseHeader":{
         "status":0,
         "QTime":0}},
     "192.168.56.1:8984_solr":{
       "responseHeader":{
         "status":0,
         "QTime":0}},
     "192.168.56.1:8984_solr":{
       "responseHeader":{
         "status":0,
         "QTime":0}},
     "192.168.56.1:7574_solr":{
       "responseHeader":{
         "status":0,
         "QTime":0}},
     "192.168.56.1:8983_solr":{
       "responseHeader":{
         "status":0,
         "QTime":0}},
     "192.168.56.1:8983_solr":{
       "responseHeader":{
         "status":0,
         "QTime":0}},
     "192.168.56.1:8983_solr":{
       "responseHeader":{
         "status":0,
         "QTime":1}}},
   "sometag135341573915254":{
     "responseHeader":{
       "status":0,
       "QTime":0},
     "STATUS":"completed",
     "Response":"TaskId: sometag135341573915254 webapp=null 
path=/admin/cores 
params={core=test2_shard9_replica_n34&async=sometag135341573915254&qt=/admin/cores&name=shard9&action=BACKUPCORE&location=file:///C:/Users/elyograg/Downloads/solrbackups/test2.3&wt=javabin&version=2} 
status=0 QTime=0"},
   "sometag135341570605052":{
     "responseHeader":{
       "status":0,
       "QTime":0},
     "STATUS":"completed",
     "Response":"TaskId: sometag135341570605052 webapp=null 
path=/admin/cores 
params={core=test2_shard1_replica_n1&async=sometag135341570605052&qt=/admin/cores&name=shard1&action=BACKUPCORE&location=file:///C:/Users/elyograg/Downloads/solrbackups/test2.3&wt=javabin&version=2} 
status=0 QTime=0"},
   "sometag135341570647962":{
     "responseHeader":{
       "status":0,
       "QTime":0},
     "STATUS":"completed",
     "Response":"TaskId: sometag135341570647962 webapp=null 
path=/admin/cores 
params={core=test2_shard7_replica_n26&async=sometag135341570647962&qt=/admin/cores&name=shard7&action=BACKUPCORE&location=file:///C:/Users/elyograg/Downloads/solrbackups/test2.3&wt=javabin&version=2} 
status=0 QTime=0"},
   "status":{
     "state":"completed",
     "msg":"found [sometag] in completed tasks"}}


As you can see, only 3 (out of 30) shards are mentioned in the response. 
When I did the same test on a 2-node cloud example, there were only 2 
shards in the response.

Should all 30 shards have been in the response? Is there a bug here?

If I make the request without the async parameter, the response doesn't 
contain ANY shard information at all. Because this is an empty 
collection, the backup is fast. I expected detailed information to be in 
the response.  Is that worth an issue in Jira?

Side note: In the status response, the individual shard info that IS 
present doesn't indicate what node handled the CoreAdmin call.  That 
would be useful information to include.

Thanks,
Shawn

Re: Something odd with async request status for BACKUP operation on Collections API

Posted by Shawn Heisey <ap...@elyograg.org>.

On 10/14/2018 10:39 PM, Shalin Shekhar Mangar wrote:
> The responses are collected by node so subsequent responses from the same
> node overwrite previous responses. Definitely a bug. Please open an issue.

Done.

https://issues.apache.org/jira/browse/SOLR-12867

Thanks,
Shawn

Re: Something odd with async request status for BACKUP operation on Collections API

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

The responses are collected by node so subsequent responses from the same
node overwrite previous responses. Definitely a bug. Please open an issue.

On Mon, Oct 15, 2018 at 6:24 AM Shawn Heisey <ap...@elyograg.org> wrote:

> On 10/14/2018 6:25 PM, damienk@gmail.com wrote:
> > I had an issue with async backup on solr 6.5.1 reporting that the backup
> > was complete when clearly it was not. I was using 12 shards across 6
> nodes.
> > I only noticed this issue when one shard was much larger than the others.
> > There were no answers here
> > http://lucene.472066.n3.nabble.com/async-backup-td4342776.html
>
> One detail I thought I had written but isn't there:  The backup did
> fully complete -- all 30 shards were in the backup location.  Not a lot
> in each shard backup -- the collection was empty.  It would be easy
> enough to add a few thousand documents to the collection before doing
> the backup.
>
> If the backup process reports that it's done before it's ACTUALLY done,
> that's a bad thing.  It's hard to say whether that problem is related to
> the problem I described.  Since I haven't dived into the code, I cannot
> say for sure, but it honestly would not surprise me to find they are
> connected.  Every time I try to understand Collections API code, I find
> it extremely difficult to follow.
>
> I'm sorry that you never got resolution on your problem.  Do you know
> whether that is still a problem in 7.x?  Setting up a reproduction where
> one shard is significantly larger than the others will take a little bit
> of work.
>
> > I was focusing on the STATUS returned from the REQUESTSTATUS command, but
> > looking again now I can see a response from only 6 shards, and each shard
> > is from a different node. So this fits with what you're seeing. I assume
> > your shards 1, 7, 9 are all on different nodes.
>
> I did not actually check, and the cloud example I was using isn't around
> any more, but each of the shards in the status response were PROBABLY on
> separate nodes.  The cloud example was 3 nodes.  It's an easy enough
> scenario to replicate, and I provided enough details for anyone to do it.
>
> The person on IRC that reported this problem had a cluster of 15 nodes,
> and the status response had ten shards (out of 30) mentioned.  It was
> shards 1-9 and shard 20.  The suspicion is that there's something
> hard-coded that limits it to 10 responses ... because without that, I
> would expect the number of shards in the response to match the number of
> nodes.
>
> Thanks,
> Shawn
>
>

-- 
Regards,
Shalin Shekhar Mangar.

Re: Something odd with async request status for BACKUP operation on Collections API

Posted by Shawn Heisey <ap...@elyograg.org>.

On 10/14/2018 6:25 PM, damienk@gmail.com wrote:
> I had an issue with async backup on solr 6.5.1 reporting that the backup
> was complete when clearly it was not. I was using 12 shards across 6 nodes.
> I only noticed this issue when one shard was much larger than the others.
> There were no answers here
> http://lucene.472066.n3.nabble.com/async-backup-td4342776.html

One detail I thought I had written but isn't there:  The backup did 
fully complete -- all 30 shards were in the backup location.  Not a lot 
in each shard backup -- the collection was empty.  It would be easy 
enough to add a few thousand documents to the collection before doing 
the backup.

If the backup process reports that it's done before it's ACTUALLY done, 
that's a bad thing.  It's hard to say whether that problem is related to 
the problem I described.  Since I haven't dived into the code, I cannot 
say for sure, but it honestly would not surprise me to find they are 
connected.  Every time I try to understand Collections API code, I find 
it extremely difficult to follow.

I'm sorry that you never got resolution on your problem.  Do you know 
whether that is still a problem in 7.x?  Setting up a reproduction where 
one shard is significantly larger than the others will take a little bit 
of work.

> I was focusing on the STATUS returned from the REQUESTSTATUS command, but
> looking again now I can see a response from only 6 shards, and each shard
> is from a different node. So this fits with what you're seeing. I assume
> your shards 1, 7, 9 are all on different nodes.

I did not actually check, and the cloud example I was using isn't around 
any more, but each of the shards in the status response were PROBABLY on 
separate nodes.  The cloud example was 3 nodes.  It's an easy enough 
scenario to replicate, and I provided enough details for anyone to do it.

The person on IRC that reported this problem had a cluster of 15 nodes, 
and the status response had ten shards (out of 30) mentioned.  It was 
shards 1-9 and shard 20.  The suspicion is that there's something 
hard-coded that limits it to 10 responses ... because without that, I 
would expect the number of shards in the response to match the number of 
nodes.

Thanks,
Shawn

Re: Something odd with async request status for BACKUP operation on Collections API

Posted by da...@gmail.com.

Hi Shawn,

I had an issue with async backup on solr 6.5.1 reporting that the backup
was complete when clearly it was not. I was using 12 shards across 6 nodes.
I only noticed this issue when one shard was much larger than the others.
There were no answers here
http://lucene.472066.n3.nabble.com/async-backup-td4342776.html

I was focusing on the STATUS returned from the REQUESTSTATUS command, but
looking again now I can see a response from only 6 shards, and each shard
is from a different node. So this fits with what you're seeing. I assume
your shards 1, 7, 9 are all on different nodes.

HTH,
Damien.


On Sat, 13 Oct 2018 at 02:28, Shawn Heisey <ap...@elyograg.org> wrote:

> I'm working on reproducing a problem reported via the IRC channel.
>
> Started a test cloud with 7.5.0. Initially with two nodes, then again
> with 3 nodes.  Did this on Windows 10.
>
> Command to create a collection:
>
> bin\solr create -c test2 -shards 30 -replicationFactor 2
>
> For these URLs, I dropped them into a browser, so URL encoding was
> handled automatically.  I'm sure the URL to start the backup wouldn't
> work as-is with curl because it includes characters that need encoding.
>
> Backup URL:
>
>
> http://localhost:8983/solr/admin/collections?action=BACKUP&name=test2.3&collection=test2&location=C
> :\Users\elyograg\Downloads\solrbackups&async=sometag
>
> Request status URL:
>
>
> http://localhost:8983/solr/admin/collections?action=REQUESTSTATUS&requestid=sometag
>
> Here's the raw JSON response from the status URL:
> {
>    "responseHeader":{
>      "status":0,
>      "QTime":3},
>    "success":{
>      "192.168.56.1:7574_solr":{
>        "responseHeader":{
>          "status":0,
>          "QTime":0}},
>      "192.168.56.1:8983_solr":{
>        "responseHeader":{
>          "status":0,
>          "QTime":2}},
>      "192.168.56.1:8983_solr":{
>        "responseHeader":{
>          "status":0,
>          "QTime":0}},
>      "192.168.56.1:8983_solr":{
>        "responseHeader":{
>          "status":0,
>          "QTime":2}},
>      "192.168.56.1:8983_solr":{
>        "responseHeader":{
>          "status":0,
>          "QTime":0}},
>      "192.168.56.1:7574_solr":{
>        "responseHeader":{
>          "status":0,
>          "QTime":0}},
>      "192.168.56.1:7574_solr":{
>        "responseHeader":{
>          "status":0,
>          "QTime":1}},
>      "192.168.56.1:8983_solr":{
>        "responseHeader":{
>          "status":0,
>          "QTime":35}},
>      "192.168.56.1:8983_solr":{
>        "responseHeader":{
>          "status":0,
>          "QTime":0}},
>      "192.168.56.1:8983_solr":{
>        "responseHeader":{
>          "status":0,
>          "QTime":1}},
>      "192.168.56.1:8983_solr":{
>        "responseHeader":{
>          "status":0,
>          "QTime":1}},
>      "192.168.56.1:8983_solr":{
>        "responseHeader":{
>          "status":0,
>          "QTime":0}},
>      "192.168.56.1:8983_solr":{
>        "responseHeader":{
>          "status":0,
>          "QTime":33}},
>      "192.168.56.1:8983_solr":{
>        "responseHeader":{
>          "status":0,
>          "QTime":34}},
>      "192.168.56.1:8983_solr":{
>        "responseHeader":{
>          "status":0,
>          "QTime":0}},
>      "192.168.56.1:8983_solr":{
>        "responseHeader":{
>          "status":0,
>          "QTime":40}},
>      "192.168.56.1:8984_solr":{
>        "responseHeader":{
>          "status":0,
>          "QTime":2}},
>      "192.168.56.1:8984_solr":{
>        "responseHeader":{
>          "status":0,
>          "QTime":2}},
>      "192.168.56.1:7574_solr":{
>        "responseHeader":{
>          "status":0,
>          "QTime":0}},
>      "192.168.56.1:8983_solr":{
>        "responseHeader":{
>          "status":0,
>          "QTime":0}},
>      "192.168.56.1:7574_solr":{
>        "responseHeader":{
>          "status":0,
>          "QTime":0}},
>      "192.168.56.1:7574_solr":{
>        "responseHeader":{
>          "status":0,
>          "QTime":0}},
>      "192.168.56.1:8983_solr":{
>        "responseHeader":{
>          "status":0,
>          "QTime":0}},
>      "192.168.56.1:7574_solr":{
>        "responseHeader":{
>          "status":0,
>          "QTime":0}},
>      "192.168.56.1:8984_solr":{
>        "responseHeader":{
>          "status":0,
>          "QTime":0}},
>      "192.168.56.1:8984_solr":{
>        "responseHeader":{
>          "status":0,
>          "QTime":0}},
>      "192.168.56.1:7574_solr":{
>        "responseHeader":{
>          "status":0,
>          "QTime":0}},
>      "192.168.56.1:8983_solr":{
>        "responseHeader":{
>          "status":0,
>          "QTime":0}},
>      "192.168.56.1:8983_solr":{
>        "responseHeader":{
>          "status":0,
>          "QTime":0}},
>      "192.168.56.1:8983_solr":{
>        "responseHeader":{
>          "status":0,
>          "QTime":1}}},
>    "sometag135341573915254":{
>      "responseHeader":{
>        "status":0,
>        "QTime":0},
>      "STATUS":"completed",
>      "Response":"TaskId: sometag135341573915254 webapp=null
> path=/admin/cores
> params={core=test2_shard9_replica_n34&async=sometag135341573915254&qt=/admin/cores&name=shard9&action=BACKUPCORE&location=file:///C:/Users/elyograg/Downloads/solrbackups/test2.3&wt=javabin&version=2}
>
> status=0 QTime=0"},
>    "sometag135341570605052":{
>      "responseHeader":{
>        "status":0,
>        "QTime":0},
>      "STATUS":"completed",
>      "Response":"TaskId: sometag135341570605052 webapp=null
> path=/admin/cores
> params={core=test2_shard1_replica_n1&async=sometag135341570605052&qt=/admin/cores&name=shard1&action=BACKUPCORE&location=file:///C:/Users/elyograg/Downloads/solrbackups/test2.3&wt=javabin&version=2}
>
> status=0 QTime=0"},
>    "sometag135341570647962":{
>      "responseHeader":{
>        "status":0,
>        "QTime":0},
>      "STATUS":"completed",
>      "Response":"TaskId: sometag135341570647962 webapp=null
> path=/admin/cores
> params={core=test2_shard7_replica_n26&async=sometag135341570647962&qt=/admin/cores&name=shard7&action=BACKUPCORE&location=file:///C:/Users/elyograg/Downloads/solrbackups/test2.3&wt=javabin&version=2}
>
> status=0 QTime=0"},
>    "status":{
>      "state":"completed",
>      "msg":"found [sometag] in completed tasks"}}
>
>
> As you can see, only 3 (out of 30) shards are mentioned in the response.
> When I did the same test on a 2-node cloud example, there were only 2
> shards in the response.
>
> Should all 30 shards have been in the response? Is there a bug here?
>
> If I make the request without the async parameter, the response doesn't
> contain ANY shard information at all. Because this is an empty
> collection, the backup is fast. I expected detailed information to be in
> the response.  Is that worth an issue in Jira?
>
> Side note: In the status response, the individual shard info that IS
> present doesn't indicate what node handled the CoreAdmin call.  That
> would be useful information to include.
>
> Thanks,
> Shawn
>
>