You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@solr.apache.org by Jordan Diehl <jo...@hitachivantara.com> on 2021/09/20 18:28:41 UTC

Potential Bug: Incremental backup attempts fail after a shard split operation has completed

Hello,

I was going to open a Solr bug, but I saw the message saying I should discuss this via another channel first. I have been attempting to use the incremental backup API on Solr 8.9.0, but while testing in our product we would occasionally get into a state where all subsequent backup attempts would fail. After some triage we found that it was happening to any collection which had undergone a shard split operation. If we did a backup, completed a shard split operation, then attempted another backup, the second backup would fail with a FileNotFound exception relating to the backup id of the second backup as the error message.


Steps to reproduce:

  *   Create a new collection with no associated backups
  *   Run a backup for this collection

     *   /admin/collections?action=BACKUP&name=myBackupName&collection=myCollectionName&location=/path/to/my/shared/drive

  *   Run a shard split operation

     *   /admin/collections?action=SPLITSHARD&collection=name&shard=shardID

  *   Attempt another backup


Expected Outcome:

* If this operation is being blocked intentionally, then I would expect an informative error message explaining why it failed. Otherwise I would expect the backup to complete successfully.


Actual Outcome:

* The backup operation fails with a NoSuchFileException.

NOTE: In the below exception message the number in the file which isn’t found (in this case zk_backup_1) relates to the backup attempt which is currently being attempted.

{

  "responseHeader":{

    "status":500,

    "QTime":54},

  "failure":{

    "MYIPADDRESS:31018_solr":"org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException:Error from server at null: Error handling 'BACKUPCORE' action"},

  "Operation backup caused exception:":"java.nio.file.NoSuchFileException:java.nio.file.NoSuchFileException: /opt/hci/solrBackups/reproCollectionBackup/reproCollection/zk_backup_1",

  "exception":{

    "msg":"/opt/hci/solrBackups/reproCollectionBackup/reproCollection/zk_backup_1",

    "rspCode":-1},

  "error":{

    "metadata":[

      "error-class","org.apache.solr.common.SolrException",

      "root-error-class","org.apache.solr.common.SolrException"],

    "msg":"/opt/hci/solrBackups/reproCollectionBackup/reproCollection/zk_backup_1",

    "trace":"org.apache.solr.common.SolrException: /opt/hci/solrBackups/reproCollectionBackup/reproCollection/zk_backup_1\n\tat org.apache.solr.client.solrj.SolrResponse.getException(SolrResponse.java:65)\n\tat org.apache.solr.handler.admin.CollectionsHandler.invokeAction(CollectionsHandler.java:301)\n\tat org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:257)\n\tat org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:216)\n\tat org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:836)\n\tat org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:800)\n\tat org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:545)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:427)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:357)\n\tat org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:201)\n\tat org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)\n\tat org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:602)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)\n\tat org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1435)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)\n\tat org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501)\n\tat org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1350)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)\n\tat org.eclipse.jetty.server.handler.InetAccessHandler.handle(InetAccessHandler.java:177)\n\tat org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:322)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat org.eclipse.jetty.server.Server.handle(Server.java:516)\n\tat org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388)\n\tat org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:633)\n\tat org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:380)\n\tat org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)\n\tat org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)\n\tat org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)\n\tat org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)\n\tat org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:383)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:882)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1036)\n\tat java.lang.Thread.run(Thread.java:748)\n",

    "code":500}}




I tried a few different workaround attempts, but after going through these steps I wasn’t able to run another backup for the collection.


Workaround attempt 1:

  *   Use the API to delete the backup

  *   Used the API to purge unused backup files

  *   Restarted Solr

  *   Attempted another backup

  *   Encountered the same failure


Workaround attempt 2:

  *   Deleted all files in my Solr backup mount location

  *   Restarted Solr

  *   Attempted another backup

  *   Encountered the same failure


Re: Potential Bug: Incremental backup attempts fail after a shard split operation has completed

Posted by Cassandra Targett <ca...@gmail.com>.
I didn’t develop this feature but know some of how it was designed and developed, and believe that it wasn’t intentional to omit support for backups post-shard split. I think it might have just been overlooked as a use case.

I’m going to guess that the cause of this is that the shard names changed during the split shard procedure. Since the new incremental backup copies 1 replica from each shard, it needs to track the shard names in addition to the collection name. After a shard split all the shard names are changed, so how can it know that “shard1" is now “shard1_0" and “shard1_1”? I agree that if this is the case the error is not helpful.

If you specify `incremental=false` are you able to get it to succeed? I know that defeats the purpose here, but just wondering if it unblocks you. If no matter what you’re totally blocked on backups for this collection, that would be helpful to know.

I also think you should go ahead and file this in Jira as a bug. And thank you for the nicely detailed explanation of the problem.

Cassandra
On Sep 20, 2021, 2:17 PM -0500, Jordan Diehl <jo...@hitachivantara.com>, wrote:
> Hello,
>
> I was going to open a Solr bug, but I saw the message saying I should discuss this via another channel first. I have been attempting to use the incremental backup API on Solr 8.9.0, but while testing in our product we would occasionally get into a state where all subsequent backup attempts would fail. After some triage we found that it was happening to any collection which had undergone a shard split operation. If we did a backup, completed a shard split operation, then attempted another backup, the second backup would fail with a FileNotFound exception relating to the backup id of the second backup as the error message.
>
>
> Steps to reproduce:
>
> * Create a new collection with no associated backups
> * Run a backup for this collection
>
> * /admin/collections?action=BACKUP&name=myBackupName&collection=myCollectionName&location=/path/to/my/shared/drive
>
> * Run a shard split operation
>
> * /admin/collections?action=SPLITSHARD&collection=name&shard=shardID
>
> * Attempt another backup
>
>
> Expected Outcome:
>
> * If this operation is being blocked intentionally, then I would expect an informative error message explaining why it failed. Otherwise I would expect the backup to complete successfully.
>
>
> Actual Outcome:
>
> * The backup operation fails with a NoSuchFileException.
>
> NOTE: In the below exception message the number in the file which isn’t found (in this case zk_backup_1) relates to the backup attempt which is currently being attempted.
>
> {
>
> "responseHeader":{
>
> "status":500,
>
> "QTime":54},
>
> "failure":{
>
> "MYIPADDRESS:31018_solr":"org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException:Error from server at null: Error handling 'BACKUPCORE' action"},
>
> "Operation backup caused exception:":"java.nio.file.NoSuchFileException:java.nio.file.NoSuchFileException: /opt/hci/solrBackups/reproCollectionBackup/reproCollection/zk_backup_1",
>
> "exception":{
>
> "msg":"/opt/hci/solrBackups/reproCollectionBackup/reproCollection/zk_backup_1",
>
> "rspCode":-1},
>
> "error":{
>
> "metadata":[
>
> "error-class","org.apache.solr.common.SolrException",
>
> "root-error-class","org.apache.solr.common.SolrException"],
>
> "msg":"/opt/hci/solrBackups/reproCollectionBackup/reproCollection/zk_backup_1",
>
> "trace":"org.apache.solr.common.SolrException: /opt/hci/solrBackups/reproCollectionBackup/reproCollection/zk_backup_1\n\tat org.apache.solr.client.solrj.SolrResponse.getException(SolrResponse.java:65)\n\tat org.apache.solr.handler.admin.CollectionsHandler.invokeAction(CollectionsHandler.java:301)\n\tat org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:257)\n\tat org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:216)\n\tat org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:836)\n\tat org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:800)\n\tat org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:545)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:427)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:357)\n\tat org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:201)\n\tat org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)\n\tat org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:602)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)\n\tat org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1435)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)\n\tat org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501)\n\tat org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1350)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)\n\tat org.eclipse.jetty.server.handler.InetAccessHandler.handle(InetAccessHandler.java:177)\n\tat org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:322)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat org.eclipse.jetty.server.Server.handle(Server.java:516)\n\tat org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388)\n\tat org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:633)\n\tat org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:380)\n\tat org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)\n\tat org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)\n\tat org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)\n\tat org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)\n\tat org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:383)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:882)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1036)\n\tat java.lang.Thread.run(Thread.java:748)\n",
>
> "code":500}}
>
>
>
>
> I tried a few different workaround attempts, but after going through these steps I wasn’t able to run another backup for the collection.
>
>
> Workaround attempt 1:
>
> * Use the API to delete the backup
>
> * Used the API to purge unused backup files
>
> * Restarted Solr
>
> * Attempted another backup
>
> * Encountered the same failure
>
>
> Workaround attempt 2:
>
> * Deleted all files in my Solr backup mount location
>
> * Restarted Solr
>
> * Attempted another backup
>
> * Encountered the same failure
>