You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by "Roy Perkins (Jira)" <ji...@apache.org> on 2021/06/14 19:21:00 UTC

[jira] [Comment Edited] (SOLR-15371) Backups randomly fail sometimes

    [ https://issues.apache.org/jira/browse/SOLR-15371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17363163#comment-17363163 ] 

Roy Perkins edited comment on SOLR-15371 at 6/14/21, 7:20 PM:
--------------------------------------------------------------

Like I said, it can be hard to reproduce.  It seems like it happens when the host I run the backup from is the leader for the shard of the collection that fails.  Below is some output from my backup script:
{code:java}
{
  "responseHeader":{
    "status":0,
    "QTime":0},
  "success":{
    "solrmulti03.DOM.DOMAIN.com:8983_solr":{
      "responseHeader":{
        "status":0,
        "QTime":0}},
    "solrmulti08.DOM.DOMAIN.com:8983_solr":{
      "responseHeader":{
        "status":0,
        "QTime":0}},
    "solrmulti01.DOM.DOMAIN.com:8983_solr":{
      "responseHeader":{
        "status":0,
        "QTime":4}},
    "solrmulti04.DOM.DOMAIN.com:8983_solr":{
      "responseHeader":{
        "status":0,
        "QTime":14}},
    "solrmulti04.DOM.DOMAIN.com:8983_solr":{
      "responseHeader":{
        "status":0,
        "QTime":0},
      "STATUS":"completed",
      "Response":"TaskId: 100034112630053395656 webapp=null path=/admin/cores params={core=search_shard2_replica_n4&async=100034112630053395656&qt=/admin/cores&name=shard2&action=BACKUPCORE&location=file:///mnt/solr_backups/search/search-06-14-2021&wt=javabin&version=2} status=0 QTime=14"},
    "solrmulti03.DOM.DOMAIN.com:8983_solr":{
      "responseHeader":{
        "status":0,
        "QTime":0},
      "STATUS":"completed",
      "Response":"TaskId: 100034112630053446666 webapp=null path=/admin/cores params={core=search_shard3_replica_n29&async=100034112630053446666&qt=/admin/cores&name=shard3&action=BACKUPCORE&location=file:///mnt/solr_backups/search/search-06-14-2021&wt=javabin&version=2} status=0 QTime=0"},
    "solrmulti08.DOM.DOMAIN.com:8983_solr":{
      "responseHeader":{
        "status":0,
        "QTime":0},
      "STATUS":"completed",
      "Response":"TaskId: 100034112630053465731 webapp=null path=/admin/cores params={core=search_shard4_replica_n23&async=100034112630053465731&qt=/admin/cores&name=shard4&action=BACKUPCORE&location=file:///mnt/solr_backups/search/search-06-14-2021&wt=javabin&version=2} status=0 QTime=0"}},
  "100034112630053395656":{
    "responseHeader":{
      "status":0,
      "QTime":0},
    "STATUS":"completed",
    "Response":"TaskId: 100034112630053395656 webapp=null path=/admin/cores params={core=search_shard2_replica_n4&async=100034112630053395656&qt=/admin/cores&name=shard2&action=BACKUPCORE&location=file:///mnt/solr_backups/search/search-06-14-2021&wt=javabin&version=2} status=0 QTime=14"},
  "100034112630053446666":{
    "responseHeader":{
      "status":0,
      "QTime":0},
    "STATUS":"completed",
    "Response":"TaskId: 100034112630053446666 webapp=null path=/admin/cores params={core=search_shard3_replica_n29&async=100034112630053446666&qt=/admin/cores&name=shard3&action=BACKUPCORE&location=file:///mnt/solr_backups/search/search-06-14-2021&wt=javabin&version=2} status=0 QTime=0"},
  "100034112630053465731":{
    "responseHeader":{
      "status":0,
      "QTime":0},
    "STATUS":"completed",
    "Response":"TaskId: 100034112630053465731 webapp=null path=/admin/cores params={core=search_shard4_replica_n23&async=100034112630053465731&qt=/admin/cores&name=shard4&action=BACKUPCORE&location=file:///mnt/solr_backups/search/search-06-14-2021&wt=javabin&version=2} status=0 QTime=0"},
  "100034112630053492379":{
    "responseHeader":{
      "status":0,
      "QTime":0},
    "STATUS":"failed",
    "Response":"Failed to backup core=search_shard1_replica_n25 because org.apache.solr.common.SolrException: Directory to contain snapshots doesn't exist: file:///mnt/solr_backups/search/search-06-14-2021. Note that Backup/Restore of a SolrCloud collection requires a shared file system mounted at the same path on all nodes!"},
  "failure":{
    "solrmulti01.DOM.DOMAIN.com:8983_solr":{
      "responseHeader":{
        "status":0,
        "QTime":0},
      "STATUS":"failed",
      "Response":"Failed to backup core=search_shard1_replica_n25 because org.apache.solr.common.SolrException: Directory to contain snapshots doesn't exist: file:///mnt/solr_backups/search/search-06-14-2021. Note that Backup/Restore of a SolrCloud collection requires a shared file system mounted at the same path on all nodes!"}},
  "status":{
    "state":"failed",
    "msg":"found [1000] in failed tasks"}}
{code}


was (Author: meltingrobot):
Like I said, it can be hard to reproduce.  It seems like it happens when the host I run the backup from is the leader for the shard of the collection that fails.  Below is some output from my backup script:
{noformat}
{ "responseHeader":{ "status":0, "QTime":0}, "success":{ "solrmulti03.DOM.DOMAIN.com:8983_solr":{ "responseHeader":{ "status":0, "QTime":0}}, "solrmulti08.DOM.DOMAIN.com:8983_solr":{ "responseHeader":{ "status":0, "QTime":0}}, "solrmulti01.DOM.DOMAIN.com:8983_solr":{ "responseHeader":{ "status":0, "QTime":4}}, "solrmulti04.DOM.DOMAIN.com:8983_solr":{ "responseHeader":{ "status":0, "QTime":14}}, "solrmulti04.DOM.DOMAIN.com:8983_solr":{ "responseHeader":{ "status":0, "QTime":0}, "STATUS":"completed", "Response":"TaskId: 100034112630053395656 webapp=null path=/admin/cores params={core=search_shard2_replica_n4&async=100034112630053395656&qt=/admin/cores&name=shard2&action=BACKUPCORE&location=file:///mnt/solr_backups/search/search-06-14-2021&wt=javabin&version=2} status=0 QTime=14"}, "solrmulti03.DOM.DOMAIN.com:8983_solr":{ "responseHeader":{ "status":0, "QTime":0}, "STATUS":"completed", "Response":"TaskId: 100034112630053446666 webapp=null path=/admin/cores params={core=search_shard3_replica_n29&async=100034112630053446666&qt=/admin/cores&name=shard3&action=BACKUPCORE&location=file:///mnt/solr_backups/search/search-06-14-2021&wt=javabin&version=2} status=0 QTime=0"}, "solrmulti08.DOM.DOMAIN.com:8983_solr":{ "responseHeader":{ "status":0, "QTime":0}, "STATUS":"completed", "Response":"TaskId: 100034112630053465731 webapp=null path=/admin/cores params={core=search_shard4_replica_n23&async=100034112630053465731&qt=/admin/cores&name=shard4&action=BACKUPCORE&location=file:///mnt/solr_backups/search/search-06-14-2021&wt=javabin&version=2} status=0 QTime=0"}}, "100034112630053395656":{ "responseHeader":{ "status":0, "QTime":0}, "STATUS":"completed", "Response":"TaskId: 100034112630053395656 webapp=null path=/admin/cores params={core=search_shard2_replica_n4&async=100034112630053395656&qt=/admin/cores&name=shard2&action=BACKUPCORE&location=file:///mnt/solr_backups/search/search-06-14-2021&wt=javabin&version=2} status=0 QTime=14"}, "100034112630053446666":{ "responseHeader":{ "status":0, "QTime":0}, "STATUS":"completed", "Response":"TaskId: 100034112630053446666 webapp=null path=/admin/cores params={core=search_shard3_replica_n29&async=100034112630053446666&qt=/admin/cores&name=shard3&action=BACKUPCORE&location=file:///mnt/solr_backups/search/search-06-14-2021&wt=javabin&version=2} status=0 QTime=0"}, "100034112630053465731":{ "responseHeader":{ "status":0, "QTime":0}, "STATUS":"completed", "Response":"TaskId: 100034112630053465731 webapp=null path=/admin/cores params={core=search_shard4_replica_n23&async=100034112630053465731&qt=/admin/cores&name=shard4&action=BACKUPCORE&location=file:///mnt/solr_backups/search/search-06-14-2021&wt=javabin&version=2} status=0 QTime=0"}, "100034112630053492379":{ "responseHeader":{ "status":0, "QTime":0}, "STATUS":"failed", "Response":"Failed to backup core=search_shard1_replica_n25 because org.apache.solr.common.SolrException: Directory to contain snapshots doesn't exist: file:///mnt/solr_backups/search/search-06-14-2021. Note that Backup/Restore of a SolrCloud collection requires a shared file system mounted at the same path on all nodes!"}, "failure":{ "solrmulti01.DOM.DOMAIN.com:8983_solr":{ "responseHeader":{ "status":0, "QTime":0}, "STATUS":"failed", "Response":"Failed to backup core=search_shard1_replica_n25 because org.apache.solr.common.SolrException: Directory to contain snapshots doesn't exist: file:///mnt/solr_backups/search/search-06-14-2021. Note that Backup/Restore of a SolrCloud collection requires a shared file system mounted at the same path on all nodes!"}}, "status":{ "state":"failed", "msg":"found [1000] in failed tasks"}}
{noformat}

> Backups randomly fail sometimes
> -------------------------------
>
>                 Key: SOLR-15371
>                 URL: https://issues.apache.org/jira/browse/SOLR-15371
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Backup/Restore
>    Affects Versions: 8.5.2, 8.8.2
>            Reporter: Roy Perkins
>            Priority: Major
>
> Hi, we have an issue where sometimes one shard fails to backup due to what might be a race condition in creating the folder/starting the backup.  When this happens, we have to restart the first server in a shard to get the backup to succeed again.  The cluster backs up to a shared NFS mount.  4/5 times the backup goes fine without issues (there is even another collection that the backup will run for later in the morning that will succeed fine even though it's all the same servers)  Below is the error I get.
> {code:java}
> "Response":"Failed to backup core=slprod_shard4_replica_n6 because org.apache.solr.common.SolrException: Directory to contain snapshots doesn't exist: file:///mnt/solr_backups/slprod/slprod-04-25-2021. Note that Backup/Restore of a SolrCloud collection requires a shared file system mounted at the same path on all nodes!"},
> {code}
> And below is the line I use to backup with (obviously with bash variables set earlier in the script)
> {code:java}
> curl -s "http://localhost:8983/solr/admin/collections?action=BACKUP&name=${COLLECTION}-${DATE}&collection=${COLLECTION}&location=${BACKUP_PATH}&async=1000"
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org