You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by "Artem Abeleshev (Jira)" <ji...@apache.org> on 2022/01/06 01:35:00 UTC

[jira] [Comment Edited] (SOLR-15842) Collection Backup Status doesn't calculate de IndexSizeMb correctly.

    [ https://issues.apache.org/jira/browse/SOLR-15842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17466714#comment-17466714 ] 

Artem Abeleshev edited comment on SOLR-15842 at 1/6/22, 1:34 AM:
-----------------------------------------------------------------

Hi, Richardo! Thanks for rising an issue.

In short:

Unfortunately it is unable to workaround the problem due to its nature. This is caused because the results of the shard backup requests are not included to the task object that is used for async tracking.

Details:

Async backup requests work the following way. When a request hits the Solr it checks if there is an _async_ parameter provided. If it is, the action will be queued and response will be immediately returned without waiting until backup will be completed. Then queued action will be processed and its result will be placed to the distributed map within the Zookeeper (you can check these maps using Solr Admin web interface). In case of collection backup, a request will be also sent to each of the shards to make backup of the index files (each shard will backup it's own index files). These requests will be also sent as async and will be handled in a similar way. Action will be submitted to executor service and immediate response will be returned without waiting until the shard backup process will be completed. After submitted action will be processed the results would be stored in local tracking map in a form of {_}org.apache.solr.handler.admin.CoreAdminHandler.TaskObject{_}:

{_}org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(SolrQueryRequest, SolrQueryResponse){_}:
{code:java}
  public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) throws Exception {
      ...
      final String taskId = req.getParams().get(CommonAdminParams.ASYNC);
      final TaskObject taskObject = new TaskObject(taskId);
      ...
      if(taskId != null) {
      ...
        addTask(RUNNING, taskObject);
      }
      ...
      CoreAdminOperation op = opMap.get(req.getParams().get(ACTION, STATUS.toString()).toLowerCase(Locale.ROOT));
      ...
      final CallInfo callInfo = new CallInfo(this, req, rsp, op);
      ...
      if (taskId == null) {
        callInfo.call();
      } else {
        ...
        parallelExecutor.execute(() -> {
          boolean exceptionCaught = false;
          try {
            callInfo.call();
            taskObject.setRspObject(callInfo.rsp);
          } catch (Exception e) {
            exceptionCaught = true;
            taskObject.setRspObjectFromException(e);
          } finally {
            removeTask("running", taskObject.taskId);
            if (exceptionCaught) {
              addTask("failed", taskObject, true);
            } else {
              addTask("completed", taskObject, true);
            }
          }
        });
        ...
      }
      ...
    }
{code}
Then, results located at the tracking map will be polled by sender using backup status requests. The problem you've raised in this issue lies within the _setRspObject_ method call and the nature of how async response is stored. After executing the _call_ method of the _CallInfo_ it will contain a response with all the backup results. Then, these results are supposed to be placed to the _TaskObject_ by calling method {_}setRspObject{_}. But let's check what happening there:
{code:java}
  public void setRspObject(SolrQueryResponse rspObject) {
    this.rspInfo = rspObject.getToLogAsString("TaskId: " + this.taskId);
  }
{code}
Results located in _SolrQueryResponse_ are completely ignored here, Instead, a string with shard request async id and all the content of the _toLog_ list of the _SolrQueryResponse_ is placed to the tracking result. This is what each shard would return to the sender as a result of index backup:
{code:json}
{
    "responseHeader": {
        "status": 0,
        "QTime": 1
    },
    "STATUS": "completed",
    "Response": "TaskId: 10402421194574306142 webapp=null path=/admin/cores params={core=techproducts_shard1_replica_n2&async=10402421194574306142&qt=/admin/cores&name=shard1&shardBackupId=md_shard1_0&action=BACKUPCORE&location=file:///path/to/my/shared/drive/mybackup/techproducts&incremental=true&wt=javabin&version=2} status=0 QTime=57"
}
{code}
Note the _Response_ value:
{code:json}
"Response": "TaskId: 10402421194574306142 webapp=null path=/admin/cores params={core=techproducts_shard1_replica_n2&async=10402421194574306142&qt=/admin/cores&name=shard1&shardBackupId=md_shard1_0&action=BACKUPCORE&location=file:///path/to/my/shared/drive/mybackup/techproducts&incremental=true&wt=javabin&version=2} status=0 QTime=57"
{code}
and compare it to the _response_ value of a sync request result:
{code:json}
"response": [
    "startTime",
    "2021-12-24T14:20:32.021Z",
    "indexFileCount",
    21,
    "uploadedIndexFileCount",
    21,
    "indexSizeMB",
    0.006,
    "uploadedIndexFileMB",
    0.006,
    "shard",
    "shard1",
    "endTime",
    "2021-12-24T14:20:32.396Z",
    "shardBackupId",
    "md_shard1_0"
]
{code}
After sender will obtain all the results from all the shards it will try to aggregate them:

{_}org.apache.solr.cloud.api.collections.BackupCmd.aggregateResults(NamedList, String, BackupManager, BackupProperties, Collection<Slice>){_}:
{code:java}
  private NamedList aggregateResults(NamedList results, String collectionName,
                                    BackupManager backupManager,
                                    BackupProperties backupProps,
                                    Collection<Slice> slices) {
    NamedList<Object> aggRsp = new NamedList<>();
    aggRsp.add("collection", collectionName);
    aggRsp.add("numShards", slices.size());
    aggRsp.add("backupId", backupManager.getBackupId().id);
    aggRsp.add("indexVersion", backupProps.getIndexVersion());
    aggRsp.add("startTime", backupProps.getStartTime());

    double indexSizeMB = 0;
    NamedList shards = (NamedList) results.get("success");
    for (int i = 0; i < shards.size(); i++) {
      NamedList shardResp = (NamedList)((NamedList)shards.getVal(i)).get("response");
      if (shardResp == null)
        continue;
      indexSizeMB += (double) shardResp.get("indexSizeMB");
    }
    aggRsp.add("indexSizeMB", indexSizeMB);
    return aggRsp;
  }
{code}
As you can see, in the case of the _success_ block found it will iterate over all entries and will try to extract the _response_ value. For async requests it will fail here for at least two reasons:
 - shard async response contains value _Response_ not _response_
 - shard async response value _Response_ is a type of _String_ not a _NamedList_

To get more clear picture of the whole process read an article I wrote about collection backup: [Code Anatomy: Solr Collection Backup|https://tyoma.hashnode.dev/code-anatomy-solr-collection-backup]


was (Author: JIRAUSER282679):
Hi, Richardo! Thanks for rising an issue.

In short:

Unfortunately it is unable to workaround the problem due to its nature. This is caused because the results of the shard backup requests are not included to the task object that is used for async tracking.

Details:

Async backup requests work the following way. When a request hits the Solr it checks if there is an _async_ parameter provided. If it is, the action will be queued and response will be immediately returned without waiting until backup will be completed. Then queued action will be processed and it's result will be placed to the distributed map within the Zookeeper (you can check these maps using Solr Admin web interface). In case of collection backup, a request will be also sent to each of the shards to make backup of the index files (each shard will backup it's own index files). These requests will be also sent as async and will be handled in a similar way. Action will be submitted to executor service and immediate response will be returned without waiting until the shard backup process will be completed. After submitted action will be processed the results would be stored in local tracking map in a form of {_}org.apache.solr.handler.admin.CoreAdminHandler.TaskObject{_}:

{_}org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(SolrQueryRequest, SolrQueryResponse){_}:
{code:java}
  public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) throws Exception {
      ...
      final String taskId = req.getParams().get(CommonAdminParams.ASYNC);
      final TaskObject taskObject = new TaskObject(taskId);
      ...
      if(taskId != null) {
      ...
        addTask(RUNNING, taskObject);
      }
      ...
      CoreAdminOperation op = opMap.get(req.getParams().get(ACTION, STATUS.toString()).toLowerCase(Locale.ROOT));
      ...
      final CallInfo callInfo = new CallInfo(this, req, rsp, op);
      ...
      if (taskId == null) {
        callInfo.call();
      } else {
        ...
        parallelExecutor.execute(() -> {
          boolean exceptionCaught = false;
          try {
            callInfo.call();
            taskObject.setRspObject(callInfo.rsp);
          } catch (Exception e) {
            exceptionCaught = true;
            taskObject.setRspObjectFromException(e);
          } finally {
            removeTask("running", taskObject.taskId);
            if (exceptionCaught) {
              addTask("failed", taskObject, true);
            } else {
              addTask("completed", taskObject, true);
            }
          }
        });
        ...
      }
      ...
    }
{code}
Then, results located at the tracking map will be polled by sender using backup status requests. The problem you've raised in this issue lies within the _setRspObject_ method call and the nature of how async response is stored. After executing the _call_ method of the _CallInfo_ it will contain a response with all the backup results. Then, these results are supposed to be placed to the _TaskObject_ by calling method {_}setRspObject{_}. But let's check what happening there:
{code:java}
  public void setRspObject(SolrQueryResponse rspObject) {
    this.rspInfo = rspObject.getToLogAsString("TaskId: " + this.taskId);
  }
{code}
Results located in _SolrQueryResponse_ are completely ignored here, Instead, a string with shard request async id and all the content of the _toLog_ list of the _SolrQueryResponse_ is placed to the tracking result. This is what each shard would return to the sender as a result of index backup:
{code:json}
{
    "responseHeader": {
        "status": 0,
        "QTime": 1
    },
    "STATUS": "completed",
    "Response": "TaskId: 10402421194574306142 webapp=null path=/admin/cores params={core=techproducts_shard1_replica_n2&async=10402421194574306142&qt=/admin/cores&name=shard1&shardBackupId=md_shard1_0&action=BACKUPCORE&location=file:///path/to/my/shared/drive/mybackup/techproducts&incremental=true&wt=javabin&version=2} status=0 QTime=57"
}
{code}
Note the _Response_ value:
{code:json}
"Response": "TaskId: 10402421194574306142 webapp=null path=/admin/cores params={core=techproducts_shard1_replica_n2&async=10402421194574306142&qt=/admin/cores&name=shard1&shardBackupId=md_shard1_0&action=BACKUPCORE&location=file:///path/to/my/shared/drive/mybackup/techproducts&incremental=true&wt=javabin&version=2} status=0 QTime=57"
{code}
and compare it to the _response_ value of a sync request result:
{code:json}
"response": [
    "startTime",
    "2021-12-24T14:20:32.021Z",
    "indexFileCount",
    21,
    "uploadedIndexFileCount",
    21,
    "indexSizeMB",
    0.006,
    "uploadedIndexFileMB",
    0.006,
    "shard",
    "shard1",
    "endTime",
    "2021-12-24T14:20:32.396Z",
    "shardBackupId",
    "md_shard1_0"
]
{code}
After sender will obtain all the results from all the shards it will try to aggregate them:

{_}org.apache.solr.cloud.api.collections.BackupCmd.aggregateResults(NamedList, String, BackupManager, BackupProperties, Collection<Slice>){_}:
{code:java}
  private NamedList aggregateResults(NamedList results, String collectionName,
                                    BackupManager backupManager,
                                    BackupProperties backupProps,
                                    Collection<Slice> slices) {
    NamedList<Object> aggRsp = new NamedList<>();
    aggRsp.add("collection", collectionName);
    aggRsp.add("numShards", slices.size());
    aggRsp.add("backupId", backupManager.getBackupId().id);
    aggRsp.add("indexVersion", backupProps.getIndexVersion());
    aggRsp.add("startTime", backupProps.getStartTime());

    double indexSizeMB = 0;
    NamedList shards = (NamedList) results.get("success");
    for (int i = 0; i < shards.size(); i++) {
      NamedList shardResp = (NamedList)((NamedList)shards.getVal(i)).get("response");
      if (shardResp == null)
        continue;
      indexSizeMB += (double) shardResp.get("indexSizeMB");
    }
    aggRsp.add("indexSizeMB", indexSizeMB);
    return aggRsp;
  }
{code}
As you can see, in the case of the _success_ block found it will iterate over all entries and will try to extract the _response_ value. For async requests it will fail here for at least two reasons:
 - shard async response contains value _Response_ not _response_
 - shard async response value _Response_ is a type of _String_ not a _NamedList_

To get more clear picture of the whole process read an article I wrote about collection backup: [Code Anatomy: Solr Collection Backup|https://tyoma.hashnode.dev/code-anatomy-solr-collection-backup]

> Collection Backup Status doesn't calculate de IndexSizeMb correctly.
> --------------------------------------------------------------------
>
>                 Key: SOLR-15842
>                 URL: https://issues.apache.org/jira/browse/SOLR-15842
>             Project: Solr
>          Issue Type: Bug
>          Components: Backup/Restore, SolrCloud
>    Affects Versions: 8.10, 8.10.1
>            Reporter: Ricardo Ruiz Maldonado
>            Priority: Major
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> When [backing up|#backup]] a collection either for the *S3 Repository* or the {*}LocalFileSystemRepository{*}, if I provide the *async* parameter and then check the status of the backup with the [REQUESTSTATUS|#requeststatus]] endpoint, even if the backup finishes successfully, the *indexSizeMB* parameter is always 0.
> If I do a *sync* backup and wait until it finishes, then the *indexSizeMB* parameter has the right value.
> Here are some examples of the responses for each case:
> h3. S3 Backup (Sync)
> {code:java}
> {
>   "responseHeader":{
>     "status":0,
>     "QTime":30640},
>   "success":{
>     "10.9.21.42:8983_solr":{
>       "responseHeader":{
>         "status":0,
>         "QTime":5857},
>       "response":[
>         "startTime","2021-12-09T03:16:14.944860Z",
>         "indexFileCount",18,
>         "uploadedIndexFileCount",18,
>         "indexSizeMB",0.026,
>         "uploadedIndexFileMB",0.026,
>         "shard","shard2",
>         "endTime","2021-12-09T03:16:20.694631Z",
>         "shardBackupId","md_shard2_0"]},
>     "10.9.21.42:8983_solr":{
>       "responseHeader":{
>         "status":0,
>         "QTime":5891},
>       "response":[
>         "startTime","2021-12-09T03:16:14.951702Z",
>         "indexFileCount",18,
>         "uploadedIndexFileCount",18,
>         "indexSizeMB",0.133,
>         "uploadedIndexFileMB",0.133,
>         "shard","shard1",
>         "endTime","2021-12-09T03:16:20.735084Z",
>         "shardBackupId","md_shard1_0"]}},
>   "response":[
>     "collection","collection",
>     "numShards",2,
>     "backupId",0,
>     "indexVersion","8.10.1",
>     "startTime","2021-12-09T03:16:14.381680Z",
>     "indexSizeMB",0.159]}{code}
> h3. S3 Backup (async)
> {code:java}
> {
>     "responseHeader": {
>         "status": 0,
>         "QTime": 4
>     },
>     "success": {
>         "10.9.21.42:8983_solr": {
>             "responseHeader": {
>                 "status": 0,
>                 "QTime": 2
>             }
>         },
>         "10.9.21.42:8983_solr": {
>             "responseHeader": {
>                 "status": 0,
>                 "QTime": 3
>             }
>         },
>         "10.9.21.42:8983_solr": {
>             "responseHeader": {
>                 "status": 0,
>                 "QTime": 0
>             },
>             "STATUS": "completed",
>             "Response": "TaskId: backup120415121643240950269884 webapp=null path=/admin/cores params={core=collectionName_shard2_replica_n4&async=backup120415121643240950269884&qt=/admin/cores&name=shard2&shardBackupId=md_shard2_1&action=BACKUPCORE&location=s3:/b39587e3-c296-4634-b8e2-7ff1e94e6a69/index/backup.1/collectionName/&incremental=true&repository=s3&prevShardBackupId=md_shard2_0&wt=javabin&version=2} status=0 QTime=2"
>         },
>         "10.9.21.42:8983_solr": {
>             "responseHeader": {
>                 "status": 0,
>                 "QTime": 0
>             },
>             "STATUS": "completed",
>             "Response": "TaskId: backup120415121643240950730312 webapp=null path=/admin/cores params={core=collectionName_shard1_replica_n1&async=backup120415121643240950730312&qt=/admin/cores&name=shard1&shardBackupId=md_shard1_1&action=BACKUPCORE&location=s3:/b39587e3-c296-4634-b8e2-7ff1e94e6a69/index/backup.1/collectionName/&incremental=true&repository=s3&prevShardBackupId=md_shard1_0&wt=javabin&version=2} status=0 QTime=3"
>         }
>     },
>     "backup120415121643240950269884": {
>         "responseHeader": {
>             "status": 0,
>             "QTime": 0
>         },
>         "STATUS": "completed",
>         "Response": "TaskId: backup120415121643240950269884 webapp=null path=/admin/cores params={core=collectionName_shard2_replica_n4&async=backup120415121643240950269884&qt=/admin/cores&name=shard2&shardBackupId=md_shard2_1&action=BACKUPCORE&location=s3:/b39587e3-c296-4634-b8e2-7ff1e94e6a69/index/backup.1/collectionName/&incremental=true&repository=s3&prevShardBackupId=md_shard2_0&wt=javabin&version=2} status=0 QTime=2"
>     },
>     "backup120415121643240950730312": {
>         "responseHeader": {
>             "status": 0,
>             "QTime": 0
>         },
>         "STATUS": "completed",
>         "Response": "TaskId: backup120415121643240950730312 webapp=null path=/admin/cores params={core=collectionName_shard1_replica_n1&async=backup120415121643240950730312&qt=/admin/cores&name=shard1&shardBackupId=md_shard1_1&action=BACKUPCORE&location=s3:/b39587e3-c296-4634-b8e2-7ff1e94e6a69/index/backup.1/collectionName/&incremental=true&repository=s3&prevShardBackupId=md_shard1_0&wt=javabin&version=2} status=0 QTime=3"
>     },
>     "response": [
>         "collection",
>         "collectionName",
>         "numShards",
>         2,
>         "backupId",
>         1,
>         "indexVersion",
>         "8.10.1",
>         "startTime",
>         "2021-12-04T06:12:52.540773Z",
>         "indexSizeMB",
>         0.0
>     ],
>     "status": {
>         "state": "completed",
>         "msg": "found [backup12041512] in completed tasks"
>     }
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org