You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Chris M. Hostetter (Jira)" <ji...@apache.org> on 2019/10/29 17:53:00 UTC
[jira] [Commented] (SOLR-13872) Backup can fail to read index files w/NoSuchFileException during merges (SOLR-11616 regression)

    [ https://issues.apache.org/jira/browse/SOLR-13872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962282#comment-16962282 ] 

Chris M. Hostetter commented on SOLR-13872:
-------------------------------------------


My first inclination is that we'll need to:
* replace {{getLatestCommit()}} with something that (atomicly) does a {{reserveAndGetLatestCommit()}} type operation
* make {{saveCommitPoint(gen)}} smart enough the error if the specified generation is already deleted, or if it's less then the latest commit and "unknown" in the current set of commits (ie: last time {{updateCommitPoints(..)}} was called)
* add a lot of synchronization between the methods that "reserve" a commit and the {{IndexDeletionPolicy}} abstraction methods that can result in a commit being deleted -- we need to make sure that if someone is reserving a commit there is no thread-safety/race condition if the IndexWriter is concurrently asking (us to ask) the delegate {{IndexDeletionPolicy}} to delete that commit.

...but while all of those things are almost certianly neccessary, i'm not sure they are sufficient -- in particular we need to be careful about the order of operations: currently IDPW doesn't invoke it's {{updateCommitPoints(...)}} method (to modify it's internal state) until *after* it's delegated the onInit/onCommit calls from the IndexWriter... which means depending on where/how we synchrnize, and where/how exactly we "mark" reservations, we mightstill wind up in a situation where we tell a caller we've reserved a commit, right after it's deleted, but before we *KNOW* it's deleted.

Adding to my headaches is confusion about the way "named snapshots" and the way {{IndexDeletionPolicyWrapper}} depends on (and consults) {{SolrSnapshotMetaDataManager}} to know if some commits should be saved (even if not reserved) and if/how the thread safety of "naming" these snapshoots works (or should work).

It's weird to me that the relationship isn't reversed ... that {{SolrSnapshotMetaDataManager}} should call {{IDPW.saveCommitPoint(gen)}} / {{IDPW.releaseCommitPoint(gen)}} instead of eacy {{IndexCommitWrapper}} asking the {{SolrSnapshotMetaDataManager}} if it can be deleted ... i keep second guessing why it works that way and what i might be missunderstanding and what thread safety issues i might not be thinking about as a result. ... i need to spend more time wrapping my head around these "named snapshots" and their lifecycles and all ofthe code paths that touch that.

----

Also, a quick followup to the previously suggested workaround...

In reading up on the code, i realized that even though all of the underlying code paths using IndexDeletionPolicyWrapper have these thread safety issues, the SolrDeletionPolicy configured workaround should work "better" when using the SolrCloud collection API "BACKUP" action, or the (undocumented) CoreAdmin "BACKUP" action instead of using the {{/replication}} handler as i was in my testing for a single core.

The Core/Collection BACKUP actions *attempt* to reserve the current commit before the start of the file copying – meaning that instead of needing to configure large enough values of {{maxCommitsToKeep}} or {{maxCommitAge}} to ensure that the commit is reserved for the entire duration of the backup process (and all of the underlying disk IO), you can (in theory) use smaller values because you only need to configure them to reserve a commit long enough for the SnapShooter code to have a chance to start and do it's own reservation.

>  Backup can fail to read index files w/NoSuchFileException during merges (SOLR-11616 regression)
> ------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-13872
>                 URL: https://issues.apache.org/jira/browse/SOLR-13872
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Chris M. Hostetter
>            Priority: Major
>         Attachments: index_churn.pl
>
>
> SOLR-11616 purports to fix a bug in Solr's backup functionality that causes 'NoSuchFileException' errors when attempting to backup an index while it is undergoing indexing (and segment merging)
> Although SOLR-11616 is marked with "Fix Version: 7.2" it's pretty easy to demonstrate that this bug still exists on master, branch_8x, and even in 7.2 - so it seems less like the current problem is a "regression" and more that the original fix didn't work.
> ----
> The crux of the problem seems to be concurrency bugs in if/how a commit is "reserved" before attempting to copy the files in that commit to the backup location.  
> A possible work around discussed in more depth in the comments below is to update {{solrconfig.xml}} to explicitly configure the {{SolrDeletionPolicy}} with either the {{maxCommitsToKeep}} or {{maxCommitAge}} options to ensure the commits are kept around long enough for the backup to be created.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org