You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Vikas Saurabh (JIRA)" <ji...@apache.org> on 2018/07/13 11:53:00 UTC

[jira] [Comment Edited] (OAK-7246) Improve cleanup of locally copied index files

    [ https://issues.apache.org/jira/browse/OAK-7246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542944#comment-16542944 ] 

Vikas Saurabh edited comment on OAK-7246 at 7/13/18 11:52 AM:
--------------------------------------------------------------

Discussed this a bit with [~tmueller] and [~chetanm] for a simpler and safer approach to fix clean up. We need this constraint to hold - only files which would not longer be required should be deleted

Given how files are created on local disk (due to CoR or CoW), we'd (probably fairly) assume these:
* filesystems have a trustworthy notion of time that gets put in creation timestamp
* creation timestamps of local files would follow the same sequence as revisions being observed(CoR)/created(CoW)
* files created before any of the files visible to a closing reader won't be required in the future (it's basically a corollary of the point above - but is still important to state explicitly)

Given these, a simple approach for clean up could be that the closing readers, while being closed, schedules deletion of files resulting from following steps:
# get timestamps of local files which are visible on its revision - set max of these timestamps as T ~max~
# get set of local files as S ~local~
# subtract files visible on its revision from S ~local~ to get S ~candidate~
# filter S ~candidate~ by "keep if {{creation timestamp}} < T ~max~ to get S ~delete~
# schedule files listed in S ~delete~ to be deleted

With this approach, we can entirely get rid of shared set being maintained by writers. Similarly, we don't need to maintain {{localFileNames}} in {{CopyOnReadDirectory}} as a snapshot of files available locally when the reader was being opened.


was (Author: catholicon):
Discussed this a bit with [~tmueller] and [~chetanm] for a simpler and safer approach to fix clean up. We need this constraint to hold - only files which would not longer be required should be deleted

Given how files are created on local disk (due to CoR or CoW), we'd (probably fairly) assume these:
* creation timestamps of local files would follow the same sequence as revisions being observed(CoR)/created(CoW)
* files created before any of the files visible to a closing reader won't be required in the future (it's basically a corollary of the point above - but is still important to state explicitly)

Given these, a simple approach for clean up could be that the closing readers, while being closed, schedules deletion of files resulting from following steps:
# get timestamps of local files which are visible on its revision - set max of these timestamps as T ~max~
# get set of local files as S ~local~
# subtract files visible on its revision from S ~local~ to get S ~candidate~
# filter S ~candidate~ by "keep if {{creation timestamp}} < T ~max~ to get S ~delete~
# schedule files listed in S ~delete~ to be deleted

With this approach, we can entirely get rid of shared set being maintained by writers. Similarly, we don't need to maintain {{localFileNames}} in {{CopyOnReadDirectory}} as a snapshot of files available locally when the reader was being opened.

> Improve cleanup of locally copied index files
> ---------------------------------------------
>
>                 Key: OAK-7246
>                 URL: https://issues.apache.org/jira/browse/OAK-7246
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene
>            Reporter: Vikas Saurabh
>            Assignee: Vikas Saurabh
>            Priority: Major
>
> This task is to re-think how should we do clean up of locally copied index files which are no longer in use.
> Current approach:
> # index writers, while creating index files, keep list of currently-being-written files
> ## this list is cleared when a new index writer comes into play
> # index tracker opens new index (at new revision) via observation
> ## while being opened, we also track current dir listing of the local index files
> # during opening new index, the tracker closes the old revision of index reader
> ## during this close, local files noted above during open are purged if ( they don't show up in remote view of the index && they aren't part of currently being written list by index writer)
> This approach, at least in following timeline, would incur extra copying (and as a side-effect also open some index files directly off of remote input stream during CoWs):
> # CoW1 creates [a, b]
> # CoW2 starts and creates [c, d], removes [a, b] from remote
> # CoR1 opens an index due to CoW1
> ## local-list-CoR1 = [a, b, c, d], remote-index-list=[a, b]
> # CoW2 finishes
> # CoW3 creates [e, f], removes [a,b] from remote
> ## CoW-currently-being-written-list=[e,f]
> # CoR2 opens due to CoW2
> ## local-list-CoR2=[a,b,c,d,e,f], remote-index-list=[c,d]
> # CoR1 closes
> ## deletes [c,d] as they aren't in its list of index files ([a,b]) AND aren't part of shared list ([e,f])
> Disclaimer: the timeline might be off a bit (haven't written a test yet... but the basic point is that CoR could be working with a index file set and the new files might have come in twice after CoR - thus shared list doesn't have complete information of new files written in.
> [~chetanm], can you please check the timeline above - I'd try to work on a test case in the mean time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)