You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Chetan Mehrotra (JIRA)" <ji...@apache.org> on 2017/12/18 06:45:00 UTC

[jira] [Comment Edited] (OAK-6353) Use Document order traversal for reindexing performed on DocumentNodeStore setups

    [ https://issues.apache.org/jira/browse/OAK-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294556#comment-16294556 ] 

Chetan Mehrotra edited comment on OAK-6353 at 12/18/17 6:44 AM:
----------------------------------------------------------------

With new Document order traversal based indexing significant performance improvements were seen. 

For a large repo (255M Mongo Docs, 66M nodes under /content and having 4.2M assets) earlier indexing completed in 13.66 h. Compared to that document order based indexing completed in 3.469 h. 

With this initial planned implementation is done. Specific issues can later be opened for further improvements. Possible future enhancements

# Prefetch the previous documents before doing Mongo traversal - This may reduce the time to resolve the NodeDocument to NodeState
# Mongo query optimizations
## Avoid fetching nodes under hidden paths at all
## Only fetch those documents from Mongo which are under included paths - This can be done by using javascript function
# Sorting optimization - Sort the batch in memory as nodes are being read and just write the sorted files

*Usage*

This mode can be enabled for Mongo based setup via cli argument {{--doc-traversal-mode}}

This indexing mode requires quite a bit of local disk space to store all the NodeState in json format. For 200GB Mongo repo it required 100GB of local disk space to keep the NodeState json and also for performing external sort on that

Also documents need to be updated


was (Author: chetanm):
With new Document order traversal based indexing significant performance improvements were seen. 

For a large repo (255M Mongo Docs, 66M nodes under /content and having 4.2M assets) earlier indexing completed in 13.66 h. Compared to that document order based indexing completed in 3.469 h. 

With this initial planned implementation is done. Specific issues can later be opened for further improvements. Possible future enhancements

# Prefetch the previous documents before doing Mongo traversal - This may reduce the time to resolve the NodeDocument to NodeState
# Mongo query optimizations
## Avoid fetching nodes under hidden paths at all
## Only fetch those documents from Mongo which are under included paths - This can be done by using javascript function
# Sorting optimization - Sort the batch in memory as nodes are being read and just write the sorted files

Also documents need to be updated

> Use Document order traversal for reindexing performed on DocumentNodeStore setups
> ---------------------------------------------------------------------------------
>
>                 Key: OAK-6353
>                 URL: https://issues.apache.org/jira/browse/OAK-6353
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: run
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>             Fix For: 1.7.13, 1.8
>
>         Attachments: OAK-6353-v1.patch, OAK-6353-v2.patch
>
>
> [~tmueller] suggested [here|https://issues.apache.org/jira/browse/OAK-6246?focusedCommentId=16034442&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16034442] that document order traversal can be faster compared to current mode of path based traversal. Initial test indicate that such a traversal can be order of magnitude faster. 
> So this task is meant to implement such an approach and see if it can be a viable indexing mode used for DocumentNodeStore based setups



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)