You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Chetan Mehrotra (JIRA)" <ji...@apache.org> on 2017/01/03 07:01:06 UTC

[jira] [Commented] (OAK-4808) Index external changes as part of NRT indexing

    [ https://issues.apache.org/jira/browse/OAK-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15794339#comment-15794339 ] 

Chetan Mehrotra commented on OAK-4808:
--------------------------------------

Had a discussion with [~catholicon] and have following proposals to implement this

h3. A - Diff per external observation

In this mode we would register an Observer which would only be interested in external changes and run {{IndexUpdate}} for each such change. This would be similar to how AsyncIndexUpdate work just that it would be done for each external change. Here we have choice of 

# Running this diff sync 
# Running this diff async via BackgroundObserver

To ensure true sync semantics we might need to implement the diff part sync and then adding the Lucene docs to index can be done depending on index type

This can be optimized a bit by using {{ChangeSet}} info to see if any changed property is covered by some index definition. Note that even if one of the changed property matches we need to perform complete diff

*Pros*
Can be implemented in oak-lucene itself without requiring change in other parts

*Cons*
Add overhead in cluster deployment. It can happen in say 2-3 node cluster setup that one of the cluster node adds lots of content and only very small portion of that content is indexed. In this approach we would still be performing the diff and reading all those changes even if they would not be used. 

h3. B1 - Record indexed doc path as part of JournalEntry

Here the approach would be similar to how {{ChangeSet}} is accumulated (OAK-5101) for each commit and then that info is recorded as part of JournalEntry in DocumentNodeStore. Later when background write is done it is saved in backend. And more later once background read is done then such info is collected from all journal entries and made available to observer via CommitInfo

# Hybrid index logic would add a Indexed document path data as part of CommitInfo CommitContext. This can make use of {{LuceneDocumentHolder}} wrapped with some common interface
# DocumentNodeStore would extract this data and add it to JournalEntry. Similar to what it does for {{ChangeSet}}
# Upon save of JournalEntry this data is saved as json
# Upon read this data from all JournalEntry is collected and a combined indexed path set is exposed

*Points to consider*
# The indexed path data in theory is unbounded (though it should be lot smaller compared to actual modified paths). So this data needs to be backed by StringSort
# (optional) We may also record the index paths along with indexed paths i.e. path of the indexes which index that path. In that case 
## This information needs to be "encoded" when line is added to StringSort
## Comparator should ignore
## While sorting the index path data should be merged

*Pros*
Minimizes the overhead quite a lot. The indexer would only be reading the indexed paths and hence would have lot less overhead

*Cons*

* It would require changes in DocumentNodeStore
* Need to define some {{JournalEntryComponent}} extension point which is then used in oak-lucene

h3. B2 - Record indexed doc path as part of JournalEntry TreeNode

This is variant of B1 where instead of maintaining a separate set we encode this info in TreeNode structure maintained by JournalEntry. Currently it records only if a child node is modified/added/removed. With this also record name of index path or just a boolean flag to indicate that this path is also indexed

*Pros*
Reduces the overhead in JournalEntry storage by avoiding adding an indexed path info twice and just record the delta

[~mreutegg] [~catholicon] Thoughts?

> Index external changes as part of NRT indexing
> ----------------------------------------------
>
>                 Key: OAK-4808
>                 URL: https://issues.apache.org/jira/browse/OAK-4808
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>             Fix For: 1.5.17, 1.6
>
>
> With OAK-4412 NRT indexing support in hybrid index case indexes local changes. It would be useful to have an option to index external changes. In ideal world
> # Async indexing is configured at 5 sec internal
> # DocumentNodeStore background read is configured at 1 sec interval
> So we can index 5 external changes before changes via async indexing are picked up. In real world delays can happen in both parts so having such changes indexed via NRT mode would be useful to reduce the latency in reflecting external changes as part of query result.
> This part would though introduce a cost as it would do complete diff for each external change. So while implementing care must be taken to do it on best effort basis i.e. if queue becomes large then skip certain processing etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)