You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Chetan Mehrotra (JIRA)" <ji...@apache.org> on 2016/06/21 10:47:58 UTC
[jira] [Commented] (OAK-2787) Faster multi threaded indexing for binary content

    [ https://issues.apache.org/jira/browse/OAK-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15341545#comment-15341545 ] 

Chetan Mehrotra commented on OAK-2787:
--------------------------------------

Another approach suggested by [~tmueller]

{quote}As an alternative, would it be possible to do text extraction as part of the datastore, before (or as part of) adding the binary? That would slow down upload a bit (would that be a problem?), but distribute the load (I assume binaries are uploaded on each cluster node concurrently). Basically, when the binary is there, the extracted text is there as well. We would need to extend the datastore API somewhat (for example, add a "TextExtractingDataStore" interface, and a wrapper around the existing datastore). But it would be simpler, as there is no need to maintain the state of extraction, and no need to coordinate and distribute the load. If it's not feasible, it would be good to know why and in what cases (maybe a hybrid approach can be used).
{quote}

One possible way to implement this would be to 
# Have a new CommitHook which looks for new binary property being added
# It then performs text extraction 
# Stores the extracted text as a binary property (hidden) on same node
# Have LuceneIndexEditor look for this property

This should be easy to implement and can be enabled/disabled as per setup requirement

> Faster multi threaded indexing for binary content
> -------------------------------------------------
>
>                 Key: OAK-2787
>                 URL: https://issues.apache.org/jira/browse/OAK-2787
>             Project: Jackrabbit Oak
>          Issue Type: Wish
>          Components: lucene
>            Reporter: Chetan Mehrotra
>
> With Lucene based indexing the indexing process is single threaded. This hamper the indexing of binary content as on a multi processor system only single thread can be used to perform the indexing
> [~ianeboston] Suggested a possible approach [1] involving a 2 phase indexing
> # In first phase detect the nodes to be indexed and start the full text extraction of the binary content. Post extraction save the binary token stream back to the node as a hidden data. In this phase the node properties can still be indexed and a marker field would be added to indicate the fulltext index is still pending
> # Later in 2nd phase look for all such Lucene docs and then update them with the saved token stream
> This would allow the text extraction logic to be decouple from Lucene indexing logic
> [1] http://markmail.org/thread/2w5o4bwqsosb6esu



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)