You are viewing a plain text version of this content. The canonical link for it is here.

Posted to oak-issues@jackrabbit.apache.org by "Ian Boston (JIRA)" <ji...@apache.org> on 2015/05/01 16:39:06 UTC

[jira] [Commented] (OAK-2787) Faster multi threaded indexing for binary content

    [ https://issues.apache.org/jira/browse/OAK-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14523260#comment-14523260 ] 

Ian Boston commented on OAK-2787:
---------------------------------

The description matches what I said. Tokenizing (text extraction) immutable content once and storing the result has several benefits:

Pros:

* The cost is incurred once, important where its expensive like with remastered PDFs.
* Reindexing costs are greatly reduced
* An items properties  can be indexed before tokenizing is complete reducing latency between the repository and the index.
* The impact of performing tokenization can be controlled or offloaded.

Cons:

* Assumes content bodies are immutable.
* Slight increase in storage requirements to hold the tokenized stream.
* Potentially more latency where the tokenization process is intentionally resource constrained.

> Faster multi threaded indexing for binary content
> -------------------------------------------------
>
>                 Key: OAK-2787
>                 URL: https://issues.apache.org/jira/browse/OAK-2787
>             Project: Jackrabbit Oak
>          Issue Type: Wish
>          Components: lucene
>            Reporter: Chetan Mehrotra
>
> With Lucene based indexing the indexing process is single threaded. This hamper the indexing of binary content as on a multi processor system only single thread can be used to perform the indexing
> [~ianeboston] Suggested a possible approach [1] involving a 2 phase indexing
> # In first phase detect the nodes to be indexed and start the full text extraction of the binary content. Post extraction save the binary token stream back to the node as a hidden data. In this phase the node properties can still be indexed and a marker field would be added to indicate the fulltext index is still pending
> # Later in 2nd phase look for all such Lucene docs and then update them with the saved token stream
> This would allow the text extraction logic to be decouple from Lucene indexing logic
> [1] http://markmail.org/thread/2w5o4bwqsosb6esu



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)