You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Chetan Mehrotra (JIRA)" <ji...@apache.org> on 2015/04/20 11:39:59 UTC
[jira] [Created] (OAK-2787) Faster multi threaded indexing for
binary content
Chetan Mehrotra created OAK-2787:
------------------------------------
Summary: Faster multi threaded indexing for binary content
Key: OAK-2787
URL: https://issues.apache.org/jira/browse/OAK-2787
Project: Jackrabbit Oak
Issue Type: Wish
Components: lucene
Reporter: Chetan Mehrotra
With Lucene based indexing the indexing process is single threaded. This hamper the indexing of binary content as on a multi processor system only single thread can be used to perform the indexing
[~ianeboston] Suggested a possible approach [1] involving a 2 phase indexing
# In first phase detect the nodes to be indexed and start the full text extraction of the binary content. Post extraction save the binary token stream back to the node as a hidden data. In this phase the node properties can still be indexed and a marker field would be added to indicate the fulltext index is still pending
# Later in 2nd phase look for all such Lucene docs and then update them with the saved token stream
This would allow the text extraction logic to be decouple from Lucene indexing logic
[1] http://markmail.org/thread/2w5o4bwqsosb6esu
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)