You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Chetan Mehrotra (JIRA)" <ji...@apache.org> on 2015/08/05 07:30:04 UTC

[jira] [Commented] (OAK-3092) Cache recently extracted text to avoid duplicate extraction

    [ https://issues.apache.org/jira/browse/OAK-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14654843#comment-14654843 ] 

Chetan Mehrotra commented on OAK-3092:
--------------------------------------

On one of the system following logs can be seen

{noformat}
04.08.2015 14:50:05.074 *INFO* [pool-9-thread-1] org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditorContext Text extraction stats  60 (Time Taken 3 min, 1 sec, Bytes Read 93.1 MB, Extracted text size 647940) 
04.08.2015 14:53:30.923 *INFO* [pool-9-thread-3] org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditorContext Text extraction stats  60 (Time Taken 3 min, 4 sec, Bytes Read 93.1 MB, Extracted text size 647940) 
04.08.2015 15:39:04.539 *INFO* [pool-9-thread-2] org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditorContext Text extraction stats  48 (Time Taken 2 min, 1 sec, Bytes Read 35.5 MB, Extracted text size 2866840)
{noformat}

This setup had aggregation rule defined. So same binary might go through text extraction multiple in given indexing cycle. So having proposed cache would help

> Cache recently extracted text to avoid duplicate extraction
> -----------------------------------------------------------
>
>                 Key: OAK-3092
>                 URL: https://issues.apache.org/jira/browse/OAK-3092
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>             Fix For: 1.3.5
>
>
> It can happen that text can be extracted from same binary multiple times in a given indexing cycle. This can happen due to 2 reasons
> # Multiple Lucene indexes indexing same node - A system might have multiple Lucene indexes e.g. a global Lucene index and an index for specific nodeType. In a given indexing cycle same file would be picked up by both index definition and both would extract same text
> # Aggregation - With Index time aggregation same file get picked up multiple times due to aggregation rules
> To avoid the wasted effort for duplicate text extraction from same file in a given indexing cycle it would be better to have an expiring cache which can hold on to extracted text content for some time. The cache should have following features
> # Limit on total size
> # Way to expire the content using [Timed Evicition|https://code.google.com/p/guava-libraries/wiki/CachesExplained#Timed_Eviction] - As chances of same file getting picked up are high only for a given indexing cycle it would be better to expire the cache entries after some time to avoid hogging memory unnecessarily 
> Such a cache would provide following benefit
> # Avoid duplicate text extraction - Text extraction is costly and has to be minimized on critical path of {{indexEditor}}
> # Avoid expensive IO specially if binary content are to be fetched from a remote {{BlobStore}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)