You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Thomas Mueller (JIRA)" <ji...@apache.org> on 2017/07/26 13:29:00 UTC

[jira] [Commented] (OAK-5519) Skip problematic binaries instead of blocking indexing

    [ https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16101666#comment-16101666 ] 

Thomas Mueller commented on OAK-5519:
-------------------------------------

[~catholicon] and [~chetanm] I think we should try the "Memory of bad file" solution, if that's simple. 

I assume we could write a test case first, that uses a "custom" Tika config as documented in http://jackrabbit.apache.org/oak/docs/query/lucene.html#Tika_Config, custom in that it does nothing except throw an exception / error / out of memory error every time. Then try if this runs into an endless loop. Then remember the file if it fails *three times* in a row. I think it would be better to wait three times, because the first time might be due to a non-repeatable problems (out of memory caused by another thread).

> Skip problematic binaries instead of blocking indexing
> ------------------------------------------------------
>
>                 Key: OAK-5519
>                 URL: https://issues.apache.org/jira/browse/OAK-5519
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: indexing
>            Reporter: Alexander Klimetschek
>              Labels: resilience
>             Fix For: 1.8
>
>
> If a text extraction is blocked (weird PDF) or a blob cannot be found in the datastore or any other error upon indexing one item from the repository that is outside the scope of the indexer, it currently halts the indexing (lane). Thus one item (that maybe isn't important to the users at all) can block the indexing of other, new content (that might be important to users), and it always requires manual intervention  (which is also not easy and requires oak experts).
> Instead, the item could be remembered in a known issue list, proper warnings given, and indexing continue. Maintenance operations should be available to come back to reindex these, or the indexer could automatically retry after some time. This would allow normal user activity to go on without manual intervention, and solving the problem (if it's isolated to some binaries) can be deferred.
> I think the line should probably be drawn for binary properties. Not sure if other JCR property types could trigger a similar issue, and if a failure in them might actually warrant a halt, as it could lead to an "incorrect" index, if these properties are important. But maybe the line is simply a try & catch around "full text extraction".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)