You are viewing a plain text version of this content. The canonical link for it is here.

Posted to oak-issues@jackrabbit.apache.org by "Vikas Saurabh (JIRA)" <ji...@apache.org> on 2018/05/25 10:44:00 UTC

[jira] [Resolved] (OAK-7353) oak-run tika extraction should support getting assistance from stored indexed data from a lucene index

     [ https://issues.apache.org/jira/browse/OAK-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vikas Saurabh resolved OAK-7353.
--------------------------------
       Resolution: Fixed
    Fix Version/s: 1.9.3
                   1.10

Fixed in trunk at [r1832231|https://svn.apache.org/r1832231]. Also, documented at [r1832232|https://svn.apache.org/r1832232] (still to be published).

> oak-run tika extraction should support getting assistance from stored indexed data from a lucene index
> ------------------------------------------------------------------------------------------------------
>
>                 Key: OAK-7353
>                 URL: https://issues.apache.org/jira/browse/OAK-7353
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene, oak-run
>            Reporter: Vikas Saurabh
>            Assignee: Vikas Saurabh
>            Priority: Major
>             Fix For: 1.10, 1.9.3
>
>
> oak-run supports pre-text-extraction \[0] which does a great job at doing text extraction in parallel so that in can be used ingested later during indexing.
> But:
> * it still reaches to datastore, which, in case of s3 could be very slow
> * it still does extraction (duh!) - which is expensive
> A common case where we want to get pre-extracted text is reindexing - say on update of index definition which won't impact extracted data from binaries (basically updates which don't change tika configuration)
> In those case, it's often possible that there is a version on indexed data from older version of index def that can supply extracted text (as it's binary properties are indexed as stored fields)
> So, essentially, it would be nice to have tika based pre-text-extraction be able to consult an index and pick extracted text from there to fill up text extraction store. Of course, if the index doesn't have data for a given binary, it should still fallback to extract it.
> \[0]: https://jackrabbit.apache.org/oak/docs/query/pre-extract-text.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)