You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Vikas Saurabh (JIRA)" <ji...@apache.org> on 2018/05/25 10:44:00 UTC
[jira] [Resolved] (OAK-7353) oak-run tika extraction should support
getting assistance from stored indexed data from a lucene index
[ https://issues.apache.org/jira/browse/OAK-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vikas Saurabh resolved OAK-7353.
--------------------------------
Resolution: Fixed
Fix Version/s: 1.9.3
1.10
Fixed in trunk at [r1832231|https://svn.apache.org/r1832231]. Also, documented at [r1832232|https://svn.apache.org/r1832232] (still to be published).
> oak-run tika extraction should support getting assistance from stored indexed data from a lucene index
> ------------------------------------------------------------------------------------------------------
>
> Key: OAK-7353
> URL: https://issues.apache.org/jira/browse/OAK-7353
> Project: Jackrabbit Oak
> Issue Type: Improvement
> Components: lucene, oak-run
> Reporter: Vikas Saurabh
> Assignee: Vikas Saurabh
> Priority: Major
> Fix For: 1.10, 1.9.3
>
>
> oak-run supports pre-text-extraction \[0] which does a great job at doing text extraction in parallel so that in can be used ingested later during indexing.
> But:
> * it still reaches to datastore, which, in case of s3 could be very slow
> * it still does extraction (duh!) - which is expensive
> A common case where we want to get pre-extracted text is reindexing - say on update of index definition which won't impact extracted data from binaries (basically updates which don't change tika configuration)
> In those case, it's often possible that there is a version on indexed data from older version of index def that can supply extracted text (as it's binary properties are indexed as stored fields)
> So, essentially, it would be nice to have tika based pre-text-extraction be able to consult an index and pick extracted text from there to fill up text extraction store. Of course, if the index doesn't have data for a given binary, it should still fallback to extract it.
> \[0]: https://jackrabbit.apache.org/oak/docs/query/pre-extract-text.html
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)