You are viewing a plain text version of this content. The canonical link for it is here.

Posted to oak-issues@jackrabbit.apache.org by "Vikas Saurabh (JIRA)" <ji...@apache.org> on 2018/03/16 13:39:00 UTC

[jira] [Created] (OAK-7353) oak-run tika extraction should support getting assistance from stored indexed data from a lucene index

Vikas Saurabh created OAK-7353:
----------------------------------

             Summary: oak-run tika extraction should support getting assistance from stored indexed data from a lucene index
                 Key: OAK-7353
                 URL: https://issues.apache.org/jira/browse/OAK-7353
             Project: Jackrabbit Oak
          Issue Type: Improvement
          Components: lucene, oak-run
            Reporter: Vikas Saurabh
            Assignee: Vikas Saurabh


oak-run supports pre-text-extraction \[0] which does a great job at doing text extraction in parallel so that in can be used ingested later during indexing.

But:
* it still reaches to datastore, which, in case of s3 could be very slow
* it still does extraction (duh!) - which is expensive

A common case where we want to get pre-extracted text is reindexing - say on update of index definition which won't impact extracted data from binaries (basically updates which don't change tika configuration)
In those case, it's often possible that there is a version on indexed data from older version of index def that can supply extracted text (as it's binary properties are indexed as stored fields)

So, essentially, it would be nice to have tika based pre-text-extraction be able to consult an index and pick extracted text from there to fill up text extraction store. Of course, if the index doesn't have data for a given binary, it should still fallback to extract it.

\[0]: https://jackrabbit.apache.org/oak/docs/query/pre-extract-text.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)