You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Matt Ryan (JIRA)" <ji...@apache.org> on 2019/06/20 14:37:00 UTC

[jira] [Created] (OAK-8421) Add oak-run option to dump extracted text for all binaries

Matt Ryan created OAK-8421:
------------------------------

             Summary: Add oak-run option to dump extracted text for all binaries
                 Key: OAK-8421
                 URL: https://issues.apache.org/jira/browse/OAK-8421
             Project: Jackrabbit Oak
          Issue Type: Improvement
          Components: indexing, oak-run
    Affects Versions: 1.14.0
            Reporter: Matt Ryan


If you use {{oak-run}} to dump the extracted text from binary properties, during the "generate" step inlined binaries are skipped and not placed into the output CSV file.  Then during either the "extract" or "populate" steps which use this CSV the extracted text from those binaries will not be included in the dump.

It would be nice to include an option to the "generate" step to tell {{oak-run}} to also include inlined binaries in the CSV.  Then, for this to work, the "extract" step would also need the node store parameter so it could get the text from the node store if the binary is inlined.

I'm not sure about the "populate" step, it might need this too.  It tries to get the text directly from the index, so it would depend if inlined binaries also store their extracted text in the index.  I would assume they do, so maybe the "populate" step wouldn't need to be modified.

The {{oak-run}} documentation would also need to be updated; specifically this page:  [https://jackrabbit.apache.org/oak/docs/query/pre-extract-text.html]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)