You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@tika.apache.org by Apache Wiki <wi...@apache.org> on 2018/11/05 14:45:57 UTC

[Tika Wiki] Update of "TikaExtractingEmbeddedCode" by TimothyAllison

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.

The "TikaExtractingEmbeddedCode" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/TikaExtractingEmbeddedCode

New page:
= Extracting Embedded VBA and JS =
By default, Tika ignores embedded VBA and js.  The user must configure this via tika-config.xml:

{{{
<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser">
            <parser-exclude class="org.apache.tika.parser.html.HtmlParser"/>
            <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
            <parser-exclude class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"/>
            <parser-exclude class="org.apache.tika.parser.microsoft.OfficeParser"/>
        </parser>

        <parser class="org.apache.tika.parser.html.HtmlParser">
            <params>
                <param name="extractScripts" type="bool">true</param>
            </params>
        </parser>
        <parser class="org.apache.tika.parser.pdf.PDFParser">
            <params>
                <param name="extractActions" type="bool">true</param>
            </params>
        </parser>
        <parser class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser">
            <params>
                <param name="extractMacros" type="bool">true</param>
            </params>
        </parser>
        <parser class="org.apache.tika.parser.microsoft.OfficeParser">
            <params>
                <param name="extractMacros" type="bool">true</param>
            </params>
        </parser>    
    </parsers>
</properties>
}}}

We encourage using the `RecursiveParserWrapper` for easier understanding of the extracted data and the boundaries between the parent file and the embedded files -- the `-J` option in `tika-app` or the `/rmeta` endpoint in `tika-server`.