You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by Apache Wiki <wi...@apache.org> on 2018/11/05 14:45:57 UTC
[Tika Wiki] Update of "TikaExtractingEmbeddedCode" by TimothyAllison
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "TikaExtractingEmbeddedCode" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/TikaExtractingEmbeddedCode
New page:
= Extracting Embedded VBA and JS =
By default, Tika ignores embedded VBA and js. The user must configure this via tika-config.xml:
{{{
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
<parser-exclude class="org.apache.tika.parser.html.HtmlParser"/>
<parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
<parser-exclude class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"/>
<parser-exclude class="org.apache.tika.parser.microsoft.OfficeParser"/>
</parser>
<parser class="org.apache.tika.parser.html.HtmlParser">
<params>
<param name="extractScripts" type="bool">true</param>
</params>
</parser>
<parser class="org.apache.tika.parser.pdf.PDFParser">
<params>
<param name="extractActions" type="bool">true</param>
</params>
</parser>
<parser class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser">
<params>
<param name="extractMacros" type="bool">true</param>
</params>
</parser>
<parser class="org.apache.tika.parser.microsoft.OfficeParser">
<params>
<param name="extractMacros" type="bool">true</param>
</params>
</parser>
</parsers>
</properties>
}}}
We encourage using the `RecursiveParserWrapper` for easier understanding of the extracted data and the boundaries between the parent file and the embedded files -- the `-J` option in `tika-app` or the `/rmeta` endpoint in `tika-server`.