You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/11/27 17:24:00 UTC
[jira] [Comment Edited] (TIKA-2507) xlsx takes more than 5 mins to
parse in 1.16
[ https://issues.apache.org/jira/browse/TIKA-2507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267091#comment-16267091 ]
Tim Allison edited comment on TIKA-2507 at 11/27/17 5:23 PM:
-------------------------------------------------------------
Thank you for opening this issue. I'm sorry it took me so long to respond.
You're right. This xlsx file contains 564 copies of a chart. :P When I recommended opening an issue for this, I had forgotten that I already parameterized chart/graph parsing with {{includeShapeBasedContent}}. If you set that to false, the parsing of this file goes back to < 1 second. If that is {{true}}, the default, I also see that parsing takes a really, really long time.
If you're calling Tika programmatically, you can turn this off with:
{noformat}
OfficeParserConfig config = new OfficeParserConfig();
config.setIncludeShapeBasedContent(false);
ParseContext parseContext = new ParseContext();
parseContext.set(OfficeParserConfig.class, config);
{noformat}
Or you can set this in your tika-config.xml with something like:
{noformat}
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser"/>
<parser class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser">
<params>
<param name="includeShapeBasedContent" type="bool">false</param>
</params>
</parser>
<parser class="org.apache.tika.parser.microsoft.OfficeParser">
<params>
<param name="includeShapeBasedContent" type="bool">false</param>
</params>
</parser>
</parsers>
</properties>
{noformat}
was (Author: tallison@mitre.org):
You're right. This xlsx file contains 564 copies of a chart. :P When I recommended opening an issue for this, I had forgotten that I already parameterized chart/graph parsing with {{includeShapeBasedContent}}. If you set that to false, the parsing of this file goes back to < 1 second. If that is {{true}}, the default, I also see that parsing takes a really, really long time.
If you're calling Tika programmatically, you can turn this off with:
{noformat}
OfficeParserConfig config = new OfficeParserConfig();
config.setIncludeShapeBasedContent(false);
ParseContext parseContext = new ParseContext();
parseContext.set(OfficeParserConfig.class, config);
{noformat}
Or you can set this in your tika-config.xml with something like:
{noformat}
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser"/>
<parser class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser">
<params>
<param name="includeShapeBasedContent" type="bool">false</param>
</params>
</parser>
<parser class="org.apache.tika.parser.microsoft.OfficeParser">
<params>
<param name="includeShapeBasedContent" type="bool">false</param>
</params>
</parser>
</parsers>
</properties>
{noformat}
> xlsx takes more than 5 mins to parse in 1.16
> --------------------------------------------
>
> Key: TIKA-2507
> URL: https://issues.apache.org/jira/browse/TIKA-2507
> Project: Tika
> Issue Type: Bug
> Components: server
> Affects Versions: 1.16
> Environment: started server with
> {noformat}
> java -jar tiki-server-1.16.jar
> {noformat}
> Reporter: José Borges Ferreira
> Assignee: Tim Allison
> Attachments: Tika.1.16-killer.xlsx
>
>
> when sending a xlsx file with a lot of charts the tiki server takes more that 5 min to process on my 2,2GHz Macbook pro.
> In version 1.15 this takes less than a second. Looking at the changeling I'm guessing that can be related with some features introduced in 1.16, namely :
> # Extract text from charts in .docx, .pptx, .xlsx and .xlsb(TIKA-2254).
> # Extract text from diagrams in .docx, .pptx, .xlsx and .xlsb(TIKA-1945).
> I'm attaching the file
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)