You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/11/27 17:24:00 UTC

[jira] [Comment Edited] (TIKA-2507) xlsx takes more than 5 mins to parse in 1.16

    [ https://issues.apache.org/jira/browse/TIKA-2507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267091#comment-16267091 ] 

Tim Allison edited comment on TIKA-2507 at 11/27/17 5:23 PM:
-------------------------------------------------------------

Thank you for opening this issue.  I'm sorry it took me so long to respond.

You're right.  This xlsx file contains 564 copies of a chart. :P When I recommended opening an issue for this, I had forgotten that I already parameterized chart/graph parsing with {{includeShapeBasedContent}}.  If you set that to false, the parsing of this file goes back to < 1 second.  If that is {{true}}, the default, I also see that parsing takes a really, really long time.

If you're calling Tika programmatically, you can turn this off with: 

{noformat}
        OfficeParserConfig config = new OfficeParserConfig();
        config.setIncludeShapeBasedContent(false);
        ParseContext parseContext = new ParseContext();
        parseContext.set(OfficeParserConfig.class, config);
{noformat}

Or you can set this in your tika-config.xml with something like: 
{noformat}
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser"/>
        <parser class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser">
            <params>
                <param name="includeShapeBasedContent" type="bool">false</param>
            </params>
        </parser>
        <parser class="org.apache.tika.parser.microsoft.OfficeParser">
            <params>
                <param name="includeShapeBasedContent" type="bool">false</param>
            </params>
        </parser>
    </parsers>
</properties>
{noformat}



was (Author: tallison@mitre.org):
You're right.  This xlsx file contains 564 copies of a chart. :P When I recommended opening an issue for this, I had forgotten that I already parameterized chart/graph parsing with {{includeShapeBasedContent}}.  If you set that to false, the parsing of this file goes back to < 1 second.  If that is {{true}}, the default, I also see that parsing takes a really, really long time.

If you're calling Tika programmatically, you can turn this off with: 

{noformat}
        OfficeParserConfig config = new OfficeParserConfig();
        config.setIncludeShapeBasedContent(false);
        ParseContext parseContext = new ParseContext();
        parseContext.set(OfficeParserConfig.class, config);
{noformat}

Or you can set this in your tika-config.xml with something like: 
{noformat}
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser"/>
        <parser class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser">
            <params>
                <param name="includeShapeBasedContent" type="bool">false</param>
            </params>
        </parser>
        <parser class="org.apache.tika.parser.microsoft.OfficeParser">
            <params>
                <param name="includeShapeBasedContent" type="bool">false</param>
            </params>
        </parser>
    </parsers>
</properties>
{noformat}


> xlsx takes more than 5 mins to parse in 1.16
> --------------------------------------------
>
>                 Key: TIKA-2507
>                 URL: https://issues.apache.org/jira/browse/TIKA-2507
>             Project: Tika
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 1.16
>         Environment: started server with 
> {noformat}
> java -jar tiki-server-1.16.jar
> {noformat}
>            Reporter: José Borges Ferreira
>            Assignee: Tim Allison
>         Attachments: Tika.1.16-killer.xlsx
>
>
> when sending a xlsx file with a lot of charts the tiki server takes more that 5 min to process on my  2,2GHz Macbook pro.
> In version 1.15 this takes less than a second. Looking at the changeling I'm guessing that can be related with some features introduced in 1.16, namely :
> # Extract text from charts in .docx, .pptx, .xlsx and .xlsb(TIKA-2254).
> # Extract text from diagrams in .docx, .pptx, .xlsx and .xlsb(TIKA-1945).
> I'm attaching the file



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)