You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Michael McCandless (Created) (JIRA)" <ji...@apache.org> on 2011/10/14 20:28:12 UTC

[jira] [Created] (TIKA-753) Improve performance when parsing embedded Office docs

Improve performance when parsing embedded Office docs
-----------------------------------------------------

                 Key: TIKA-753
                 URL: https://issues.apache.org/jira/browse/TIKA-753
             Project: Tika
          Issue Type: Improvement
          Components: parser
            Reporter: Michael McCandless
            Assignee: Michael McCandless
             Fix For: 1.0




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-753) Improve performance when parsing embedded Office docs

Posted by "Michael McCandless (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated TIKA-753:
------------------------------------

    Attachment: TIKA-753.patch

Patch.
                
> Improve performance when parsing embedded Office docs
> -----------------------------------------------------
>
>                 Key: TIKA-753
>                 URL: https://issues.apache.org/jira/browse/TIKA-753
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: TIKA-753.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-753) Improve performance when parsing embedded Office docs

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13128924#comment-13128924 ] 

Michael McCandless commented on TIKA-753:
-----------------------------------------

OK I committed this; I'll leave it open so we remember to do the TODOs on next POI upgrade.
                
> Improve performance when parsing embedded Office docs
> -----------------------------------------------------
>
>                 Key: TIKA-753
>                 URL: https://issues.apache.org/jira/browse/TIKA-753
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: TIKA-753.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-753) Improve performance when parsing embedded Office docs

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13128145#comment-13128145 ] 

Michael McCandless commented on TIKA-753:
-----------------------------------------

Awesome, thanks Nick!

I'll add a TODO where we use Ole10Native to cutover to DirectoryNode once we upgrade POI.  And then I guess leave this issue open after committing to remind us to go back and do these two TODOs...
                
> Improve performance when parsing embedded Office docs
> -----------------------------------------------------
>
>                 Key: TIKA-753
>                 URL: https://issues.apache.org/jira/browse/TIKA-753
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: TIKA-753.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-753) Improve performance when parsing embedded Office docs

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13127758#comment-13127758 ] 

Michael McCandless commented on TIKA-753:
-----------------------------------------

I noticed that when we parse an embedded Office document, it's
inefficient because we take the NPOIFileSystem we had already parsed
(from the full document) and write the "sub-directory" containing the
embedded document to a temp file, only to re-parse it again once we've
recursed to the inner detector/parser.

I worked out a patch to instead just directly pass the sub-directory
of the embedded document directly to the inner detector/parser.

This gives a good speedup in my test case: I have a private test set
of 2,080 Word docs; parsing them (and their embedded docs) takes 16.1
on trunk and 10.7 sec with this patch -- 34% faster (best of 10).

The change has a few parts:

  * Fixed all Office parsers to alternatively directly take the
    document root (DirectoryNode); this was straightforward (but
    touched a lot of sources) because internally these parsers were
    extracting that root anyway.

  * Fixed AbstractPOIFSExtractor to not do the serialization to a temp
    file and instead put the document's root on an otherwise empty
    (new byte[0]) TikaInputStream as the openContainer.

  * Fixed OfficeParser and POIFSContainerDetector to recognize a
    DirectoryNode on the incoming TikaInputStream, and parse/detect
    that directly.

The one catch I hit was a failure in POIContainerExtractionTest, due
to already-fixed bug 51949 in POI (NPE on double-close of
ZipFileZipEntrySource); I added a workaround in
ParsingEmbeddedDocumentExtractor for this, with a TODO to remove the
workaround once POI releases and we upgrade.  It's important to remove
that because we are double-opening the ZIP archive now for embedded
OOXML docs...

I also converted a couple if/else string equal chains into HashMap
lookups.

                
> Improve performance when parsing embedded Office docs
> -----------------------------------------------------
>
>                 Key: TIKA-753
>                 URL: https://issues.apache.org/jira/browse/TIKA-753
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 1.0
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-753) Improve performance when parsing embedded Office docs

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13128135#comment-13128135 ] 

Nick Burch commented on TIKA-753:
---------------------------------

Patch looks fine to me

I've added a static constructor on Ole10Native to create from a DirectoryNode, as well as POIFSFileSystem, which'll let that bit be tidied up later
                
> Improve performance when parsing embedded Office docs
> -----------------------------------------------------
>
>                 Key: TIKA-753
>                 URL: https://issues.apache.org/jira/browse/TIKA-753
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: TIKA-753.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-753) Improve performance when parsing embedded Office docs

Posted by "Michael McCandless (Resolved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved TIKA-753.
-------------------------------------

    Resolution: Fixed

I opened TIKA-757 as the blanket issue for addressing TODOs on next POI upgrade so I think we can now resolve this one.
                
> Improve performance when parsing embedded Office docs
> -----------------------------------------------------
>
>                 Key: TIKA-753
>                 URL: https://issues.apache.org/jira/browse/TIKA-753
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: TIKA-753.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira