You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Michael McCandless (Updated) (JIRA)" <ji...@apache.org> on 2011/10/12 14:57:11 UTC

[jira] [Updated] (TIKA-751) Small improvements to how embedded docs are parsed in AbstractPOIFSExtractor.handleEmbeddedOfficeDoc

     [ https://issues.apache.org/jira/browse/TIKA-751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated TIKA-751:
------------------------------------

    Attachment: TIKA-751.patch

Patch.
                
> Small improvements to how embedded docs are parsed in AbstractPOIFSExtractor.handleEmbeddedOfficeDoc
> ----------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-751
>                 URL: https://issues.apache.org/jira/browse/TIKA-751
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: TIKA-751.patch
>
>
> I noticed some minor things in this method:
>   * It does too much work (writes the tmpFile out) if the
>     EmbeddedDocumentExtractor didn't want to actually parse file
>     file.
>   * It writes the tmpFile when it won't use it in the OLE10_NATIVE
>     case (because we use a TikeInputStream from the in-RAM byte[]
>     instead).
> Also I fixed a typo in the method name (embeded -> embedded) -- is
> that OK?  It's a protected method, and a few of the office parsers
> invoke it.
> Finally I cutover to TemporaryResources to track the possible tmpFile
> and open TikaInputStream against it.
> Separately, it's inefficient now that we must serialize a sub-dir
> (DirectoryEntry) in the NPOIFileSystem to a tmp file only to re-parse
> it back to an NPOIFileSystem in OfficeParser; I'd like to look into
> instead (somehow) directly passing the NPOIFileSystem's DirectoryEntry
> to OfficeParser... but that looks like a bigger change.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira