You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Michael McCandless (Updated) (JIRA)" <ji...@apache.org> on 2011/10/12 14:57:11 UTC
[jira] [Updated] (TIKA-751) Small improvements to how embedded docs
are parsed in AbstractPOIFSExtractor.handleEmbeddedOfficeDoc
[ https://issues.apache.org/jira/browse/TIKA-751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated TIKA-751:
------------------------------------
Attachment: TIKA-751.patch
Patch.
> Small improvements to how embedded docs are parsed in AbstractPOIFSExtractor.handleEmbeddedOfficeDoc
> ----------------------------------------------------------------------------------------------------
>
> Key: TIKA-751
> URL: https://issues.apache.org/jira/browse/TIKA-751
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: 1.0
>
> Attachments: TIKA-751.patch
>
>
> I noticed some minor things in this method:
> * It does too much work (writes the tmpFile out) if the
> EmbeddedDocumentExtractor didn't want to actually parse file
> file.
> * It writes the tmpFile when it won't use it in the OLE10_NATIVE
> case (because we use a TikeInputStream from the in-RAM byte[]
> instead).
> Also I fixed a typo in the method name (embeded -> embedded) -- is
> that OK? It's a protected method, and a few of the office parsers
> invoke it.
> Finally I cutover to TemporaryResources to track the possible tmpFile
> and open TikaInputStream against it.
> Separately, it's inefficient now that we must serialize a sub-dir
> (DirectoryEntry) in the NPOIFileSystem to a tmp file only to re-parse
> it back to an NPOIFileSystem in OfficeParser; I'd like to look into
> instead (somehow) directly passing the NPOIFileSystem's DirectoryEntry
> to OfficeParser... but that looks like a bigger change.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira