You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2015/06/04 15:13:38 UTC

[jira] [Created] (TIKA-1648) Investigate Word .doc WMF/EMF/PICT attachmetns

Nick Burch created TIKA-1648:
--------------------------------

             Summary: Investigate Word .doc WMF/EMF/PICT attachmetns
                 Key: TIKA-1648
                 URL: https://issues.apache.org/jira/browse/TIKA-1648
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.9
            Reporter: Nick Burch


As spotted when working on TIKA-1644, many of the govdocs1 Word .doc files have embedded image resources which are coming through as WMF, EMF or PICT. In at least some of the cases, these files don't have the typical header that would be expected for that file, but do have PDF header some tens or a few hundred bytes into the file. (Some of the files do come out correctly though, so it doesn't look universal)

It's possible that this is all as expected and normal. However, it's possible that something in the POI code for pulling out the embedded resources is either truncating or failing to truncate the header, or some how otherwise failing to correctly pull these out. The result is that they aren't coming through quite as they should do as embedded resources.

This is probably going to mean lots of time with the file format specs, some time creating some slightly-unusual test files with these formats of attachments in, then finally looking at the govdocs ones



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)