You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2021/05/27 22:25:00 UTC

[jira] [Commented] (TIKA-3422) Excluding both WMFParser and EMFParser causes wmf instances NOT to appear at all

    [ https://issues.apache.org/jira/browse/TIKA-3422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17352786#comment-17352786 ] 

Tim Allison commented on TIKA-3422:
-----------------------------------

I was recently surprised by some exclusions that I was doing.  I'll take a look.

Unrelated note: if you're excluding EMF, you'll potentially exclude files embedded in the EMF.  IIRC Mac Excel used to or still does put embedded PDFs inside EMFs.

I added a mime-excluding metadata filter for just this reason.  At some point, I'll even document it. :D

Let me take a look... 

> Excluding both WMFParser and EMFParser causes wmf instances NOT to appear at all
> --------------------------------------------------------------------------------
>
>                 Key: TIKA-3422
>                 URL: https://issues.apache.org/jira/browse/TIKA-3422
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.26
>            Reporter: Josh Burchard
>            Priority: Major
>              Labels: EMFParser, WMFParser
>         Attachments: tika-config_no_emf_or_wmf.xml, tika-config_no_wmf.xml
>
>
> I was attempting to exclude embedded wmf and emf files from being parsed, but I noticed that when I do so, only instances of EMF files are noted by Tika in the returned /rmeta/text
> As an experiment I created two tika-config.xml files. The first excludes only the WMFParser, and when my MSWord source doc is processed I see lines like this, as expected:
> {{"Content-Type":"image/wmf","X-Parsed-By":["org.apache.tika.parser.CompositeParser","org.apache.tika.parser.EmptyParser"]}}
> And there are the EMF files that were found and parsed by the EMFParser:
> {{"Content-Type":"image/emf","X-Parsed-By":["org.apache.tika.parser.CompositeParser","org.apache.tika.parser.DefaultParser","org.apache.tika.parser.microsoft.EMFParser"]}}
>  
> A problem arises though when I try to exclude WMFParser AND EMFParser. Suddenly any WMF instances have disappeared and only EMF instances are shown as being handled by the EmptyParser. 
> {{"Content-Type":"image/emf","X-Parsed-By":["org.apache.tika.parser.CompositeParser","org.apache.tika.parser.EmptyParser"]}}
>  
> I think in the 2nd case BOTH types should be shown as being handled by the EmptyParser. I still want to know that the WMF files are in the container even though I'm not parsing them.
>  
> P.S. For whatever reason I can't upload the original Word doc that I'm testing with. Jira won't allow me.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)