You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Johan van der Knijff (JIRA)" <ji...@apache.org> on 2017/09/08 15:11:00 UTC
[jira] [Comment Edited] (TIKA-2461) Wordperfect file identified as
Quattro Pro document
[ https://issues.apache.org/jira/browse/TIKA-2461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16158764#comment-16158764 ]
Johan van der Knijff edited comment on TIKA-2461 at 9/8/17 3:10 PM:
--------------------------------------------------------------------
Sure, here it is:
[PerfectOffice_MAIN_trunc|https://github.com/bitsgalore/shared/raw/master/PerfectOffice_MAIN_trunc]
Which is identified by Tika as:
{code}
application/vnd.wordperfect
{code}
Which looks pretty much right to me.
was (Author: johanvanderknijff):
Sure, here it is:
[PerfectOffice_MAIN_trunc|https://github.com/bitsgalore/shared/raw/master/PerfectOffice_MAIN_trunc]
> Wordperfect file identified as Quattro Pro document
> ---------------------------------------------------
>
> Key: TIKA-2461
> URL: https://issues.apache.org/jira/browse/TIKA-2461
> Project: Tika
> Issue Type: Bug
> Components: detector
> Affects Versions: 1.16
> Environment: Linux Mint 17
> Reporter: Johan van der Knijff
> Priority: Minor
>
> While running Tika 1.16 in detect mode over some legacy files from our repository system, I came across one file with a .wpd extension for which Tika reported the following mimetype:
>
> {code}
> application/x-quattro-pro; version=7-8
> {code}
> Opening the file in LibreOffice reveals this is actually a WordPerfect document (not sure about which version; the .WPD extension suggests WP 6 or later). I had a look at the Quattro Pro entry in tika-mimetypes.xml:
> {code}
> <mime-type type="application/x-quattro-pro">
> <_comment>
> Quattro Pro - Corel Spreadsheet (part of WordPerfect Office suite)
> </_comment>
> <!-- qp2 and wb3 are currently detected by POIFSContainerDetector
> TODO: add detection for wb2 and wb1 -->
> <glob pattern="*.qpw"/>
> <glob pattern="*.wb1"/>
> <glob pattern="*.wb2"/>
> <glob pattern="*.wb3"/>
> </mime-type>
> {code}
> This suggests that the problem originates from POIFSContainerDetector.
> For legal reasons I cannot share the original file. However I was able to create a derived file by truncating the original file after 18 kB, and this derived file shows the same behaviour. The file is available at this link:
> [tika-identified-as-quattro-pro-truncated.wpd|https://github.com/bitsgalore/shared/raw/master/tika-identified-as-quattro-pro-truncated.wpd]
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)