You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "mungeol heo (JIRA)" <ji...@apache.org> on 2015/09/02 04:08:46 UTC

[jira] [Commented] (TIKA-330) Better HWP (Hangul Word Processor) detection pattern

    [ https://issues.apache.org/jira/browse/TIKA-330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726598#comment-14726598 ] 

mungeol heo commented on TIKA-330:
----------------------------------

HWP file has two file formats now which are HWP 3.0 and HWP 5.0.
The signature string start with "HWP Document File V" only can detect HWP 3.0.
It should be changed to "HWP Document File" for detecting both version of file formats of HWP file.

> Better HWP (Hangul Word Processor) detection pattern
> ----------------------------------------------------
>
>                 Key: TIKA-330
>                 URL: https://issues.apache.org/jira/browse/TIKA-330
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.6
>
>
> The current magic byte pattern we have for the HWP (Hangul Word Processor, application/x-hwp) file format matches also the test-outlook.msg test file we have. I looked for a better detection pattern and found one from OpenOffice.org.
> The hwpfilter/source/hwpfile.cpp file suggests that all HWP files start with the signature string "HWP Document File V", so I'll change the detection pattern accordingly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)