You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2007/11/12 03:23:50 UTC

[jira] Created: (TIKA-95) Pluggable magic header detectors

Pluggable magic header detectors
--------------------------------

                 Key: TIKA-95
                 URL: https://issues.apache.org/jira/browse/TIKA-95
             Project: Tika
          Issue Type: New Feature
            Reporter: Jukka Zitting
            Priority: Minor


Some file formats like MS Office files or specific XML schemas don't have simple magic marker bytes that could be used to easily identify the type of the document. However, it would in many cases be possible to detect such formats with more complex parsing logic.

Also, there are some external libraries (like Sanselan as mentioned in TIKA-92) that contain their own magic header rules. Instead of duplicating such rules in Tika, it would be better if Tika could just invoke the existing external functionality.

To support these cases Tika should provide a mechanism to plug in custom magic header detector components in addition to the traditional configured magic patterns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TIKA-95) Pluggable magic header detectors

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-95?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated TIKA-95:
----------------------------------

    Component/s: mime

> Pluggable magic header detectors
> --------------------------------
>
>                 Key: TIKA-95
>                 URL: https://issues.apache.org/jira/browse/TIKA-95
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> Some file formats like MS Office files or specific XML schemas don't have simple magic marker bytes that could be used to easily identify the type of the document. However, it would in many cases be possible to detect such formats with more complex parsing logic.
> Also, there are some external libraries (like Sanselan as mentioned in TIKA-92) that contain their own magic header rules. Instead of duplicating such rules in Tika, it would be better if Tika could just invoke the existing external functionality.
> To support these cases Tika should provide a mechanism to plug in custom magic header detector components in addition to the traditional configured magic patterns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.