You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by jeremybmerrill <gi...@git.apache.org> on 2015/10/13 22:04:35 UTC

[GitHub] tika pull request: lower priority on magic for application/xhtml+x...

GitHub user jeremybmerrill opened a pull request:

    https://github.com/apache/tika/pull/58

    lower priority on magic for application/xhtml+xml 

    to avoid misdetecting xhtml-containing emails as XHTML docs.
    
    Emails I have (happy to share if you want) contain XHTML, as one part of a multipart email. Prior to this pull request, the priority on the `application/xhtml+xml` magic detector was 50, equal to the priority on the `message/rfc822` detector. Because of the relative position of the two detectors in `tika-mimetypes.xml`, the emails were incorrectly detected as XHTML documents.
    
    With this PR, by downgrading the priority of `application/xhtml+xml` to 40, the more-sensitive email magic detectors take precedence, causing the emails to be properly detected as `message/rfc822`.
    
    I have not run this thru the govdocs tester or anything other than my own documents, so, full disclosure, this could cause false negative xhtml-detections elsewhere.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jeremybmerrill/tika trunk

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/tika/pull/58.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #58
    
----
commit 3e481ce547b2a6eda315f8b467811ea41d284ef7
Author: Jeremy B. Merrill <je...@nytimes.com>
Date:   2015-10-13T19:51:55Z

    lower priority on magic for application/xhtml+xml to avoid misdetecting xhtml-containing emails as XHTML docs

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tika pull request: lower priority on magic for application/xhtml+x...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/tika/pull/58


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---