You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ken Krugler (JIRA)" <ji...@apache.org> on 2014/05/12 23:22:16 UTC

[jira] [Commented] (TIKA-1296) Add case insensitive matching for text/html mime type

    [ https://issues.apache.org/jira/browse/TIKA-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995648#comment-13995648 ] 

Ken Krugler commented on TIKA-1296:
-----------------------------------

Hi Phil - thanks for bringing this up, I didn't even realize that stringignorecase was an option. Can you think of any reason why we wouldn't want to just change all of these HTML-related match values to use stringignorecase?

> Add case insensitive matching for text/html mime type
> -----------------------------------------------------
>
>                 Key: TIKA-1296
>                 URL: https://issues.apache.org/jira/browse/TIKA-1296
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>    Affects Versions: 1.5
>            Reporter: Phil Lester
>
> Currently in tika-mimetypes.xml for the mime type text/html (and possibly others) matches in a couple different cases are provided for the elements so that varying HTML writing styles are matched. As of version 1.5 of Tika the ability exists to make these case insensitive using the "stringignorecase" type. This would allow consolidation of some matches and improve detection of poorly-formed HTML that would be rendered by most browsers regardless of case.
> For example:
>       <match value="&lt;BODY" type="string" offset="0"/>
>       <match value="&lt;body" type="string" offset="0"/>
> could become:
>       <match value="&lt;BODY" type="stringignorecase" offset="0"/>



--
This message was sent by Atlassian JIRA
(v6.2#6252)