You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ken Krugler (JIRA)" <ji...@apache.org> on 2010/08/25 16:06:16 UTC

[jira] Created: (TIKA-497) HtmlHandler should fix up incorrect capitalization of names in attributes before putting into metadata

HtmlHandler should fix up incorrect capitalization of names in <meta http-equiv="xxx"> attributes before putting into metadata
------------------------------------------------------------------------------------------------------------------------------

                 Key: TIKA-497
                 URL: https://issues.apache.org/jira/browse/TIKA-497
             Project: Tika
          Issue Type: Improvement
    Affects Versions: 0.7
            Reporter: Ken Krugler
            Assignee: Ken Krugler
            Priority: Minor
             Fix For: 0.8


With the current behavior, you can get metadata entries that have "Content-Type" and "content-type" as their names, because http-equiv attribute values often use incorrect capitalization.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-497) HtmlHandler should fix up incorrect capitalization of names in attributes before putting into metadata

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902457#action_12902457 ] 

Jukka Zitting commented on TIKA-497:
------------------------------------

Instead of just fixing the capitalization, I'd argue that the HTML parser should specifically look for these kinds of well known metadata keys and automatically map such information to the applicable Metadata constants we already have. If there are multiple sources for a particular metadata entry (Content-Type is a perfect example), then reasonable heuristics should be used to merge the information.

> HtmlHandler should fix up incorrect capitalization of names in <meta http-equiv="xxx"> attributes before putting into metadata
> ------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-497
>                 URL: https://issues.apache.org/jira/browse/TIKA-497
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.7
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>            Priority: Minor
>             Fix For: 0.8
>
>
> With the current behavior, you can get metadata entries that have "Content-Type" and "content-type" as their names, because http-equiv attribute values often use incorrect capitalization.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-497) HtmlHandler should fix up incorrect capitalization of names in attributes before putting into metadata

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated TIKA-497:
-----------------------------------

    Component/s: parser

- classify

> HtmlHandler should fix up incorrect capitalization of names in <meta http-equiv="xxx"> attributes before putting into metadata
> ------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-497
>                 URL: https://issues.apache.org/jira/browse/TIKA-497
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>            Priority: Minor
>             Fix For: 0.9
>
>
> With the current behavior, you can get metadata entries that have "Content-Type" and "content-type" as their names, because http-equiv attribute values often use incorrect capitalization.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-497) HtmlHandler should fix up incorrect capitalization of names in attributes before putting into metadata

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated TIKA-497:
-----------------------------------

    Fix Version/s:     (was: 0.8)
                   0.9

- pushing out to 0.9 -- there's no patch for this yet and it's 0.8 release time

> HtmlHandler should fix up incorrect capitalization of names in <meta http-equiv="xxx"> attributes before putting into metadata
> ------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-497
>                 URL: https://issues.apache.org/jira/browse/TIKA-497
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.7
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>            Priority: Minor
>             Fix For: 0.9
>
>
> With the current behavior, you can get metadata entries that have "Content-Type" and "content-type" as their names, because http-equiv attribute values often use incorrect capitalization.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.