You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Ken Krugler (JIRA)" <ji...@apache.org> on 2010/01/06 23:56:54 UTC

[jira] Created: (TIKA-359) Calls to Charset.isSupported() will throw exceptions for invalid charset names

Calls to Charset.isSupported() will throw exceptions for invalid charset names
------------------------------------------------------------------------------

                 Key: TIKA-359
                 URL: https://issues.apache.org/jira/browse/TIKA-359
             Project: Tika
          Issue Type: Bug
    Affects Versions: 0.5
            Reporter: Ken Krugler
            Assignee: Ken Krugler
             Fix For: 0.6


The HtmlParser and TXTParser code currently call Charset.isSupported() to determine if charset hint info (from meta tags or incoming metadata).

But this method throws IllegalCharsetNameException for unknown (versus unsupported) encoding names, which kills the parsing process.

What's needed is a wrapper that catches this exception and returns false.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-359) Calls to Charset.isSupported() will throw exceptions for invalid charset names

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798090#action_12798090 ] 

Ken Krugler commented on TIKA-359:
----------------------------------

Given the junk that can be found inside of meta http-equiv tags for HTML documents, what's needed is a routine that tries to clean up the charset (removing junk like quotes), expands the set of aliases to handle common types (like cp-1252 vs. cp1252), and then returns null or a valid/normalized/supported charset name.

I've got the first cut of something like this in Bixo, which I'll turn into a utility routine/patch for Tika.

> Calls to Charset.isSupported() will throw exceptions for invalid charset names
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-359
>                 URL: https://issues.apache.org/jira/browse/TIKA-359
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.5
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>             Fix For: 0.6
>
>
> The HtmlParser and TXTParser code currently call Charset.isSupported() to determine if charset hint info (from meta tags or incoming metadata).
> But this method throws IllegalCharsetNameException for unknown (versus unsupported) encoding names, which kills the parsing process.
> What's needed is a wrapper that catches this exception and returns false.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TIKA-359) Calls to Charset.isSupported() will throw exceptions for invalid charset names

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated TIKA-359:
-----------------------------------

    Affects Version/s: 0.6
        Fix Version/s:     (was: 0.6)
                       0.7

- if there are no objections, I'd like to push this to 0.7 since I (for really real this time) am cutting an RC of Tika 0.6 tonight...

> Calls to Charset.isSupported() will throw exceptions for invalid charset names
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-359
>                 URL: https://issues.apache.org/jira/browse/TIKA-359
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.5, 0.6
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>             Fix For: 0.7
>
>
> The HtmlParser and TXTParser code currently call Charset.isSupported() to determine if charset hint info (from meta tags or incoming metadata).
> But this method throws IllegalCharsetNameException for unknown (versus unsupported) encoding names, which kills the parsing process.
> What's needed is a wrapper that catches this exception and returns false.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-359) Calls to Charset.isSupported() will throw exceptions for invalid charset names

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852075#action_12852075 ] 

Ken Krugler commented on TIKA-359:
----------------------------------

Hi Chris,

Sorry for the delay - yes, go ahead and defer this to 0.8.

Thanks,

-- Ken

> Calls to Charset.isSupported() will throw exceptions for invalid charset names
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-359
>                 URL: https://issues.apache.org/jira/browse/TIKA-359
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.5, 0.6, 0.7
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>             Fix For: 0.8
>
>
> The HtmlParser and TXTParser code currently call Charset.isSupported() to determine if charset hint info (from meta tags or incoming metadata).
> But this method throws IllegalCharsetNameException for unknown (versus unsupported) encoding names, which kills the parsing process.
> What's needed is a wrapper that catches this exception and returns false.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TIKA-359) Calls to Charset.isSupported() will throw exceptions for invalid charset names

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated TIKA-359:
-----------------------------------

    Affects Version/s: 0.7
        Fix Version/s:     (was: 0.7)
                       0.8

Hey Ken, are you OK with this going into 0.8? I'm going to try and cut an 0.7 RC within the next few hours. Let me know. I saw that you mentioned having something in Bixo for this but haven't seen a patch yet so thought it might be OK to turn this into a 0.8 issue.

Thanks!

Cheers,
Chris

> Calls to Charset.isSupported() will throw exceptions for invalid charset names
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-359
>                 URL: https://issues.apache.org/jira/browse/TIKA-359
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.5, 0.6, 0.7
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>             Fix For: 0.8
>
>
> The HtmlParser and TXTParser code currently call Charset.isSupported() to determine if charset hint info (from meta tags or incoming metadata).
> But this method throws IllegalCharsetNameException for unknown (versus unsupported) encoding names, which kills the parsing process.
> What's needed is a wrapper that catches this exception and returns false.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.