You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Piotr B. (JIRA)" <ji...@apache.org> on 2009/09/07 09:18:57 UTC

[jira] Created: (TIKA-274) CharsetDetector.setDeclaredEncoding has no effect

CharsetDetector.setDeclaredEncoding has no effect
-------------------------------------------------

                 Key: TIKA-274
                 URL: https://issues.apache.org/jira/browse/TIKA-274
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.4, 0.5
            Reporter: Piotr B.


In TXTParser.java we may read:

        // Use the declared character encoding, if available
        String encoding = metadata.get(Metadata.CONTENT_ENCODING);
        if (encoding != null) {
            detector.setDeclaredEncoding(encoding);
        }

But it seems to be not implemented feature.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (TIKA-274) CharsetDetector.setDeclaredEncoding has no effect

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-274.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.5
         Assignee: Jukka Zitting

Hmm, good point. It looks like the feature was never implemented in the ICU4J code that we're using.

I modified the TXTParser code in revision 813624 so that we now always use the given encoding as the default in case the automatic encoding detection fails.

This behavior could be further improved by making the encoding hint affect the detection code for example when choosing between the highly similar ISO-8859-X character sets. Please file a new improvement issue if you have a concrete use case where this would be beneficial.

> CharsetDetector.setDeclaredEncoding has no effect
> -------------------------------------------------
>
>                 Key: TIKA-274
>                 URL: https://issues.apache.org/jira/browse/TIKA-274
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4, 0.5
>            Reporter: Piotr B.
>            Assignee: Jukka Zitting
>             Fix For: 0.5
>
>
> In TXTParser.java we may read:
>         // Use the declared character encoding, if available
>         String encoding = metadata.get(Metadata.CONTENT_ENCODING);
>         if (encoding != null) {
>             detector.setDeclaredEncoding(encoding);
>         }
> But it seems to be not implemented feature.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.