You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Daniel Bonniot de Ruisselet (Created) (JIRA)" <ji...@apache.org> on 2012/02/24 15:44:50 UTC

[jira] [Created] (TIKA-868) TXT parser does not honour the specified encoding

TXT parser does not honour the specified encoding
-------------------------------------------------

                 Key: TIKA-868
                 URL: https://issues.apache.org/jira/browse/TIKA-868
             Project: Tika
          Issue Type: Bug
            Reporter: Daniel Bonniot de Ruisselet
             Fix For: 1.1


With input text "Indanyl", the encoding is recognized as IBM500, even when "UTF-8" is specified explicitly.

I would argue that detection should only be used when the declared information is incorrect (saving time and avoiding wrong detection), as proposed by Ken Krugler in TIKA-539.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-868) TXT parser does not honour the specified encoding

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated TIKA-868:
-----------------------------------

    Component/s: parser
    
> TXT parser does not honour the specified encoding
> -------------------------------------------------
>
>                 Key: TIKA-868
>                 URL: https://issues.apache.org/jira/browse/TIKA-868
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Daniel Bonniot de Ruisselet
>             Fix For: 1.3
>
>
> With input text "Indanyl", the encoding is recognized as IBM500, even when "UTF-8" is specified explicitly.
> I would argue that detection should only be used when the declared information is incorrect (saving time and avoiding wrong detection), as proposed by Ken Krugler in TIKA-539.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-868) TXT parser does not honour the specified encoding

Posted by "Chris A. Mattmann (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated TIKA-868:
-----------------------------------

    Fix Version/s:     (was: 1.1)
                   1.2

- push out to 1.2
                
> TXT parser does not honour the specified encoding
> -------------------------------------------------
>
>                 Key: TIKA-868
>                 URL: https://issues.apache.org/jira/browse/TIKA-868
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Daniel Bonniot de Ruisselet
>             Fix For: 1.2
>
>
> With input text "Indanyl", the encoding is recognized as IBM500, even when "UTF-8" is specified explicitly.
> I would argue that detection should only be used when the declared information is incorrect (saving time and avoiding wrong detection), as proposed by Ken Krugler in TIKA-539.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Closed] (TIKA-868) TXT parser does not honour the specified encoding

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler closed TIKA-868.
----------------------------

    Resolution: Duplicate
      Assignee: Ken Krugler
    
> TXT parser does not honour the specified encoding
> -------------------------------------------------
>
>                 Key: TIKA-868
>                 URL: https://issues.apache.org/jira/browse/TIKA-868
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Daniel Bonniot de Ruisselet
>            Assignee: Ken Krugler
>             Fix For: 1.3
>
>
> With input text "Indanyl", the encoding is recognized as IBM500, even when "UTF-8" is specified explicitly.
> I would argue that detection should only be used when the declared information is incorrect (saving time and avoiding wrong detection), as proposed by Ken Krugler in TIKA-539.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-868) TXT parser does not honour the specified encoding

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated TIKA-868:
-----------------------------------

    Fix Version/s:     (was: 1.2)
                   1.3

- push to 1.3
                
> TXT parser does not honour the specified encoding
> -------------------------------------------------
>
>                 Key: TIKA-868
>                 URL: https://issues.apache.org/jira/browse/TIKA-868
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Daniel Bonniot de Ruisselet
>             Fix For: 1.3
>
>
> With input text "Indanyl", the encoding is recognized as IBM500, even when "UTF-8" is specified explicitly.
> I would argue that detection should only be used when the declared information is incorrect (saving time and avoiding wrong detection), as proposed by Ken Krugler in TIKA-539.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-868) TXT parser does not honour the specified encoding

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated TIKA-868:
-----------------------------------


- push to 1.3
                
> TXT parser does not honour the specified encoding
> -------------------------------------------------
>
>                 Key: TIKA-868
>                 URL: https://issues.apache.org/jira/browse/TIKA-868
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Daniel Bonniot de Ruisselet
>             Fix For: 1.3
>
>
> With input text "Indanyl", the encoding is recognized as IBM500, even when "UTF-8" is specified explicitly.
> I would argue that detection should only be used when the declared information is incorrect (saving time and avoiding wrong detection), as proposed by Ken Krugler in TIKA-539.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-868) TXT parser does not honour the specified encoding

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13433282#comment-13433282 ] 

Ken Krugler commented on TIKA-868:
----------------------------------

Hi Daniel - using the latest Tika (trunk) I get back UTF-8 as the encoding, if I pass in UTF-8 as the encoding in the content type, via metadata.set(Metadata.CONTENT_TYPE, "text/plain; charset=UTF-8"); If I don't set the CONTENT_TYPE, I get back ISO-8859-1, which also seems like the right thing.

                
> TXT parser does not honour the specified encoding
> -------------------------------------------------
>
>                 Key: TIKA-868
>                 URL: https://issues.apache.org/jira/browse/TIKA-868
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Daniel Bonniot de Ruisselet
>             Fix For: 1.3
>
>
> With input text "Indanyl", the encoding is recognized as IBM500, even when "UTF-8" is specified explicitly.
> I would argue that detection should only be used when the declared information is incorrect (saving time and avoiding wrong detection), as proposed by Ken Krugler in TIKA-539.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira