You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "peter royal (Created) (JIRA)" <ji...@apache.org> on 2011/12/20 19:44:30 UTC

[jira] [Created] (TIKA-822) MediaType fails to parse charset that has quoted value

MediaType fails to parse charset that has quoted value
------------------------------------------------------

                 Key: TIKA-822
                 URL: https://issues.apache.org/jira/browse/TIKA-822
             Project: Tika
          Issue Type: Bug
          Components: mime
    Affects Versions: 1.0
            Reporter: peter royal


If a mime type is

text/html; charset="UTF-8"

the value is incorrectly "UTF-8" not UTF-8

patch available at https://github.com/osi/tika/commit/b77814874ebff8f412ebb2f2adc52c6465d603c4

i have a CLA on file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-822) MediaType fails to parse charset that has quoted value

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173763#comment-13173763 ] 

Nick Burch commented on TIKA-822:
---------------------------------

Should we handle single quotes too? I don't think they're valid for http, but potentially could crop up in other situations
                
> MediaType fails to parse charset that has quoted value
> ------------------------------------------------------
>
>                 Key: TIKA-822
>                 URL: https://issues.apache.org/jira/browse/TIKA-822
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 1.0
>            Reporter: peter royal
>
> If a mime type is
> text/html; charset="UTF-8"
> the value is incorrectly "UTF-8" not UTF-8
> patch available at https://github.com/osi/tika/commit/b77814874ebff8f412ebb2f2adc52c6465d603c4
> i have a CLA on file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-822) MediaType fails to parse charset that has quoted value

Posted by "peter royal (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173833#comment-13173833 ] 

peter royal commented on TIKA-822:
----------------------------------

thanks!

                
> MediaType fails to parse charset that has quoted value
> ------------------------------------------------------
>
>                 Key: TIKA-822
>                 URL: https://issues.apache.org/jira/browse/TIKA-822
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 1.0
>            Reporter: peter royal
>             Fix For: 1.1
>
>
> If a mime type is
> text/html; charset="UTF-8"
> the value is incorrectly "UTF-8" not UTF-8
> patch available at https://github.com/osi/tika/commit/b77814874ebff8f412ebb2f2adc52c6465d603c4
> i have a CLA on file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (TIKA-822) MediaType fails to parse charset that has quoted value

Posted by "Nick Burch (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Burch resolved TIKA-822.
-----------------------------

       Resolution: Fixed
    Fix Version/s: 1.1
    
> MediaType fails to parse charset that has quoted value
> ------------------------------------------------------
>
>                 Key: TIKA-822
>                 URL: https://issues.apache.org/jira/browse/TIKA-822
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 1.0
>            Reporter: peter royal
>             Fix For: 1.1
>
>
> If a mime type is
> text/html; charset="UTF-8"
> the value is incorrectly "UTF-8" not UTF-8
> patch available at https://github.com/osi/tika/commit/b77814874ebff8f412ebb2f2adc52c6465d603c4
> i have a CLA on file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-822) MediaType fails to parse charset that has quoted value

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173796#comment-13173796 ] 

Nick Burch commented on TIKA-822:
---------------------------------

OK, thanks for the info and the patch. I've added it, along with single quote support and a note about the outstanding issues for quoted strings, in r1221581.
                
> MediaType fails to parse charset that has quoted value
> ------------------------------------------------------
>
>                 Key: TIKA-822
>                 URL: https://issues.apache.org/jira/browse/TIKA-822
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 1.0
>            Reporter: peter royal
>             Fix For: 1.1
>
>
> If a mime type is
> text/html; charset="UTF-8"
> the value is incorrectly "UTF-8" not UTF-8
> patch available at https://github.com/osi/tika/commit/b77814874ebff8f412ebb2f2adc52c6465d603c4
> i have a CLA on file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-822) MediaType fails to parse charset that has quoted value

Posted by "peter royal (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173768#comment-13173768 ] 

peter royal commented on TIKA-822:
----------------------------------

the rfc for mime isn't clear on whether single quotes make a valid quoted string. overall, the parser needs a bit more work to be fully rfc-compliant (quoted strings can have equals in them, for instance). 

I was just trying to fix the simple case I came across. the java mail API generates quoted charset fields for text attachments, which is how I found this. 
                
> MediaType fails to parse charset that has quoted value
> ------------------------------------------------------
>
>                 Key: TIKA-822
>                 URL: https://issues.apache.org/jira/browse/TIKA-822
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 1.0
>            Reporter: peter royal
>
> If a mime type is
> text/html; charset="UTF-8"
> the value is incorrectly "UTF-8" not UTF-8
> patch available at https://github.com/osi/tika/commit/b77814874ebff8f412ebb2f2adc52c6465d603c4
> i have a CLA on file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-822) MediaType fails to parse charset that has quoted value

Posted by "peter royal (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173769#comment-13173769 ] 

peter royal commented on TIKA-822:
----------------------------------

the rfc for mime isn't clear on whether single quotes make a valid quoted string. overall, the parser needs a bit more work to be fully rfc-compliant (quoted strings can have equals in them, for instance). 

I was just trying to fix the simple case I came across. the java mail API generates quoted charset fields for text attachments, which is how I found this. 
                
> MediaType fails to parse charset that has quoted value
> ------------------------------------------------------
>
>                 Key: TIKA-822
>                 URL: https://issues.apache.org/jira/browse/TIKA-822
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 1.0
>            Reporter: peter royal
>
> If a mime type is
> text/html; charset="UTF-8"
> the value is incorrectly "UTF-8" not UTF-8
> patch available at https://github.com/osi/tika/commit/b77814874ebff8f412ebb2f2adc52c6465d603c4
> i have a CLA on file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-822) MediaType fails to parse charset that has quoted value

Posted by "peter royal (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

peter royal updated TIKA-822:
-----------------------------

    Comment: was deleted

(was: the rfc for mime isn't clear on whether single quotes make a valid quoted string. overall, the parser needs a bit more work to be fully rfc-compliant (quoted strings can have equals in them, for instance). 

I was just trying to fix the simple case I came across. the java mail API generates quoted charset fields for text attachments, which is how I found this. )
    
> MediaType fails to parse charset that has quoted value
> ------------------------------------------------------
>
>                 Key: TIKA-822
>                 URL: https://issues.apache.org/jira/browse/TIKA-822
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 1.0
>            Reporter: peter royal
>
> If a mime type is
> text/html; charset="UTF-8"
> the value is incorrectly "UTF-8" not UTF-8
> patch available at https://github.com/osi/tika/commit/b77814874ebff8f412ebb2f2adc52c6465d603c4
> i have a CLA on file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira